We all know the error log used by SQL Server to audit failed logins by default. But is it good enough? For many scenarios, it isn’t. It’s actually missing a lot of important information. But, worry not, for there are better solutions!
Problem
There are many different reasons why a login would fail. Every time this happens, it’s logged as an error event in the SQL Server instance. In most cases, this would also be logged in the SQL Server error log. But, the details are limited to an IP address and, in the best-case scenario, a username.
In a real customer use case, we noticed many repeating login failures in the SQL error log, but the information provided wasn’t enough to troubleshoot the issue. For example, it had crucial information missing – such as client hostname, program name, and for some failure types – even the login name.
First Solution: Default Server Trace
Did you know that the default server trace also records information about failed login events in your SQL Server instance? It includes very useful information such as the client hostname, program name, login name, target database, and even the host process ID.
To get the failed login events from the default server trace, all you gotta do is query all the events with EventClass=20. This identifies all the “Login Failed” event types. You can use the query below:
SELECT trc.*
FROM fn_trace_getinfo(default) AS inf
CROSS APPLY fn_trace_gettable (convert(nvarchar(255), inf.value),default ) AS trc
WHERE inf.property = 2 AND inf.value IS NOT NULL
AND trc.EventClass= 20
ORDER BY trc.StartTime DESC
There are a few issues with it, though:
- As the default server trace, this log can contain a whole lot of data other than failed logins, and because of that it could take a while to filter through what you need due to its size (at least, in OLTP-heavy environments).
- This is a system-generated server trace, and as such – may not have satisfactory retention depth, file roll-over settings, etc. for your specific use-cases.
- It’s still possible that not all login-related failures will be recorded in this trace. Especially the more “exotic” kinds of login failures (such as those related to connection pooling, for example).
Still, the default server trace should be more than good enough for most use cases. But, even for those use cases when it’s not enough, we still have another possible solution.
Second Solution: Extended Events
Luckily, since these failed logins are also user error events, this means we can set up an extended event session to capture all the details about it. We would also have much more control over it, since we’d be the ones setting it up (i.e. file roll-over and retention settings, specific event types, etc.).
The one major downside to this is that, since you’re the one who has to set it up, it will only start helping you from that moment on. Obviously, it won’t help you if you’re investigating a server that doesn’t have it already running.
In any case, as part of my research before setting up this extended event session, I did the following:
- I queried from the sys.messages table for any errors that contain the words “login” or “SSPI” in its text message.
- I browsed the results and picked every error that looked like something that might be related to failed logins.
- I queried the sys.dm_xe_objects table (WHERE object_type = ‘action’) in order to get the full list of additional actions that might be retrievable for the error_logged extended event.
- I prepared an extended event script that would log all the error types that I picked, and add all the additional actions possible. I’ve also configured its retention settings so that it won’t have too much impact on storage if anything goes sideways.
The final result is the script that you can see below. It contains two scripts actually: one that creates the extended event session and activates it, and one that queries from it to display the collected events and their details.
At the time of this writing, the additional details (actions) that I found to be supported with these events are:
- Client host name
- Client program name
- User name
- Database ID
- Session ID
I wish it would’ve been possible to also log the client process ID as in the default server trace, but unfortunately it doesn’t seem to be supported (that would’ve helped pinpoint the exact process causing the issue). Still, it’s a whole lot more than what’s offered in the regular error log.
The events included at the time of this writing include the following errors:
- Error 11248, Severity 16: A corrupted message has been received. The SSPI login header is invalid.
- Error 17806, Severity 20: SSPI handshake failed with error code 0x%x, state %d while establishing a connection with integrated security; the connection has been closed. Reason: %.ls %.ls %.*ls
- Error 17830, Severity 20: Network error code 0x%x occurred while establishing a connection; the connection has been closed. This may have been caused by client or server login timeout expiration. Time spent during login: total %d ms, enqueued %d ms, network writes %d ms, network reads %d ms, establishing SSL %d ms, network reads during SSL %d ms, network writes during SSL %d ms, secure calls during SSL %d ms, enqueued during SSL %d ms, negotiating SSPI %d ms, network reads during SSPI %d ms, network writes during SSPI %d ms, secure calls during SSPI %d ms, enqueued during SSPI %d ms, validating login %d ms, including user-defined login processing %d ms.%.*ls
- Error 17817, Severity 20: Unsupport login ack packet response received when opening client connection.%.*ls
- Error 17828, Severity 20: The prelogin packet used to open the connection is structurally invalid; the connection has been closed. Please contact the vendor of the client library.%.*ls
- Error 17832, Severity 20: The login packet used to open the connection is structurally invalid; the connection has been closed. Please contact the vendor of the client library.%.*ls
- Error 17892, Severity 20: Logon failed for login ‘%.*ls’ due to trigger execution.%.*ls
- Error 17897, Severity 20: Session recovery feature data used in login record to open or recover a connection is structurally or semantically invalid; the connection has been closed. Please contact the vendor of the client library.%.*ls
- Error 18056, Severity 20: The client was unable to reuse a session with SPID %d, which had been reset for connection pooling. The failure ID is %d. This error may have been caused by an earlier operation failing. Check the error logs for failed operations immediately before this error message.
- Error 18061, Severity 20: The client was unable to join a session with SPID %d. This error may have been caused by an earlier operation failing or a change in permissions since the session was established. Check the error logs for failed operations immediately before this error message.
- Error 18401, Severity 14: Login failed for user ‘%.*ls’. Reason: Server is in script upgrade mode. Only administrator can connect at this time.%.*ls
- Error 18451, Severity 14: Login failed for user ‘%.*ls’. Only administrators may connect at this time.%.*ls
- Error 18452, Severity 14: Login failed. The login is from an untrusted domain and cannot be used with Integrated authentication.%.*ls
- Error 18456, Severity 14: Login failed for user ‘%.*ls’.%.*ls%.*ls
- Error 18458, Severity 14: Login failed. The number of simultaneous users already equals the %d registered licenses for this server. To increase the maximum number of simultaneous users, obtain additional licenses and then register them through the Licensing item in Control Panel.%.*ls
- Error 18459, Severity 14: Login failed. The workstation licensing limit for SQL Server access has already been reached.%.*ls
- Error 18460, Severity 14: Login failed. The number of simultaneous users has already reached the limit of %d licenses for this ‘%ls’ server. Additional licenses should be obtained and installed or you should upgrade to a full version.%.*ls
- Error 18461, Severity 14: Login failed for user ‘%.*ls’. Reason: Server is in single user mode. Only one administrator can connect at this time.%.*ls
- Error 18470, Severity 14: Login failed for user ‘%.*ls’. Reason: The account is disabled.%.*ls
- Error 18486, Severity 14: Login failed for user ‘%.*ls’ because the account is currently locked out. The system administrator can unlock it. %.*ls
- Error 18487, Severity 14: Login failed for user ‘%.*ls’. Reason: The password of the account has expired.%.*ls
- Error 18488, Severity 14: Login failed for user ‘%.*ls’. Reason: The password of the account must be changed.%.*ls
- Error 26078, Severity 20: Client disconnected during login
- Error 33147, Severity 20: Federated Authentication Feature data used in login record to open a connection is structurally or semantically invalid; the connection has been closed. Please contact the vendor of the client library.%.*ls.
- Error 40623, Severity 20: Reauthentication failed for login “%.*ls”. Within the past reauthentification interval, the login has become invalid due to a password change, a dropped login, or other cause. Please retry login.
I think I got all of them based on what I found in sys.messages, but if I missed something important, please let me know in the comments!
Take note of the event session settings for file roll-over, retention, memory usage, etc. You can fine-tune these settings to whatever you feel best fits your specific use-cases.
Conclusion
The default server trace and the extended events session turned out to be invaluable in the customer’s production environment. With the help of the extended events session, we even started detecting logon errors that weren’t even visible at all otherwise (due to some exotic application-level issues). In fact, this was so useful, that we decided to make this XE session a must-have standard on all production servers.
I hope you’ll find a good use for these solutions, and make them an inseparable part of your production environments and a valuable troubleshooting tool in your DBA tool belt.
If you have any thoughts on this, ideas, objections, questions, please comment below.