Site icon Eitan Blumin's blog

How to Troubleshoot THREADPOOL Waits and Deadlocked Schedulers

In this blog post I’ll help you troubleshoot the rare but critical issues of Deadlocked Schedulers and high THREADPOOL waits – using practical means.

Also, very useful T-SQL script towards the end.

TL;DR get the script here

In this page:

Background

To understand deadlocked schedulers and THREADPOOL waits, we must first understand what are tasks, worker threads, and schedulers.

If this topic is new to you, then please refer to the following resources before proceeding, for some required reading:

In short, high THREADPOOL waits can happen when SQL Server doesn’t have enough “worker threads” to handle new tasks, which could cause SQL Server to hang and refuse connections. When a task is waiting for a worker thread to become available, that wait type is called THREADPOOL wait.

A background process, called “Scheduler Monitor“, will identify when the same worker threads are “stuck” in the same state for 60 seconds or more. In which case it will resolve the issue as a Deadlocked Scheduler, and that’ll cause dropped connections, rollbacks, and even fail-overs.

When a Deadlocked Scheduler event happens, SQL Server will automatically generate a memory dump file (SQLDump#####.mdmp), and log the incident in the SQL Server Error Log.

Such incidents usually happen due to one or more sessions locking some kind of database resource (such a session would be our “head blocker“), which causes huge blocking chains of other processes trying to access the same resource. Since each such a blocked session takes up a worker thread, the server could run out of available worker threads (depending on the scenario and circumstances, of course).

There are other possible causes as well, such as queries with parallelism running amuck, or the server CPU being overworked. But such use-cases are rare in comparison.

What follows below are various possible solutions and recommendations to troubleshoot scenarios of high THEADPOOL waits and Deadlocked Schedulers. Sit back, and enjoy the show.

What does the Microsoft Support say?

(hint: it’s not Ring-ding-ding-ding-dingeringeding 🦊)

Usually, when contacting Microsoft Customer Support Service (CSS) in regards to a deadlocked schedulers incident, they would reply with the following canned recommendations:

  1. Update your SQL Server version to the latest CU.
  2. Consider to increase the number of max worker threads.
  3. Setting database to READ_COMMITTED_SNAPSHOT can also help reduce blocking.
  4. Set MaxDOP based on the recommendations at https://support.microsoft.com/en-us/kb/2806535

These are what can be called “wide-net” workarounds, or “quick-fix“, or “spray-and-pray“… Unfortunately, these recommendations wouldn’t necessarily solve your actual root cause. Instead, they throw a big wrecking ball at the problem, and hope that the little spider gets caught up in the wreckage.

Kinda like that

If you ask nicely from the Microsoft support person, they could dig deeper into the memory dump file (.mdmp), utilizing their access to the source code and private symbols, and finding the actual offender (i.e. the head blocker).

But, all that is predicated on the conditions that you can afford Microsoft support in the first place, and that you have the time to wait for their response (which can take days and sometimes even weeks), and that you actually have a memory dump file.

SIDE NOTE: It’s important to stress that sometimes there wouldn’t be any head blocker, which could mean that there’s actually an issue at the server level – such as inefficient/insufficient hardware, problematic SQL Server build version, or bad misconfigurations of MaxDop or CTFP. In such cases, Microsoft’s “spray-and-pray” recommendations would indeed be the right fix.

Real-Time Troubleshooting

Microsoft documentation, as well as most bloggers and MVPs, would recommend finding your actual root cause by connecting to the server using the DAC (Dedicated Admin Connection), and looking for the “head blockerswhile the incident is happening in real-time.

Unfortunately, that kind of solution is not really feasible when you only find out about these incidents after the fact (when your server has already crashed/failed-over). Or you may find yourself fumbling frantically to get that damn head blocker during the most stressful situation possible.

(tell me if any of this sounds familiar: “Everything is freezing, I need to connect to the server ASAP! Oh right I need to connect via DAC! Oh damn I need to connect locally! Move already you damn RDP! Come on SSMS, open faster!!! Oh no where did I put that damn script?! Where’s the head blocker?! Argh I dropped my keys under the car at the most inopportune moment! Everything is burning and the cluster has already failed over!!! 🔥🔥🔥🔥 AAAAAH damnit, I missed it! Now I shall never know what done did the thing! 😭”).

One might think about some kind of mechanism to capture these events automatically, such as an extended events session recording high THREADPOOL waits. But that’s only going to give you the symptom. Tasks that are stuck on the THREADPOOL wait are way at the back of the line. What you really need is to know who’s at the front, i.e. your root cause, i.e. your head blocker.

The sp_server_diagnostics Procedure

Meet the system stored procedure sp_server_diagnostics that can give you all kinds of data about the current state of the SQL instance. It’s available in SQL Server 2012 and later.

It outputs the state of several “components” represented in XML form, most of which we’ll ignore for the purpose of this post. Today we’ll focus specifically on the “query_processing” type component.

This component contains very useful information such as the top wait types, the most CPU-intensive sessions, some other stuff, and also:

Example contents of a query_processing component with some blockage

This sounds like a very useful set of details that we could use to troubleshoot THREADPOOL waits and find the potential head blocker, right?

But… How can we investigate this data retroactively?

System_Health Extended Events

Luckily, SQL server periodically stores the output of sp_server_diagnostics in the form of an extended event called “sp_server_diagnostics_component_result” which is also recorded in the built-in system_health session.

The problem with the system_health extended events session is that it records lots of additional events and it cycles quite rapidly, which makes this an issue in terms of retention depth and ease of investigation.

You could create your own Extended Events session that records these events to a file target, using your own retention and rollover settings. But then it will start helping you only after you set it up. But what if the incident has already happened?

Well, if by any chance you have FCI or AG set up, then you’re in luck! Because there’s one more alternative solution:

SQLDIAG Extended Event Files

In Failover-Cluster-Instance (FCI) and Availability-Group (AG) environments, SQL Server automatically records the outputs of sp_server_diagnostics into the SQL Server Log subfolder (the same one where the ERRORLOG files are located) in the form of extended event files (.xel) with the keyword “SQLDIAG” in the middle of their file name.

Their retention depth is more favorable than what’s possible in the system_health session, and their content is specifically focused on the sp_server_diagnostics output.

Investigating the Server Diagnostics data

When trying to learn how you could investigate those SQLDIAG files, you would tend to find a couple of Microsoft-made utilities such as RML Utilities and SqlNexus that can (reportedly) parse PSSDIAG and SQLDIAG files and display their contents in a nice looking report.

Unfortunately, those are exceedingly outdated tools which are no longer properly maintained, and it could be nearly impossible to get them to work properly.

Following several real-world use cases that urged me to investigate these SQLDiag files, I found myself having to create my own alternative solution.

Luckily, since these are standard XEL files we’re talking about, the functionality to open them up is already available in T-SQL via the sys.fn_xe_file_target_read_file system function, and everything inside can be retrieved via XML shredding.

Eventually, I wrote this T-SQL script:

👉 Troubleshoot_high_THREADPOOL_waits_and_Deadlocked_Schedulers.sql

Since this post has already become long enough, I will not explain this script in-depth, but it’s filled with comments and explanations in-line, so you can examine and understand it on your own.

The script is versatile enough to support reading from SQLDIAG files as well as from the system_health extended events session (or any other XEL files containing sp_server_diagnostics_component_result events).

Its output can look something like this:

Output from an actual environment that had a Deadlocked Schedulers incident

The interesting columns that can help us the most are:

During incidents where worker thread starvation is happening (symptomized by high THREADPOOL waits and Deadlocked Schedulers), you can see a point in time when the pendingTasks value starts to climb up until it reaches some kind of peak, and then dropping back down when the head blocker gets either killed or released, or when everything crashes due to a Deadlocked Scheduler event.

During these incidents (specifically, when they start and when they end), it’s most interesting to investigate the possibleHeadBlockers column to find the possible root cause(s) for the whole mess – specifically, which command could potentially be the head blocker.

As mentioned above, the contents of this column are constructed from the XML report by looking for blocking session IDs (SPID), which exist as either processes blocked by something which is not a session (i.e. they would also show up in the blockedByNonSession column), or they were not found to be blocked at all.

Unfortunately, from what I’ve seen, the XML data does not tell us what type of wait(s) the blockers were stuck on… But we do get their execution stack, which may help you find the exact statement that’s causing all the trouble. For example:

There’s the bastard! Get ‘im!

Once you find your head blocker, investigate why it’s causing the blocks, and resolve that issue (whether it’s by improving its performance, changing its isolation level, breaking it up into smaller workloads, or whatever). There’s no one-size-fits-all in this part. It’s old-fashioned performance tuning on a case-by-case basis.

After you resolved your head-blocker, then you would have hopefully resolved your actual root cause – without having to resort to less optimal “spray-and-pray” solutions such as upgrading the server hardware, changing MaxDOP/CTFP, or (god forbid) manually increasing the “max worker threads” setting.

Conclusion

With the tips in this post, and the corresponding T-SQL script, I hope that I taught you how to practically troubleshoot high THREADPOOL waits and Deadlocked Schedulers, without necessarily relying on Microsoft Support digging into your dump files. Such incidents could have very critical effect on production systems, and it’s important to find and resolve their actual root cause.

To summarize:

With the tools I provided for you here, you should have more than enough to show those head blockers what’s what!

NOTE: The experience I had with actual real-world use cases have thus-far confirmed that the details I find in the SQLDIAG / system_health files (using my T-SQL script) match Microsoft Support’s findings from the relevant SQLDump files.

Exit mobile version