Following an incident at a customer’s production environment, Nathan Lifshes and I realized that we stumbled upon a yet-unknown bug in SQL Server, causing an access violation error, memory dumps, dropped connections, and even cluster fail-overs.
In this page:
Applies To
The following SQL Server bug was found and has been reproduced in every single version of SQL Server that we tested. These include:
- SQL Server 2012 SP4-GDR
- SQL Server 2014 SP1-GDR
- SQL Server 2016 SP2-CU13
- SQL Server 2017 RTM
- SQL Server 2019 RTM
- SQL Server 2019 CU4
- SQL Server 2019 CU8
- Azure SQL Database
- Azure SQL Managed Instance
These are all the SQL Server versions that we could get our hands on without messing up production environments. We suspect that this bug exists in all SQL Server versions, but we can’t tell for certain. So far, though, it’s 9 out of 9, including the very latest SQL Server versions as of today.
Steps to Reproduce
The Access Violation error is triggered when an execution plan with parallelism involves specific system table functions. We found that the error occurs ONLY with parallel execution plans.
Therefore, in order to reproduce it, you’ll need:
- SQL Server instance with MaxDOP setting not equal to 1
- At least 2 available CPU cores
We found that we were able to reliably reproduce the error using SELECT queries with a CROSS APPLY clause, joining a temporary table with a large number of rows, with one of the system table functions, and also performing some kind of aggregation in the query (COUNT with GROUP BY was found to be sufficient).
In some cases, a parallel plan could not be triggered unless we also added Query Trace Flag 8649.
The following T-SQL script can be used as-is to reproduce the error:
WARNING: DO NOT RUN IN PRODUCTION ENVIRONMENTS!
IF OBJECT_ID('tempdb..#objects') IS NOT NULL DROP TABLE #objects;
CREATE TABLE #objects
(
[object_id] INT
, [index_id] INT NULL
, [partition_number] INT NULL
);
INSERT INTO #objects
SELECT TOP (20000) -- Change this TOP number to as high as needed to trigger parallelism
p.object_id
, p.index_id
, p.partition_number
FROM sys.partitions AS p WITH(NOLOCK)
CROSS JOIN sys.all_columns c1
CROSS JOIN sys.all_columns c2
-- Use CROSS APPLY with some kind of aggregate function to trigger parallelism:
SELECT
DB_NAME()
, p.object_id
, p.index_id
, p.partition_number
, COUNT(*)
FROM #objects AS p WITH(NOLOCK)
CROSS APPLY sys.dm_db_index_operational_stats(db_id(),p.object_id,p.index_id,p.partition_number) AS dmf
GROUP BY
p.object_id
, p.index_id
, p.partition_number
The error has been reproduced using the function sys.dm_db_index_operational_stats as well as sys.dm_db_stats_properties. Although we didn’t test it on ALL functions so, as far as we know, it may apply to other functions as well. Both of these functions are visible in the execution plan as a “Table Valued Function” operator.
Symptoms
Once you reproduce the access violation error, you will see the following exception:
Msg 0, Level 11, State 0, Line 55
A severe error occurred on the current command. The results, if any, should be discarded.
After the error occurs, you can run the following query to find recently created SQLDump files:
SELECT *
, log_filename = REPLACE([filename], '.mdmp', '.log')
, txt_filename = REPLACE([filename], '.mdmp', '.txt')
FROM sys.dm_server_memory_dumps
WHERE creation_time > DATEADD(hour, -24, GETDATE())
ORDER BY creation_time DESC
By looking into the corresponding SQLDump####.txt file, or in the SQL Server Error Log, you will find the c0000005 EXCEPTION_ACCESS_VIOLATION exception code, and then the following message in the next line:
Access Violation occurred reading address 0000000000000000
If you reproduce the error in Azure SQL DB or Azure SQL Managed Instance, the exception will look like this:
Msg 64, Level 20, State 0, Line 0
A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 – The specified network name is no longer available.)
Unfortunately, we found no way to retrieve similar SQLDump files or Error Log messages when using Azure SQL DB, since these functionalities don’t exist in this managed service.
No corresponding messages were found in sys.event_log either.
In Azure SQL Managed Instance, the access violation event was visible in the SQL Server Error Log, and the instance appears to have been restarted automatically as a result of it.
The access violation error does NOT happen with all system table functions. For example, it doesn’t happen with sys.dm_io_virtual_file_stats, despite the execution plan being with parallelism:
The following T-SQL script was used to test it:
IF OBJECT_ID('tempdb..#files') IS NOT NULL DROP TABLE #files;
CREATE TABLE #files
(
[database_id] INT
, [file_id] INT NULL
);
INSERT INTO #files
SELECT TOP (30000)
main.database_id
, main.file_id
FROM sys.master_files AS main WITH(NOLOCK)
CROSS JOIN sys.all_columns AS c1
CROSS JOIN sys.all_columns AS c2
SELECT
DB_NAME()
, main.database_id
, main.file_id
, COUNT(*)
FROM #files AS main WITH(NOLOCK)
CROSS APPLY sys.dm_io_virtual_file_stats(main.database_id, main.file_id) AS ios
GROUP BY
main.database_id
, main.file_id
OPTION(QUERYTRACEON 8649);
I also wasn’t able to reproduce the error using a user-defined multi-line table function. Even such that was wrapping the same problematic system functions:
CREATE FUNCTION dbo.TempUserFunction (@object_id INT, @index_id INT, @partition_number INT)
RETURNS @ret TABLE
(
-- columns returned by the function
objectID int NULL,
indexID int NULL,
partitionNumber int NULL,
dummy INT NULL
)
AS
BEGIN
INSERT INTO @ret
SELECT @object_id, @index_id, @partition_number, c.forwarded_fetch_count
FROM sys.dm_db_index_operational_stats(DB_ID(), @object_id, @index_id, @partition_number) AS c
OPTION(QUERYTRACEON 8649)
RETURN;
END
GO
SELECT
DB_NAME()
, p.object_id
, p.index_id
, p.partition_number
, COUNT(*)
FROM #objects AS p WITH(NOLOCK)
CROSS APPLY dbo.TempUserFunction(p.object_id,p.index_id, p.partition_number) AS udf
GROUP BY
p.object_id
, p.index_id
, p.partition_number
OPTION(QUERYTRACEON 8649)
In this scenario, a parallel plan could not be generated at all, and the error was not reproduced.
However, a user-defined inline table function that did the exact same thing, did indeed trigger parallelism and reproduced the same error (as expected):
CREATE FUNCTION dbo.TempUserInlineFunction (@object_id INT, @index_id INT, @partition_number INT)
RETURNS TABLE
AS RETURN
(
SELECT @object_id AS objectId, @index_id AS indexId
, @partition_number AS partitionNumber, c.forwarded_fetch_count
FROM sys.dm_db_index_operational_stats(DB_ID(), @object_id, @index_id, @partition_number) AS c
)
GO
SELECT
DB_NAME()
, p.object_id
, p.index_id
, p.partition_number
, COUNT(*)
FROM #objects AS p WITH(NOLOCK)
CROSS APPLY dbo.TempUserInlineFunction(p.object_id,p.index_id, p.partition_number) AS ios
GROUP BY
p.object_id
, p.index_id
, p.partition_number
OPTION(QUERYTRACEON 8649)
Workaround
Disabling parallelism (for example, using OPTION(MAXDOP 1)) has prevented the access violation error from being reproduced. For example, this does not reproduce the error:
IF OBJECT_ID('tempdb..#objects') IS NOT NULL DROP TABLE #objects;
CREATE TABLE #objects
(
[object_id] INT
, [index_id] INT NULL
, [partition_number] INT NULL
);
INSERT INTO #objects
SELECT TOP (20000)
p.object_id
, p.index_id
, p.partition_number
FROM sys.partitions AS p WITH(NOLOCK)
CROSS JOIN sys.all_columns c1
CROSS JOIN sys.all_columns c2
SELECT
DB_NAME()
, p.object_id
, p.index_id
, p.partition_number
, COUNT(*)
FROM #objects AS p WITH(NOLOCK)
CROSS APPLY sys.dm_db_index_operational_stats(db_id(),p.object_id,p.index_id,p.partition_number) AS dmf
GROUP BY
p.object_id
, p.index_id
, p.partition_number
OPTION (MAXDOP 1);
Conclusion
This bug has already been submitted to Microsoft Technical Support and to the SQL Server product team. We’ve received word that a fix is already underway and should be addressed in a future release.
Until this is fixed, we strongly advise to use OPTION(MAXDOP 1) to make sure parallelism is disabled when using system table functions such as sys.dm_db_index_operational_stats and sys.dm_db_stats_properties.
Were you able to reproduce the error with other system functions? Did you get different results from what’s described in this post? Please let us know about it in the comments below!
cover image by testbytes from Pixabay
EDIT: Based on several tests, the error does not seem to be reproducible in SQL Server 2019 CU11. All execution plans appear to be forced to run serially, regardless of parallelism-related query hints and instance configurations, and thus the crash is avoided.