Access Violation error when querying from a system table function with parallelism

Following an incident at a customer’s production environment, Nathan Lifshes and I realized that we stumbled upon a yet-unknown bug in SQL Server, causing an access violation error, memory dumps, dropped connections, and even cluster fail-overs.

In this page:

Applies To

The following SQL Server bug was found and has been reproduced in every single version of SQL Server that we tested. These include:

  • SQL Server 2012 SP4-GDR
  • SQL Server 2014 SP1-GDR
  • SQL Server 2016 SP2-CU13
  • SQL Server 2017 RTM
  • SQL Server 2019 RTM
  • SQL Server 2019 CU4
  • SQL Server 2019 CU8
  • Azure SQL Database
  • Azure SQL Managed Instance

These are all the SQL Server versions that we could get our hands on without messing up production environments. We suspect that this bug exists in all SQL Server versions, but we can’t tell for certain. So far, though, it’s 9 out of 9, including the very latest SQL Server versions as of today.

Steps to Reproduce

The Access Violation error is triggered when an execution plan with parallelism involves specific system table functions. We found that the error occurs ONLY with parallel execution plans.

Therefore, in order to reproduce it, you’ll need:

  • SQL Server instance with MaxDOP setting not equal to 1
  • At least 2 available CPU cores

We found that we were able to reliably reproduce the error using SELECT queries with a CROSS APPLY clause, joining a temporary table with a large number of rows, with one of the system table functions, and also performing some kind of aggregation in the query (COUNT with GROUP BY was found to be sufficient).

In some cases, a parallel plan could not be triggered unless we also added Query Trace Flag 8649.

The following T-SQL script can be used as-is to reproduce the error:

WARNING: DO NOT RUN IN PRODUCTION ENVIRONMENTS!

IF OBJECT_ID('tempdb..#objects') IS NOT NULL DROP TABLE #objects;
CREATE TABLE #objects
(
  [object_id] INT
, [index_id] INT NULL
, [partition_number] INT NULL
);

INSERT INTO #objects
SELECT TOP (20000) -- Change this TOP number to as high as needed to trigger parallelism
	  p.object_id
	, p.index_id
	, p.partition_number
FROM sys.partitions AS p WITH(NOLOCK)
CROSS JOIN sys.all_columns c1
CROSS JOIN sys.all_columns c2

-- Use CROSS APPLY with some kind of aggregate function to trigger parallelism:
SELECT
	  DB_NAME()
	, p.object_id
	, p.index_id
	, p.partition_number
	, COUNT(*)
FROM #objects AS p WITH(NOLOCK)
CROSS APPLY sys.dm_db_index_operational_stats(db_id(),p.object_id,p.index_id,p.partition_number) AS dmf
GROUP BY
	  p.object_id
	, p.index_id
	, p.partition_number

The error has been reproduced using the function sys.dm_db_index_operational_stats as well as sys.dm_db_stats_properties. Although we didn’t test it on ALL functions so, as far as we know, it may apply to other functions as well. Both of these functions are visible in the execution plan as a “Table Valued Function” operator.

These screenshots could only be made possible using the “Display Estimated Execution Plan” functionality

Symptoms

Once you reproduce the access violation error, you will see the following exception:

Msg 0, Level 11, State 0, Line 55
A severe error occurred on the current command. The results, if any, should be discarded.

After the error occurs, you can run the following query to find recently created SQLDump files:

SELECT *
, log_filename = REPLACE([filename], '.mdmp', '.log')
, txt_filename = REPLACE([filename], '.mdmp', '.txt')
FROM sys.dm_server_memory_dumps
WHERE creation_time > DATEADD(hour, -24, GETDATE())
ORDER BY creation_time DESC

By looking into the corresponding SQLDump####.txt file, or in the SQL Server Error Log, you will find the c0000005 EXCEPTION_ACCESS_VIOLATION exception code, and then the following message in the next line:

Access Violation occurred reading address 0000000000000000

If you reproduce the error in Azure SQL DB or Azure SQL Managed Instance, the exception will look like this:

Msg 64, Level 20, State 0, Line 0
A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 – The specified network name is no longer available.)

Unfortunately, we found no way to retrieve similar SQLDump files or Error Log messages when using Azure SQL DB, since these functionalities don’t exist in this managed service.
No corresponding messages were found in sys.event_log either.

In Azure SQL Managed Instance, the access violation event was visible in the SQL Server Error Log, and the instance appears to have been restarted automatically as a result of it.

The access violation error does NOT happen with all system table functions. For example, it doesn’t happen with sys.dm_io_virtual_file_stats, despite the execution plan being with parallelism:

The following T-SQL script was used to test it:

IF OBJECT_ID('tempdb..#files') IS NOT NULL DROP TABLE #files;
CREATE TABLE #files
(
  [database_id] INT
, [file_id] INT NULL
);

INSERT INTO #files
SELECT TOP (30000)
	  main.database_id
	, main.file_id
FROM sys.master_files AS main WITH(NOLOCK)
CROSS JOIN sys.all_columns AS c1
CROSS JOIN sys.all_columns AS c2

SELECT
	  DB_NAME()
	, main.database_id
	, main.file_id
	, COUNT(*)
FROM #files AS main WITH(NOLOCK)
CROSS APPLY sys.dm_io_virtual_file_stats(main.database_id, main.file_id) AS ios
GROUP BY
	  main.database_id
	, main.file_id
OPTION(QUERYTRACEON 8649);

I also wasn’t able to reproduce the error using a user-defined multi-line table function. Even such that was wrapping the same problematic system functions:

CREATE FUNCTION dbo.TempUserFunction (@object_id INT, @index_id INT, @partition_number INT)
RETURNS @ret TABLE 
(
    -- columns returned by the function
    objectID int NULL,
    indexID int NULL,
    partitionNumber int NULL,
    dummy INT NULL
)
AS
BEGIN
	INSERT INTO @ret
	SELECT @object_id, @index_id, @partition_number, c.forwarded_fetch_count
	FROM sys.dm_db_index_operational_stats(DB_ID(), @object_id, @index_id, @partition_number) AS c
	OPTION(QUERYTRACEON 8649)

	RETURN;
END
GO

SELECT
	  DB_NAME()
	, p.object_id
	, p.index_id
	, p.partition_number
	, COUNT(*)
FROM #objects AS p WITH(NOLOCK)
CROSS APPLY dbo.TempUserFunction(p.object_id,p.index_id, p.partition_number) AS udf
GROUP BY
	  p.object_id
	, p.index_id
	, p.partition_number
OPTION(QUERYTRACEON 8649)

In this scenario, a parallel plan could not be generated at all, and the error was not reproduced.

However, a user-defined inline table function that did the exact same thing, did indeed trigger parallelism and reproduced the same error (as expected):

CREATE FUNCTION dbo.TempUserInlineFunction (@object_id INT, @index_id INT, @partition_number INT)
RETURNS TABLE
AS RETURN
(
	SELECT @object_id AS objectId, @index_id AS indexId
	      , @partition_number AS partitionNumber, c.forwarded_fetch_count
	FROM sys.dm_db_index_operational_stats(DB_ID(), @object_id, @index_id, @partition_number) AS c
)
GO

SELECT
	  DB_NAME()
	, p.object_id
	, p.index_id
	, p.partition_number
	, COUNT(*)
FROM #objects AS p WITH(NOLOCK)
CROSS APPLY dbo.TempUserInlineFunction(p.object_id,p.index_id, p.partition_number) AS ios
GROUP BY
	  p.object_id
	, p.index_id
	, p.partition_number
OPTION(QUERYTRACEON 8649)

Workaround

Disabling parallelism (for example, using OPTION(MAXDOP 1)) has prevented the access violation error from being reproduced. For example, this does not reproduce the error:

IF OBJECT_ID('tempdb..#objects') IS NOT NULL DROP TABLE #objects;
CREATE TABLE #objects
(
  [object_id] INT
, [index_id] INT NULL
, [partition_number] INT NULL
);

INSERT INTO #objects
SELECT TOP (20000)
	  p.object_id
	, p.index_id
	, p.partition_number
FROM sys.partitions AS p WITH(NOLOCK)
CROSS JOIN sys.all_columns c1
CROSS JOIN sys.all_columns c2

SELECT
	  DB_NAME()
	, p.object_id
	, p.index_id
	, p.partition_number
	, COUNT(*)
FROM #objects AS p WITH(NOLOCK)
CROSS APPLY sys.dm_db_index_operational_stats(db_id(),p.object_id,p.index_id,p.partition_number) AS dmf
GROUP BY
	  p.object_id
	, p.index_id
	, p.partition_number
OPTION (MAXDOP 1);

Conclusion

This bug has already been submitted to Microsoft Technical Support and to the SQL Server product team. We’ve received word that a fix is already underway and should be addressed in a future release.

Until this is fixed, we strongly advise to use OPTION(MAXDOP 1) to make sure parallelism is disabled when using system table functions such as sys.dm_db_index_operational_stats and sys.dm_db_stats_properties.

Were you able to reproduce the error with other system functions? Did you get different results from what’s described in this post? Please let us know about it in the comments below!

cover image by testbytes from Pixabay

5 comments

  1. We have had the same issue although in our case, it happens 2-3 times a day, so we cannot easily reproduce.

    Our server has MAXDOP = 2 and uses 4 CPUs.

    In a stored procedure we wrote, we read data from : sys.dm_filestream_non_transacted_handles.

    We have tried adding OPTION (MAXDOP 1) but the error still happens, although less often it seems.

    Thanks

    Like

    • Interesting… So you see the same kind of details in your dump txt files as we did?
      You should probably take these dump files and attach them to a support case with Microsoft.

      Like

      • It seems we do :

        SqlDumpExceptionHandler: Process 603 generated fatal exception c0000005 EXCEPTION_ACCESS_VIOLATION. SQL Server is terminating this process.

        Like

      • Eric, since your issue is still reproduced even with MAXDOP 1, it may not be related to the issue described in this post. There are many different possible use cases where ACCESS VIOLATION exceptions would happen. Most of which are fixed by Microsoft patches. I’d advise to do the following:

        1. Run DBCC CHECKDB to make sure you don’t have any corruption in your databases.
        2. Install the latest service pack and cummulative update for your SQL version.
        3. If the issue still persists, open a support case with Microsoft and attach the dump files and any related .log, .txt and .xel files: https://support.serviceshub.microsoft.com/supportforbusiness/create

        Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.