One fateful night, cold and dreary, I’ve stumbled upon an apparently little known fact about SQL Server – specifically SQL Server collations and how they affect performance… Yes, really!
While normalizing a table in their database, one of my clients had weird and inconsistent performance problems and couldn’t find a solution for it for quite a while.
They had two comparable queries that performed a LIKE query on the normalized column:
- The Original (Un-Normalized) Query: Scanned 2.5 million rows and executed for 3 seconds. It looked something like this:
SELECT someColumns FROM MyTable WHERE StringColumn LIKE 'SomeText%'
- The New (Normalized) Query: Scanned 10 million similar rows and executed for 50 seconds.
It looked something like this:
SELECT someColumns FROM MyTableNormalized AS main JOIN MyLookup AS lu ON main.TextID = lu.ID WHERE lu.StringColumn LIKE 'SomeText%'
Each query was executed against a different table, but the data in the tables was identical (the row number difference was due to the normalization). The column we searched against was varchar(255) in both tables.
For a while we couldn’t wrap our heads around the reason for the huge difference in performance. Something just didn’t seem right.
During my investigation I noticed an intriguing difference in the execution plans, that in the first query (the fast one) an implicit conversion was performed for the varchar column, but no such conversion occurred in the slow query. At first I didn’t pay attention to it because, as a DBA, I was trained to think that implicit or explicit conversions HURT performance, and not the other way around – so this couldn’t have been the reason.
But eventually (and with a little push from Shy Engelberg – Thank you Shy!) I took a closer look to find out why this conversion was happening in the first place.
We found that the collation of the varchar column was different between the two tables.
The first table had a Latin SQL collation, yet the second table had an ANSI collation (or windows collation if you prefer to call it that).
We tried to change the collation of the second table to the SQL collation as well – and presto! The query now executed for 5 seconds – 10 times faster! (and only 2 times slower than the 3 second query, a much more sensible difference).
I looked for materials regarding this issue, and couldn’t find much except this (apparently important) white paper: http://support.microsoft.com/kb/322112
Here are the relevant highlights:
Unicode sorting rules are much more complex than the rules for a non-Unicode SQL sort order. When SQL Server compares Unicode data, the characters are assigned a weight that is dynamically modified based on the collation’s locale. The data is also modified by comparison style settings such as width, accent, or Kana-sensitivity. The Unicode sort routines support more intelligent sort behaviors like word sorting […]
If you are storing and handling your data by using non-Unicode data types (char, varchar, text), and [if] you are using a SQL collation, string comparisons will be performed with a non-Unicode SQL sort order.
If you are storing and handling your data by using non-Unicode data types (char, varchar, text), and [if] you are using a Windows collation, string comparisons will be performed with the Unicode sorting rules. This may cause certain operations that are unusually dependent on string sorting performance to take longer and to use more CPU than a similar operation that is performed with a SQL collation.
If you are using Unicode data types (nchar, nvarchar, ntext), there is no difference in the sorting behavior for the SQL and the Windows collations. Both will use Unicode sorting rules.
In short, windows collations perform as slow as Unicode data types, and SQL collations perform much faster.
However, in such cases the sorting order must be taken into consideration because it could change the results. Though in the case of this client, the first table was the original way to go anyway, so we’re actually “fixing” the sorting order difference (in case such existed).
Later we also tested the performance of a binary collation and saw that it was even faster in queries that required comparison and grouping, but it performed worse in queries that required sorting. This makes sense since binary collations must be parsed textually before being sorted – a heavier operation than doing the same for SQL collations – as opposed to a comparison operation which can be as fast as doing XOR of two values.
Choosing SQL or Windows Collations seem like another consideration we need to take when designing a database and checking for performance – but beware! Changing the collation may change the sorting order of your text, so if your application depends on a certain form of sorting – you need to make sure the new collation matches it.
The same white paper mentioned above also claims that changing indexes or how a query is written will make a bigger difference than changing collations, but it’s surprising what a huge difference in performance a collation caused in the case of my client, and we did try to rewrite the query several times and changing indexes, but nothing made a big difference as much as the change in collation.
Did you encounter a similar scenario? Did you have similar (or even more interestingly – different) results? Tell us about it in the comments!
Thanks again to Shy Engelberg for his very helpful advice on this one.
This article was originally published by Eitan Blumin on November, 2011, in www.madeiradata.com