This month’s #tsql2sday is hosted by the @AirborneGeek (t|b), who asks us to take a lesson from something frequently done by pilots – learning from accidents and mistakes done by others. As a long-time SQL Server Consultant DBA, I have learned from quite a lot of mistakes done (mostly) by others, seeing as a significant part of my job description is to come over and fix such mistakes. So, today I’ll use this opportunity to talk about one such interesting incident.
My post today is published at the Madeira Data Solutions blog. In the post, I guide you through our investigation conducted in a real-world use case, and the interesting things we’ve learned as a result:
[The] customer contacted us with complaints about recurring performance issues. They couldn’t quite put their finger on anything specific, but they reported the following symptoms:
- End-users intermittently getting “a general sense of slowness while using their system”.
- Users experiencing sudden disconnections from the database.
- Maintenance jobs failing occasionally.
Looking into the history of the failed jobs, they appear to be caused by disconnections in the Availability Group, and these correlated with corresponding events in the AlwaysOn_Health extended events session (as did the disconnections reported by end-users): State changes into RESOLVING and back, WSFC disconnections… The works.
However, we couldn’t pinpoint the cause of these disconnections [at first].