The other day I had to tune a query that has gone awry when it was put on new hardware. The old runtime was around one hour, but on the newer hardware it had dropped to 5 hours and 20 minutes. The main reason i could think of was that the new hardware was placed at our new IAAS solution, and as usual the storage engineers had designed for availability, not performance. The CrystalMark numbers were half of that of our own old hardware.
When SQLserver needs to retrieve data from a table, it checks if it should use an index or a tablescan. The threshold for this choise is around 30% of the number of 8k pages of the table, not the number of rows.
The statistics concerning the number of pages and rows can be changed with the little known clause of the update statistics command:
UPDATE STATISTICS WITH pagecount = 1000000
I recently used this in a case where a vendor application had a bad SQL statement in it that didn’t use the index. The table was deemed too small for index usage by SQLservers optimizer, but the subsequent tablescan led to extensive blocking in a splitsecond environment. Being a vendor application I couldn’t change the SQL statements and after examining some alternatives a planguide was considered the appropriate way. So i changed the table statistics to have a large pagecount and the optimizer was convinced that an index was best now. I captured the new executionplan with the indexusage into a planguide (the howto of that can be found here )
To reset the pagenumber use the command
DBCC UPDATEUSAGE ( 0, <tablename>)
This command can help you to find the threshold of plan changes ( called ‘tipping point’ ) See Kimberly Tripps post for an intro: https://www.sqlskills.com/blogs/kimberly/the-tipping-point-query-answers/
If you set up a replication – for example, a snapshot replication to refresh the ACC environment with PROD data – a correction of the identities is needed for the applications to run correctly.
The code below checks the identities and gives the syntax to correct the identity values:
IDENT_SEED(TABLE_SCHEMA+’.’+TABLE_NAME) AS Seed,
IDENT_INCR(TABLE_SCHEMA+’.’+TABLE_NAME) AS Increment,
IDENT_CURRENT(TABLE_SCHEMA+’.’+TABLE_NAME) AS Current_Identity,
TABLE_SCHEMA+’.’+TABLE_NAME as table_name,
‘DBCC CHECKIDENT(”’+TABLE_SCHEMA+’.’+TABLE_NAME+”’, RESEED, ‘+CAST(IDENT_CURRENT(TABLE_SCHEMA+’.’+TABLE_NAME) +1 AS VARCHAR(10))+’)’ as Command
OBJECTPROPERTY(OBJECT_ID(TABLE_SCHEMA+’.’+TABLE_NAME), ‘TableHasIdentity’) = 1
AND TABLE_TYPE = ‘BASE TABLE’
AND TABLE_NAME not like ‘Mspeer%’
AND TABLE_NAME not in (‘sysarticles’,’syspublications’)
ORDER BY TABLE_SCHEMA, TABLE_NAME
In order to view deadlock information in SQLserver (2012 and beyond), you need to have VIEW SERVER STATE permission. The deadlock reports are part of the default trace, called the System Health Session:
Open this Event_File, it may take a while for all events to load from the file. After that, from the Extended Events menu, choose Grouping, and group on NAME.
Click OK and the events are grouped by, well, name:
Open up the XML_deadlock_report and chose one of the deadlock reports. Click on the DEAEDLOCK tab to get the graph:
If you click on the DETAILS tab, you can doubleclick on the VALUE to get the XML with all the deadlock details.
In this post I will explain how to create executionplans from production on your developers laptop for SQLserver:
(1) download and install the SQLManagement Studio snap-in plan explorer from SentryOne
(2) Export databaseobjects including statistics and histograms from production as follows:
(3) Create the database on your laptop and make sure to switch off any statistics creations in the properties:
(4) Copy parallel processing configuration parameters from production:
sp_configure ‘cost threshold for parallelism’ , x sp_configure ‘max degree of parallelism’, x
You are now ready to check the executionplans of your new SQL statements your new environment. If you look at the tablesizes ( with this query ) you will see that your new tables are have the statistics in place, but all tables are really empty:
In the next post we’ll have a look at the executionplans.
The other day I had an interesting problem involving an unstable SQLserver AG cluster on a virtual machine (EC2) in Amazon’s AWS. It appeared that if a data integrity job and an index rebuild job overlap, the node simply went unresponsive and got evicted. A failover kicked in and we could only revive the primary node by rebooting and revert the node-eviction. What on earth could have caused this? I have never seen a server go down because of a standard OLA maintenancejob on an RDBMS, so the troubleshooter in me was intrigued what was going on. Such a severe and reproducible problem didn’t seem a small bug to me, but instead it seemed we hit on a misunderstood concept.
How it started
So let me tell you what happened on day 1, when we first got a notification of the node failover. At first we didn’t know what caused the breakdown so Pavlov made us look at the different logfiles to collect diagnostic data and come up with possible reproduction scenario’s. (You might have a look at the Microsofts TigerToolbox for Failover Analyses here)
We saw a few telling entries and some red herrings, such as the loss of connection with the witness server. The primary node lost all network connectivity, but wasn’t that simply a sign of resource starvation? Also, the problem seemed to occur after the windows OS has had a patch. That got us sidetracked for a while before we ruled this out as a coincidence.
The first solid clue came when we noticed that the problem occurred around the same time in the evening. There was little activity on the database, except a daily index rebuild and corruption check, as demanded by the vendor of the application. Surely an index rebuild couldn’t lead to starvation to such an extend that the instance is brought down? Moreover, we just had the instance increased in both memory and CPU resources by a factor 4. To make it even more interesting: Before the fourfold increase of the serversize we didn’t had any problems.
Is it the infrastructure?
And there I got my first suspicion: it had to do with the infrastructure footprint we’re placing. This came with a little problem: it was hosted in the cloud, where we didn’t have access to overcommitment figures, settings and logfiles on the infrastructure. Also there were some cloud specific details to consider, some of which I had not heard before. For example, in an unrelated incident where a downscaling of a node took me a few days because of G2 storage credit starvation(?). Over the years I’ve read all the fine print there is of SQLserver, but on Amazon, not so. This is going to be a learning experience.
The final hint came from a different source. Someone from cloudops couldn’t start up a database server as this sometimes hit upon a resource limit. Aha! We’re in some sort of resource consumption group, I should have investigated the contract with the Cloudprovider more closely. I hadn’t done that yet. After all, the cloud is like a tap for resources and in this wonderful brave new scalable world there shouldn’t be something like resource limits, no?
My suspicion was with memory starvation and I noticed some funnies: According to the task manager, the SQLserver takes 112 Megabyte, but the resource monitor showed a reserved amount of over 11GB. I now had a suspect, guilty by association: the ‘Locking Pages in Memory’ setting. This is a setting for SQLserver in a virtual machine environment to keep the memory reserved for the database. The so-called balloon driver process snatches away any unused memory from the virtual machines, which is not a good idea for software like databases who rely heavily on cache for their performance.
Indeed, after removing this setting (in the group policy, for details see here) the crash didn’t occur anymore and both the taskmanager and the resourcemanager now showed a consistent picture.
Being a database engineer I like to know the performance and setup of the underlying infrastructure. I didn’t bother to check this out with Amazon, assuming that I would not be able to see any log or errors on their side. But they do have some fine print, which I need to heed if I want to be able to advice on the best buy in the cloud.
What are the top 3 mistakes I have seen during cloudmigrations?
I have worked for various companies that have entered a cloud migration project. What struck me is that, despite the reasons for going to the cloud differ per company, they seem to hit the same problems. Some of them overcame the issues, but others fell out of the cloud tree, hitting every branch on the way down.
So here is my top 3 of mistakes, from companies that went for a digital transformation to the ones going for a simple IAAS solution (if you allow me to label that as a cloudmigration for the post sake).
MISTAKE 1: The Big Bang Theory
Let start off with an important mistake: we’re going to the cloud without doing a pilot-project, we’re going Big Bang. It goes like this: The upper brass chooses a provider and sets up the migration plan. The opinion of their own technicians is skipped, because they wouldn’t agree anyway. Their jobs would be on the line, so their opinion can be discarded as biased. In one case, own staff was deemed inferior, as one manager told me: “We can’t win from players who play the Champions League”, referring to the cloud provider.
I’ve seen this run amok in several ways. With an IAAS migration I’ve seen twice that the performance was way inferior or the service was bad. ( many outages without good diagnostic capabilities with the IAAS provider). With cloudmigrations with the aim of a digital transformation it appeared that cloudtechnologies are not mainstream. Real expertise is under construction, mistakes will be made.
Lesson: One would like to make these kind of mistakes during a pilot project, not during a mass migration.
MISTAKE 2: The cloud is just an on-prem datacenter
The cloud is a nice greenfield situation and promises a digital transformation unburdened by legacy systems. The services and machines in the cloud are new and riddled with extra bells and whistles. No more dull monolithic architecture, but with opportunities of microservice architecture. Gradually it is understood that the cloud entails a bit more that this. For example, how are the responsibilities reallocated among the staff, can the company switch to a scrum culture, how cloudsavvy are the architects and technicians? On that last point, I’ve seen an architect propose to have every application have its own database instance. Such a lack of understanding of the revenuemodel of a provider will make your design of the databaselandscape fall apart.
Lesson: A cloud transformation without a company transformation in terms of organisation, processes, methods, IT architecture and corporate culture is a ‘datacenter in the cloud’: new hat, same cowboy.
MISTAKE 3: The cloud fixes a bad IT department
Some situations are too sensitive to be discussed openly, so the next mistake I understood from people in whisper-mode: The quality of the own IT department doesn’t suffice and the brave new cloud will lead us to a better IT department. The changes are all good: On-premise datacenters are unnecessary, some jobs such as database engineers are not needed anymore, no more patch policies or outages. The cloud works as an automagical beautifier.
In short, managers who were unable to create a satisfactory IT department now have an opportunity to save the day with a cloudmigration. They can bring in a consultancy company to migrate to the new promised lands and climb the learning curve together before the budget runs out. The IT department used to be a trainwreck, but due to this controlled explosion called ‘Cloudmigration’ something beautiful will grow from the resulting wastelands. I leave it to your imagination what could go wrong here.
These are the top 3 mistakes I have seen in the field. Surely there are more nuanced issues, such as how cheap and safe the cloud really is, but that is subject I would like to discuss in a different post.
“Database administration is a dying breed; don’t go there“; that was the advice I received twenty-one years ago. It is a popular idea that spring up from time to time. The latest argument came from the cloud: database services are well-automated so who needs a good DBA?
I have seen two main arguments for the diminishing role of the DBA which i have seen repeated in blogposts and first hand conversations. First: new technology makes the DBA obsolete. The latest databases are self tuned, the administration is automated, relational databases will be taken over by NoSQL or Hadoop, MongoDB, XML, Object Oriented databases, Cloud.
This argument fades away as soon as that new technology falls from its hype cycle. NoSQL didn’t replace the RDBMS, administration tasks are more difficult in an ever increasing landscape, the cloud merely shifts the work to optimalisation, capacitymanagement and tasks higher in the value chain.
The second argument stems from an underestimation of the DBA work: “We don’t need DBA’s because our developers know enough about databases“. I have heard similar statements during job interviews. When I asked why a DBA role considered by this company I was told that “the developers don’t have time anymore for DBA work“. This was a softwarehouse where no one needed a DBA until they realised they NEEDED a DBA. A need for a firefighter role to fix the datamodel, performance or datacorruption. A need for a firefighter that would silently do his work without causing any delay in the software build process, bring about fundamental changes in the structure without using up other resources. There was no ambition to raise the maturity level of the buildprocess, no vision on operational intelligence or business intelligence: innovation was confined to application features. For efficiency and control over the buildprocess they used Scrum, that would suffice.
There is one funny in that job interview I would like to share with you. After hearing their situation, I asked them if they have a lot of incidents now and if they thought that was part of the deal of writing software. I forgot the exact answer they gave, but I didn’t forget that the IT manager interviewing me was called away ..… for an incident. I concluded the interview without seeing him again.
To advocate or not?
SQLserver guru Brent Ozar had an encouter with a CTO who said “I thought I’d save money in the cloud by not having a DBA, but what I’m learning is that in the cloud, I actually get a return on my DBA investments.” Surely, for performance projects in the cloud, picking up the euro’s is visible for everyone. But streamlining and compacting a databaselandscape for better agility is reserved for good technical leadership who is aiming for a mature company. Central question for a company is: how do you see your DBA?
A Plan Guide is a way to add a hint to a SQL statement if you can’t change the SQL statement at the source. This performance trick is hidden deep into the performance toolbox and shouldn’t be the first weapon of choice. I found that implementing planguides a bit of a delicate excercise , both on Oracle and SQL server. A small and insignificant copy/paste error seems to leave the planguide inoperable. I therefor use the following methode for SQL server which consists of 5 steps:
(1) retrieve the SQL_handle from the cache (2) retrieve the executionplan and store this in a planguide (3) store the planguide in a file (4) edit the executionplan in this file (5) drop the old planguide and load the new one from the edited file.
So let’s have a look at those steps in turn:
(1) retrieve the SQL_handle from the cache
select sql_handle, text from sys.sysprocesses cross apply sys.dm_exec_sql_text (sql_handle) where text like ‘%TOP 50%’
Of course, you should adjust the query to find the query you’re looking for in the Where clause.
(2) retrieve the executionplan and store this in a planguide
Who has never been to a SQLbits conference will be in for a little shock. Conference by name, but geekfest in appearance, this SQLbits is a special beast in the world of SQLserver conventions. Most eyecatching is “the infamous SQLbits party”, where this year there was a party around the theme of the roaring twenties; gambling, Al Capone, Charleston dancers and reliving the frolicks of a post-war period.
What is good about this conference is that there’s a lot of room for networking. As you may have noticed, IT-folks are better with computers than people, and a little more guidance in meeting people is helpful. For example: there are tables for every industry : banking, healthcare etc so you can team up with people in you branch. There was also a pubquiz where we were put in groups of about 6 people and had to answer freakish questions like “How tall was the brother in the lyrics of “I can’t stand losing you” from the Police?”. I’m embarrassed to admit that I knew the answer to that one.
On to the technical sessions. It’s impossible to mention all 150 of them, but generally the quality of them was very good. I went to most of the Powershell sessions, as this is my focus this year. Rob Sewell showed us how to build a threshold monitoring with Pester – a wonderful idea – and connected it to PowerBI for management consumption.
Wonderfull conference, I hope to go next year as well
Right before the saturday, Microsoft announced the withdrawal of their support of SQLsaturdays. What would this mean for the next events? I couldn’t find an official statement from Microsoft to share, but the message came across: SQLserver isn’t hot anymore. The real bets are on different cards, such as Artificial Intelligence and Big data.
But nothing of this impending doom on that sunny saturday. We were offered 35 sessions on subjects ranging from Business Intelligence, Azure, AI & Machine learning, performance, specific subjects such as “U-SQL custom extractor for multi-gigabyte XML” and reminders that we should learn about powershell ( “dbachecks – The PowerShell Module to validate your SQL Estate”) or Extended Events (“Simplifying XEvents Management with dbatools”) .
A new kid on this block is GDPR. Privacy is well valued in this new law, so far so good,but implementing this law is a complex task with unknown unkowns on the way. Even Brent Ozar freaked out and doesn’t sell his services to inhabitants of europe anymore. “GDPR, A European Horror Story” by John Martin gave a nice overview of the problems to follow. He urged us to follow the recent news about Apollo.io, a spamfactory sales engagement startup, whose breach this summer may lead to the first serious testcase of GDPR after Europeans found their data back via haveibeenpwned.com.