Tune a query from 5 hours to 10 seconds

The other day I had to tune a query that has gone awry when it was put on new hardware. The old runtime was around one hour, but on the newer hardware it had dropped to 5 hours and 20  minutes. The main reason i could think of was that the new hardware was placed at our new IAAS solution, and as usual the storage engineers had designed for availability, not performance. The CrystalMark numbers were half of that of our own old hardware.

Join me at this tunings demo (level 300 +) :

click for video

Your next skillset – it’s not IT

Until recently, if you wanted to work on your professional development as IT worker, you only had to follow the certification path of your vendor. After all, it are the hardcore skills that will land you on your new job.

Certifications, combined with experience, are enough for a successful career. What more do you need? Of course, as your skills improve, you’re steering into the category where ‘consulting skills’ and soft skills are required. Good communication increases your effectiveness. We are all familiar with the examples of IT professionals who certainly make a valuable contribution to the team, but who you would rather not send to a customer. Fine craftsman, but keep him indoors.

But if you want to grow as a professional, how to do this? What if an extra certificate does not help you further in your career, but you encounter roadblocks of a different kind? For some of these, there are well-known selfhelp books, such as Getting Things Done or The seven habits of highly effective people. There are also many webcasts training programs aimed at social and communication skills. These can offer eye-openers, but they also have their limitations. They do not know your situation, do not know your company culture, your character, the local cultural chemistry, politics and hidden agendas. You have to sharpen your skills through osmosis and there’s no dialog at all.

This problem has grown significantly in recent years, at least for me. With the cloud, many new technologies have come within reach, companies are being disrupted or trying to stay ahead of this with a digital transformation, and the role of the DBA is being redefined. These three developments have their own development speed at every company, and this can lead to substantial growing pains. Even within a company, people dwell on different parts of the hype cycle. This causes me substantial problems for me.

Personal guidance is indispensable, but how to obtain it? Asking for help and feedback from the actors around you can be difficult, especially when there’s a strong political environment, or when there simply aren’t available conversation partners available. A few leading IT people are trying to show us the way here. Brent Ozar has his Level 500 Guide to Career Internals. It’s a webcast, so the communication is still a one-way street. An other interesting offer came from Paul Randal: Personal Guidance by Email.

Having a well-informed discussion partner in this rapidly changing world is invaluable. When technical challenges are no longer real challenges and you have to determine your position in these volatile times, personal mentorship is the thing that can make the difference. Let’s hope more mentors will rise to this demand.

Becoming a cloud DBA (II)

Cost driven cloud movements

One of the advantages of the cloud is that it offers a build or buy decision for your database infrastructure. For some companies, stuck with legacy systems and outdated hardware, the prospect of a greenfield deployment seems very appealing. Let’s get rid of the old junk, let’s enter the rocky road of transformation, possibly under the supervision of a third party – for advice or eventual blame gaming.

It’s true that you can get into the cloud like this and save money, but the same forces of mismanagement that created the ‘old junk’ are still present when moving to the cloud. And mismanagement in the cloud can be a costly ‘learning experience’. Or looking at it from the other way, the better you are in control, the less appealing the cloud is money wise. For example, DropBox started initially on Amazon, but decided it was better off building its own infrastructure, saving $75 million over two years. That’s the tongue in cheek observation aka paradox: the better you are onpremis, the less the cloud has to offer. And the more the cloud has to offer, the less you can take advantage of it.

Furthermore, the law of conservation of misery dictates the presence of at least some disadvantages, and the cloud is not an exception. Clouds can go down for a several hours in high profile incidents, questioning the promised uptimes. Or they can suffer from capacity problems, as some reported that “azure seems to be full“. But apart from the Single Point of Failure problem, some databases will run into problems in the cloud and are exported back to the on-premise machines, a move called repatriation. The reason of such repatriation can be both costs or technical reasons. As an example of the latter: the cloud can prone to the Noisy Neighbor effect, and certain workloads won’t cope well with this. Or the network latency to the cloud is prohibitive for certain applications.
Surely this can be remedied in a more expensive tier in the cloud, but the on-premis datacenter comes in scope again when the ROI of these solutions are calculated.

Gartner puts it this way: “What remains on-premises are business processes that are mission-critical and require greater oversight and more detailed levels of control than is available via cloud infrastructure and hosted models.

So, what’s the DBA role here?

If money is the main motivator, then TCO calculations are on your tasklist. For you as a DBA it revolves around data placement strategy, and there’s a whole information war about the TCO costs involved. “Microsofts cost comparisons are misleading” says this article from Amazon ( “Fact-checking the truth on TCO for running Windows workloads in the cloud”). Or have a look at Amazons TCO calculator. Just simply insert the number of vCPU and memory and the cost comparison can be made. A total omission of the virtual/physical ration of CPU’s is apparent missing vital element. Be prepared to face a lot of TCO gaming out there, so be sure to understand the revenue model of your cloud vendor.

You need to be able to measure how much bang do you actually get for your buck in the cloud. As I showed in one of my previous post, a lot of DBA’s are not impressed with the cloud performance. Therefore, your second task in this will be Benchmarking the Cloud. This can be a science on its own, but you may suffice by capturing and replaying a workload with the open source toolset WorkLoad Tools of which I’ll post some of my testresults in a later post. Be adviced that, apart from capacity measuring of the cloud, latency may also play a role in your application. If you need to go deeper into benchmarking than a replay, you’re into serious readings such as this one.

Conclusion

The Cloud is just someone else’s computer” – and this surely rings true in this scenario. Money is the main motivator, so making TCO calculations and benchmarks is on your tasklist. As most companies will have a hybrid cloud/onprem environment, you therefor need to be able to advice where to place the workloads. And be advised that capacity planning is different too. In the onprem scenario, you can calculate your capacity requirements every few years and take some healthy overcapacity into the mix. Compare that to the cloud you’ll have to revise the capacity every few month depending on the volatility of your environment. You may want to automate the reports of the resourceconsumption in the cloud, so hone your skills in Powershell/ PowerBI/ Bash or any of your favorite scripting tool. Tasks to be automated are on their way to you.

Becoming a cloud DBA (I)

rule 1: Know why your company goes to the cloud

The cloud is going to change the role of the DBA“. That message is omnipresent in webcasts, conferences and perhaps at your company as well. It is presented as the big gamechanger, making you move up in the valuechain and leave behind the chores of patching, setting up high availability systems or writing backup strategies. Or will it?

There are a few changes in the DBA roles and tasks for sure, but to which extend depends on several factors. As you may have noticed, the role of the DBA is very broad and every company has his own boundaries of responsibilities for the DBA. The cloud won’t change that, but still there are some main roads laid out for you. An important one is the motivation of your company to go to the cloud. For simplicity, I will consider two basic reasons. The first is the idea that the costs of the cloud is lower than the onpremis machines.

Some companies are stuck with old and deprecated hardware and stuffed datacenters, and therefor look forward to happily leave the legacy behind and enjoy the CAPEX model. (That is, pay as you go with no upfront hardware costs). As their datacenters are maxed out with hardware that has exceeded its economical life, the cloud presents itself as an easy way out without costly transformation costs.
The second main reason is more organizational one: the struggling company is stuck with many political trench lines that run through the IT department, with Windows Admins, Network Admins, DBA and developers as the warring factions in an ongoing trench-war.
How wonderful wouldn’t it be if all the infrastructure is maintained by a single someone else? Some place where all the infrastructure is available with a few clicks on a button and all political lines are redefined? It exists, is the cloud! Yes, this cloud promises to be the crowbar that gets organizational changes done.

Of course there are other reasons to go to the cloud. For example, several companies fear a disruptive digital innovation on their market and they realise that their IT department needs to switch from a business supporting role to a business driving one. But for the sake of simplicity (and for the focus on the DBA role) I will only look at the two aforementioned reasons to go to the cloud which i will dub: “lower costs” or “vision”.

It is important for the DBA to know in which of these two settings (s)he is, as this will determine the tasks and skills that are needed. We’ll have a look at these different roles of both scenario’s in later posts.

Apart from the cost factor, there is an other element that influences the DBA role, and that is whether you have paying customers – internal or external – or not. For example, an internal customer is a different department in your company, such as the Business Intelligence team who can buy more horsepower for their data warehouse from you. Or you have external customers, much like the cloudvendors themselves. Anyway, if you do have paying customers, you’re a cloud reseller and buy the resources to create your portfolio. Alternatively, if you haven’t got paying customers but instead need to service the database needs with a fixed budget, this will reflect on your choices and tasks in the cloud.

In the coming posts, we’ll see how.

Continue at Part II: Cost Driven Cloud Movements

What I do better than the cloud

MSsql guru Brent Ozar asked an interesting question to the DBA’s in one of his blogs:

what do you think you do better than the cloud?

It’s a good question and lies at the heart of the transformation of production DBA to cloud DBA. Where does the cloud fall short and where can the DBA make a difference? What is the place of the cloud DBA and where can he add the extra value?

Some of the respondents took the liberty to discuss the shortcomings of the cloud as the difference between hype and reality became obvious to them, so the answers also give an insight of what issue’s are encountered by DBA’s who work with cloud technologies.

The majority of remarks was about performance and costcontrol. On the third place came ‘nothing’ as what they could do better. Further down the rank were the lower systemvisibility, security concerns, support quality and so on.

As for performance, all the DBA’s are not impressed : “..our local data center blows Azure out of the water.” It seems that the virtual CPU / physical CPU ratio isn’t as good in the cloud as they are on premise. In your own datacenter you can check this ratio and correct any over-commitment but in the cloud you’re stuck with what you’re given. Of course you can buy yourself extra performance to the equivalent of your large onpremise machine, but the costs quickly become prohibitive. It is for this very reason that performance management is an area where the cloud DBA can save money. As Brent Ozar was told in a conversation with a CTO:

” ..the CTO told me, “I thought I’d save money in the cloud by not having a DBA, but what I’m learning is that in the cloud, I actually get a return on my DBA investments.

On the second place is cost management. As one respondent noted: “Make a plan on prem and run with it for 5 years. Make a plan in the cloud and you are revisiting it every 8-10 months“. There are indeed many moving parts: price structures can change, extra load may need the purchase of heavier machines and “constant resource usage monitoring and $$discussions” . From my experience, I agree with the comment “many people know too little about costs and paradigm shifts which make for some huge intangible but very real costs to just up and move“. And here is where the cloud DBA steps in.

So?

So back to the question: What can the DBA’s do better than the cloud? From the potpourri of answers it seems that the DBA’s are in a position to evaluate cost/performance or ‘bang for the buck’. This is backed by studies in the field:

“Businesses increasingly rely on DBAs to help them understand the pricing implications and ROI of moving to the cloud.”

https://www.snowflake.com/blog/3-trends-that-are-transforming-the-role-of-the-dba/

“… seasoned DBAs are seen as key influencers in the overall data management infrastructure and their input is sought for decisions such as when to move data to the cloud.”

http://www.dbta.com/Editorial/News-Flashes/New-2018-Study-Identifies-Emerging-Trends-in-Database-Administration-122639.aspx

Conclusion

The foregoing shows that the DBA in its new role of cloud data asset manager can have a lot of value, however it takes one crucial element to achieve it. And that is that the company realizes that costs are a design parameter. Without this, any cloud adventures may end up pretty costly. It’s the task of the DBA to convey this message.

Changing the table pagecount

When SQLserver needs to create an executionplan to retrieve data from a table, it checks if it can use an index. Sometimes it will choose an index, sometimes a full tablescan. The threshold for this decision is not the portion of rows, but the portion of 8k pages of the table. An old rule of thumb said that if around 30% of the number of pages of a table are to be retrieved, the query will not use an index, but a tablescan.
You can tamper with the statistics concerning the number of pages and rows with the little known clause of the update statistics command:

UPDATE STATISTICS WITH pagecount = 1000000

I recently used this in a case where a vendor application had a bad SQL statement in it that didn’t use the index. The table was deemed too small for index usage by SQLservers optimizer, but the subsequent tablescan led to extensive blocking in a splitsecond environment. Being a vendor application I couldn’t change the SQL statements and after examining some alternatives a planguide was considered the appropriate way. So i changed the table statistics to have a large pagecount and the optimizer was convinced that an index was best now. I captured the new executionplan with the indexusage into a planguide (the howto of that can be found here )

To reset the pagenumber use the command

DBCC UPDATEUSAGE ( 0, <tablename>)

This command can help you to find the threshold of plan changes ( called ‘tipping point’ ) See Kimberly Tripps post for an intro: https://www.sqlskills.com/blogs/kimberly/the-tipping-point-query-answers/

Mind you that the ‘tipping point’ depends on a lot of variables, such as MAXDOP settings, SQLserver version, optimizer version etcetera. You need some time in the lab when you need to find the tipping point for your table.

Set Identities on MSSQLserver

If you set up a replication – for example, a snapshot replication to refresh the ACC environment with PROD data – a correction of the identities is needed for the applications to run correctly.

The code below checks the identities and gives the syntax to correct the identity values:

SELECT
IDENT_SEED(TABLE_SCHEMA+’.’+TABLE_NAME) AS Seed,
IDENT_INCR(TABLE_SCHEMA+’.’+TABLE_NAME) AS Increment,
IDENT_CURRENT(TABLE_SCHEMA+’.’+TABLE_NAME) AS Current_Identity,
TABLE_SCHEMA+’.’+TABLE_NAME as table_name,
‘DBCC CHECKIDENT(”’+TABLE_SCHEMA+’.’+TABLE_NAME+”’, RESEED, ‘+CAST(IDENT_CURRENT(TABLE_SCHEMA+’.’+TABLE_NAME) +1 AS VARCHAR(10))+’)’ as Command
FROM
INFORMATION_SCHEMA.TABLES
WHERE
OBJECTPROPERTY(OBJECT_ID(TABLE_SCHEMA+’.’+TABLE_NAME), ‘TableHasIdentity’) = 1
AND TABLE_TYPE = ‘BASE TABLE’
AND TABLE_NAME not like ‘Mspeer%’
AND TABLE_NAME not in (‘sysarticles’,’syspublications’)
ORDER BY TABLE_SCHEMA, TABLE_NAME

sample output

viewing Deadlock info

In order to view deadlock information in SQLserver (2012 and beyond), you need to have VIEW SERVER STATE permission. The deadlock reports are part of the default trace, called the System Health Session:

Doubleclick the Event_File

Open this Event_File, it may take a while for all events to load from the file. After that, from the Extended Events menu, choose Grouping, and group on NAME.

Click OK and the events are grouped by, well, name:

87 deadlocks found

Open up the XML_deadlock_report and chose one of the deadlock reports. Click on the DEAEDLOCK tab to get the graph:

If you click on the DETAILS tab, you can doubleclick on the VALUE to get the XML with all the deadlock details.

Execution Plans I: setup

In this post I will explain how to create executionplans from production on your developers laptop for SQLserver:

(1) download and install the SQLManagement Studio snap-in plan explorer from SentryOne

(2) Export databaseobjects including statistics and histograms from production as follows:

Generate DB objects
Press the Advanced Button
switch on Script Statistics and Histograms

(3) Create the database on your laptop and make sure to switch off any statistics creations in the properties:

(4) Copy parallel processing configuration parameters from production:

sp_configure ‘cost threshold for parallelism’ , x
sp_configure ‘max degree of parallelism’, x

You are now ready to check the executionplans of your new SQL statements your new environment. If you look at the tablesizes ( with this query ) you will see that your new tables are have the statistics in place, but all tables are really empty:

Example from AdventureWorks Database

In the next post we’ll have a look at the executionplans.

Instability in the cloud

A short CSI story with a hidden suspect

The other day I had an interesting problem involving an unstable SQLserver AG cluster on a virtual machine (EC2) in Amazon’s AWS. It appeared that if a data integrity job and an index rebuild job overlap, the node simply went unresponsive and got evicted.
A failover kicked in and we could only revive the primary node by rebooting and revert the node-eviction. What on earth could have caused this? I have never seen a server go down because of a standard OLA maintenancejob on an RDBMS, so the troubleshooter in me was intrigued what was going on. Such a severe and reproducible problem didn’t seem a small bug to me, but instead it seemed we hit on a misunderstood concept.

How it started

So let me tell you what happened on day 1, when we first got a notification of the node failover. At first we didn’t know what caused the breakdown so Pavlov made us look at the different logfiles to collect diagnostic data and come up with possible reproduction scenario’s. (You might have a look at the Microsofts TigerToolbox for Failover Analyses here)

We saw a few telling entries and some red herrings, such as the loss of connection with the witness server. The primary node lost all network connectivity, but wasn’t that simply a sign of resource starvation?
Also, the problem seemed to occur after the windows OS has had a patch. That got us sidetracked for a while before we ruled this out as a coincidence.

The first solid clue came when we noticed that the problem occurred around the same time in the evening. There was little activity on the database, except a daily index rebuild and corruption check, as demanded by the vendor of the application. Surely an index rebuild couldn’t lead to starvation to such an extend that the instance is brought down? Moreover, we just had the instance increased in both memory and CPU resources by a factor 4. To make it even more interesting: Before the fourfold increase of the serversize we didn’t had any problems.

Is it the infrastructure?

And there I got my first suspicion: it had to do with the infrastructure footprint we’re placing. This came with a little problem: it was hosted in the cloud, where we didn’t have access to overcommitment figures, settings and logfiles on the infrastructure. Also there were some cloud specific details to consider, some of which I had not heard before. For example, in an unrelated incident where a downscaling of a node took me a few days because of G2 storage credit starvation(?). Over the years I’ve read all the fine print there is of SQLserver, but on Amazon, not so.
This is going to be a learning experience.

Memory starvation

The final hint came from a different source. Someone from cloudops couldn’t start up a database server as this sometimes hit upon a resource limit. Aha! We’re in some sort of resource consumption group, I should have investigated the contract with the Cloudprovider more closely. I hadn’t done that yet. After all, the cloud is like a tap for resources and in this wonderful brave new scalable world there shouldn’t be something like resource limits, no?

My suspicion was with memory starvation and I noticed some funnies: According to the task manager, the SQLserver takes 112 Megabyte, but the resource monitor showed a reserved amount of over 11GB.
I now had a suspect, guilty by association: the ‘Locking Pages in Memory’ setting. This is a setting for SQLserver in a virtual machine environment to keep the memory reserved for the database. The so-called balloon driver process snatches away any unused memory from the virtual machines, which is not a good idea for software like databases who rely heavily on cache for their performance.

Indeed, after removing this setting (in the group policy, for details see here) the crash didn’t occur anymore and both the taskmanager and the resourcemanager now showed a consistent picture.

Lesson?

Being a database engineer I like to know the performance and setup of the underlying infrastructure. I didn’t bother to check this out with Amazon, assuming that I would not be able to see any log or errors on their side. But they do have some fine print, which I need to heed if I want to be able to advice on the best buy in the cloud.

Cloudmigration: top 3 mistakes

What are the top 3 mistakes I have seen during cloudmigrations?

I have worked for various companies that have entered a cloud migration project. What struck me is that, despite the reasons for going to the cloud differ per company, they seem to hit the same problems. Some of them overcame the issues, but others fell out of the cloud tree, hitting every branch on the way down.

So here is my top 3 of mistakes, from companies that went for a digital transformation to the ones going for a simple IAAS solution (if you allow me to label that as a cloudmigration for the post sake).

MISTAKE 1: The Big Bang Theory

Let start off with an important mistake: we’re going to the cloud without doing a pilot-project, we’re going Big Bang. It goes like this: The upper brass chooses a provider and sets up the migration plan. The opinion of their own technicians is skipped, because they wouldn’t agree anyway. Their jobs would be on the line, so their opinion can be discarded as biased. In one case, own staff was deemed inferior, as one manager told me: “We can’t win from players who play the Champions League”, referring to the cloud provider.

I’ve seen this run amok in several ways. With an IAAS migration I’ve seen twice that the performance was way inferior or the service was bad. ( many outages without good diagnostic capabilities with the IAAS provider). With cloudmigrations with the aim of a digital transformation it appeared that cloudtechnologies are not mainstream. Real expertise is under construction, mistakes will be made.

Lesson: One would like to make these kind of mistakes during a pilot project, not during a mass migration.

MISTAKE 2: The cloud is just an on-prem datacenter

The cloud is a nice greenfield situation and promises a digital transformation unburdened by legacy systems. The services and machines in the cloud are new and riddled with extra bells and whistles. No more dull monolithic architecture, but with opportunities of microservice architecture. Gradually it is understood that the cloud entails a bit more that this. For example, how are the responsibilities reallocated among the staff, can the company switch to a scrum culture, how cloudsavvy are the architects and technicians? On that last point, I’ve seen an architect propose to have every application have its own database instance. Such a lack of understanding of the revenuemodel of a provider will make your design of the databaselandscape fall apart.

Lesson: A cloud transformation without a company transformation in terms of organisation, processes, methods, IT architecture and corporate culture is a ‘datacenter in the cloud’: new hat, same cowboy.

MISTAKE 3: The cloud fixes a bad IT department

Some situations are too sensitive to be discussed openly, so the next mistake I understood from people in whisper-mode: The quality of the own IT department doesn’t suffice and the brave new cloud will lead us to a better IT department. The changes are all good: On-premise datacenters are unnecessary, some jobs such as database engineers are not needed anymore, no more patch policies or outages. The cloud works as an automagical beautifier.

In short, managers who were unable to create a satisfactory IT department now have an opportunity to save the day with a cloudmigration. They can bring in a consultancy company to migrate to the new promised lands and climb the learning curve together before the budget runs out. The IT department used to be a trainwreck, but due to this controlled explosion called ‘Cloudmigration’ something beautiful will grow from the resulting wastelands. I leave it to your imagination what could go wrong here.

These are the top 3 mistakes I have seen in the field. Surely there are more nuanced issues, such as how cheap and safe the cloud really is, but that is subject I would like to discuss in a different post.