Instability in the cloud

A short CSI story with a hidden suspect

The other day I had an interesting problem involving an unstable SQLserver AG cluster on a virtual machine (EC2) in Amazon’s AWS. It appeared that if a data integrity job and an index rebuild job overlap, the node simply went unresponsive and got evicted.
A failover kicked in and we could only revive the primary node by rebooting and revert the node-eviction. What on earth could have caused this? I have never seen a server go down because of a standard OLA maintenancejob on an RDBMS, so the troubleshooter in me was intrigued what was going on. Such a severe and reproducible problem didn’t seem a small bug to me, but instead it seemed we hit on a misunderstood concept.

How it started

So let me tell you what happened on day 1, when we first got a notification of the node failover. At first we didn’t know what caused the breakdown so Pavlov made us look at the different logfiles to collect diagnostic data and come up with possible reproduction scenario’s. (You might have a look at the Microsofts TigerToolbox for Failover Analyses here)

We saw a few telling entries and some red herrings, such as the loss of connection with the witness server. The primary node lost all network connectivity, but wasn’t that simply a sign of resource starvation?
Also, the problem seemed to occur after the windows OS has had a patch. That got us sidetracked for a while before we ruled this out as a coincidence.

The first solid clue came when we noticed that the problem occurred around the same time in the evening. There was little activity on the database, except a daily index rebuild and corruption check, as demanded by the vendor of the application. Surely an index rebuild couldn’t lead to starvation to such an extend that the instance is brought down? Moreover, we just had the instance increased in both memory and CPU resources by a factor 4. To make it even more interesting: Before the fourfold increase of the serversize we didn’t had any problems.

Is it the infrastructure?

And there I got my first suspicion: it had to do with the infrastructure footprint we’re placing. This came with a little problem: it was hosted in the cloud, where we didn’t have access to overcommitment figures, settings and logfiles on the infrastructure. Also there were some cloud specific details to consider, some of which I had not heard before. For example, in an unrelated incident where a downscaling of a node took me a few days because of G2 storage credit starvation(?). Over the years I’ve read all the fine print there is of SQLserver, but on Amazon, not so.
This is going to be a learning experience.

Memory starvation

The final hint came from a different source. Someone from cloudops couldn’t start up a database server as this sometimes hit upon a resource limit. Aha! We’re in some sort of resource consumption group, I should have investigated the contract with the Cloudprovider more closely. I hadn’t done that yet. After all, the cloud is like a tap for resources and in this wonderful brave new scalable world there shouldn’t be something like resource limits, no?

My suspicion was with memory starvation and I noticed some funnies: According to the task manager, the SQLserver takes 112 Megabyte, but the resource monitor showed a reserved amount of over 11GB.
I now had a suspect, guilty by association: the ‘Locking Pages in Memory’ setting. This is a setting for SQLserver in a virtual machine environment to keep the memory reserved for the database. The so-called balloon driver process snatches away any unused memory from the virtual machines, which is not a good idea for software like databases who rely heavily on cache for their performance.

Indeed, after removing this setting (in the group policy, for details see here) the crash didn’t occur anymore and both the taskmanager and the resourcemanager now showed a consistent picture.

Lesson?

Being a database engineer I like to know the performance and setup of the underlying infrastructure. I didn’t bother to check this out with Amazon, assuming that I would not be able to see any log or errors on their side. But they do have some fine print, which I need to heed if I want to be able to advice on the best buy in the cloud.

Cloudmigration: top 3 mistakes

What are the top 3 mistakes I have seen during cloudmigrations?

I have worked for various companies that have entered a cloud migration project. What struck me is that, despite the reasons for going to the cloud differ per company, they seem to hit the same problems. Some of them overcame the issues, but others fell out of the cloud tree, hitting every branch on the way down.

So here is my top 3 of mistakes, from companies that went for a digital transformation to the ones going for a simple IAAS solution (if you allow me to label that as a cloudmigration for the post sake).

MISTAKE 1: The Big Bang Theory

Let start off with an important mistake: we’re going to the cloud without doing a pilot-project, we’re going Big Bang. It goes like this: The upper brass chooses a provider and sets up the migration plan. The opinion of their own technicians is skipped, because they wouldn’t agree anyway. Their jobs would be on the line, so their opinion can be discarded as biased. In one case, own staff was deemed inferior, as one manager told me: “We can’t win from players who play the Champions League”, referring to the cloud provider.

I’ve seen this run amok in several ways. With an IAAS migration I’ve seen twice that the performance was way inferior or the service was bad. ( many outages without good diagnostic capabilities with the IAAS provider). With cloudmigrations with the aim of a digital transformation it appeared that cloudtechnologies are not mainstream. Real expertise is under construction, mistakes will be made.

Lesson: One would like to make these kind of mistakes during a pilot project, not during a mass migration.

MISTAKE 2: The cloud is just an on-prem datacenter

The cloud is a nice greenfield situation and promises a digital transformation unburdened by legacy systems. The services and machines in the cloud are new and riddled with extra bells and whistles. No more dull monolithic architecture, but with opportunities of microservice architecture. Gradually it is understood that the cloud entails a bit more that this. For example, how are the responsibilities reallocated among the staff, can the company switch to a scrum culture, how cloudsavvy are the architects and technicians? On that last point, I’ve seen an architect propose to have every application have its own database instance. Such a lack of understanding of the revenuemodel of a provider will make your design of the databaselandscape fall apart.

Lesson: A cloud transformation without a company transformation in terms of organisation, processes, methods, IT architecture and corporate culture is a ‘datacenter in the cloud’: new hat, same cowboy.

MISTAKE 3: The cloud fixes a bad IT department

Some situations are too sensitive to be discussed openly, so the next mistake I understood from people in whisper-mode: The quality of the own IT department doesn’t suffice and the brave new cloud will lead us to a better IT department. The changes are all good: On-premise datacenters are unnecessary, some jobs such as database engineers are not needed anymore, no more patch policies or outages. The cloud works as an automagical beautifier.

In short, managers who were unable to create a satisfactory IT department now have an opportunity to save the day with a cloudmigration. They can bring in a consultancy company to migrate to the new promised lands and climb the learning curve together before the budget runs out. The IT department used to be a trainwreck, but due to this controlled explosion called ‘Cloudmigration’ something beautiful will grow from the resulting wastelands. I leave it to your imagination what could go wrong here.

These are the top 3 mistakes I have seen in the field. Surely there are more nuanced issues, such as how cheap and safe the cloud really is, but that is subject I would like to discuss in a different post.

Being a cloud DBA

Database administration is a dying breed; don’t go there“; that was the advice I received twenty-one years ago. It is a popular idea that spring up from time to time. The latest argument came from the cloud: database services are well-automated so who needs a good DBA?

I have seen two main arguments for the diminishing role of the DBA which i have seen repeated in blogposts and first hand conversations.
First: new technology makes the DBA obsolete. The latest databases are self tuned, the administration is automated, relational databases will be taken over by NoSQL or Hadoop, MongoDB, XML, Object Oriented databases, Cloud.

This argument fades away as soon as that new technology falls from its hype cycle. NoSQL didn’t replace the RDBMS, administration tasks are more difficult in an ever increasing landscape, the cloud merely shifts the work to optimalisation, capacitymanagement and tasks higher in the value chain.

The second argument stems from an underestimation of the DBA work: “We don’t need DBA’s because our developers know enough about databases“. I have heard similar statements during job interviews. When I asked why a DBA role considered by this company I was told that “the developers don’t have time anymore for DBA work“. This was a softwarehouse where no one needed a DBA until they realised they NEEDED a DBA. A need for a firefighter role to fix the datamodel, performance or datacorruption. A need for a firefighter that would silently do his work without causing any delay in the software build process, bring about fundamental changes in the structure without using up other resources. There was no ambition to raise the maturity level of the buildprocess, no vision on operational intelligence or business intelligence: innovation was confined to application features. For efficiency and control over the buildprocess they used Scrum, that would suffice.

There is one funny in that job interview I would like to share with you. After hearing their situation, I asked them if they have a lot of incidents now and if they thought that was part of the deal of writing software. I forgot the exact answer they gave, but I didn’t forget that the IT manager interviewing me was called away ..… for an incident.
I concluded the interview without seeing him again.

To advocate or not?

SQLserver guru Brent Ozar had an encouter with a CTO who said “I thought I’d save money in the cloud by not having a DBA, but what I’m learning is that in the cloud, I actually get a return on my DBA investments.” Surely, for performance projects in the cloud, picking up the euro’s is visible for everyone. But streamlining and compacting a databaselandscape for better agility is reserved for good technical leadership who is aiming for a mature company. Central question for a company is: how do you see your DBA?