Note: Once again I accidentally scheduled a technical article to publish on a holiday. In honour of Canadian Thanskgiving I am republishing this article on Tuesday. -MDG
One of my clients called me and asked me why some of their servers were running so terribly slow. Actually, that’s not entirely true… they told me that they were working on Server1, Server2, and Server3 and all three needed more CPUs and more RAM. Because we live in a virtual world, this is easy enough to accomplish. It took me all of five minutes to accomplish this for the three servers, and that included the time it took me to walk back to my desk via the coffee machine.
I did not respond so hastily when over the course of the next few weeks I was asked to increase the resources again… and again. What are you guys doing, trying to run NASA? No, we are developers working on our tools, and they are just too slow!
Rather than increase the resources again I decided to do some investigating. I wanted to see why these computers (servers with 12GB of RAM and 2 quad-core virtual CPUs) were running so slow… and yes, I checked to make sure that it was not just greedy users who wanted more more more, the computers really were running – no, that is the wrong term – they were crawling slower than they should have.
After checking several possibilities over the next few days I figured out that somebody had taken VM snapshots of these servers – rogue VM snapshots, because there actually is a written company policy about the proper and acceptable use of VM snapshots – months earlier, and they had just continued to grow… like mold.
The way VM snapshots work – and I should mention at this point that they work about the same in VMware as in Hyper-V – is that the virtual memory and hard drive files are paused, made read-only, and delta files are made for both. You will not see any difference from within the virtual machine – the memory will continue to work as it had, as will the hard drive – but the files that comprise the virtual machine will change.
The snapshot file will continue to grow… and grow… and grow. As you can see from this image, the file is at about 12.5GB in size. Not too bad, right? Well look at this:
Did I forget to mention that while the Virtual Memory snapshot file is shown in Datastore Browser, the actual delta files are not (just like the Flat files are hidden). This is what we see when we connect to the host and look at what is going on under the hood.
This VM Snapshot is less than an hour old. Over time the file will grow… to ridiculous sizes. And yes, eventually your virtual machine will slow down… and then crawl… and then, eventually, it might stop. However if you were to look at your performance monitors, both from within and from outside the virtual machine, the performance baselines will look perfectly normal. The performance of course will not, and that is where things get dicey.
So Why Use Them?
Virtual Machine Snapshots (or Checkpoints, as Microsoft has taken to calling them) are a great tool when used responsibly. They should never be considered a long-term solution to anything. What they are is a great way to step forward into the unknown… you have a change to make, a patch to apply, a program (or even an operating system) to upgrade, and you are worried that something will corrupt. Before going ahead with the change you can take a VM Snapshot, make the change, and once you have confirmed that it worked you can delete the snapshot. If the change did indeed hork something, you can revert to the moment in time before you started, and all is good.
…But don’t keep them longer than you need to!
I mentioned that the client in question has a written company policy about the proper and acceptable use of VM snapshots. That is for a couple of reasons:
- If you follow the policy, you don’t just take a snapshot – you name it and make notes.
- When only one person takes the snapshots, that person can keep a diary of what snapshots there are; they can know who requested them, and they can then follow up with the requesting party to make sure they can be deleted.
When rogue administrators (Have I mentioned before how I loathe letting anyone who doesn’t need administrative rights have administrative rights?) take snapshots without following the proper procedures – which includes deleting the snapshots when they are no longer needed, then you will run into problems. However when the proper policy is followed, this will never become an issue.
VM Snapshots: Good or Bad?
Just like any potentially dangerous tool, the answer is both good and bad. When used properly they are great, but with time they become rotten to the core.
How do I know if I have them?
If you spend any amount of time in vCenter, you know that there is no simple way to determine what VM snapshots are in your environment… short of going into the Snapshot Manager for each VM and checking. However if you are an avid reader of this blog you may have caught an article I wrote a little over a year ago called How do YOU Manage?. IN it I mentioned a tool I love called RVTools. Among the myriad reports it will generate for you is one called vSnapshot, and when you use it while connected to your vCenter environment it will list all of the snapshots you have.
You can download it from http://robware.net/. While it is free (Rob calls it ‘Nice to haveware’) there is a Donate button, and although it is in Dutch, it will allow you to donate through PayPal. I just did by the way… as a way of saying Thanks! to Rob for the hard work he puts into it that I was then able to benefit from!
If you use PowerCLI (also discussed in the article) there is a way to get the same information in PowerShell, which is:
get-vm | get-snapshot | format-list
…And for those of you running System Center Virtual Machine Manager and not vCenter Server, there is a PowerShell script for you too. It is available here, and is a free download from the TechNety Script Library.
I have been telling people for years that Snapshots/Checkpoints are good but dangerous. As I always say: If you cannot measure it, you cannot manage it. Using these tools will allow you to measure, manage, and then eliminate VM Snapshots in a timely manner… before they become a problem.