If you have been reading my blog over the past few weeks you know that I have converted my beloved HP ProLiant DL585 G2 into a hybrid virtualization host/SAN device by installing Microsoft Windows Server 2008 R2 SP1 with the free Microsoft iSCSI Software Target 3.3. Although the host has nearly 750GB of storage, I did not allocate that much for the target – I was going to have a number of non-HAVM (highly available virtual machines) that would account for some of the storage, and I just did not think that my first software target needed that much. I created two virtual devices on my RAID-5 array that were 128GB and 8GB. I created Cluster Shared Volumes and place my first two HAVMs on my cluster… easy peasy!
For the past couple of weeks I have been ignoring alerts from HP Insight with Microsoft System Center Essentials that I was running very low on storage on one of my drives. Never do that. This morning I came into the office to find that both HAVMs on that SAN target were in a paused state, and I did not have to look any further to know what had happened. I had been thinking that the ignored alerts were because my array was near capacity, and paid no attention because I knew that all of the files on that array were static size. I didn’t look further, or I would have seen that one of the VHDs attached to the software target was nearing capacity. While the VHDs created by Microsoft iSCSI Software Target 3.3 are static, the VHDs I had created for my file server (SWMI-FS1) and for my domain controller (SWMI-DC2) were dynamically expanding… it was only a matter of time before they expanded to equal 128GB… and as I always tell my students, dynamically expanding disks will expand to fill the volume, and then they (and everything else running on that volume) will stop… period.
Fortunately the steps required to resolve this issue were pretty straightforward, and both my HAVMs were back up and running within twenty minutes (I had to take a phone call, else it would have been shorter!)
- I disabled and then mounted the VHD locally on my storage server to see if there were any unnecessary files that could have been deleted. It is strongly advised that CSVs are to be used exclusively to store files for Hyper-V, and that nothing else should be stored on the volume. I am pretty religious about this practice, so I did not expect to find anything worth deleting. I promptly dismounted the drive from the local system.
- In my Failover Cluster I took the volume off-line. There were several warnings that this would cause any services running on that volume to stop, but that ship had already sailed so I was not worried. However if you are running these steps as a preventive measure, it is important to take heed of these warnings.
- After verifying that I had sufficient space to do so, I extended the volume: In the iSCSI Software Target snap-in (it is available either as a stand-alone MMC console, but it is also integrated into your computer’s Server Manager, under Storage) I right-clicked on the volume and selected Extend Virtual Disk. I was given the option of extending it onto an unformatted SCSI drive, but did not want the complication of extending a virtual disk that was on a RAID array onto a non-RAID disk. Instead I selected the free space that I had – rather than using all of the free space, I like to use binary numbers – so I extended the 128GB volume by 128GB, to 256GB.
- I re-mounted the virtual disk to the local system and discovered there was now ample free space into which to extend my partition, which I did. It is important to note that if I had not dismounted the VHD in step 1, I would have had to dismount and then remount it in order for it to detect the extended volume.
- In my Failover Cluster I brought the volume back on-line. When that was successful, I was then able to bring my clustered resources – my highly available virtual machines – back on-line. They had not crashed – Hyper-V (with which Failover Cluster Manager is designed to integrate) identified the pending doom, and cleanly paused them. That way when it was time to come back on-line it took under a minute for all of them to be running perfectly… and for both Failover Cluster Manager and then HP Insight with Microsoft System Center Essentials to report that the target drive now had 50.5% free space!
This incident has reminded me of a few lessons that we must always remember.
Firstly never ignore an alert, even if you think you know what it is. Alerts are our systems’ way of telling us that something is getting ready to stop working, and that is never good. If you get an alert that you thought you had accounted for (as I did with my free space alert) then there is a good chance that it is a similar but different warning, and you should take the time to investigate.
Secondly it is important when architecting your solutions to plan for expansion. I was able to recover from this potential jackpot because when I built the server I planned for expanded storage, so when it happened I did not have to go out and buy extra hard drives – I simply had to extend my VHD. With that being said, I have to remember that I have now used up that expansion space, so over the next few weeks I can plan to purchase a couple of extra drives to regain my breathing space for next time.
Thirdly I have to thank whoever you thank in cases like these that I had selected the right tools for the job. I hear far too often from IT Pros trying to build patchwork solutions on the cheap that they are planning to cut corners… now as an MVP I am fortunate enough that I get the software for free, but I built my iSCSI software target on high-quality Hewlett Packard ProLiant servers with SAS (serial attached SCSI) drives, instead of building a white-box and putting a bunch of SATA disks into it. solid hardware and the right drives and RAID level meant that there were no complications or stumbling blocks in this recovery, and I did not need to restore anything from backup (which I could have done but was unnecessary.
Hopefully if you ever find yourself needing to use these steps that you will have as simple a time recovering as I did… the best way to ensure that is the case is to plan your infrastructure out properly!