It is Friday morning in Tokyo, and there is a line out the door. If you didn’t know any better you would think that they were lined up to get an autograph from the latest pop icon.
However if you look at the sign on the door it does not say ‘Tokyo Arena’ or ‘Tokyo Hilton.’ It says IT Service Desk, and the throngs lined up are users, and each one has their laptop with them. It seems that they are all having similar problems, either to do with not being able to log in at all, or Outlook crashing when they receive HTML based e-mail.
If I were a Help Desk Technician I might be thinking right now that this was a bad day to get out of bed. If I was an IT Director I would be <figuratively> screaming for answers, needing my team to find the root cause… Is it malware? Are we under attack? Was there just some massive incompetence that killed our systems?
It wouldn’t be long before I discovered the answer. Are we under attack? No. Is it malware? No… at least, not in the most commonly accepted definition of the term. What we were facing was a patch from Microsoft that was causing our myriad issues. Patch KB3097877, part of the November 10 patch roll out cycle, is to blame.
With that knowledge, as an IT Director, I would be setting forth the following plan:
- Train the Support Counter techs to resolve the issue (as found in this article from Microsoft);
- Ensure the patch was immediately removed from WSUS; and
- Once the ‘crisis’ was over, I would bring the interested parties into a room and do a post-mortem… that is, figure out what went wrong, and how to prevent it from happening in the future.
The second point is easy. Once you know what patch it is all you have to do is have a WSUS admin mark it as DECLINED. The first point is stressful for the support techs, but they are well trained and will handle it.
It is during the third point – the post mortem – that I would be looking at my team and wanting them all to simultaneously burst into flames. Because someone – one of these people whom I trust with my infrastructure, and therefore with the ability for the entire company to work – would have to look at me and say ‘We accept and push out all patches immediately without testing them.’
If I am an extremely diligent IT Director I will know that in our IT Department Policy and Procedures Statement there is a policy about applying patches, and likely it says that patches should be applied only after proper testing. If we are a less stringent company the policy might read that patches should be applied only after a reasonable delay has passed, and the appropriate forums and blogs on the Internet have claimed they were okay.
If there is no such policy then the blame lies with me. I can glare at the others, I can even yell if I am a bad leader. However the buck stops here.
If however there is such a policy, I would be looking at the team and asking them why the policy hadn’t been followed? I imaging they would be looking at me quizzically and someone would say ‘This is just what we do… it’s never caused problems before!?’
I might look at the admin who said that and ask if he wears a seat belt when he drives a car. I might ask if he wears a life vest when he goes boating. Chances are if you don’t, nothing will happen. You wear them to be safe and increase your chances of survival if something does happen. It is the reason we test patches (or let others test them) before we apply them.
The mistake caused by the admins neglecting to test patches might cause hundreds of thousands of dollars in lost productivity… and yet it is almost certain that nobody will lose their job. They probably won’t even get a reprimand. None of that is necessary. What is necessary is that we learn from this. Patches do not break things very often, but we have to remember that they can, and because of that we must take the proper steps – do our due diligence – to make sure we don’t get hit.