Antonette's Lessons Learned Story #5
"Grace under pressure; When good commissioning goes bad"
As told by Annalisa:
So there we were, on Saturday morning at 8:15 am testing the fire suppression system for the Computer Room. Sounds easy right? Sure it does. How hard could it be? After all, the pre-action fire suppression system consists of only a few items; smoke detectors, low air pressure switch, tamper switch, and the water flow switch. We have a commissioning document, reviewed by two of the most anal-retentive people in all the land. This document went through no less than seven revisions before the final document was approved!
So, at 8:15 am , we are coming down the home stretch dreaming of going home and enjoying our Saturday. Then "it" happens. We reach behind the control valve and activate the water flow switch. An alarm comes in on the panel and we return the switch to normal. Attempts to clear the alarm were unsuccessful. I ask the fire testing contractor, "Gee, how come it says 'AC Trouble' rather than Zone X alarm this time?" He says, "I don't know, but I can't reset the system to normal." Another fire test contractor comes out of the computer room and says, "Did you guys do anything? That room just got awfully quiet!" We investigate and discover that we have taken out all the servers in the room. And there were A LOT!
We call the maintenance supervisor and he starts to look for a breaker or something that we must have tripped. The lights were still on in the room and the rest of the building had power, just the HVAC units and the computer equipment were not operating. This verified that the local power company didn't have an outage. However, we were curious as to why the servers went down since they have a dedicated UPS serving them. We then check in the basement and we find that the UPS was knocked offline too! SO now we know why the servers have no power and we have to get them back up. We now attempt to restart the UPS.
We crank the rotary switch on the UPS through it's positions, from 4 (normal) to 0 (off). The servers upstairs kick on and then off as we rotate the switch back to the startup position. We attempt to restart the UPS as per the instructions on the screen and it fails due to a DC power overload. Nobody can figure out how to reset it. I pipe up with "If you can get me a manual, I can get it back online." Off we go in search of a manual for the UPS. Now, funny thing, the manual for the UPS is online at the manufacturer's website. Unfortunately, the UPS also supplies power to the phone system. Yup, that's right, the phones don't work either. Finally, the electrician finds a manual and I start to read.
At this point, we have quite a bedraggled (and dismayed, I mean, come on, it is Saturday after all!) group assembled to do damage control. The IT Manager, a few of his people, the Facilities Manager, an electrician, and the company with the maintenance contract on the UPS have all been called in to add their expertise to this mess. Of course the top question of the hour is: Why didn't this ever happen before? This is an existing system that is required to be tested by law each year. So, why didn't we know it was going to happen?
The IT Manager and his group proceed to inform us that we have just killed 192 servers. I'm thinking, "Oh Crap, I just took out all of their North American Operations!" I've reviewed the user's manual for the UPS and I think I know how to restart the UPS without further damage to the servers, however, the Facilities Manager is having none of this! His butt is already on the line due to the first power loss incident and he's concerned about how the servers will fare from being kicked on and off so far this morning. So, IT begins to shut down all 192 (!) servers in an orderly manner. In the meantime, the electrician and the maintenance company rep check out the UPS. It appears to be fine, none the worse for the wear after it's busy morning.
So the first problem we have is to get the UPS back up and running so we can retrace our testing from the morning. The maintenance rep takes the battery strings out of the equation to reduce the DC voltage coming into the UPS. I figure out that the rotary switch in the maintenance bypass cabinet is the one that is giving us so much trouble. Now we try to restart the UPS. We start the UPS and as instructed on the screen, wait 30 seconds before turning to position 2. Failure. Someone mentions making sure the capacitors are unloaded. The maintenance rep checks it out, sure enough, even though the battery strings are disconnected, it takes a minimum of 10 minutes for the capacitors to unload which then enables the UPS to be started. So off we go, turn to position 2, and position 3, following the steps on the screen.
Oh, and just so you know, when you have the maintenance bypass cabinet, the start up procedure gets just a little more complicated. I'm flipping back and forth in the UPS manual, between the start up procedure and the maintenance bypass cabinet section. The UPS screen says, "Turn rotary switch to position 4." So I interject, "DON'T touch that switch! First you have to switch the load from the maintenance bypass to the UPS, then you place it in position 4." They take a chance and follow my instructions and voila! It works! We do a little happy dance before trudging back upstairs.
Finally, we are ready to go again. We start again and retrace our steps from the top. Again, we get to the water flow switch portion of the test and everything is hunky dory. We confirm that the water flow switch does in fact de-energize the UPS and all the equipment in the room. We proceed back to the basement. UPS restarts easily and perfectly, now that we know all the steps.
We then retest the water flow switch, except this time, somehow, we know about a handy dandy bypass which we think will solve our problems. Miraculously, it works! When the system is in bypass, water flow switch activation does not de-energize the equipment in the room or the UPS. Success is ours! We finish the testing and finally we get to go home a mere 7 hours after we started.
I slink in on Monday morning and as usual update Antonette about my adventure. I asked for a heads up if I should be working on my resume. Quite calmly she tells me I did everything I could and that yes, I still have a job; of course, if the company closes, we are all done for.
The next day Antonette has a meeting, with the same client, no joke. I'm thinking this cannot be good. But it's about another job, and the project manager graciously tells Antonette that I'm a heroine. When she returns, she tells me how he proceeded to tell everyone in the room how I held together under pressure while everyone else was losing it, that I calmly found and successfully executed the restart sequence. Rumor is spreading through out the company of what a heroine I am. And proudly Antonette concurred, if there is one thing Annalisa can do it is hold her own under pressure.
Why did it happen? The way the fire suppression system cabinet door opened, with the display panel mounted on the front, an observer was prevented from seeing the bypass switch which is located in the lower left corner by the hinge. The activation of the bypass switch takes less than 3 seconds and due to the configuration noted above, virtually invisible to anyone not aware of it's existence. Only the prior vendor testing the system knew of its existence. No notes, no documentation, no vendor.
Lesson learned: Documentation; It's vital.

|