I typically go onsite for switch software updates. They’re just about the only thing that I don’t have a good failback mechanism for in most of the networking stacks that I support. If a host server update fails, I can reset it through iLO or iDRAC. If a firewall update fails, I mostly have High Availability configurations so a single failure won’t ruin my night. However, I always am present for Cisco Catalyst updates. The failure scenarios are too many, and my recovery options too few.
This past Friday I was doing a simple update, from 15.1 to 15.2.4(E6) on a pair of non-stacked Catalyst 2960X’s. I’d done two previous updates on this environment without issue, and after my onsite maintenance windows had been delayed a few times, I had to just schedule it to be done remotely. What could go wrong?
I backed up all my configurations and downloaded the latest Cisco-recommended software on my switch, set it to /overwrite and /reload. I watched the upgrade status proceeding normally, remembering that there is often a long period where the switch is unresponsive due to console display errors during upgrades. Then I saw it start to reboot. And I waited.
After 20 minutes my remote session didn’t come back up. I connected to the VPN and found that I could ping and ssh to the switch, but couldn’t ping any connected network devices. Logging in to the switch and running terminal monitor I started looking for what the problem could be. show ver shows me that the upgrade was successful. I can ping other switches and servers from inside this switch. So what’s wrong?
After a few minutes, the following message comes up in the terminal:
%ILET-1-DEVICE_AUTHENTICATION_FAIL: The FlexStack Module inserted in
this switch may not have been manufactured by Cisco or with Cisco's
authorization. If your use of this product is the cause of a support
issue, Cisco may deny operation of the product, support under your
warranty or under a Cisco technical support program such as
Smartnet. Please contact Cisco's Technical Assistance Center for
more information.
But I’m not using any FlexStack modules, and all my hardware is legitimate. What’s going on? I search this message in Cisco support forums and find the link to Bug ID CSCur56395. Which states:
If this issue is seen AFTER UPGRADE, then hard power-cycle is required
Great.
You can try a reload but this won’t work. You can try a downgrade back to the previous version, but I don’t know if this will work (let me know if it does). Seemed too risky to me, and I’ve never done it, hope to try it in the lab if I can recreate the issue. In my case I had to call a coworker who lives nearby to go onsite and power the switch down.
Sorry if you read this far hoping for a quick solution to this problem. Time to call your datacenter smart hands, or lace up your boots and head onsite yourself. If you are lucky, you are onsite already, laptop balanced on top of the KVM, reading this post, in which case you are very lucky! Just unplug the switch for 5 minutes, do some stretches, plug it back in, and all will be well again.
Postmortem notes for next time:
- My hosts should be balanced between switches. Fix that next time I’m onsite. This outage wouldn’t have required repair at 11pm on a Friday if the host had just failed over to the other switch.
- UPS should have had a network card in it. Not sure I would have done it in this scenario, but in some cases it would be helpful to be able to reset one of the power banks in the UPS using telnet from inside the failed switch. In this case there was no management card in the switch, and I would rather not risk a dirty shutdown of Exchange. But had I been prepared for this, I could arrange servers and switches accordingly into each of the APC’s power banks to minimize unsafe shutdowns while still allowing remote reboots.