First let me say I hate when things don't fail in predictable ways.

Ok. Now on to our regularly scheduled program.

A few weeks ago, I'm in a remote office, about 5 hours from here. We're
doing a bunch of final stuff on day 2, and the power goes out. We need
to leave in 3 hours to be back here at a certain time. We wait and
wait, no power. The whole block is out. So we clean up and do what we
can and then leave.

About 3 hours into the trip we get a call. Power is back up, but nobody
can get to anything. No internet. So I call and open up a trouble
ticket. I also find out an hour later they can't get to the server
either.. that would have been a useful piece of info to know.. Internet
circuit is up they can ping my router.

I get one of the users to do some tests for me. Most machines can't see
anything. Some machines can ping some things and some when they ping
get 90% or higher packet loss. I have them go back and power off the
switch and wait and then power it back on. (HP Procurve 5304xl modular
switch). No go. Nothing changes. Well there was a major accident on
the interstate going the other way that we already passed and they had
closed down the interstate, so there was no going back at the moment.

Everything I had them try points to the switch. We also found out that
their UPS wasn't working properly either, so we grabbed another one of
those and brought it up along with a 24 port switch I had just in case.
I get up there and I can't get out, I can ping some things and not
others. Switch is just goofy. On a chance, I hook up the other switch
download a tftp server, and the latest flash for it, manage to find a
working port on the big switch and flash it. It takes forever due to
packet loss but it goes. Reboot and poof everything works. Weird I
think, it must have somehow corrupted the flash in that outage.

Fast forward 3 weeks. Power goes out, same exact thing happens.
Remotely I manage to upload a new flash.. Takes over an hour. Reboot..
Doesn't fix the issue. It's still down, same symptoms. So I get the
switch yanked and a smaller one put in. We bring the switch back here
to this office and it runs fine. I get on the phone with HP who sends
me out a new backplane/chassis for it.

Today I get a call. People can't log in. I talk to one of the guys, he
can't even get an IP address. He's in a conf room and I think ok maybe
that cable didn't get reconnected. We check it out and sure enough it's
connected and the port is lit. Then I remember there's a small switch
in the conf room. I say look at the bottom are the ports lit? Yes. 4
of them. 4? Who else is in there? "Just me"

Ok, I have him unplug his cable and the cable going from the wall to the
switch, and 2 ports go out, but 2 are still lit and solid, not blinking...

Ok has anyone by now deduced the sole reason of their networking
problems for the last few weeks yet? It hit me as soon as he said he
had 4 lights lit....

I said do me a favor.. Take the caps off the table (there are 3 inserts
that have 2 network jacks and 2 phone jacks down in the table and
plastic covers on them.)

Is there by any chance, a network cable with both ends plugged into each
port? "Yup"


So what happens is this. You have the large procurve with a connection
going to that conf room to another small unmanaged 7 port procurve
switch. That switch has a loop now. So effectively on the large
switch, one port is looping back on itself. That'd screw things up royally.

Now getting back to my first statement about the way things fail. Why
would reflashing the switch cause it to function again seemingly without
issue for 3 weeks. Why would a power cycle cause the switch to lose
it's mind. Why would reflashing it again fix the problem as before?

Had it stayed "broken" after flash we would have eventually stumbled on
that problem as we did look for loops in the rack at the time.

Oddly enough, the second time it happened I was able to get into the GUI
of the switch and it showed every port that was active pegged at 100%

So anyway, problem solved but sheesh the way it presented was odd
because it was inconsistent. I would think if you create a loop it
would stay screwed up until you fix that.

Now I have to figure out who to thwack up there for doing that.