Turn It Off and Count to Ten

Turn it off, wait ten seconds, then turn it back on again.   That’s the first advice you will usually get from a helpdesk.   Some users find it annoying; here is the explanation.

This is almost always the best initial response to a problem. Usually it will be best done as soon as the problem becomes apparent but this may be difficult to decide and we’ll come back to it.

The ten second wait has several reasons:

All digital logic circuits use smoothed DC power but the mains/line power is AC and subject to glitches. The power supply circuits store a bit of energy in capacitors – often enough to power things for several seconds. Just because the lights went out doesn’t mean the processor stopped – it will probably be the last thing to actually stop. Ten seconds is often long enough although user guides will sometimes say a minute – or even five minutes (people lack patience for that).

Power supplies contain anti-surge components that are designed to have a longish break in power, not a one second break. Turning things off and on in a second can do long-term damage to these.

Sometimes you may need to unplug equipment altogether because the off-switch doesn’t really control the processor, just the visible lights and motors. If the thing has batteries you may need to remove them as well – and yes, the device may lose all its settings. So turn the power off first. Worry about batteries in a later second step.


Explanation

What you are doing is a “cold boot”.

Digital hardware, firmware and software is very complex and can get into states that the design team did not foresee. Endless loops, waiting for an event that will not happen or simply executing an invalid instruction perhaps because the program jumped into memory that is supposed to be data.

It is also possible that program memory was corrupted by a “soft error”, radiation – alpha decay within the chip, cosmic rays hitting the chip can turn individual bits of groups of bits to a wrong state. Dynamic Random Access Memory and cache memory are particularly vulnerable and expensive equipment will often use Error Correcting Memory (ECC-RAM ) in an attempt to overcome this.

Turning the power off will stop the processor(s) involved and empty the working memory. When the device restarts it reloads memory and registers into a state that is known and expected. Whatever triggered the problem will hopefully not happen again. Soft errors are sometimes called Single Event Upsets (SEU) and they genuinely can be caused by cosmic rays – aircraft and satellites are more vulnerable.

A helpful side effect of turning something off, then on again is that it does help make sure you actually had it turned on in the first place. Asking users if they have actually turned something on may provoke them, but failure to ask can waste a surprising amount of time.

If it does recur, try to be alert to the sequence of events: by avoiding them you may avoid the issue – and some brands will respond with a firmware update if they know there is a problem. Murphy’s law suggests a crash won’t recur when you are expecting it; it will lull you into complacency and hit just when there is a deadline to meet. People don’t like upgrading firmware but it is often a fix for erratic faults.

Turning things off then on again can also fix network issues. When a machine boots it either has a fixed IP address or it expects to get one from the local router. There are usually several requests for a DHCP address, but after a while devices slow requests down or stop asking. With so much equipment network connected there is also the possibility that odd behaviour means it’s been hacked. Malware won’t usually bring attention to itself because stealth is of the essence; but they might get things wrong. Whether turning it off and on will help probably depends on whether they can overwrite firmware or not. Powering off, disconnecting the network and seeing what happens may indicate something – but to find out what is really going on will need network monitoring. How to know you’ve been hacked is beyond our remit in this write-up.

Fast Reaction

Is it best to turn it off straight away – or wait and let an “expert” look at things?

The simple answer is that if you have a fault in something and you read this far it is probably too late for the quick reaction. A reason for being quick is that if a power device goes into latch-up it will overheat and destroy itself. Likewise if the processor has crashed it has presumably left whatever it was controlling in an unpredictable state such as one set of coils in a stepper-motor turned on. If the device starts buzzing or some other peculiar noise fast reaction might prevent that.

This contradicts the need to write error messages down if you are to have any hope of finding out what is going wrong with a regular fault. There is no universal right answer, but on the whole we’d say if it isn’t working a thing may as well be off as on.