An ethernet network host watchdog
ArticleCategory: [Choose a category, do not translate
this]
Hardware
AuthorImage:[Here we need a little image from you]
TranslationInfo:[Author + translation history. mailto: or
http://homepage]
original in en Guido Socher
AboutTheAuthor:[A small biography about the author]
Guido likes Linux because it is a really good system to
develop your own hardware.
Abstract:[Here you write a little summary]
A watchdog is a piece of equipment that supervises other systems
and resets them in case it detects that those systems are failing.
Such watchdogs can be used to make systems more reliable. Reliability is
a major cost factor in many cases. Think about remote equipment where
it might take hours to get on site and service it. In some cases it
is even impossible to get there. Think about satellites in outer space.
A crucial factor is of course also the reliability of the watchdog. A small,
independent watchdog device is therefore generally better then a
software only solution implemented in the system itself. The Linux kernel
has e.g such a watch dog called "softdog".
This softdog can help a lot to
improve the reachability of a server but it can not cover all possible
cases because it is part of the failing system. Finally a watchdog can never cover a total equipment failure. It is a
good remedy for temporary problems that go away after a reboot.
ArticleIllustration:[This is the title picture for your
article]
ArticleBody:[The article body]
The idea
The idea of a network equipment watchdog is based on the requirements and ideas
of a customer who needed to improve the reliability of telecommunication
equipment.
This equipment was just hanging once in a while and he had to manually monitor
the system around the clock to be able to reset it in case it was stuck again.
He wanted some device to automatically monitor the system and to automatically
recover it.
Ping
A simple way of detecting if network equipment is up is to send a ping and
see if there is a reply. Such a ping (ICMP echo) can therefore be used to monitor network
equipment. A watchdog could therefore just ping that equipment.
A problem is however the case of a system that is "half up". Think of a
webserver. The network interface might be up but somehow the apache webserver
application died. In this case the machine would be ping-able but the
web server would actually not work. We could poll a specific web-page to fix this.
A web-server is however only a very specific case. How can we generalize the
solution for other systems? One could run a script on the server itself that
would execute a number of tests to see if the system was in good shape.
If everything was OK the script can send a ping to the watchdog. In this
case it is not the watchdog that originates the ping but the "health check
script" on the on the monitored equipment that sends once in a while a ping to the watchdog to
say "I am OK".
Only if those pings are missing for a period of time then the watchdog
will reset the system.
Time to reboot
We must pay special attention to the way systems reboot. Let's say we
expect an "alive signal" (=ping/reply) from the monitored network equipment
every 30sec. Maybe after 2 missing ping/reply we would initiate a reboot. In other words a little bit after
60sec we would initiate a reboot. The system reboots but the time it takes
to do that might be 5 minutes, 10 minutes. We must avoid to reboot the system
during the startup otherwise it will never finish the startup.
The solution is to put the watchdog after a reset into a "passive state". In
this state it will continue to monitor the system but it will not initiate a
new reset. Only when the watchdog gets again the first "I am alive indication"
then the watchdog will go back into an "active state" where it would initiate
again a reboot/reset in case of a failure.
This way it does not really matter how long the startup takes.
The hardware
The tuxgraphics ethernet board has on pin PD7 the possibility to connect a
relay. Relays do usually have a contact that opens and one that closes.
Dependent on whether you want to reset the monitored equipment or you want
to disconnect it for a moment from power you can use one of the two relay contact.
The Ethernet board will just supply a current to the relay at the time
of the reset/restart.
The hardware is therefore very simple. Just take the standard tuxgraphics
ethernet board and connect a relay to it.
The tuxgraphics host watchdog
The watchdog is configurable via its own web-pages. You just point your
web browser to it and you can see the state of the system, how often it had
to be reset, if the watchdog is active or passive etc.... You can also
configure if ping shall be sent from the watchdog or if the system will
ping the watchdog.
The watchdog has its own online help. Have a look.
References/Download