I've been working on an implementation of NRPE for Netduino Plus 2.
NRPE is a protocol spoken by Nagios for system monitoring. So far I've hooked up a flood sensor (the reason I started the project) and a temperature/humidity sensor (so I could get the mechanics of multiple checks figured out early, and allow for easier expansion).
I feel like I'm near the end of the project, but during development I've had some threading/resource type crashes so I've made an effort to test the code under circumstances that go way beyond what will happen during deployment. I do NOT want to be going down to my basement every few weeks or days to manually reboot the device if it hangs. I am OK with it getting into a bad state briefly, SO LONG as it can recover in a few minutes by itself and keep going.
I've been testing under three scenarios.
Normal Usage: In deployment, the device will be polled every 1-5 minutes per service. This is very light duty and it is certainly possible that the code I have may already sustain this load indefinitely without crashing.
In my testing though I have stepped up the load two more levels more to try to find any lurking networking/threading/resource bugs:
Stiff Breeze: One copy of this shell script running against the device. You will need check_nrpe available to you to make this work, which is part of NRPE.:
#!/bin/bash
COUNTER=0
while [ $COUNTER -lt 100 ]; do
./check_nrpe -n -H noah.doodle.local -c check_temp
./check_nrpe -n -H noah.doodle.local -c check_flood
done
[ Yes I'm not incrementing the counter, and this runs forever, deliberately. ]
Gale: Multiple copies of the above shell script running in parallel against the device.
Under Stiff Breeze, I can consistently get the device to crash in 2-4 hours. When it crashes, I get a network 10054 error, probably when some incoming connection that the Netduino is trying to respond to times out. Then I get some 10048 errors on the aborted stream. And finally I bounce back to the main loop, but evidently the network stack is dead by then, as the Netduino is no longer responsive to ping. The Netduino must be power cycled/rebooted to become responsive to ping.
Under Gale, I can get the same thing to happen in a matter of minutes.
(I haven't done long-term Normal Usage testing, as it seems likely that it will take days for an error to occur, if it is going to occur at all.)
Am I being unrealistic about what the device can do? Or is there a coding error here I can correct?
You do not need my sensor setup to simulate the crash (although I've included a Fritzing screenshot in the zip for the interested). All you need is the ability to run check_nrpe in a loop as shown above.
Thanks very much!