The Netduino forums have been replaced by new forums at community.wildernesslabs.co.
This site has been preserved for archival purposes only
and the ability to make new accounts or posts has been turned off.
Network stops working on the hour or one minute after
I've been having a problem where my netduino network app stops working after a variable length of time. Sometimes it runs for a few minutes, sometimes a few hours, but sooner or later the network just stops working.
Oddly, after it happened a few times, I got a hunch that it was always stopping at about 1 minute past each hour.
A bold claim, you may think, and you'll want to see some evidence. And here it is...
My app interfaces to a weather station via a serial port. It has 3 main sections as follows:
A command interpreter which listens on a server socket for client connections and processes commands.
A weather data broadcaster that runs on its own thread, polls the weather station every 20 seconds for current data and sends it out in a UDP broadcast. The broadcast payload contains the weather data serialized as rudimentary XML and contains a UTC timestamp, that is set using DateTime.UtcNow.
A class that uses a timer to flash the on-board LED with different cadences depending on what's going on (e.g. slow flash = idle, fast flash = client is connected).
When the app starts, it performs the follwoing actions:
Waits for the network to become available (Netduino is configured to use DHCP but always gets the same address from the router).
Uses SNTP to get the time from pool.ntp.org and sets the system clock.
Spins up the broadcaster thread and configures it to send broadcasts every 20 seconds.
starts the LED cadence timer and sets the cadence to the 'idle' pattern.
opens a server socket, configures it to listen and calls socket.Accept() to wait for a connection on the assigned port. This is essentially the main application loop, which spends most of its time blocked waiting for a client connection. When a connection is received, the connected socket is handed off to a worker thread to process any and all requests from that client, while the main loop goes around and waits to accept the next connection.
Most of the actual work is done in worker threads or in a timer callback, while the main application loop spends most of its time blocked on socket.Accept().
There is a corresponding PC application that listens for the UDP broadcasts and displays the data, including the timestamp, so I can tell at a glance what time the netduino last successfully sent a broadcast.
Once I had noticed that the problem occurred almost on the hour, I thought I might be jumping to conclusions so I thought I'd better do some more controlled tests. Over a 24-hour period I ran the code several times, noting the time when I pressed the Netduino reset button, the time that the PC received its first data broadcast and the time that the broadcasts stopped. Here are the results:
Start time First broadcast Last broadcast received
(PC time, UTC+1) (UTC) (UTC)
=================================================
21:12:00 n/a22:00:59
00:17:0023:17:2806:00:51
10:57:0009:57:2910:00:50
11:16:0010:16:5114:01:01
21:05:4520:06:1422:00:46
Spooky! These points of data make a beautiful line... Something odd is happening just after each hour change.
I can reason about my code a draw a few conclusions:
I know the PC app hasn't crashed because if I press RESET on the Netduino, it continues receiving new broadcasts.
I know the Netduino hasn't completely reset because it would re-start the broadcasts and command interpreter.
When the network traffic stops, BOTH the listener AND the broadcaster are broken. These are in completely separate independent bits of code, one running in its own thread. I find it unlikely therefore that they should both stop working at the same moment, unless the problem is related to the network stack.
I know the netduino hasn't hung completely because the timer-based flashing LED continues to flash, indicating that the Netduino is in some part alive and well and running my code.
I think it HAS to be a network related issue. My hunch, and it IS just a hunch with no supporting evidence, is that there's something odd going on with DHCP. My router is configured to give a "permanent lease" and the Netduino has an address reservation configured in the router, so it SHOULD always gets the same IP address (etc) from the router.
Do you have Wireshark, and either an old-fashioned hub or a switch/router which will copy all packets to all ports? The easiest way to debug the networking traffic-on-the-wire side of this is by using Wireshark.
Is there any special action you take based on packets received after the start of a new hour? Extra threads (or blocking threads) that happen on that hour marker? Networking does live at the managed code layer--so if there's a blocked thread within an event handler...NETMF may not be so happy to let network traffic through.
This is a very curious issue, very interesting. Let's figure it out.
My code doesn't depend on time in any way, other than indirectly, i.e.:
It uses a timer to flash the LED
It uses Thread.Sleep in the broadcast loop
It reads DateTime.UtcNow in order to put it in the network broadcast data.
Other than that, there's no time-dependent code, which is why it is such an odd issue.
I have Microsoft's Network Monitor which is a lot like WireShark. I'm not sure how I can capture traffic from the Netduino to the router though, as it all goes through a switch. I would see the broadcasts of course, but I suspect none of the 'direct' traffic. I'll investigate whether there is something I can do with SysLog on the router.
Thanks for taking an interest. I very much appreciate your help in getting to the bottom of this...
One question. Would socket.Accept() ever time out and return without there being an incoming connection?
Regards,
Tim Long
P.S. I'd be happy to share the code if you think it will help.
The router logging is pretty useless, however my main switch is a managed switch and has a feature that lets me put one of the ports into 'promiscuous mode' so I can monitor all the traffic going through it. Currently I don't have a long enough cable so it will take me a day or two to get one, then I can try monitoring the traffic.
The broadcast loop is on its own thread, right? Not blocking a managed event from finishing its processing?
I have Microsoft's Network Monitor which is a lot like WireShark. I'm not sure how I can capture traffic from the Netduino to the router though, as it all goes through a switch. I would see the broadcasts of course, but I suspect none of the 'direct' traffic. I'll investigate whether there is something I can do with SysLog on the router.
You'll want Wireshark. It's awesome. And then in the toolbar, add the filter "eth.dst == 5c:86:4a:xx:xx:xx || eth.src == 5c:86:4a:xx:xx:xx" where xx:xx:xx is the rest of the MAC address on your Netduino. If you want to see broadcast traffic too, add "|| eth.dst == ff:ff:ff:ff:ff:ff".
One question. Would socket.Accept() ever time out and return without there being an incoming connection?
I don't believe that socket.Accept() should timeout unless the socket is closed/disposed. But if you're seeing different behavior (and it's inconsistent with the desktop implementation) we can tweak the behavior to match expectations.
P.S. I'd be happy to share the code if you think it will help.
Let's see if we can provide enough info that you can track down the issue on your side...and if we run into a brick wall this might be a good option for deep debug. Let's try to get it to fail quicker (or 100% consistently) first though...unfortunately intermittent non-reproducible issues can't get quick resolutions. [It sounds like this is reproducible though...which is great!]
however my main switch is a managed switch and has a feature that lets me put one of the ports into 'promiscuous mode' so I can monitor all the traffic going through it.
Perfect. Let's get you a copy of Wireshark and see what's actually happening on the wire.
Also...if you can simplify your code down to a few dozen lines of code that all run on one thread (if it's not already that small), that will be really useful for reproducing the issue.
It's absolutely reproducible - the network stack fails just after the hour... but not EVERY hour, unfortunately; sometimes it will go for 4 or 5 hours and then fail, right on the hour. I haven't figured out how to make it fail on demand but there is probably a lot I can do to narrow it down.
I never noticed the pattern until I started setting my system clock via SNTP; before that it looked completely random.
I will see if I can make a repro solution with as little code as possible in it. I think you're dead right on that, literally *anything* could be happening; the root cause could be something that the weather station is doing (it has its own clock and it is suspiciously about 2 minutes behind my PC clock) so I need to eliminate all that as a possible cause. Watch this space.
I managed to catch the one-minute-past-the-hour bug happening today in the debugger. It looks like there is a deadlock down at the link layer. When the issue happened, my code was still running but the thread that was sending UDP broadcasts suddenly stops working. I managed to get a stack trace and this is what's happening:
Unfortunately I can't see the parameter values, but it looks like the WaitHandle.WaitOne is never timing out and never getting signalled. Therefore, the entire thread is deadlocked.
I know this happened at approximately 1 minute past the hour, because the thread sends a broadcast every 10 seconds and the last successful broadcast was in my debug output as follows:
There's a time stamp right at the end of the JSON document, 06:01:35.626Z.
This problem is very reproducible. It doesn't happen every hour, but when it does happen it is always just after 1 minute past the hour.
It isn't just the sending side that's hung. My main thread sits listening for socket connections and that is frozen too (I can no longer telnet into it, the connection simply times out in the telnet app). Looking at the call stack for that thread, it's a very familiar picture:
I've been meditating on the 'one-minute-past-the-hour' thing and having a general poke around in the firmware. It's a lot to take in all in one session, but I think I've uncovered a fairly critical race condition right here (STM32_time_functions.cpp, line 168):
If the timer overflows at the point marked, then the overflow (carry) has been missed but the timer has wrapped back to 0, so that would result in time _going backwards_ at which point HAL_COMPLETION::DequeueAndExec() will not be able to schedule any further events until the timer overflows again.
This might explain what I'm seeing where lots of things all seem to just stop working at one minute past the hour (some hours).
A rough calculation suggests that the timer will wrap every 71 minutes (I was hoping it would be 61... but that odd 1-minute is in there). I would have thought the chances of hitting this race condition are fairly small, but I've been doing embedded development since 1986 and I'm seen some pretty strange things happen...
It strikes me that I should get set up to build the firmware at this point, if I am to chase this down any further. What do I need? Is there a HowTo on building the firmware?
After I've spent what amounts to several days digging into this issue, it would be nice to know that someone is at least reading these threads... I would call this a 'show stopper'...
Digging a bit deeper, when the network stack is frozen, I have several threads all stuck in WaitHandle.WaitOne(), backtracking up the call stack shows that the calls originated as follows:
Main thread - TcpSocket.Accept
ILinkLayer.SendFrame
ArpResolver.SendArpGenericThread
IpV4Layer.LoopbackInBackgroundThread
ICMPv4Handler.SendIcmpMessagesThread
DHCPv4Client.DhcpStateMachine
EthernetInterface.Send
TcpSocket.TcpStateMachine
It looks like the network stack is well and truly deadlocked. Interestingly, my target system is still alive and flashing an LED, that thread is based on a timer callback originating from a System.Treading.Timer,and that is going just fine.
Since nothing I can do should be able to deadlock the network stack like that, I'd say this is an indicator for a pretty major bug, either in the network stack itself or in the underlying firmware.
I think this is serious enough that someone ought to be paying it a lot more attention than it feels like they are. At the very least, this is a critical bug if not a showstopper!
So the good news here is that it appears that the stack is not locking up the board (unlike issues we had with lwIP). The bad news is that there is some weird blocking issue going on.
My first two guesses: 1. NETMF tends to execute one managed hardware event handler at a time, queued, in series. If a project has code in a managed event handler which is not exiting and which is blocking the network stack from getting interrupts...you can see this kind of behavior. This is one of the tradeoffs of the NETMF threading model. [Side note: we are working on some concepts that would actually improve NETMF in this area without breaking backwards code compatibility. But nothing shipping today.] 2. A core runtime bug. If we can reproduce the bug in managed code, and we can whittle down the sample code to a manageable size, we can debug the native issue and work with Microsoft to remedy it in the NETMF core.
Can you send me a copy of the source, along with repro instructions on how to repro the issue consistently? And also your board type, firmware version, etc.? We'll repro the issue here and work with you to resolve the issue (either through guidance on how to modify the sample or through a core bugfix).
Thanks so much for all your effort testing out the new bits!
If the timer overflows at the point marked, then the overflow (carry) has been missed but the timer has wrapped back to 0, so that would result in time _going backwards_ at which point HAL_COMPLETION::DequeueAndExec() will not be able to schedule any further events until the timer overflows again.
Sleuthing Awesome.
NETMF uses MachineTicks for all timer/thread scheduling (which is based off of the overflow bits added to the core timer bits. There's always a possibility of a bug in the timing code--but other bug scenarios are a lot more likely. We beat up on timer code quite a bit.
After I've spent what amounts to several days digging into this issue, it would be nice to know that someone is at least reading these threads...
We're definitely here, following along. Just trying to get a handle on what is going on behind the scenes. This post was really helpful. See below. And thanks for all your test time!
Digging a bit deeper, when the network stack is frozen, I have several threads all stuck in WaitHandle.WaitOne(), backtracking up the call stack shows that the calls originated as follows:
Main thread - TcpSocket.Accept
ILinkLayer.SendFrame
ArpResolver.SendArpGenericThread
IpV4Layer.LoopbackInBackgroundThread
ICMPv4Handler.SendIcmpMessagesThread
DHCPv4Client.DhcpStateMachine
EthernetInterface.Send
TcpSocket.TcpStateMachine
Okay, now this is interesting. This is your _stack trace_??? Can you grab a screen capture of this? DHCP and ICMP shouldn't really be involved at all with a TCP packet send...
This scenario really feels like a "NETMF event queue is being blocked" kind of issue... Let's drill into your code sample and figure out what's up.
I might have caused confusion there. That wasn't a stack trace as such. What I did was look at each thread in turn, that was blocked in a WaitOne() call, and backtrack up the stack to the first bit of managed code, i.e. where in Netduino.IP the call originated from. So each item in the list is essentially the top of the managed code stack for each thread.
The difficulty at the moment is that I have custom hardware that you will not have. I can certainly give you access to my code but its not open source - can you provide me with a BitBucket login so I can add you to the repo? I would push to GitHub but they don't allow private repos.
I will also see if I can reproduce the lockup without having to have my custom hardware attached. It would be good to eliminate that from the picture anyway.
Today I tried a simple app that just toggles some LEDs and uses an AutoResetEvent to synchronize two threads, that is not locking up at all as far as I can tell. I'm going to introduce some network code into it later today and see if that has any effect.
OK so here is the next installment. The simple program I mentioned above, it has been running for over 18 hours and I _thought_ it was working perfectly, but I've just noticed that the Debug.Print output has stopped. It stopped at... wait for it... 14:01:45 UTC. One minute past the hour!!
OK this is a really simple program and requires only a few LEDs so it is something you can easily reproduce in-house. The program uses 4 LEDs on D8..D11. It uses a pair of threads, called Ping and Pong, each thread controls one LED, the two threads use an AutoResetEvent to synchornize. I repeated that by creating another instance of the class, so another 2 threads and another 2 LEDs, just because I already had them plugged in. So a total of 4 LEDs and 4 threads. That is all working fine, the LEDs are blinking away even as I type.
However, the Ping thread produces Debug.Print output and that has stopped working.
I started the program at 04:13 UTC (yes I am a night owl!); the debug output stopped at 14:01:45 UTC and it is now 18:30 UTC. So it took 10 hours for the debug output to stop.
I've attached the full Debug.Print output in case its relevant.
I'm using Netduino Plus 2, and the firmware is the Netduino.IP release, which I think you are calling Netduino.IP v1.0.0. I'm monitoring the debug output in Visual Studio 2013 Ultimate.
I will try to leave the system running as long as possible in case you want me to look at anything else, but probably easier if you try to repro it yourself, in house.
One month later and no reply. I guess that clearly demonstrates the level of commitment we can expect from SecretLabs when it comes to chasing down really serious showstopper bugs.
Come on Chris, I put literally days of effort into this, I'm really disappointed that you haven't picked up the ball and run with it.
Also the com port issues, for which I was able to reproduce your findings, there pretty much useless to me at this point, because I can not use flow control
Could you please take some time and investigate seems that our forum members are trying to get stuff done but are hold back by this timer issue at least give Tim a reply..