Netduino home hardware projects downloads community

Jump to content


The Netduino forums have been replaced by new forums at community.wildernesslabs.co. This site has been preserved for archival purposes only and the ability to make new accounts or posts has been turned off.
Photo

N+ 2 "hangs" after a while. How to debug?

debug debugging

  • Please log in to reply
7 replies to this topic

#1 Niels R.

Niels R.

    Member

  • Members
  • PipPip
  • 28 posts

Posted 03 July 2013 - 06:24 AM

I have a N+ 2 running a simple program: get the values of 2 sensors every 30 seconds and send it to a web service.

 

After a while the web service stops receiving requests from the N+. When I look at the N+ I still see the LAN LED flicker from time to time.

 

I've deployed the same code with some additional Debug.Print() lines to another N+ an ran it connected to VS.

 

After a while I see the debug messages (and requests at the web service) have stopped, but no Exception message is shown in the debugger window.

 

How do I start debugging this? Any tips?

Additionally: How can I monitor the memory/CPU usage of my device?

 

Niels



#2 Niels R.

Niels R.

    Member

  • Members
  • PipPip
  • 28 posts

Posted 04 July 2013 - 01:47 PM

Anyone?



#3 ziggurat29

ziggurat29

    Advanced Member

  • Members
  • PipPipPip
  • 244 posts

Posted 04 July 2013 - 03:33 PM

Heres some things I have done that may or may not help:

 

*  run your unit attached to the debugger until it fails.  there sometimes is useful output you cannot otherwise catch.

*  oh, before that, can you reliably stimulate the failure?  the more reliably your can stimulate it, the more confidence you have in your tests.

*  sometimes the easiest way to find a bug, is to make it worse; a kind of 'homeopathic debugging'.  For example, if you think you have memory problems, you could allocate a chunk to waste some, and see how that changes things.  Obviously you have to do this judiciously because obviously you can allocate a big chunk and make the system fail.  Rather, can you allocate a small chunk that should surely not cause a failure, but yet it causes you to fail significantly sooner or more often.

*  since the code is .net, run it on the desktop.  you will be able to debug more easily, and if fails there then you know you have a bug.  if it does run there, then you haven't disproved a bug, but you have a little more confidence at least.

*  make a logger and log to a file.  this is particularly useful if it is impractical to run in the debugger, or if the presence of the debugger masks your problem (it happens).  alas, be aware of some existing file writing issues (specifically, folks have had problems with their final writes not being committed, and that sort of thing really diminishes the value of a debug log since of course the interesting bits up to the point of failure are at the end).  also know that for certain root causes, logging can mask the origin of the problem. also know that for certain failures, logging can exacerbate your problem! but don't let that stop you.

*  if your code is running in a loop, toggle the onboard led; this is perhaps the least disruptive and simplest way of showing 'my code running here', that can be useful to prove the existence of lockups and paths of execution.  also, you have two lights at your disposal:  the blue onboard led, but you also can programmatically turn off the white power one, so that makes a potential second.

 

 

some other stuff:

*  you said a 'lockup'.  try to qualify lockup' to mean either a thread that stops running, or a piece of code that is running, but doesn't execute the way you want (that simply manifests the same symptoms).  I mention this because there are some instances of the halted thread related to networking that you can find elsewhere on this forum (sorry no direct link but I remember a recent thread was something like 'we have reproduce a lockup' or something like that).

*  if it is a halted thread of execution, see if other threads in your app are similarly halted, or just that one.  you only have a single thread?  well make another.  (actually I would suggest this as a design choice anyway, but thats separate)

*  if the other threads are responsive, then it could be some problem with some driver (e.g. the point the networking).  if you can find the call that halts, then you might be able to figure out a way to work around

 

ok, you asked about memory:

*  Debug.Print ( "mem:  " + Debug.GC(true) );

  this will print out your currently free memory.  it will also stimulate compaction.  (NB I wrote this as I post from memory, so double check for typos)

*  netmf is supposed to stimulate automatic GC when needed, and maybe it does much of the time, but I'm telling you from experience that you will probably want to do a Debug.GC(true) yourself periodically.  I have definitely restored a system to functionality simply by adding those calls prior to functional steps which create a bunch of temp stuff, but I suggest that you don't do this as a first line of defense.  Rather, find the failing point, see if you can repeatably stimulate the failure, then add the line and see if the behaviour is changed dramatically.

*  when considering the free mem as a result of Debug.GC(true), realize that this is after compaction.  Meaning that the act of looking at the memory available has changed the availability of that memory (for the better).  So if it is failing, and you notice 'I have 30k available), then really it was probably failing when you had more memory availble, but fragmented such that it was not usable.

*  netmf, when running low, will first become 'softly' unstable.  by this I mean that calls will internally fail due to memory problems, but not throw exceptions that you can catch and handle, leaving it in what appears to the program as a functional state, but in fact not so much.  If you have your debugger attached, you can see the internal failures happen, with the call returning rather than throwing.  I find this starts to happen when you get down to about 20K free as reported by GC(true), but this is a soft limit that I'm sure depends on my apps use of memory.

 

Eventually, you may get to a point where you decide you have found the bug, and it's out of your control to fix it.  Hopefully not, but it can happen.  Then you have to consider work-arounds.  Some might be:

*  can you detect yourself being in a state that will result in failure, in time to do something about it?

*  if you can, is it a viable option to:  stop your web service, CG(true), then start it back up?  No?

*  if you can't detect it beforehand and react, can you observe it from afar?  If only one thread is 'locked up', can you use a second non-lockedup thread to observe that the first has failed, and at least reset the board?  (you can also forcibly terminate a thread, but this generally introduces more instability.  better to just reboot.  boot is quite quick on an embedded system).

 

Here are a couple of tales from the trenches from my personal experience:

 

*  I have a component which operates a GSM modem.  I use regexs to do my parsing.  These are monstrously convenience, but they are also monstrously memory intensive.  I cannot precompile all the machines I need beforehand, so I build them on the fly as needed and then throw them away afterwards.  While this is 'wasteful' of CPU, I actually have way more than enough CPU resources, and am tight on RAM, so this is a good time/size tradeoff.  But, it introduces a new problem of memory fragmentation.  It was difficult to narrow down the point of failure, because it moved out in the code, but near-ish one area.  I could deliberately manipulate the problem by allocating chunks of a few K, and then I could also mask the problem by doing a GC(true).  So I concluded probably memory fragmentation issues.  (I say 'probably' because 'proven' would require an actual heap walk, which is not easy to do.  But I was satisfied with 'probably' and instead I used testing over several days of continuous running to make the uncertainty diminish.  Oh, BTW, two boards are better than one, if you can afford it.  You can have one board doing a lifecycle test, while you continue to develop on the other).

 

*  As a design style, I mostly prefer blocking calls and threads.  However, I always (well almost always), use a variation with a timeout value.  This helps because your thread can at least actively let someone know that it is alive when it times out.  flick an LED or check in with a watchdog log a message, whatever.  And 'whatever' can be stuff you haven't thought of yet, like 'add a call to GC(true)'

 

* I use a bunch of serial ports.  The orthodox way of using the serial ports is to set up a DataAvailable event handler that is invoked when data is available.  However, I (and others) have found this event to inexplicably stop being fired.  I ran this same code on the desktop, using one of those FTDI usb serial port devices connected to my hardware and found it to run perfectly, so I felt pretty good about the code being 'correct'.  I was initially worried about re-engineering the code to not use the DataAvailable event, but then I realized that I could emulate that event with a worker thread, and surgically slip that into my already developed code.  Point being is that sometimes you have to devise a workaround to a defect that does not originate in your own code.

 

*  I (and others) have had some deadlock issues with networking.  In my experience, I can reliably deadlock when I perform a socket connect() to a host that is reachable, but not listening on the port to which I am trying to connect (e.g., the listening server is down for whatever reason). I have also found that this only deadlocks the thread making that call -- all the others are still responsive.  That call does not have a timeout option (and I don't know if it would work anyway since its not supposed to lock like this to begin with).  However, it is possible for me to reliably detect that the 'lockup' has occurred, and I can reliably reboot the board in the case that it has.  Beautiful?  No, it's hideous, but my alternative is to not deliver product.  This scenario, though a 'negative case' (in the positive case there is always my infrastructure server listening for incoming calls), is actually easy to stimulate:  if I need to upgrade the server binarie, then it will be offline during the brief moment that I replace the binary and restart it.  This will then cause all my clients (i.e. boards) in the field to become immediately deadlocked, each requiring a power cycle.  But since I have modded the firmware to detect this scenaro and reboot, now those using are able to restore themselves to functionality.

 

HTH



#4 Chris Walker

Chris Walker

    Secret Labs Staff

  • Moderators
  • 7767 posts
  • LocationNew York, NY

Posted 04 July 2013 - 04:43 PM

Hi Niels, My first instinct is that there may be something wrong in the lwIP network stack which is locking up when a bad network condition happens. Do you happen to have a copy of WireShark? If you can capture the Netduino<->router communication, you may be able to snag the last data to flow between the two. If we can repro that scenario, we can figure out where the bug is and work to get it fixed in lwIP itself. Watchdogs are also good overall solutions for working around those types of complex bugs, rebooting your device if something locks up in the native code networking stack. Chris

#5 Niels R.

Niels R.

    Member

  • Members
  • PipPip
  • 28 posts

Posted 05 July 2013 - 06:28 AM

ziggurat29: Thank you very very much for the lengthy answer. This gives me a lot to work with!!! I really appriciate it.

 

I was thinking in the direction of the network stack as I had these kind of "hang ups" during early development when I was fiddling with the network cables etc... I have spend some time writing code to avoid this kind of situations, but it looks like it isn't failsave at all.

 

Do you mind sharing the "detection code" you use? Or give me some pointers to work with?

 

Chris: As stated above I'm indeed suspecting the network stack. Do you have any info or example code regarding watchdogs?

 

Niels



#6 cranberry

cranberry

    New Member

  • Members
  • Pip
  • 5 posts

Posted 05 July 2013 - 07:02 AM

Niels, which firmware does your ND running on? Which version of the .NET SDK do you have installed?

 

My experience is that the .NET Framework SDK 4.3 is very buggy (also concerning networking/network stack). I'm using code for measuring my Power Consumption and Generation and also send data to a webservice. I'm running FW 4.2.2.2 and version 4.2 of the SDKs.



#7 Niels R.

Niels R.

    Member

  • Members
  • PipPip
  • 28 posts

Posted 05 July 2013 - 07:05 AM

Niels, which firmware does your ND running on? Which version of the .NET SDK do you have installed?

 

My experience is that the .NET Framework SDK 4.3 is very buggy. I'm using code for measuring my Power Consumption and Generation and also send data to a webservice. I'm running FW 4.2.2.2 and version 4.2 of the SDKs.

 

I don't use beta firmware as I don't want to spend to much time running into undocumented problems. My devices have firmware v4.2.2.2 and are using the v4.2 SDKs (as you do).



#8 ziggurat29

ziggurat29

    Advanced Member

  • Members
  • PipPipPip
  • 244 posts

Posted 05 July 2013 - 07:50 PM

...

I was thinking in the direction of the network stack as I had these kind of "hang ups" during early development when I was fiddling with the network cables etc... I have spend some time writing code to avoid this kind of situations, but it looks like it isn't failsave at all.

 

Do you mind sharing the "detection code" you use? Or give me some pointers to work with?

...

 

I don't mind sharing my detection code at all.  You'll be underwhelmed, it's rather simple and not really network specific.  Rather, it is simply a 'nonresponsive thread' detection, and is dependent on your not using blocking calls (or at least, use blocking calls with a timeout).  I think it will be easier to explain verbally.  Then if you still want code, I'll provide it.

 

1)  I have a 'client thread' that connects to a server, and sends (and receives) data.  In my case it is a socket-level protocol, and over a persistent connection, and not web based, but that detail shouldn't really matter.

 

2)  that thread's logic, oversimplified, looks like this:

  while ( should be connected )

  {

  // 'point A'

  if ( ! connect )

  {

  log failure

  sleep before retrying

  }

  else

  {

  // 'point B'

  if ( poll ( timeoutR, read )

  {

            // read and dispatch data

  } else if ( haveStuffToWrite and poll ( timeoutW, write ) )

  {

  // write stuffs

  } else

  {

  // nothing to do this go-round

  }

  }

  // 'point C'

  }

 

  So, this is basically a blocking loop with timeouts (so strictly it is not really blocking, but you get it), that should always be executing something at least as often as timeoutR + timeoutW, if not more frequently.  This is the key point.

 

3)  at the points labelled as 'point A', 'B' and 'C', I perform a watchdog 'checkin' action to a separate application component, called, ironically enough, the 'watchdog service'.

 

4)  this 'watchdog service' is a separate worker thread structured this way:

 

member vars:

hastable; mapping 'tag' to 'entry'.  a 'tag' is an arbitrary integer that I will explain, and an entry is a struct containing an expiration time (and I add an arbitrary text string for logging)

hardware watchdog pin (I will explain this later; its not crucial for this discussion but I want you to know it's there)

 

  thread function:

 

  while ( should be running )

  {

foreach hastable entry

  if ( entry is expired )

  log failure, and entry text string

  issue reboot

 

whack hardware watchdog (if you have one)

 

sleep ( watchdog polling interval, about 250ms )

  }

 

  and this component exposes two methods:

 

  checkin ( tag, duration )

  cancel checkin ( tag )

 

5)  in the points labelled 'point A' 'B' and 'C', in the 'client thread' of bullet point 1 above, the calls to 'checkin' and 'cancel checkin' are made.

 

6)  finally, the 'tag' is an arbitrary integer that each of my subsystems has.  for example, that 'client thread' in bullet point '1', above, would have some arbitrary number assigned to it.  It could be hard-coded so long as one 'subsystem' you are tracking uses it consistently in its calls to 'checkin' and 'cancel checkin'

 

//end=============

 

OK, so here's how it comes together if you haven't figured it out already:  In your thread of execution that manages your 'thing' be it network, or whatever, at any place you want, make a call to 'checkin' with a timeout that specifies 'if you don't hear back from me in XXX seconds, reboot the board'.  Then make sure that you can do another checkin before that time elapses in normal cases.  Subsequent checkin's (on the same tag) will delete and replace previous checkins.  If you don't want to track the service anymore, you issue 'cancel checkin', which will just remove the entry from the hastable of things being tracked.

 

At 'point A', I call checkin with a timeout of 60 sec, because connect can sometimes be a slow operation in real-world cases, but longer than that probably means the board is hung.

 

At 'point B', I call checkin with a timeout being equal to timeoutR + timeoutW plus  some overhead for correctly processing the data in worst-case scenarios, which in my case I have at 5 sec

 

At 'point C', my service is shutting down, an no longer needs to be tracked, so I call 'cancel checkin'.

 

OK, so the hardware watchdog bit.  If you don't have one, then ignore it.  However, it is entirely possible for the whole system to become locked up, including the 'watchdog service' I described above.  Then you're hosed -- you will need to physically reboot the system.  So, if you do have a hardware watchdog, then the 'watchdog service' would have the responsibility of whacking it at the point I marked above (at the end of its polling loop).

 

So this mechanism is not only compatible with a hardware watchdog, and actually makes it more useful:  Hardware watchdogs typically cannot be stopped, cannot have their time period altered (usually about 1/3 sec), and cannot conditionally monitor several components.  Aside from convenience, this is a bit of a serious issue because 3rd party (or firmware) code that you cannot modify (such as the internals of what 'connect' does), may easily block for longer than the hardware watchdog timeout period, then you are hosed.  You can't use it at all.

 

Lastly, in this description I showed my network service running as a worker thread.  You don't really have to do this if you don't want.  The 'watchdog service' does have to run as a separate thread, but if your existing code is a single thread of execution that is fine, because you always can issue 'cancel checkin' at the end of the code sequence you are monitoring.

 

Also, lastly lastly, if you do use this approach, you will be very much happier in life if you avail yourself of some logging mechanism, and use the free form text string (or whatever) I indicated in the 'entry' struct of the watchdog service.  That way, you can log why you are rebooting the board:  what code point failed.  Oh, and while developing, not actually reboot, which is really annoying when single-stepping code!

 

OK so that's my knumbskull hung thread detection mechanism:  a software watchdog service.  Thats also compatible with an optional hardware watchdog.

 

 

I don't use beta firmware as I don't want to spend to much time running into undocumented problems. My devices have firmware v4.2.2.2 and are using the v4.2 SDKs (as you do).

 

Wise.  and truthfully I don't use greater that 4.2.2.1, because the x.2 source hasn't ever been posted, and I personally experienced some I2C anomalies in x.2 with it that seemed to go away in x.1







0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users

home    hardware    projects    downloads    community    where to buy    contact Copyright © 2016 Wilderness Labs Inc.  |  Legal   |   CC BY-SA
This webpage is licensed under a Creative Commons Attribution-ShareAlike License.