Heres some things I have done that may or may not help:
* run your unit attached to the debugger until it fails. there sometimes is useful output you cannot otherwise catch.
* oh, before that, can you reliably stimulate the failure? the more reliably your can stimulate it, the more confidence you have in your tests.
* sometimes the easiest way to find a bug, is to make it worse; a kind of 'homeopathic debugging'. For example, if you think you have memory problems, you could allocate a chunk to waste some, and see how that changes things. Obviously you have to do this judiciously because obviously you can allocate a big chunk and make the system fail. Rather, can you allocate a small chunk that should surely not cause a failure, but yet it causes you to fail significantly sooner or more often.
* since the code is .net, run it on the desktop. you will be able to debug more easily, and if fails there then you know you have a bug. if it does run there, then you haven't disproved a bug, but you have a little more confidence at least.
* make a logger and log to a file. this is particularly useful if it is impractical to run in the debugger, or if the presence of the debugger masks your problem (it happens). alas, be aware of some existing file writing issues (specifically, folks have had problems with their final writes not being committed, and that sort of thing really diminishes the value of a debug log since of course the interesting bits up to the point of failure are at the end). also know that for certain root causes, logging can mask the origin of the problem. also know that for certain failures, logging can exacerbate your problem! but don't let that stop you.
* if your code is running in a loop, toggle the onboard led; this is perhaps the least disruptive and simplest way of showing 'my code running here', that can be useful to prove the existence of lockups and paths of execution. also, you have two lights at your disposal: the blue onboard led, but you also can programmatically turn off the white power one, so that makes a potential second.
some other stuff:
* you said a 'lockup'. try to qualify lockup' to mean either a thread that stops running, or a piece of code that is running, but doesn't execute the way you want (that simply manifests the same symptoms). I mention this because there are some instances of the halted thread related to networking that you can find elsewhere on this forum (sorry no direct link but I remember a recent thread was something like 'we have reproduce a lockup' or something like that).
* if it is a halted thread of execution, see if other threads in your app are similarly halted, or just that one. you only have a single thread? well make another. (actually I would suggest this as a design choice anyway, but thats separate)
* if the other threads are responsive, then it could be some problem with some driver (e.g. the point the networking). if you can find the call that halts, then you might be able to figure out a way to work around
ok, you asked about memory:
* Debug.Print ( "mem: " + Debug.GC(true) );
this will print out your currently free memory. it will also stimulate compaction. (NB I wrote this as I post from memory, so double check for typos)
* netmf is supposed to stimulate automatic GC when needed, and maybe it does much of the time, but I'm telling you from experience that you will probably want to do a Debug.GC(true) yourself periodically. I have definitely restored a system to functionality simply by adding those calls prior to functional steps which create a bunch of temp stuff, but I suggest that you don't do this as a first line of defense. Rather, find the failing point, see if you can repeatably stimulate the failure, then add the line and see if the behaviour is changed dramatically.
* when considering the free mem as a result of Debug.GC(true), realize that this is after compaction. Meaning that the act of looking at the memory available has changed the availability of that memory (for the better). So if it is failing, and you notice 'I have 30k available), then really it was probably failing when you had more memory availble, but fragmented such that it was not usable.
* netmf, when running low, will first become 'softly' unstable. by this I mean that calls will internally fail due to memory problems, but not throw exceptions that you can catch and handle, leaving it in what appears to the program as a functional state, but in fact not so much. If you have your debugger attached, you can see the internal failures happen, with the call returning rather than throwing. I find this starts to happen when you get down to about 20K free as reported by GC(true), but this is a soft limit that I'm sure depends on my apps use of memory.
Eventually, you may get to a point where you decide you have found the bug, and it's out of your control to fix it. Hopefully not, but it can happen. Then you have to consider work-arounds. Some might be:
* can you detect yourself being in a state that will result in failure, in time to do something about it?
* if you can, is it a viable option to: stop your web service, CG(true), then start it back up? No?
* if you can't detect it beforehand and react, can you observe it from afar? If only one thread is 'locked up', can you use a second non-lockedup thread to observe that the first has failed, and at least reset the board? (you can also forcibly terminate a thread, but this generally introduces more instability. better to just reboot. boot is quite quick on an embedded system).
Here are a couple of tales from the trenches from my personal experience:
* I have a component which operates a GSM modem. I use regexs to do my parsing. These are monstrously convenience, but they are also monstrously memory intensive. I cannot precompile all the machines I need beforehand, so I build them on the fly as needed and then throw them away afterwards. While this is 'wasteful' of CPU, I actually have way more than enough CPU resources, and am tight on RAM, so this is a good time/size tradeoff. But, it introduces a new problem of memory fragmentation. It was difficult to narrow down the point of failure, because it moved out in the code, but near-ish one area. I could deliberately manipulate the problem by allocating chunks of a few K, and then I could also mask the problem by doing a GC(true). So I concluded probably memory fragmentation issues. (I say 'probably' because 'proven' would require an actual heap walk, which is not easy to do. But I was satisfied with 'probably' and instead I used testing over several days of continuous running to make the uncertainty diminish. Oh, BTW, two boards are better than one, if you can afford it. You can have one board doing a lifecycle test, while you continue to develop on the other).
* As a design style, I mostly prefer blocking calls and threads. However, I always (well almost always), use a variation with a timeout value. This helps because your thread can at least actively let someone know that it is alive when it times out. flick an LED or check in with a watchdog log a message, whatever. And 'whatever' can be stuff you haven't thought of yet, like 'add a call to GC(true)'
* I use a bunch of serial ports. The orthodox way of using the serial ports is to set up a DataAvailable event handler that is invoked when data is available. However, I (and others) have found this event to inexplicably stop being fired. I ran this same code on the desktop, using one of those FTDI usb serial port devices connected to my hardware and found it to run perfectly, so I felt pretty good about the code being 'correct'. I was initially worried about re-engineering the code to not use the DataAvailable event, but then I realized that I could emulate that event with a worker thread, and surgically slip that into my already developed code. Point being is that sometimes you have to devise a workaround to a defect that does not originate in your own code.
* I (and others) have had some deadlock issues with networking. In my experience, I can reliably deadlock when I perform a socket connect() to a host that is reachable, but not listening on the port to which I am trying to connect (e.g., the listening server is down for whatever reason). I have also found that this only deadlocks the thread making that call -- all the others are still responsive. That call does not have a timeout option (and I don't know if it would work anyway since its not supposed to lock like this to begin with). However, it is possible for me to reliably detect that the 'lockup' has occurred, and I can reliably reboot the board in the case that it has. Beautiful? No, it's hideous, but my alternative is to not deliver product. This scenario, though a 'negative case' (in the positive case there is always my infrastructure server listening for incoming calls), is actually easy to stimulate: if I need to upgrade the server binarie, then it will be offline during the brief moment that I replace the binary and restart it. This will then cause all my clients (i.e. boards) in the field to become immediately deadlocked, each requiring a power cycle. But since I have modded the firmware to detect this scenaro and reboot, now those using are able to restore themselves to functionality.