Javascript Disabled Detected

You currently have javascript disabled. Several functions may not work. Please re-enable javascript to access full functionality.

The Netduino forums have been replaced by new forums at community.wildernesslabs.co. This site has been preserved for archival purposes only and the ability to make new accounts or posts has been turned off.

netduino/MF performance

Started by Daman, Mar 16 2011 03:25 AM

Next

Please log in to reply

24 replies to this topic

#1 Daman

New Member

Members
5 posts

Posted 16 March 2011 - 03:25 AM

I have just started playing with the netduino and a FPGA prototyping board. My first project was a very simple program to read a thermistor and display the result on 2 7-segment LEDs. I used the FPGA to drive the LEDs. Excited to get something up and running quickly I wrote a quick and dirty class that reads an analog input, converts to resistance, and then does a brut force search for the resistance in a table to convert the resistance to temperature. (had to loop about 200 times to find the correct resistance at room temperature. To smooth out the signal it created a 1k sliding average window. So I ran the program and though it was locked up because it was taking so long to get the first reading. Turned out it was taking 20 seconds to initialize the 1k array with the first 1k samples. Well since I had the FPGA and a logic analyzer I started to take some timings and here is what I discovered. I create a piece of code that simply toggled one of the digital outputs. myoutputport.Write(true); myoutputport.Write(false); I measure rising edge to falling edge time of ~95 uS. The following code snippet takes ~23 uS to run: i = (i + 1) % count; f = (byte)((sum / count)/10); i, f, sum, and count are all ints. Based on the fact that the netduino has a 48MHz clock, the above code takes just over 2000 instructions to execute, this can not be correct. I must be doing something wrong. I am fairly confident the timings are correct. I put the above code into a large loop so I could do some wall clock timings and the results confirmed what I was measuring with my 160 nS clock in the FPGA. I have tried creating release code but get the same results. Can that be? Is this CLR overhead? Do I have something misconfigured? Any help understanding would be appreciated. Thanks

Back to top

#2 Mario Vernari

Advanced Member

Members
1768 posts

LocationVenezia, Italia

Posted 16 March 2011 - 05:16 AM

Hi Daman. I did not understand well what is your problem. What I am tell you, is that the CLR is a very complex layer. Once you have written your C# code, the compiler creates an intermediate-language which is "portable". This is compiled again to native opcodes and sent to the Netduino. The "overhead" of the CLR stands essentially in the memory management, which is garbage-collected automatically. This process gives you an huge simplicity to develop applications, a really safer way to play with objects, but it has a cost in terms of performance. Just for have an idea of the timings, the only port read/write tight-loop is able to run at about 8.5KHz. That's is really "disappointing" for a 48MHz chip, but if you were just toggling an output, it's probably easier to write in assembler of C. Instead C#+Micro Framework has huge advantages when you want to embed very complex blocks: e.g. sockets, displays, threading, etc. Hope it helps. Cheers Mario

Biggest fault of Netduino? It runs by electricity.

Back to top

#3 Nevyn

Advanced Member

Members
1072 posts

LocationNorth Yorkshire, UK

Posted 16 March 2011 - 06:51 AM

The following code snippet takes ~23 uS to run:

i = (i + 1) % count;
f = (byte)((sum / count)/10);

The code on the microframework is interpreted and some of the overhead will be due to the instructions being translated each time they are run as no JIT compiler is available.

Regards,
Mark

To be or not to be = 0xFF

Blogging about Netduino, .NET, STM8S and STM32 and generally waffling on about life

Follow @nevynuk on Twitter

Back to top

#4 Ravenheart

Member

Members
18 posts

LocationBulgaria

Posted 16 March 2011 - 03:31 PM

Hi Daman.
I did not understand well what is your problem.

What I am tell you, is that the CLR is a very complex layer. Once you have written your C# code, the compiler creates an intermediate-language which is "portable". This is compiled again to native opcodes and sent to the Netduino.
The "overhead" of the CLR stands essentially in the memory management, which is garbage-collected automatically. This process gives you an huge simplicity to develop applications, a really safer way to play with objects, but it has a cost in terms of performance.

Just for have an idea of the timings, the only port read/write tight-loop is able to run at about 8.5KHz. That's is really "disappointing" for a 48MHz chip, but if you were just toggling an output, it's probably easier to write in assembler of C. Instead C#+Micro Framework has huge advantages when you want to embed very complex blocks: e.g. sockets, displays, threading, etc.

Hope it helps.
Cheers
Mario

The .NET Micro Framework is an interpreter, once you get your assembly stripped out that gets sent to the Netduino and that is interpreted when it runs. At no point does your C# project get compiled to native code.

EDIT: And I was slower to answer

Back to top

#5 Corey Kosak

Advanced Member

Members
276 posts

LocationHoboken, NJ

Posted 17 March 2011 - 01:07 AM

Can that be? Is this CLR overhead?

To the above responses I would add that the following. Many people on the board are interested in the same performance issues and there are a few ideas being bandied about. The central idea is to promote a hybrid programming style where much of the logic of your program is in C#, but a few timing-critical core routines are in native code. How this native code should be provided is the key question, and there are a few ideas being bandied about:

compile your own custom entry point into the firmware
use a dynamic code generation library like my own highly-experimental Fluent library
wait a little longer to find out what Chris Walker has cooked up
wait some unknown amount of time to see if my TinyNGEN project amounts to anything

I personally believe that things are in flux right now, and although it can be frustrating at times, I bet in a few months' time we are going to see some interesting advances in this area.

PS: can you say what FPGA prototyping board you are using? That sounds like an interesting project.

Daman and like this

Back to top

#6 Daman

New Member

Members
5 posts

Posted 17 March 2011 - 02:10 AM

Thank you all for you timely responses.

I understand that the code is compiled into CLI that is then interperted by the CLR and that MF does not use JIT and the instructions are always interperted.

But execution still seems order of magnitude slower then I would expect. So here is a code snipet:

long ticksPerMicroSec = TimeSpan.TicksPerMillisecond / 1000; //1000 uS = 1 mS

now = DateTime.Now.Ticks;
later = DateTime.Now.Ticks;
elapsedTime = (later - now);
Debug.Print("Tick Count with no operations = " + elapsedTime + " ticks");

now = DateTime.Now.Ticks;
i = i % count;
f = (byte) ((sum / count) / 10);
k = 0;
later = DateTime.Now.Ticks;
elapsedTime = (later - now - 2134) / ticksPerMicroSec; //find elapsed tick count - count with no operations
Debug.Print("Execution time minus ticks with no operation = " + elapsedTime + "uS");
now = DateTime.Now.Ticks;
k = 0;
later = DateTime.Now.Ticks;
elapsedTime = (later - now - 2134) / ticksPerMicroSec;
Debug.Print("Execution time minus ticks with no operation = " + elapsedTime + "uS");

And here the print out form the debug window.

Tick Count with no operations = 2134 ticks
Execution time minus ticks with no operation = 234uS
Execution time minus ticks with no operation = 42uS

42uS to execute the statment k=0;???

That is over 1000 machine instructions! It does not seem like the CLR would have that much overhead.

Can anyone verify this timing?

I still think I have some configuration setting incorrectly.

Back to top

#7 Daman

New Member

Members
5 posts

Posted 17 March 2011 - 02:29 AM

...
PS: can you say what FPGA prototyping board you are using? That sounds like an interesting project.

It is based on the Altera Cyclone II family.

EP2C5T144 Altera CycloneII FPGA mini Development Board

I got the following USB Blaster clone

Mini Altera FPGA CPLD USB Blaster programmer JTAG

You can find a numbe of people selling similar items on eBay. These just happen to be the vendors I used. The board and blaster took about a week to arrive from Hong Kong.

Alter offers a free web edtion of Quartus II (compiler with a number of other tools including SignalTap II Logic Analyzer)

Alter now also provides a free starter edition of ModelSim

Breadboarding is not my favorite passtime. Programming glue logic in an FPGA is more fun then wiring discrete logic.

Back to top

#8 Corey Kosak

Advanced Member

Members
276 posts

LocationHoboken, NJ

Posted 17 March 2011 - 03:30 AM

Interesting. Below is a different program that calculates 17.64 microseconds for k=0

That's pretty slow (on the other hand, when saying 1000 machine instructions you appear to assume the CPU can sustain 1 machine instruction per clock cycle which I wouldn't think is possible, especially when doing writes to memory. But I don't really know how to calculate cycle times on this thing)

using System;
using Microsoft.SPOT;

namespace NetduinoApplication3 {
  public class Program {
    private static int k;

    public static void Main() {
      const int count=1000;
      var start=DateTime.Now;
      for(var i=0; i<count; ++i) {
        k=0; k=0; k=0; k=0; k=0; 
        k=0; k=0; k=0; k=0; k=0; 
        k=0; k=0; k=0; k=0; k=0; 
        k=0; k=0; k=0; k=0; k=0; 
        k=0; k=0; k=0; k=0; k=0; 
        k=0; k=0; k=0; k=0; k=0; 
        k=0; k=0; k=0; k=0; k=0; 
        k=0; k=0; k=0; k=0; k=0; 
        k=0; k=0; k=0; k=0; k=0; 
        k=0; k=0; k=0; k=0; k=0; 

        k=0; k=0; k=0; k=0; k=0; 
        k=0; k=0; k=0; k=0; k=0; 
        k=0; k=0; k=0; k=0; k=0; 
        k=0; k=0; k=0; k=0; k=0; 
        k=0; k=0; k=0; k=0; k=0; 
        k=0; k=0; k=0; k=0; k=0; 
        k=0; k=0; k=0; k=0; k=0; 
        k=0; k=0; k=0; k=0; k=0; 
        k=0; k=0; k=0; k=0; k=0; 
        k=0; k=0; k=0; k=0; k=0; 
      }
      var end=DateTime.Now;

      const int numberOfAssignmentsPerLoop=100;
      const int totalNumberOfAssignments=count*numberOfAssignmentsPerLoop;
      var elapsed=(end-start);
      var totalMicroseconds=elapsed.Seconds*1000000+elapsed.Milliseconds*1000;
      var secondsPerAssignment=((double)totalMicroseconds)/totalNumberOfAssignments;
      Debug.Print("time per assignment="+secondsPerAssignment+" uS");
    }
  }
}

Back to top

#9 Mario Vernari

Advanced Member

Members
1768 posts

LocationVenezia, Italia

Posted 17 March 2011 - 05:28 AM

I am sorry, but by using DateTime.Now is the same thing as Utility.GetMachineTime()? On PCs the DateTime is not reliable to short timing measurement. Mario

Biggest fault of Netduino? It runs by electricity.

Back to top

#10 Nevyn

Advanced Member

Members
1072 posts

LocationNorth Yorkshire, UK

Posted 17 March 2011 - 07:02 AM

42uS to execute the statment k=0;???

And to call the method which gets the time from the system. If you disassemble the three statement

now = DateTime.Now.Ticks;
k = 0;
later = DateTime.Now.Ticks

you get

[0] int64 now,
[1] int64 later,
[2] int64 k,
[3] valuetype [mscorlib]System.DateTime CS$0$0000)
L_0000: nop 
L_0001: call valuetype [mscorlib]System.DateTime [mscorlib]System.DateTime::get_Now()
L_0006: stloc.3 
L_0007: ldloca.s CS$0$0000
L_0009: call instance int64 [mscorlib]System.DateTime::get_Ticks()
L_000e: stloc.0 
L_000f: ldc.i4.0 
L_0010: conv.i8 
L_0011: stloc.2 
L_0012: call valuetype [mscorlib]System.DateTime [mscorlib]System.DateTime::get_Now()
L_0017: stloc.3 
L_0018: ldloca.s CS$0$0000
L_001a: call instance int64 [mscorlib]System.DateTime::get_Ticks()
L_001f: stloc.1

So you finally get the time now somewhere in L_0009 and later somewhere in L_001a. The value you are showing is the accumulation of all of the statements between. Assuming that the code executed at L_0009 and L_001a are equivalent then the value of 42us is the price for executing all of the statements from L_000e to L_001a inclusive.

Regards,
Mark

To be or not to be = 0xFF

Blogging about Netduino, .NET, STM8S and STM32 and generally waffling on about life

Follow @nevynuk on Twitter

Back to top

#11 CW2

Advanced Member

Members
1592 posts

LocationCzech Republic

Posted 17 March 2011 - 07:17 AM

That's pretty slow (on the other hand, when saying 1000 machine instructions you appear to assume the CPU can sustain 1 machine instruction per clock cycle which I wouldn't think is possible, especially when doing writes to memory. But I don't really know how to calculate cycle times on this thing)

If you are interested, instruction cycle timings are described (e.g.) in ARM7TDMI Technical Reference manual (pdf). However, the calculation is not trivial, because of the pipelined architecture (instructions overlap considerably) and multiple bus interface cycle types. IMHO the average number of clock ticks is about 1.5 per instruction [citation needed].

Back to top

#12 Corey Kosak

Advanced Member

Members
276 posts

LocationHoboken, NJ

Posted 17 March 2011 - 11:58 AM

And to call the method which gets the time from the system. If you disassemble the three statement

Well, his code attempts to measure that overhead and subtract it from the calculated times. (this is the reason for the magic constant 2134 in his code). My approach for hiding that overhead (and also the loop overhead) was to run the instruction many times.

Back to top

#13 Daman

New Member

Members
5 posts

Posted 17 March 2011 - 06:30 PM

Well, his code attempts to measure that overhead and subtract it from the calculated times. (this is the reason for the magic constant 2134 in his code). My approach for hiding that overhead (and also the loop overhead) was to run the instruction many times.

Not sure what the compiler did with your code. Since k is not used later on the right hand side of an expression the compiler my have optimized out the assigment. It also my optimize out the the redundent assigment. I have to go to an appoitment right now but will try this myself later.

Question to Mark - How do you view the CLI assembly code?

Thanks all,
Bill

Back to top

#14 Corey Kosak

Advanced Member

Members
276 posts

LocationHoboken, NJ

Posted 17 March 2011 - 07:22 PM

Not sure what the compiler did with your code.

I checked my exe with the 'ildasm' tool and found that the assignments were still there. If you want me to confirm your exact program, can you post the whole source, so there is no ambiguity about what are the types of i,f,k etc and whether they are local or member variables etc.

Back to top

#15 Nevyn

Advanced Member

Members
1072 posts

LocationNorth Yorkshire, UK

Posted 17 March 2011 - 07:50 PM

Question to Mark - How do you view the CLI assembly code?

I fed the compiled code to Reflector. There was a free version available but it now appears to have disappeared and has been replaced by commercial version.

Regards,
Mark

To be or not to be = 0xFF

Blogging about Netduino, .NET, STM8S and STM32 and generally waffling on about life

Follow @nevynuk on Twitter

Back to top

#16 Cuno

Advanced Member

Members
144 posts

LocationZ�rich / Switzerland

Posted 17 March 2011 - 07:52 PM

i = (i + 1) % count; f = (byte)((sum / count)/10); This code should not cause any memory allocations, so memory management overhead should not be responsible for the observed slow speed. The first measurements of one of my colleagues indicate that NETMF code is about 100 times slower than C code translated by the optimizing ARM compilers. This is certainly disappointing. Even with all the advantages of managed code, I think an interpreter should not use more than about 20 instructions on average for interpreting a simple intermediate code instruction. Not 100, and certainly not 1000. It would be interesting to take a look at how many MSIL instructions have been executed in your loop, and which ones. So it looks like there is room for making the interpreter faster. For example, instead of checking whether it is time for rescheduling threads after each and every MSIL instruction, the usual approach would be to only perform this check when a jump backwards is executed, i.e., in a loop. I guess there must be other ways to improve the interpreter, even though it has more work to do than a comparable Java byte code interpreter. One overhead of the NETMF interpreter is that it checks for breakpoints after every instruction. This overhead should only be present in a debug build, not in a release build. I think it is worthwhile to improve the interpreter, but it will never come close to the native speed of the hardware. A good MSIL compiler for the complete MSIL instruction set would probably take up a few hundred KB of RAM, so it is not an option for small microcontrollers. An offline MSIL compiler would be great, and there are a couple of attempts in this direction, but I haven't seen one where I hold my breath yet. As for me, I'll therefore gladly stick with the interpreter, but I am also interested in hybrid approaches for critical pieces of code. In particular if they don't involve a mix of C# and C code, and if the safety of managed code may only be violated in isolated, easily identifyable parts of an application's code. Cuno

Back to top

#17 Chris Walker

Secret Labs Staff

Moderators
7767 posts

LocationNew York, NY

Posted 17 March 2011 - 08:16 PM

Cuno, Thanks for the insights. Should we as a community dig into the .NET MF interpreter code and propose/contribute speed improvements back to Microsoft? Chris

Back to top

#18 Corey Kosak

Advanced Member

Members
276 posts

LocationHoboken, NJ

Posted 18 March 2011 - 12:45 AM

I fed the compiled code to Reflector.

In case people aren't aware, 'ildasm' comes with Visual Studio and, for the pay version, appears in Start -> Microsoft Visual Studio 2010 -> Microsoft Windows SDK Tools. I assume/hope it comes with the free version of VS as well.

Back to top

#19 Nevyn

Advanced Member

Members
1072 posts

LocationNorth Yorkshire, UK

Posted 18 March 2011 - 06:37 AM

In case people aren't aware, 'ildasm' comes with Visual Studio and, for the pay version, appears in Start -> Microsoft Visual Studio 2010 -> Microsoft Windows SDK Tools. I assume/hope it comes with the free version of VS as well.

I've always preferred Reflector as it can also show the code in C# and VB but if all you need is to see the IL then ildasm is just as good.

Regards,
Mark

To be or not to be = 0xFF

Blogging about Netduino, .NET, STM8S and STM32 and generally waffling on about life

Follow @nevynuk on Twitter

Back to top

#20 CW2

Advanced Member

Members
1592 posts

LocationCzech Republic

Posted 18 March 2011 - 08:41 AM

One overhead of the NETMF interpreter is that it checks for breakpoints after every instruction. This overhead should only be present in a debug build, not in a release build.

IMHO you are not right here, breakpoint checking has to be present in the release [firmware] build - this is the one that is published, without breakpoints you would not be able to debug the application (from Visual Studio). Perhaps the overhead should only take place when a debugger is attached (if it is not done so already).

I would be interested in you measurement results of a 'RTM' build (compiled with /p:Flavor=RTM option) that has debugging disabled (and "some CLR diagnostic functionality may be eliminated").

Back to top

Next

Back to General Discussion

1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users

netduino/MF performance

1 user(s) are reading this topic

Sign In