netduino/MF performance
#1
Posted 16 March 2011 - 03:25 AM
#2
Posted 16 March 2011 - 05:16 AM
#3
Posted 16 March 2011 - 06:51 AM
The following code snippet takes ~23 uS to run:
i = (i + 1) % count;
f = (byte)((sum / count)/10);
The code on the microframework is interpreted and some of the overhead will be due to the instructions being translated each time they are run as no JIT compiler is available.
Regards,
Mark
To be or not to be = 0xFF
Blogging about Netduino, .NET, STM8S and STM32 and generally waffling on about life
Follow @nevynuk on Twitter
#4
Posted 16 March 2011 - 03:31 PM
The .NET Micro Framework is an interpreter, once you get your assembly stripped out that gets sent to the Netduino and that is interpreted when it runs. At no point does your C# project get compiled to native code.Hi Daman.
I did not understand well what is your problem.
What I am tell you, is that the CLR is a very complex layer. Once you have written your C# code, the compiler creates an intermediate-language which is "portable". This is compiled again to native opcodes and sent to the Netduino.
The "overhead" of the CLR stands essentially in the memory management, which is garbage-collected automatically. This process gives you an huge simplicity to develop applications, a really safer way to play with objects, but it has a cost in terms of performance.
Just for have an idea of the timings, the only port read/write tight-loop is able to run at about 8.5KHz. That's is really "disappointing" for a 48MHz chip, but if you were just toggling an output, it's probably easier to write in assembler of C. Instead C#+Micro Framework has huge advantages when you want to embed very complex blocks: e.g. sockets, displays, threading, etc.
Hope it helps.
Cheers
Mario
EDIT: And I was slower to answer
#5
Posted 17 March 2011 - 01:07 AM
Can that be? Is this CLR overhead?
To the above responses I would add that the following. Many people on the board are interested in the same performance issues and there are a few ideas being bandied about. The central idea is to promote a hybrid programming style where much of the logic of your program is in C#, but a few timing-critical core routines are in native code. How this native code should be provided is the key question, and there are a few ideas being bandied about:
- compile your own custom entry point into the firmware
- use a dynamic code generation library like my own highly-experimental Fluent library
- wait a little longer to find out what Chris Walker has cooked up
- wait some unknown amount of time to see if my TinyNGEN project amounts to anything
I personally believe that things are in flux right now, and although it can be frustrating at times, I bet in a few months' time we are going to see some interesting advances in this area.
PS: can you say what FPGA prototyping board you are using? That sounds like an interesting project.
- Daman and like this
#6
Posted 17 March 2011 - 02:10 AM
I understand that the code is compiled into CLI that is then interperted by the CLR and that MF does not use JIT and the instructions are always interperted.
But execution still seems order of magnitude slower then I would expect. So here is a code snipet:
long ticksPerMicroSec = TimeSpan.TicksPerMillisecond / 1000; //1000 uS = 1 mS
now = DateTime.Now.Ticks;
later = DateTime.Now.Ticks;
elapsedTime = (later - now);
Debug.Print("Tick Count with no operations = " + elapsedTime + " ticks");
now = DateTime.Now.Ticks;
i = i % count;
f = (byte) ((sum / count) / 10);
k = 0;
later = DateTime.Now.Ticks;
elapsedTime = (later - now - 2134) / ticksPerMicroSec; //find elapsed tick count - count with no operations
Debug.Print("Execution time minus ticks with no operation = " + elapsedTime + "uS");
now = DateTime.Now.Ticks;
k = 0;
later = DateTime.Now.Ticks;
elapsedTime = (later - now - 2134) / ticksPerMicroSec;
Debug.Print("Execution time minus ticks with no operation = " + elapsedTime + "uS");
And here the print out form the debug window.
Tick Count with no operations = 2134 ticks
Execution time minus ticks with no operation = 234uS
Execution time minus ticks with no operation = 42uS
42uS to execute the statment k=0;???
That is over 1000 machine instructions! It does not seem like the CLR would have that much overhead.
Can anyone verify this timing?
I still think I have some configuration setting incorrectly.
#7
Posted 17 March 2011 - 02:29 AM
...
PS: can you say what FPGA prototyping board you are using? That sounds like an interesting project.
It is based on the Altera Cyclone II family.
EP2C5T144 Altera CycloneII FPGA mini Development Board
I got the following USB Blaster clone
Mini Altera FPGA CPLD USB Blaster programmer JTAG
You can find a numbe of people selling similar items on eBay. These just happen to be the vendors I used. The board and blaster took about a week to arrive from Hong Kong.
Alter offers a free web edtion of Quartus II (compiler with a number of other tools including SignalTap II Logic Analyzer)
Alter now also provides a free starter edition of ModelSim
Breadboarding is not my favorite passtime. Programming glue logic in an FPGA is more fun then wiring discrete logic.
#8
Posted 17 March 2011 - 03:30 AM
That's pretty slow (on the other hand, when saying 1000 machine instructions you appear to assume the CPU can sustain 1 machine instruction per clock cycle which I wouldn't think is possible, especially when doing writes to memory. But I don't really know how to calculate cycle times on this thing)
using System; using Microsoft.SPOT; namespace NetduinoApplication3 { public class Program { private static int k; public static void Main() { const int count=1000; var start=DateTime.Now; for(var i=0; i<count; ++i) { k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; k=0; } var end=DateTime.Now; const int numberOfAssignmentsPerLoop=100; const int totalNumberOfAssignments=count*numberOfAssignmentsPerLoop; var elapsed=(end-start); var totalMicroseconds=elapsed.Seconds*1000000+elapsed.Milliseconds*1000; var secondsPerAssignment=((double)totalMicroseconds)/totalNumberOfAssignments; Debug.Print("time per assignment="+secondsPerAssignment+" uS"); } } }
#9
Posted 17 March 2011 - 05:28 AM
#10
Posted 17 March 2011 - 07:02 AM
And to call the method which gets the time from the system. If you disassemble the three statement42uS to execute the statment k=0;???
now = DateTime.Now.Ticks; k = 0; later = DateTime.Now.Ticksyou get
[0] int64 now, [1] int64 later, [2] int64 k, [3] valuetype [mscorlib]System.DateTime CS$0$0000) L_0000: nop L_0001: call valuetype [mscorlib]System.DateTime [mscorlib]System.DateTime::get_Now() L_0006: stloc.3 L_0007: ldloca.s CS$0$0000 L_0009: call instance int64 [mscorlib]System.DateTime::get_Ticks() L_000e: stloc.0 L_000f: ldc.i4.0 L_0010: conv.i8 L_0011: stloc.2 L_0012: call valuetype [mscorlib]System.DateTime [mscorlib]System.DateTime::get_Now() L_0017: stloc.3 L_0018: ldloca.s CS$0$0000 L_001a: call instance int64 [mscorlib]System.DateTime::get_Ticks() L_001f: stloc.1So you finally get the time now somewhere in L_0009 and later somewhere in L_001a. The value you are showing is the accumulation of all of the statements between. Assuming that the code executed at L_0009 and L_001a are equivalent then the value of 42us is the price for executing all of the statements from L_000e to L_001a inclusive.
Regards,
Mark
To be or not to be = 0xFF
Blogging about Netduino, .NET, STM8S and STM32 and generally waffling on about life
Follow @nevynuk on Twitter
#11
Posted 17 March 2011 - 07:17 AM
If you are interested, instruction cycle timings are described (e.g.) in ARM7TDMI Technical Reference manual (pdf). However, the calculation is not trivial, because of the pipelined architecture (instructions overlap considerably) and multiple bus interface cycle types. IMHO the average number of clock ticks is about 1.5 per instruction [citation needed].That's pretty slow (on the other hand, when saying 1000 machine instructions you appear to assume the CPU can sustain 1 machine instruction per clock cycle which I wouldn't think is possible, especially when doing writes to memory. But I don't really know how to calculate cycle times on this thing)
#12
Posted 17 March 2011 - 11:58 AM
Well, his code attempts to measure that overhead and subtract it from the calculated times. (this is the reason for the magic constant 2134 in his code). My approach for hiding that overhead (and also the loop overhead) was to run the instruction many times.And to call the method which gets the time from the system. If you disassemble the three statement
#13
Posted 17 March 2011 - 06:30 PM
Well, his code attempts to measure that overhead and subtract it from the calculated times. (this is the reason for the magic constant 2134 in his code). My approach for hiding that overhead (and also the loop overhead) was to run the instruction many times.
Not sure what the compiler did with your code. Since k is not used later on the right hand side of an expression the compiler my have optimized out the assigment. It also my optimize out the the redundent assigment. I have to go to an appoitment right now but will try this myself later.
Question to Mark - How do you view the CLI assembly code?
Thanks all,
Bill
#14
Posted 17 March 2011 - 07:22 PM
Not sure what the compiler did with your code.
I checked my exe with the 'ildasm' tool and found that the assignments were still there. If you want me to confirm your exact program, can you post the whole source, so there is no ambiguity about what are the types of i,f,k etc and whether they are local or member variables etc.
#15
Posted 17 March 2011 - 07:50 PM
Question to Mark - How do you view the CLI assembly code?
I fed the compiled code to Reflector. There was a free version available but it now appears to have disappeared and has been replaced by commercial version.
Regards,
Mark
To be or not to be = 0xFF
Blogging about Netduino, .NET, STM8S and STM32 and generally waffling on about life
Follow @nevynuk on Twitter
#16
Posted 17 March 2011 - 07:52 PM
#17
Posted 17 March 2011 - 08:16 PM
#18
Posted 18 March 2011 - 12:45 AM
In case people aren't aware, 'ildasm' comes with Visual Studio and, for the pay version, appears in Start -> Microsoft Visual Studio 2010 -> Microsoft Windows SDK Tools. I assume/hope it comes with the free version of VS as well.I fed the compiled code to Reflector.
#19
Posted 18 March 2011 - 06:37 AM
In case people aren't aware, 'ildasm' comes with Visual Studio and, for the pay version, appears in Start -> Microsoft Visual Studio 2010 -> Microsoft Windows SDK Tools. I assume/hope it comes with the free version of VS as well.
I've always preferred Reflector as it can also show the code in C# and VB but if all you need is to see the IL then ildasm is just as good.
Regards,
Mark
To be or not to be = 0xFF
Blogging about Netduino, .NET, STM8S and STM32 and generally waffling on about life
Follow @nevynuk on Twitter
#20
Posted 18 March 2011 - 08:41 AM
IMHO you are not right here, breakpoint checking has to be present in the release [firmware] build - this is the one that is published, without breakpoints you would not be able to debug the application (from Visual Studio). Perhaps the overhead should only take place when a debugger is attached (if it is not done so already).One overhead of the NETMF interpreter is that it checks for breakpoints after every instruction. This overhead should only be present in a debug build, not in a release build.
I would be interested in you measurement results of a 'RTM' build (compiled with /p:Flavor=RTM option) that has debugging disabled (and "some CLR diagnostic functionality may be eliminated").
0 user(s) are reading this topic
0 members, 0 guests, 0 anonymous users