Netduino home hardware projects downloads community

Jump to content

The Netduino forums have been replaced by new forums at This site has been preserved for archival purposes only and the ability to make new accounts or posts has been turned off.

A different approach to speeding up managed code

  • Please log in to reply
23 replies to this topic

#21 CW2


    Advanced Member

  • Members
  • PipPipPip
  • 1592 posts
  • LocationCzech Republic

Posted 10 July 2011 - 07:36 AM

Is there a different part of the code that does the actual IL instruction interpretation on the Netduino?

Please refer to CLR_RT_Thread::Execute_IL(...) in CLR\Core\Interpreter.cpp, and also functions in Execution.cpp.

#22 Roceh



  • Members
  • PipPip
  • 11 posts

Posted 10 July 2011 - 08:41 AM

One fairly simple optimization is to use direct threading - using labels instead of a switch statement - however that wont really help for the ARM7TDMI as it has no branch prediction. I will have a look at that as i'm currently doing a Cortex-M4 port (*PLUG* - which should benefit from it. Also some of the other compilers (non GCC) may not support it as it is not ANSI C. Update: had a look and keil does support this as well (, and reading some more on this technique it does reduce the instruction count (no range check instructions) so this is still worthwhile for the netduino.

#23 BitFlipper


    Advanced Member

  • Members
  • PipPipPip
  • 61 posts

Posted 10 July 2011 - 02:37 PM

Please refer to CLR_RT_Thread::Execute_IL(...) in CLR\Core\Interpreter.cpp, and also functions in Execution.cpp.

Excellent, thanks for the info.

I did a proof of concept test by writing a minimal interpreter that can fully interpret the following managed method:

static void Main(string[] args)
    int result = 0;

    for (int idx = 0; idx < 10000000; idx++)
        result += 1;

The IL for this method looks like this:

.method private hidebysig static 
    void Main (string[] args) cil managed 
    // Method begins at RVA 0x2050
    // Code size 30 (0x1e)
    .maxstack 2
    .locals init (
        [0] int32 result,
        [1] int32 idx,
        [2] bool CS$4$0000

    IL_0000: nop
    IL_0001: ldc.i4.0
    IL_0002: stloc.0
    IL_0003: ldc.i4.0
    IL_0004: stloc.1
    IL_0005: br.s IL_0011
    // loop start (head: IL_0011)
        IL_0007: nop
        IL_0008: ldloc.0
        IL_0009: ldc.i4.1
        IL_000a: add
        IL_000b: stloc.0
        IL_000c: nop
        IL_000d: ldloc.1
        IL_000e: ldc.i4.1
        IL_000f: add
        IL_0010: stloc.1

        IL_0011: ldloc.1
        IL_0012: ldc.i4 10000000
        IL_0017: clt
        IL_0019: stloc.2
        IL_001a: ldloc.2
        IL_001b: brtrue.s IL_0007
    // end loop

    IL_001d: ret
} // end of method Program::Main

I implemented a dual stack - one that has fixed size structures that store the actual location of the variable on the second value stack (a byte*), as well as its size and type. This makes it easy to traverse the stack.

All of the IL instructions shown above are implemented. I cheated by not yet automatically allocating the locals based on the function metadata (I still have some problems tracking all of that down inside the raw PE file), instead I just push three of the correct type values onto the stack before starting the interpreter. I also have not implemented calling into other functions, that will come later (although branching instructions turn out to be some of the simplest and fastest to simulate - just change the program counter, but there is the issue with pushing all of the correct locals onto the stack, and the parameters as well). There is also no GC or ref types yet.

So how slow is it? Well, the above application takes 2.5 seconds to complete the loop of 10,000,000 iterations. Please keep in mind this is the unoptimized managed version of the interpreter running on the PC. I am using an array of delegates to map instructions to actions, and these have overhead compared to something similar in C++, but even so I am encouraged by the results. The same Main method, when run on the real .Net framework, takes about 32ms to complete. So that means that my managed interpreter is running ~78 times slower than running it "natively" on .Net.

I plan to port this to C++ and see how this can be improved. I imagine using C++, and doing some optimizations could result in a good bump in performance.

At least to me this proof of concept shows that you can get better results than what we see right now with the .Net Micro Framework.

Also note that this is for the full .Net version running on a PC. I'm not sure how this will translate to an ARM processor but I imagine it would roughly scale since it is the interpreter implementation in C++ that I believe to be inefficient.

#24 Simon B

Simon B

    New Member

  • Members
  • Pip
  • 5 posts

Posted 10 October 2011 - 03:15 AM


Here's a wild idea...

We can see that the .Net MF has an ARM JITter. I wonder if one can use this code as a basis for a custom IL -> native compiler. What are the licensing implications? How hard would it be? Is it just too crazy to work?

It could even be converted to a C# tool that can be run as a post-build step that takes as input a managed DLL and produce a native ARM file.

Or something along those lines...

Am I a lot more stupid than I thought, or does the above post indicate that in the next release (4.2) of the .NET Micro Framework, the TinyCLR will include a JIT compiler targeting ARM processors?

Doesn't that solve everyone's problem with IL v's Native Code? If the CLR embedded on the device is able to JIT Compile to ARM native instructions, we should get within 10% of the speed of executing native code anyway (after a brief delay loading the code)?

AOT compilation of the code would be better than JIT, but it all results in native (not interpreted) instruction handling... surely?

or am I insane?

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users

home    hardware    projects    downloads    community    where to buy    contact Copyright © 2016 Wilderness Labs Inc.  |  Legal   |   CC BY-SA
This webpage is licensed under a Creative Commons Attribution-ShareAlike License.