Monday, June 10, 2013

GCC profiling to VCD waveforms

During a few after work drinks with a friend, James McCombe, he mentioned that he had been working on a method of augmenting C code with macros that generated signals that could be read by a waveform visualizer and plotted on the screen. This scratched a few itches over the next week or so and I couldn't help myself but hack something together to look at the visualizations.

I put this together using gcc instrument-functions, attributes and the usual bit of rushed together python.


INSTRUMENTATION FUNCTIONS

gcc has a cool feature that inserts a function call before and after every function has been called.  It's triggered by building the final executable with -finstrument-functions. i.e.:

gcc test.c -finstrument-functions -g

{-g is needed to build for debug / keep all the symbols in the executable}


You then need to declare a couple of functions and link them into the executable that will get called before jumping to a new function and just before returning from a function:

void __attribute__((no_instrument_function))
__cyg_profile_func_enter(void *this_fn, void *call_site
{
    printf("Func ptr called: %p from address %p\n", this_fn, call_site );
}


void __attribute__((no_instrument_function))
 __cyg_profile_func_exit(void *this_fn, void *call_site)
{
    printf("Func returning to %p from address %p\n", this_fn, call_site );
}


GCC ATTRIBUTES

To stop the instrumentation functions from being called recursively, you have to indicate to gcc that it should remove these functions from the list of targets used when inserting its instrumentation hooks.

The basic premise of this is as follows:

void __attribute__((NAME)) FISH(void ) 

A full list is available here and is well worth a perusal, even if a bunch of the options here are not portable between architectures or even instruction sets. A few of my favourites are:

: constructor / destructor

These insert functions before and after main(), allowing you to do all sorts of devious initialization of things when dynamically linking code against an executable. i.e.


void __attribute__((no_instrument_function))
     __attribute__((constructor))
constructor_begin (void)
{
    printf("Main is about to be called\n");
}

void __attribute__((no_instrument_function))

     __attribute__((constructor))
constructor_end (void)
{
    printf("Main has exited (or exit was called)\n");
}


: weak

The weak attribute indicates that when linking together an executable, if multiple functions exist with the same name, deprecate any functions marked as 'weak'. This enables executables to override functions in libraries for example if they so want.


: section

The last trick I like to use is to put a bunch of functions into a specific section in the ELF file. This attribute allows you to arbitrarily specify a section name for a funtion to live in (or list of functions). In the embedded space, this is very handy when specifying functions that want to execute from onchip SRAM or even ROM.



PUTTING IT TOGETHER

Using the instrumentation trick, the constructor attributes and a little bit of code to log the transactions to a file, we can quickly augment the executable to output a trace history of its function calls at run time.

Code over here:


A few considerations made here to minimize the overhead of the tracing code:

  • The gcc profiling side periodically flushes out the trace buffer to disk (not writing every time).  Although the file system is almost always buffered, fwrite still has a considerable overhead to call for 8 bytes and so local buffering of the data is needed
  • We need to minimize the amount of data written at the expense of a little computation. The file system is a magnitude order slower than the CPU and so using a few cycles per function to much reduce the amount of data written to disk is worth while to get better run times

For the python parsing side, the code is here:


The main trick for performance is building up a dictionary (hash map) of all the addresses in the executable before parsing the file, to provide a nice O(1) lookup to get the function name being invoked.

However, this wasn't the real issue. After the hour long hack completed to get this working, I moved onto getting larger traces to run (specifically, tracing an open src C based JPEG decoder as a practical exercise). This produced trace files over 2GB in size, implying that even with the most compact of data structures, you'd be hard pressed to process them in place and so the entire function had to be rewritten to stream in / out the trace file and generated waveform. The history is in GIT if you prefer the original version, but you will need a lot of memory to get it to run :)


WAVES

So the most interesting bit of this is the visualizations. We'll start off with a trivial piece of test code found here:


Basically this code has a few function calls, the last one (FISH3) recurses 4 times, each time delaying for 1ms for the parameter passed in. The waveform for this is below:


The waveform shows the runtime is pretty much 10ms in total (the leading 1ms is a printf). FISH3 glitches each time it recurses, the delay actually happening as the function talks back up the stack of recursive calls, giving us a nice 1 / 2 / 3 / 4ms cadence.

Another excerpt below is from the JPEG decoder:



You can see in red that we are in a function here decoding a single JPEG block (njDecodeBlock). The zero cycle transitions (look like glitches) on the red trace is where a sub function returns and the njDecodeBlock continues to run. The other functions are those called by the njDecodeBlock function and their respective run times.

Looking at a full JPEG decoder (this guy) running over a 100x75 wide image gives us a much more interesting chart showing roughly the breakdown of function run times. Note this image is actually slightly quantized to make it more understandable (using the -q option in the python script) - very useful for eyeballing an application at a high level.



There are a few issues we want to resolve to make this more useful:

  • To view a callstack so its easy to understand, the waves really need to be ordered such that main() is in the top column and the stack calls down into the other functions.
  • Remove the cost of the trace buffer write itself. It seems to only be 1ms in 61ms of run time, but we should be able to account for this and hide it from the user viewing the waveform

Once again, thanks to James for his idea - next round of cocktails is on me!