Distributed Reasoner: For loops just aren't what they used to be

Sometimes, compilers are too damned good at optimization.

Intro

My PhD dissertation currently centers around a knowledge and reasoning engine and middleware called KaRL, part of my Madara toolsuite. In a recent paper, I wanted to do some performance testing of the KaRL distributed reasoner, and so I attacked the testing from three vectors: reasoning throughput (the number of rules per second the engine could perform without distributed knowledge), dissemination throughput (the number of rules per second sent over the wire in a LAN), and dissemination latency.

To make things more interesting, I decided to form a baseline for reasoning throughput. How about C++ optimal performance with a for loop and reinforcements (e.g. ++var). Oh, and it needs to be portable across Windows and Linux. Easy enough, right?

Problems, Solutions, and More Problems

The first problem on the docket was one of timer precision. I decided to go with ACE_High_Res_Timer, after some unsuccessful and highly error prone usage of the underlying gethrtime. After using the High_Res_Timer class, so it corrects for global scale factor issues between the return values of QueryPerformanceCounter(). So far, so good.

The results on my Linux and Windows machines were right in line with what I expected. Through function inlining, expression tree caching, and various other mechanisms, we are able to efficiently parse KaRL logics at greater than 1 Mhz. However, when I started comparing to my supposed baseline, I discovered that the ACE_High_Res_Timer was reporting that the optimized C++ for loop of ++var was performing at an amazing 60 Ghz to over 1 Thz... on a 2.5 Ghz processor.

What the heck was going on here?

It turns out that modern C++ compilers will completely optimize out for loops if they can. My specific issue, which remains unsolved in a portable manner, was in regards to a for loop with a simple accumulator (var) which is incremented a certain number of times. I had started a timer before the for loop and stopped it after the loop was over, but the assembly language generated from the C++ programs had 0 for loops in the function. In fact, they simply moved the final value that the loop would have had into the var. The timer was effectively reporting the time it took to query the system for the nanosecond precision timers, since the couple of assembly instructions included were not enough to amount to any nanoseconds at all.

Remarks on Known Solutions

In Visual Studio, I was able to circumvent the issue in two ways: first, by using __asym { nop }, which effectively inserts a no-op (an exchange of eax with itself), and second, by using volatile, which means the compiler is not able to optimize at all and can't fully take advantage of registers.

In g++, unfortunately, I was only able to use volatile, which means that if I wanted to test the actual loop, I have to take away every other optimization that the compiler might be able to do.

Using volatile turns out to be the only portable thing I could think of. Internet searching seemed to confirm these suspicions. I would think there would be some way to specifically tell each compiler to simply not optimize out for loops in a particular function or file though.

Downloads

Solution, which unfortunately can't get around L3 optimization in g++ and Release mode in Visual Studio.

Distributed Reasoner

Pages

Saturday, March 5, 2011

For loops just aren't what they used to be

No comments:

Post a Comment