Monday, March 14, 2011
LaTeX in a Nutshell (#1)
Sunday, March 6, 2011
Undocumented ACE_OS::sleep caveats
For those in need of sleep in microseconds, understand that Windows provides no such mechanism.
Intro
Recently, I needed a methodology for setting hertz publication rate on a publisher that would work in both Linux and Windows. The publication rate should be able to go up to mhz at least, which requires a sleep mechanism capable of 1,000,000,00 ns / 1,000,000 == 1,000 ns of precision. Consequently, the sleep would be required to function on a microsecond level.
Tools and methodologies
I decided to stick with the ACE library and specifically use the ACE_OS::sleep(const ACE_Time_Value &) call. On the surface, this should allow us to sleep for microseconds, and it does - with one small caveat: the operating system needs to have a sleep mechanism that is capable of actual us (microsecond) precision.
Problems
In WIN32 mode, the ACE_OS:sleep call uses the ::Sleep method provided by the Windows operating system. Unfortunately, ::Sleep works on millisecond precision. This means that you either blast (e.g. no sleep statement at all), or you can specify a hertz rate of <= 1 khz (1ms of sleep).
Solutions
One potential solution is bursting events and then sleeping for 1ms. The trick to this is to work out a bursting pattern that uses the sleep to sum all the microseconds that should have been done over that period. This isn't modeling exactly what you want, but the alternative is to simply only allow bursting or <= 1khz. In other words, there is no beautiful, portable solution to this that isn't going to cause stress on whatever you are trying to test (bursting is always a worst case for any software library).
Downloads
KaRL Dissemination Test - Tuned to burst mode on Windows and simply sleep for microseconds on POSIX.
Saturday, March 5, 2011
For loops just aren't what they used to be
Intro
My PhD dissertation currently centers around a knowledge and reasoning engine and middleware called KaRL, part of my Madara toolsuite. In a recent paper, I wanted to do some performance testing of the KaRL distributed reasoner, and so I attacked the testing from three vectors: reasoning throughput (the number of rules per second the engine could perform without distributed knowledge), dissemination throughput (the number of rules per second sent over the wire in a LAN), and dissemination latency.
To make things more interesting, I decided to form a baseline for reasoning throughput. How about C++ optimal performance with a for loop and reinforcements (e.g. ++var). Oh, and it needs to be portable across Windows and Linux. Easy enough, right?
Problems, Solutions, and More Problems
The first problem on the docket was one of timer precision. I decided to go with ACE_High_Res_Timer, after some unsuccessful and highly error prone usage of the underlying gethrtime. After using the High_Res_Timer class, so it corrects for global scale factor issues between the return values of QueryPerformanceCounter(). So far, so good.
The results on my Linux and Windows machines were right in line with what I expected. Through function inlining, expression tree caching, and various other mechanisms, we are able to efficiently parse KaRL logics at greater than 1 Mhz. However, when I started comparing to my supposed baseline, I discovered that the ACE_High_Res_Timer was reporting that the optimized C++ for loop of ++var was performing at an amazing 60 Ghz to over 1 Thz... on a 2.5 Ghz processor.
What the heck was going on here?
It turns out that modern C++ compilers will completely optimize out for loops if they can. My specific issue, which remains unsolved in a portable manner, was in regards to a for loop with a simple accumulator (var) which is incremented a certain number of times. I had started a timer before the for loop and stopped it after the loop was over, but the assembly language generated from the C++ programs had 0 for loops in the function. In fact, they simply moved the final value that the loop would have had into the var. The timer was effectively reporting the time it took to query the system for the nanosecond precision timers, since the couple of assembly instructions included were not enough to amount to any nanoseconds at all.
Remarks on Known Solutions
In Visual Studio, I was able to circumvent the issue in two ways: first, by using __asym { nop }, which effectively inserts a no-op (an exchange of eax with itself), and second, by using volatile, which means the compiler is not able to optimize at all and can't fully take advantage of registers.
In g++, unfortunately, I was only able to use volatile, which means that if I wanted to test the actual loop, I have to take away every other optimization that the compiler might be able to do.Using volatile turns out to be the only portable thing I could think of. Internet searching seemed to confirm these suspicions. I would think there would be some way to specifically tell each compiler to simply not optimize out for loops in a particular function or file though.
Downloads
Solution, which unfortunately can't get around L3 optimization in g++ and Release mode in Visual Studio.