Compiler Optimization Of MechAssault 2
By Kyle Wilson
Saturday, January 22, 2005MechAssault 2 is finally on shelves. Though we lost sleep over performance issues right up until the very end, the reviews are quite positive regarding frame rate. Xbox Addict says that the game "surprisingly, despite the serious graphical upgrade to the game engine, doesn't chug on the frame-rate." IGN says, "The framerate is generally solid." GameZone reports that "the game trucks along at a great framerate." This is all something of a miracle considering that a few weeks before we mastered, the first scene the player saw when the game started ran at a tortuous fifteen frames per second.
How did we get from 15 fps to 30 fps? Or, put in the way budget-conscious game programmers are comfortable with, how did we get our frame time down from 66 ms to 33 ms? One of our kick-ass gameplay programmers, Nate Payne, did a slew of tweaks to speed up ray casting, and to avoid updating unseen objects. I added a new "costs" profile which would measure and display the total time cost for a particular type of templated game object to execute or for a particular model to render, and how many of those game objects or models were being processed. That data gave our designers the information they needed to tune critical models and to figure out which enemies to hide or spawn them in later. Most of our speed-up, I admit, came from data changes.
But by changing our compiler optimization settings, we also squeezed out a significant performance gain--nearly 10%--that was basically free. More importantly, on a console with only 64 megs of memory, the same settings saved us over a megabyte and a half of memory.
Exception Handling
The first thing I did was to turn off exception handling (remove /EHsc). Although we catch exceptions in our unit tests, the MA2 game itself doesn't use exception handling at all. We include some Boost library code that could conceivably throw an exception, but nothing that our code does with Boost ought to trigger those exceptions. If they occur, they're every bit as much of a bug as a divide by zero or a null pointer dereference, and the game will crash just as hard. I think we left MA2 pretty nearly crash-free.
In principle I prefer exception handling to error codes. Exceptions save you a lot of wrapping of functions in FAIL_RETURN macros and allow you to signal errors in constructors, which is very awkward in a error-code system. But exceptions do have cost. If you enable exception handling, the compiler will build a look-up table for your entire program indicating what local objects need to be destroyed at any point of any function, if an exception occurs at that point.
Before I started tweaking our compiler settings, the amount of memory consumed by our code and associated global and static variables--what we called our "module size"--was 11241K. That was the amount of memory that had been used when main() was first entered. After turning off exception handling, that number went down to 10501K, for a savings of about 700K. In an input-recording demo I used for profiling throughout my tweaking of compiler options, frame rate improved by about 1%.
Because we use Boost, I had to add our own empty definition for boost::throw_exception. And I added a #pragma warning(disable:4530) to turn off the warning letting me know that exceptions were, in fact, off, but that some of our external libraries were still trying to throw. That's okay. We don't pretend to handle exceptions.
Link-Time Code Generation
Next, I turned on link-time code generation (/LTCG), or whole program optimization (WPO). Whole program optimization does an extra optimization pass over your compiled program, optimizing not just within a single translation unit (cpp file/obj file), but across translation units and even across statically-linked libraries. As far as I can tell from VC++ Program Manager Kang Su Gatlin's PowerPoint slides on the subject, the primary optimization that LTCG does is identify pointers which are free of aliasing. Aliasing occurs when multiple pointers may point to the same memory location. Because the compiler cannot know that the locations are distinct, it may be forced to write a register value back to memory, then re-read another memory location for another compiler access. The more possible occurrences of aliasing you have in your code, the less the compiler is able to optimize. With WPO, the linker can search for usage of a particular pointer variable across function calls to determine whether aliasing can actually occur.
Whole program optimization is very, very slow. With it enabled, our link times for FINAL builds on MA2 (the build target that we actually ship) jumped to over fifteen minutes. Our code size grew by 130K. But our performance improved by another 6.5%.
Turning on whole program optimization did uncover several bugs that had undefined behavior. Without WPO, these had somehow not crashed. With WPO, they did. In one case, a member function call was being made through a pointer to uninitialized memory. In several other cases, it was assumed that arrays on the stack were identical to their own addresses. That is, it was assumed that memset(&charArray, 0, sizeof(charArray)) would have the same effect as memset(charArray, 0, sizeof(charArray)). This is not guaranteed by the standard, and appears to change under WPO.
Optimize for Size or Speed?
I've read a number of times that changing compiler settings to optimize for size (/O1) rather than optimizing for speed (/O2) can actually make a program faster. The theory is that most modern applications are hurt more by cache misses than by straight-up code execution time, and the less space the code takes up, the fewer cache misses will occur. But every time I've tried setting my options to minimize size, my code has gotten slower. MA2 was no exception. When I changed all our libraries to /O1, performance dropped by about 5%, although the code did shrink to a trim 8673K in initial size, two megs smaller than the with the /O2 option.
Adrian Stone, our graphics programmer, suggested that I try switching most of our libraries back to optimizing for speed and keep the "optimize for size" setting on our two most bloated libraries. Both libraries are game-level code full of game-specific entity definitions. They took up a great deal of space in terms of code, but took a relatively small fraction of memory for allocated objects at runtime which in turn took a relatively small fraction of our execution time. Most of our profile is spent in scene hierarchy update and rendering set-up.
With that change--two libraries optimized for size, the rest for speed--our performace jumped to 8.2% faster than my original profile and our initial code and data size edged back up to 9597K, or 1644K smaller than when I'd started. I experimented with switching other libraries back and forth, but every other change made performace worse. Apparently I'd lucked onto the ideal settings right away.
Any opinions expressed herein are in no way representative of those of my employers.