2009年2月6日星期五

Delayed-Branch-Slot Optimization and the Result

Background

 An Instruction Set Simulator (ISS) is a program that takes stream of target instructions and simulates the effects of execution of the code stream on a host machine. Usually the target and the host are of different architectures, but nothing prevents you from running a, say, x86 simulator on a PC. Base on this definition, an ISS is nothing but an ordinary interpreter, and from the fact that it interprets machine code instead of high level languages, hence freeing from lexical analysis, parsing, byte-code compilation, garbage collection…, etc., all sorts of complications one normally finds in modern high level language interpreters, it seems that an ISS is a trivial (often labor-intensive though) construct to make.

 Well, not quite so…

 The tricky thing is that you want your simulator runs fast, really fast. A naive implementation with a central “read-evaluate-loop” structure could hardly exceed the speed of interpreting over 10 million target instructions per second (MIPS). While 10 MIPS may be okay for running standalone, user-mode applications, it is hardly enough for running an OS, which easily takes billions of instructions just to boot up.

 Threaded-code is an elegant control structure that strike at a sweet point of balance between implementation complexity and run-time efficiency. A threaded-code ISS with is capable of running system level code at about 100 MIPS with full accuracy at the instruction level.

 This is the technique we actually adopted in developing an ISS for a RISC processor. While happy with the end performance, we want to see if we can squeeze more runtime efficiency out of it. One thing that made me feel less satisfactory in our implementation was the way we coped with branch delay slot.

 Conceptually, processor executes instructions sequentially, one following another, until it reaches a branch or jump instruction, where it branches to an effective address other than the one following it. However, in some RISC machines, to better use of the pipeline structure, branching is delayed after the execution of the instruction immediately following it. The position following the branch instruction is called the branch delay slot. This is one of the few occasions where hardware natures becoming visible to programmers.

 One way to simulate this behavior -- perhaps the most obvious one -- is to keep a Boolean valued flag, whose value is set by every branch/jump instruction (if the branch/jump is taken), and cleared otherwise. This flag is checked at the end of the service routine of every normal instruction. If it is set, we jump to a pre-calculated target address; if it is not, we simple increase the instruction pointer and move on to the next instruction.

 A slightly better solution that can do away with the conditional statement associated with the Boolean flag is to maintain two instruction pointers, ip and next_ip at the same time, so that we always jump to the address pointed to by next_ip after the execution of every instruction.

 Both of these two solutions use an auxiliary variable to simulate the correct execution flow of the processor in the presence of branch delay slot. In either case, a runtime overhead is added for every normal instruction, even though most instructions do not locate in the delay slot.

 Can we do better?

Idea for Optimization

 It did not take long before I realized that whether an instruction locates in a delayed slot is a matter that can be resolved in the decoding phase, e.g., it is independent of the state of the processor. Hence for each instruction that may possibly reside in the delayed slot, we create two specialized intermediate form instructions. The instruction service routines of the two specialized forms have most of their interpretative code in common but have different dispatching semantics. Instructions that do not locate in delayed slot will simply increase the instruction pointer and jumping to the instruction service routine of the next instruction. Instructions that locates in the delayed slot in general must check if the branch was taken (for conditional branch/jump instructions) and if so, set the instruction pointer to the target address calculated by the previous branch/jump instruction and transfer control to the service routine of the instruction at that address. If the branch is not taken, the behavior is the same as those that do not locates in the delayed slot.

 In essence, this is another application of the instruction specialization optimization technique typically found in instruction set simulators where an intermediate form of decoded instructions are utilized to avoid decoding the same stream of instructions over and over again. The difference is, in this case, instead of specializing on the bit pattern of an individual instruction group, we specialize over the spatial position of individual instructions relative to special (branch/jump) instructions.

Implementation

 Once the idea is clear, the implementation is quite straight forward.

 We translate raw instruction stream into the decoded form on a page-by-page basis. A page is the smallest unit (normally 4KB, but may vary in modern architectures) that the Memory Management Unit (MMU) of the processor breaks up the memory address space for critical OS-related operations such as address translation, protection, etc., as a whole. Doing so has the extra benefit that in-page branching, which is predominately the usual case against off-page branching, does not need to go through the MMU – an expensive operation that could degrade the performance of the simulation significantly.

 While decoding the raw instructions of a certain page, we maintain a local flag that is set when the instruction under decoding is a branch/jump instruction and is cleared otherwise. This flag is checked for every instruction: if it is set, we choose the delayed-slot version of the interpretive code of that instruction; if it is clear, we use the normal version of the interpretive code of that instruction. Note that this is very much the same as we did in the pre-optimization implementation, only this time it is done in the decoding phase instead of run time.

 The only problem of that scheme is what to do with the first instruction on a page. Where does the initial value of that flag come from? While cross-page branch delay slot is uncommon, the architecture does not exclude this from happening, so we must handle this in our interpreter.

 The answer is that we maintain a similar, but non-local flag in each of the decoded page, whose value is determined by the last instruction on that page. The local flag then gets its initial value from that state flag of its previous page. If its previous page does not exist when this page is decoded, then the initial value is by default set to false.

 The correctness of this algorithm is based on the following reasoning: if the first instruction on a page is in delayed slot, then we must get here from executing instructions of its immediate previous page, which must have its decoded page in memory with the in-slot flag being correctly set.

 Result

 Happy with this discovery (I am not aware of anyone coming up with a similar idea to the best of my knowledge. This is a very specific domain of application anyway.), I re-implemented the original threaded-code instruction set simulator with the new algorithm and expect a somewhere close to 20% performance improvement.

 How did I come up with this 20% number? Simple. The intrinsic figure of merit of the performance of an instruction set simulator is the average number of host instructions executed to interpret one target instruction. This number is independent of the clock frequency of the host processor. While it certainly depends on the instruction set architecture of the host processor, we have reason to believe that it does not vary significantly from architectures to architectures. I estimate that on average our original simulator executes 20 host instructions to interrupt one target instruction, and by specializing non-branching instructions with regard on their relative position in the code stream, I could at least reduce 3 host instructions on average. One does not need to be particularly good at math to compute the final result then. It is a very rough, but nevertheless valid estimation.

 To my surprise, the benchmark shows that the new simulator executes at 105 MIPS (Millions Instructions per Second) against 115 MIPS of the original one. The delayed-slot “optimization” I have labored with actually brings down the performance by almost ten percent!

 How do I explain the seemingly contradictory result?

 It could be that the time the simulator spends on decoding is not negligible comparing with the time it spends on executing. In this case, the overhead incurred in the new simulator in the decoding phase could offset what it gains in the execution phase. But a more likely reason that I suspect is that the interpretive code has been restructured inadvertently in the new simulator for a less-optimal memory layout for utilizing the cache system of the host machine. This is just a guess. I don’t have a definite answer to this puzzle yet.

 Conclusion

 So what is the lesson of this story?

 While not directly relevant, and by no means did I regret for this small adventure to optimize an already fast simulator, I could not help but thinking of the famous remark D. Knuth made many years ago: “premature optimization is the root of all evil”.

 Optimization is hard, harder than most of us would expect. A simple solution is often the best solution. When optimization is necessary, we should try to optimize “in the large”. For example, by replacing the simple, classic read-evaluate-loop control structure with direct-threaded code, we get an order of magnitude performance boost. Surely it is worth the effort for a performance critical application. But don’t waste your time on tiny piece optimizations that you thought would work. Even when it works, it may take its penalty on something else, say, code clarity, which is more important in an engineering sense than a, say, 5% increase of runtime efficiency.

 Let me end up here by quoting from another great mind: “keep things simple, but no simpler.”