The Pentium collects lots of information about code execution, and now you can get access to it.
When Intel announced the Pentium processor in March 1993, I immediately ordered the three-volume user's manual. For people like me, who wanted to write the fastest, most efficient code possible, volume 3 appeared to be the most useful. Imagine my chagrin, then, when every interesting section on optimization contained a reference to Appendix H, which consists of a single, illuminating paragraph stating that the information I desired is "considered Intel confidential and proprietary." This information is only available to those willing to sign a nondisclosure agreement with Intel.
From the published Pentium documentation and other sources, I knew that the Pentium could return detailed statistics on all major parts of its CPU--just the type of information that is essential for code optimization. The best place to look for such information was in the new, documented RDMSR (read machine-specific register) and WRMSR (write machine-specific register) instructions. These instructions work on a set of 64-bit MSRs (machine-specific registers) contained in the Pentium.
To use RDMSR and WRMSR, you move the register identifier (i.e., the number) of the desired MSR into register ECX. Invoking RDMSR will then transfer the contents of the indicated MSR into the paired registers EDX:EAX, while WRMSR copies EDX:EAX into the internal register. The Pentium user's manual documents MSRs 0h, 1h, and 0Eh, and also states that MSRs 3h and 0Fh, as well as values above 13h, are reserved and illegal. I felt sure the undocumented registers held the key to the optimization information I wanted.
As the first step in deciphering the undocumented registers, I wrote a test program that dumped the contents of the MSRs. (I quickly discovered that any attempt to read MSR 0Ah halted my PC, so until somebody finds a use for it, I suggest leaving that one alone.) Running the test program, I found that the content of most of the registers was static. The exception was MSR10h, which was changing rapidly indeed. Guessing that MSR 10h might contain a running cycle count, I divided the value contained in 10h by my processor's 60-MHz clock speed. My hunch paid off when I ended up with a nice display of the number of seconds since I had last powered-up. Using RDMSR to read MSR10h gives you the highest precision counter available to 80x86 programs. By reading the value in MSR10h before and after a block of code, you'll know exactly how long the processor took to execute the block, down to the last cycle.
These results parallel the ones you get when you use the RDTSC (read timestamp counter) (0F/31) instruction. Mike Schmid revealed the existence of this instruction in the January issue of Dr. Dobb's Journal. As with many of the MSRs, RDTSC is not documented anywhere, except in the instruction decoding tables, where it fits right between WRMSR (0F/30) and RDMSR(0F/32). A quick comparison of RDTSC and RDMSR shows that both access the same running cycle count, with RDTSC being an alternative and slightly faster way to retrieve the data.
Unfortunately, RDMSR and RDTSC are kernel mode (ring 0) instructions. My PC crashed when I ran these instructions inside a DOS box or with a memory manager. I am guessing that you can enable ring 3 access to RDTSC, maybe byusing MSR 0Eh (test register 12 in the Intel manual), which is documented as "new feature control," or MSR 0Dh, which seems to contain a value similar to MSR 0Eh; however, I have yet to discover how to enable ring 3access.
My next break in deciphering the MSRs came during a visit to a U.S-based developer. There, I saw a utility that displayed a number of interesting statistics about programs running on a Pentium machine. The utility could dynamically display one or two internal counters from a list of 38 different hardware events. The statistics were all related to different aspects of processor performance and were just the information I needed to perform informed code optimization on the Pentium.
For example, when the developers used the utility to profile another program, the utility revealed that the target program was generating a lot of accesses to misaligned memory variables. A simple recompile of the target program, using double word (4-byte) alignment, resulted in a 21/2-times speedup.
The developers realized that the utility would be useful for other programmers, so they obtained permission from Intel to distribute the program, as long as the source code was kept secret. I obtained a copy of the executable file to see if I could figure out how it accessed the Pentium statistics.
My first obstacle was creating a disassembled listing. I converted the program code into a list of Define Byte (DB xxh) statements. I encapsulated this naked code within an assembly program wrapper, ran Borland's TASM (Turbo Assembler), and then converted the object file into a listing. Next, I located the RDMSR and WRMSR byte sequences (the Pentium wasn't around when my object disassembler was written) and started working backwards from there. After a few days of tracing and testing, I found out how the internal counters work.
The controller for the Pentium hardware counters is MSR 11h; more specifically, the lower 32 bits of MSR 11h. The first 16 bits determines the data that will end up in MSR 12h, while the second 16 bits determines the counter that will report its results in MSR 13h, which is the nineteenth and last MSR on a Pentium. An obvious extension for Intel's next CPU, the P6 (Hexium, anyone?) would be to use all 64 bits of MSR 11h and add two more stat counters as MSR 14h and 15h. The lack of more MSRs limits you to accessing no more than two counters at a time.
The encoding of each 16-bit block of MSR 11h is identical. The first 6 bits (0 to 5) are an index into the list of available hardware events (see the table "Pentium Counters" on page 191). When set, bit 6 enables counting of events in the operating-system rings 0, 1, and 2, while bit 7 enables ring 3 monitoring.
Bit 8 indicates whether you want to collect the number of hardware events or the CPU cycles that the events use. Thus by setting up both counters to track the same item, with one counting events and the other counting cycles, you get a measurement of the average time it takes to complete the tracked event.
Using this information, I wrote P5Stat, a profiling program that accesses the Pentium hardware counters. P5Stat accepts another program name on the command line and then sets out to execute the indicated program 20 times. The first time through ensures that all the caches are loaded, while on each of the next 19 runs P5Stat collects two of the 38 different hardware counters available. After the last run, P5Stat dumps all the results to standard output, where it can be redirected to a file for later use.
P5Stat has proven useful in code optimization. For example, I recently used it on WC 5.26, a freeware word count program that I wrote almost three years ago. I discovered that without optimization the dual-pipeline Pentium gave a 43 percent speedup compared to running all the code in a single pipe (i.e., on a 486). Using P5Stat to identify crucial bottlenecks, I rearranged the inner loop of the counting function for the new version, WC5.40. This required more instructions, but P5Stat showed that I had achieved nearly 100 percent filling of the dual pipes, resulting in an actual counting speed of 1.5 cycles per byte, or 40 MBps on my 60-MHz Pentium. This is a 33 percent speedup over the previous Pentium version of WC. (See the "Program Listings" on page 9 for information on how to obtain P5Stat and WC 5.40.)
The profiling information available to Pentium programmers is a powerful aid in software development. With the information in this article, you can access these features and use them to identify bottlenecks and inefficient coding practices in your programs. I hope Intel makes official information available to all programmers and that such useful features are incorporated into other architectures such as Alpha, PowerPC, and SPARC.
Using the Hardware CountersFirst, define macros for the new instructions:
| Index ||Name|
|2||Data TLB (translation look-aside buffer) miss|
|3||Data read miss|
|4||Data write miss|
|5||Write (hit) to M or E state lines|
|6||Data cache lines written back|
|7||Data cache snoops|
|8||Data cache snoop hits|
|9||Memory accesses in both pipes|
|B||Misaligned data memory references|
|D||Code TLB miss|
|E||Code cache miss|
|F||Any segment register load|
|13||BTB (branch target buffer) hits|
|14||Taken branch or BTB hit|
|17||Instructions executed in the v-pipe|
|18||Bus utilization (clocks)|
|19||Pipeline stalled by write backup|
|1A||Pipeline stalled by data memory read|
|1B||Pipeline stalled by write to E or M line|
|1C||Locked bus cycle|
|1D||I/O read or write cycle|
|1E||Noncacheable memory references|
|1F||AGI (Address Generation Interlock)|
|23||Breakpoint 0 match|
|24||Breakpoint 1 match|
|25||Breakpoint 2 match|
|26||Breakpoint 3 match|
|28||Data read or data write|
|29||Data read miss or data write miss|
Then when you want to use the specific counters, I suggest that you read the current value of MSR 11h first and only modify the part you need to set up your code
RDMSR MACRO db 0fh, 032h
WRMSR MACRO db 0fh, 030h
Now any read of MSR 12 will retrieve the current value of hardware event Nr1, while MSR 13h contains event Nr2:
mov ecx,11h RDMSR
and eax,0FE00FE00h ; save the upper 7 bits in each half
or eax,Nr1_idx+Nr1_ctrl+(Nr2_idx+Nr2_ctrl) shl 16
Finally, insert the code to be tested here.
mov ecx, 13h
mov ecx, 12h
call disp64 ; display first count
call disp64 ; display second count