Beyond the z196 Pipeline
In the January/February 2012 print edition, I covered some aspects of how the IBM zEnterprise 196 (z196) processor executes instructions through its instruction pipeline. But, there is more to the z196 processor than just the pipeline. Here’s an overview of some of the other elements, including branch prediction, special-purpose co-processors and enhanced translation look-aside buffers (TLBs). But first, let’s take a brief look at the nest.
The Cache Topology
The nest is the part of the processor responsible for accessing data in memory and maintaining proper coherency of the data. It’s extremely important because one of the major delays in instruction execution is the delay to access data. To reduce the delay, in addition to main memory, the nest includes a number of levels of buffers, called caches, to hold some of the data closer to the instruction processor.
In computer design, closer is faster. Figure 1 shows the topology of the caches on z10 and on z196. The z10 has three levels of cache: a small, low-latency cache called L1 associate with each processor; a larger but slower cache called L1.5 also associated with each processor; and, a much larger, higher latency cache called L2, which is shared by all of the processors in a book. The upper part of this figure is a schematic view of the cache topology on the z10. The L2 caches on each book communicate with each other and maintain cache coherency across all of the processors. But, even with these three levels of cache to buffer data and avoid access all the way out to main memory, it wasn’t enough to keep the z196 pipeline rolling smoothly. The z196 has an additional level of cache, which is shared by the four processors on a chip and thus is a chip-level cache, which can be seen in the lower part of the figure. The cache levels are renamed so that on the z196, L1 and L2 are the two-processor private caches, L3 is the new chip-level cache and L4 is the cache shared by all processors on a book. The lower part is a schematic of the z196 cache topology. With the addition of the on-chip L3 cache, the z196 is much better able to keep its 5.2 GHz out-of-order pipeline fed with instructions and data.
As a further optimization, there are actually two L1 caches, one for instruction fetches and one for operand fetches. These caches are exclusive in that the processor cannot maintain the same cache line in both of these L1 caches. This is why there’s such a large performance penalty for “self-modifying code.” If a program modifies instructions that are in the instruction cache, then the containing cache line must be purged from the instruction cache and loaded into the data cache and then pushed up to L2 from the data cache so it can again be loaded into the instruction cache. Actually, it doesn’t even have to be instructions that are modified. Any modification on a 256-byte cache line in the instruction cache initiates this process, so programmers and compilers have to be careful.
Bob Rogers is a z/OS designer and evangelist. An IBM Distinguished Engineer, he frequently presents at SHARE and other conferences.
More Articles From Bob Rogers
Advertisement
Search our new 2012 Buyer's Guide.
Advertisement
Maximize your IT investment with monthly information from THE source...IBM Systems Magazine EXTRA eNewsletter. SUBSCRIBE NOW.
View past Mainframe EXTRAs here
Related Articles
Administrator | Modernizing existing applications rather than replacing them is a time-tested approach to competitive advantage in the financial-services industry.
E-Newsletter | Is Your Data Center Ready for the Next Wave of Computing?