CPU Threading Efficiency: How to Improve L2/L3 Cache Hits
Traditionally, there's been an overlooked aspect of gaining better performance: Improving L2/L3 cache hits.
For the most part, AIX performance specialists have focused on executing threads with the greatest possible performance. While this is obviously important, simply having threads execute quickly isn't enough, because this approach doesn’t consider the rate and latency of L2/L3 cache misses. To address this, attention must be paid to keeping L2/L3 cache content undiluted by configuring to maintain fewer virtual CPUs of different LPARs on a given CPU core. As short-hand, I refer to this as "CPU threading efficiency."
Relative to the count of CPU cores, a workload can exhibit fewer threads running strenuously, or many threads switching rapidly, or a varying blend of both. This series of articles will define these characteristics, highlight how the POWER8 core improves workload performance, discuss POWER8 SMT-1/2/4/8 mode considerations, and offer models of CPU core threading for efficiency, throughput and performance.
(Note: The intent with these articles isn't to examine the physical POWER8 CPU complex. Rather, the focus is narrowed to the characteristics of a single CPU core. To make a clear distinction, I use the term "CPUcore" to refer to one core on a POWER8 system.)
CPU Under-Threading―and a Bit About Over-Threading
CPU under-threading and over-threading are two distinct problems. Under-threading is wasteful, expensive and rarely justified. Over-threading is frustrating, expensive and difficult to recognize. While neither is likely to cause a system outage, both extremes add overhead to and degrade the potential productivity of the entire POWER8 system. (Of course, either can be warranted for certain circumstances and workload characteristics, but that discussion is beyond the scope of this article.)
Under-threading is a sustained state of maintaining too few executing threads across too many CPUcores of a given LPAR. It's wasteful of CPUcores that would otherwise be used by other LPARs. Under-threading is encouraged when enterprises purchase more software CPU licenses than are needed.
Over-threading is a sustained state of maintaining too many executing threads across too few CPUcores of a given LPAR. It's wasteful of CPU cycles because of the overwhelming overhead of too many threads concurrently loading/storing instructions and data (called load/store unit overhead or LSU overhead, addressed below). Over-threading is often induced by buying an insufficient number of software CPU licenses for too few CPUcores.
(As you can probably imagine, of these two scenarios, under-threading is the far more common problem. For this reason, this series of articles will have little more about over-threading, though I do intend to return to this topic in the future.)
CPUcores can only run efficiently based on how they are being fed thread instructions and data from L2/L3 cache. This cannot be overstated. CPUcores can only execute threads residing in L1 instruction cache (called L1-I) and L1 data cache (called L1-D). Thus a primary priority is the load/store of the L2/L3 cache feeding the L1 cache, and these activities are managed by the CPUcore’s LSU circuitry. Threads are instructions that generally process streams of data. Thread instructions must reside in CPUcore L1-I instruction cache before being dispatched to CPUcore execution pipelines. The loading of thread instructions and data into CPUcore cache is LSU overhead; the thread is not dispatched for CPUcore execution unless instructions are L1-I ready. All threads suffer LSU overhead because all threads load instructions and data for processing, and store results.
The hypervisor owns and underlies every LPAR, translating everything going to Power Systems hardware – including LSU overhead. The hypervisor has a higher urgency for CPUcore attention versus AIX root user. Relative to AIX user/kernel workloads, the hypervisor is lightweight and so efficient that performance specialists seldom think of it. That said, it's important to consider the hypervisor. Giving it excess work is bad for overall efficiency; giving it less work is good for overall performance. And yes, we can give the hypervisor less to do. This is an innate benefit of CPU optimal threading.
Extending the Duration of Thread Performance
Let's now consider today’s trend of large-scale long-duration workloads (e.g., Big Data and analytics applications), because these workloads often process disproportionately substantial concurrent streams of data for hours or even days at a time. For such workloads using shared processor LPARs, pervasive CPU under-threading greatly encourages sharing CPUcores with other shared processor LPARs. This dilutes the hit ratio of instructions and data in the L2/L3 cache of all SPLPARs assigned to these shared CPUcores and could lead to erratic thread performance.
If you stop to consider your own environment, the previous sentence is either completely relevant or utterly nonsensical. Why? It's a matter of scale.
If your shared processor LPARs house light workloads of fewer or shorter duration threads on small-scale POWER/AIX systems, you are easily running with exceptional performance and efficiency out of the box; thus, the above statement is nonsensical by your experience. However, if your shared processor LPARs house heavy workloads of many longer duration threads on large-scale E850/E870/E880 enterprise-class systems, the above statement is huge by your experience, and the information I'll share in this series of articles is written specifically for you. Again, the difference is the scale of workload and LPAR configuration.
As noted, CPU under-threading dilutes the hit ratio of instructions and data in the L2/L3 cache of its attending CPUcores. When virtual CPUs of different LPARs too often share the same CPUcores, they dilute the L2/L3 cache content of every other LPAR. In contrast, while a CPUcore is executing a thread, there is a duration of exceptional thread performance for as long as thread instructions and data are readily available. As we continue with this series I'll discuss several tactics that can improve CPU threading efficiency.
Like what you just read? To receive technical tips and articles directly in your inbox twice per month, sign up for the EXTRA e-newsletter here.
comments powered by