AIX > Administrator > Performance

Slicing Time: Curing the Context Switch in AIX

AIX Context Switch

Over the past year I’ve shown you how to use advanced tools to diagnose performance problems with AIX systems. We’ve covered topics like kernel tracing and all of the reports that can be extracted from that data, CURT for detailed CPU usage, and SPLAT for system lock analysis. We've also examined PerfPMR, IBM’s go-to utility for diagnosing system issues.

Now that we’ve laid a solid groundwork for performance analysis, let’s shift gears. In this and future articles, I'll show you how to tune your systems based on information you’ve gleaned from your diagnostic data. Being able to diagnose AIX performance issues is important, but the next step is what really matters. Now you're ready to flip the switches, turn the dials and actually fix these issues.

Context Switching in Context

Let's start with a fairly common performance problem: frequent context switching. A context switch occurs when a CPU stops processing a working thread to process another thread. The CPU is changing its operating parameters (or its “context”) and prioritizing another program.

Some degree of context switching is normal in any computer operating system—not just AIX. In fact, it's vital. Without context switching, systems couldn’t process more than one executable; you would run one thread and that would be it. Obviously a system that runs only one thread isn't getting much work done, so you need a mechanism that tells a CPU to shift its attention from one thread to another at an appropriate time.

There are actually dozens of conditions that initiate a context switch, including self-blocking threads and threads that are blocked by still other threads. A mechanism called CPU Decay determines how long a thread may remain prioritized on a CPU. Programs have built-in timeouts that limit a thread’s CPU time. In some cases, a thread may have its CPU usage curtailed to allow other more important system activity to take place.

In short, context switching is fine, but it's a matter of degree. If your system is doing nothing but context switching, then you have a serious performance problem.

So we must first determine the difference between normal and abnormal context switching. There are no hard and fast rules for this. It simply requires a thorough understanding of your system’s workload and how that workload uses logical CPUs. More than black and white performance statistics, you must develop an intuition about your system: is it operating correctly or isn't it? And it isn't enough to know if your system is operating as it should. You need to know why—and on either count.

Analyzing the Numbers

Many utilities can be used to examine context switching, including Topas, NMON, lparstat and a host of other diagnostic programs. I prefer vmstat, because it provides system CPU and memory usage in a single concise display. Plus, the most recent incarnations of vmstat incorporate something called the PURR, which gives you accurate statistics in an IBM Power Systems SMT environment.

If you start vmstat in a vanilla manner—say, by issuing “ vmstat –w 2 “ at an AIX command prompt, you'll be presented with a number of columns of data. For our discussion, we’ll focus on only one of these columns: cs, which of course stands for "context switch." The cs column lies under the “faults” heading of your vmstat output, and its data will tell you whether the rate of context switching on your system is normal.

But let’s back up a bit. To start our diagnosis, we need two pieces of critical data. The first is a consistent sampling rate for our vmstats, or a consistent number of seconds between each line of vmstat output. Taking one set of vmstats with a 10-second sampling rate and another with a 2-second sampling rate will completely invalidate our average context switch findings. Consistency is needed so that our up and down rates of context switching are in proportion from one set of statistics to another.

The next thing we need to know is how many logical CPUs we have in our system, and how many hardware threads we’re dealing with. Recall that every Power Systems physical CPU has a certain number of hardware threads to which software threads are mapped for execution. Each of these hardware threads is presented to AIX as a logical CPU: POWER5 and POWER6 CPUs have two hardware threads which are presented to AIX as two logical CPUs, POWER7 CPUs have four logical processors (LPs) and POWER8 CPUs have eight LPs. Working threads will be dispatched to run on logical processors.

Mark J. Ray has been working with AIX for 23 years, 18 of which have been spent in performance. His mission is to make the diagnosis and remediation of the most difficult and complex performance issues easy to understand and implement.



Like what you just read? To receive technical tips and articles directly in your inbox twice per month, sign up for the EXTRA e-newsletter here.



Advertisement

Advertisement

2017 Solutions Edition

A Comprehensive Online Buyer's Guide to Solutions, Services and Education.

Achieving a Resilient Data Center

Implement these techniques to improve data-center resiliency.

AIX Enhancements -- Workload Partitioning

The most exciting POWER6 enhancement, live partition mobility, allows one to migrate a running LPAR to another physical box and is designed to move running partitions from one POWER6 processor-based server to another without any application downtime whatsoever.

IBM Systems Magazine Subscribe Box Read Now Link Subscribe Now Link iPad App Google Play Store
AIX News Sign Up Today! Past News Letters