Finding the Bottleneck
For whatever reason, I/O seems to be the one component of system performance that doesn't receive much scrutiny in performance analysis. Also, there's confusion about which tools to use and how to interpret that data. How many vmstat-derived charts have you seen where the analyst has included the CPU wait column in the chart showing utilized CPU resources? This notion is incorrect for two reasons: First, the CPU is put into a wait state typically upon a cache miss. Designers of modern processors use many technologies (called fine- and coarse-grained multithreading and simultaneous multithreading) to mitigate the CPU "wait" state. Second, modern disks use DMA, which alleviates the processor from all of the I/O work, except request initiation.
Also, if a logical volume is draped across numerous or even hundreds of disks in a large storage area network (SAN) environment, the I/O-related vmstat values are ambiguous. Therefore, if non-zero values in the "wa" are shown, that's an indicator to look at I/O using another tool.
Without a doubt, my tool of choice for analyzing I/O on AIX* is filemon. While lostat can provide some details, this investigation will take me deeper than lostat can go. I also believe that analyzing I/O requires fundamental knowledge of queuing theory, functional knowledge of the disk mechanics and I/O workloads, and some knowledge of the Logical Volume Manager (LVM) and the Virtual Memory Manager (VMM).
This article doesn't define the numerous parameters that can be measured with filemon. For this analysis, I'll focus on physical volume metrics. (Note: Any performance data contained in this article was determined in a controlled environment. Actual results may vary significantly and are dependent on many factors including system hardware configuration and software design and configuration.)
Actual Analysis
In a production environment, several machines were clustered as Lightweight Directory Access Protocol (LDAP) replica servers (using a DB2* database). A capacity study was initiated to determine the performance bottleneck and guide the direction of an upcoming hardware upgrade. The workload was determined to be open with two classes: database reads and database updates. The transaction component deemed most critical was the read response time.
The data shown in Figure 1(see below) was collected on a production server using the following command:
#filemon -T 1000000 -u -o lv,pv -
O/$HOME/filemon.out;
sleep 3600;trcstop
(Note: The data-collection period in this analysis was 3,600 seconds. This long collection period follows best practices in capacity analysis. However, 60 to 180 seconds is typically used for performance analysis.)
Tom Farwell is a technical editor for IBM Systems Magazine, Open Systems edition. He can be reached through www.tomfarwellconsulting.com.
More Articles From Tom Farwell
Advertisement
Search our new 2012 Buyer's Guide.
Advertisement
Maximize your IT investment with monthly information from THE source...IBM Systems Magazine EXTRA eNewsletter. SUBSCRIBE NOW.
View past AIX EXTRAs here
Related Articles
Web Exclusive | Implement these techniques to improve data-center resiliency.
None | The most exciting POWER6 enhancement, live partition mobility, allows one to migrate a running LPAR to another physical box and is designed to move running partitions from one POWER6 processor-based server to another without any application downtime whatsoever.