Breaking the Bottleneck
An overview of GPFS components and terminology.
Planning for GPFS
All nodes in a GPFS cluster should be running the same release of the GPFS code if they're mounting the same file systems. It's recommended that they also run the same AIX* version and release level. Additionally, IBM has registered port 1191 with IANA for GPFS-specific traffic. GPFS also provides the capability to use openssh and openssl to secure communications. This means any firewalls need to allow access to these ports as required. The GPFS daemon is called mmfsd and needs to run on the nodes.
Before creating a cluster, several planning steps must be taken. Some of these can have significant performance impacts. Specifically, four parameters - pagepool, maxFilesToCache, maxStatCache and maxMBps - must be determined and attention needs to be paid to the correct selection of the blocksize and raid stripe-size. Where NSDs are being used, it's critical that all network tuneables (no - a) be reviewed to ensure there are sufficient network buffers and that ipqmaxlen is set to an optimal value. Since this is a high I/O file system, settings for memory and I/O (vmo and ioo) as well as pbufs should be carefully monitored and regularly reviewed.
One of the first steps is to define the GPFS root directory for definitions. If we use /usr/local/gpfs, then we would expect to see the configuration file (mmfs.cfg), nodelist and the disk description file (disk.desc) in that directory. Examples of the contents of these files are in Appendix A, and a sample planning spreadsheet can be viewed here.
GPFS Configuration Settings
The following list highlights common GPFS configuration settings:
- Pagepool - The pagepool is used for I/O buffers to cache user data and indirect blocks. It's always pinned, and the default is fairly small. It's used to implement read/write requests asynchronously using the read-ahead and write-behind mechanisms. Increasing the pagepool increases the amount of data available in the cache for applications to use. This parameter is critical where applications perform significant amounts of random I/O.
- maxFilesToCache - This is the total number of different files that can be cached at any one time. This needs to be set to a large enough value to handle the number of concurrently open files and allow for caching those files.
- maxStatCache - This is additional pageable memory that's used to cache file attributes that aren't in the regular file cache. It defaults to 4 * maxFilesToCache.
- maxMBps (definition from the provided default mmfs.cfg) - maxMBpS is an estimate of how many MBps of data can be transferred in or out of a single node. The value is used in calculating the amount of I/O that can be performed to effectively pre-fetch data for readers and/or or write-behind data from writers. The maximum number of I/Os in progress concurrently will be 2 * min(nDisks, maxMBpS*avgIOtime/ blockSize), where nDisks is the number disks that make a filesystem; avgIOtime is a measured average of the last 16 full block I/O times; and blockSize is the block size for a full block in the file-system (e.g., 256K). By lowering this value, you can artificially limit how much I/O one node can put on all the virtual shared disk (VSD) servers, if there are lots of nodes that can overrun a few VSD servers. Setting this too high will usually not hurt because of other limiting factors such as the size of the pagepool, or the number of prefetchThreads or worker1Threads.
- Blocksize - The blocksize determines the largest file system size and should be set to the application buffer size or the stripe size on the raid set. If this is done incorrectly, performance will suffer significantly. Once the blocksize is set, the minimum space required for a file will be 1/32 of the blocksize, so this setting requires an understanding of file sizes as well.
- preFetchThreads - These are the maximum number of threads that can be dedicated to prefetching data for files that are read sequentially.
- Worker1Threads - The maximum number of threads that can be used for controlling sequential write-behind.
- Worker2Threads - The maximum number of threads that can be used for controlling other operations.