A Remedy for What Ails You
IBM can prescribe an array of high-availability solutions
Illustration by Adam McCauley
The resiliency of your business depends on how highly available your solutions are. How do you choose the prescription to best meet your needs? IBM can help. We've seen how independent auxiliary storage pools (IASPs) combined with IBM i5/OS, clustering and journaling provide the flexibility you need to build reliable, cost-effective, easy-to-manage high-availability (HA) and disaster-recovery solutions. The purpose of this article is to give a high-level overview of some solutions that can be built using IASPs and IBM technologies, such as switched disk, mirroring options and remote mirror and copy.
Many factors go into building highly available systems. I could spend hours on the details of each one, but I'll focus here primarily on two factors: recovery point objective (RPO) and recovery time objective (RTO). The RPO, combined with the critical data point, determines how soon after the disaster you'll need to recover your data, and the point to which data must be recovered after a system loss or failure - in other words, the amount of expected data loss you can tolerate. RTO is the time after a disaster at which business functions must be restored. That's the length of time that the company can be without critical computer services. No HA solution can be successfully implemented without knowing and meeting these business requirements. An HA assessment may be needed to truly understand these requirements.
Switched Disks
Switched disks are DASD units contained in a tower connected to two IBM System i systems or partitions via a high-speed link (HSL) loop (see Figure 1). One of the two systems owns the tower and all resources and can be switched to the other system if the owning system encounters an outage or needs to be powered down for maintenance. Switched disks require careful planning and an understanding of the hardware requirements. For example, you'll need to plan which systems will participate, how to set up your HSL network, which disk units will participate, which RAID sets it will be built over and more. Without this careful planning, the switched-disk solution may not perform as desired. Since you're limited to the length of the HSL cable, this solution provides campus HA. The IASP you create using this solution will consist of only disks in the towers being switched between the two systems.
Only one set of disks is required for this solution. There's no replication of data. Therefore, the RPO for this solution is the last committed transaction to disk. Several factors determine the switch time of the tower. You must take this switch time into consideration when calculating RTO. Customers using this solution generally see switch times of 10 to 15 minutes. Once the tower is switched, additional time is needed to get critical system functions running.
Cross-Site Mirroring
Cross-site mirroring (XSM) provides site disaster protection by keeping a copy of the IASP at another location (see Figure 2). This is sometimes called geographic mirroring because the mirrored systems can be geographically dispersed. XSM provides HA with more backup nodes than switched disks. Synchronization occurs when you make the disk pool available after you configure XSM. When XSM is active, changes to the production copy data are transmitted synchronously to the mirror-copy system (and optionally asynchronously to disk) across TCP/IP connections. Therefore it's imperative that a proper sizing of bandwidth requirements be performed. XSM is primarily used for distances less than 100 km. i5/OS V5R4 enhancements let the mirror copy be detached and then separately made available to perform save operations, create reports or perform data mining. However, synchronization isn't occurring while the mirror copy is detached.
Geographic mirroring is logical mirroring, not physical mirroring. The two disk pools must have similar disk capacities, but the mirror copy may have different numbers and types of disk units as well as different types of disk protection.
XSM increases CPU load, so there must be sufficient excess CPU capacity. You'll need to add processors to increase CPU capacity. As a general rule to achieve optimal performance for XSM, particularly during synchronization, increase your machine pool size by the amount in the following formula:
271.5 MB + (.2MB x Number of disk units in IASPs)
To prevent the performance adjuster from reducing the machine pool size, you should set QPFRADJ to zero.
Since XSM is a mostly synchronous technology, you can achieve a relatively tight RPO. Switch times generally take less than 15 minutes. Take this switch time into consideration when calculating an RTO. You'll still need some time to get critical system functions running.
Copy Services and FlashCopy
Copy Services is an optional feature of the IBM System Storage Enterprise Storage Server, DS6000 and DS8000. It brings powerful data copying and mirroring technologies to open-systems environments previously available only for mainframe storage. You can use Copy Services functions with i5/OS to perform point-in-time copies of your data using the FlashCopy function and mirroring your data for disaster recovery using remote mirror and copy functions.
While FlashCopy alone doesn't provide HA, it's impossible to discuss System Storage copy services without mentioning FlashCopy. FlashCopy is generally positioned as a backup strategy, similar to Save While Active, and therefore more relevant in a discussion of disaster recovery. The exception is Global Mirror, which leverages FlashCopy to obtain data consistency and is therefore worth mentioning here.
FlashCopy makes a single point-in-time copy of a logical unit number (LUN) that resides in the same storage server. This is also known as a time-zero copy. The target copy is totally independent of the source LUNs and is available in a matter of seconds for both read and write access once the FlashCopy command has been processed. When FlashCopy runs, a bitmap representing the relationship between the source and target volume is created (see Figure 3) and remains until the FlashCopy is removed. The point-in-time copy created by FlashCopy is typically used where you need a copy of production data produced with minimal application downtime. It can be used for online backup, testing of new applications or copying a database for data-mining purposes. The copy looks exactly like the original source volume and is an instantly available, binary copy.
New options facilitate the creation of FlashCopy consistency groups. With the FlashCopy consistency groups, I/O activity to a volume will be held off until you issue the Consistency Created task with the FlashCopy Consistency Group option. You can use this option - combined with IASPs, clustering and journaling - to achieve a zero-downtime backup window on the System i platform. There are, however, specific caveats of using this type of copy for backup. Since the IASP isn't varied off, the memory contents aren't destaged to disk. Copy Services, being hardware replication, copies only what's on disk. Journaling is essential so that, at the very least, the journal receiver holding the last known change on disk is being replicated. It's recommended to quiesce the IASP before performing the FlashCopy function to ensure all contents of memory are destaged to disk. If the IASP isn't quiesced, the copy is essentially crash consistent.
Today's System i user has more choices for HA than ever before.
James McCord is a consulting educator in the System i Technology Center in Rochester, Minn. His group conducts education and performs consulting on the System i topics of IASP enablement, XSM, IBM System Storage and Copy Services for System i platform, System i performance and more.
More Articles From James McCord