How Resilient Is Your IBM Z System of Record?
The IBM Z Resiliency Maturity Model maps progression for mainframe clients with different levels of resiliency requirements.
By Bob Abrams and David Petersen07/22/2019
To ensure that IS organizations deliver resilient service, IT architects must have systems that rapidly adapt to unforeseen events, including disruptions, business demands or security threats. The IBM Z* server provides that level of resiliency for systems of record, not only for core CICS* workloads, but also for new technologies like machine learning and analytics applications as well as a web infrastructure that supports high transaction rates for 24-7-365 operations.
Customers, employees, shareholders, supply chain partners and communities depend on your systems’ continuous availability (CA). While your business can recover from an hour of downtime, additional hours of interruption from poor restoration objectives could jeopardize your organization’s long-term health and relevance.
Mainframe shops typically introduce a z/OS* cluster with several z/OS images sharing a common time reference, called a z/OS Parallel Sysplex, as an initial step to achieving CA. Redundant system resources can then recover at the point of an incident, and enable a rapid cross-system restart. For planned outages, existing resource can absorb the workload. Another step to improve resiliency is enabling workloads to share databases and other logical resources in a Parallel Sysplex with much faster recovery times, eliminating or reducing any outage with little or no data loss.
CA is achieved by mitigating both planned and unplanned outages. Recovery objectives for data center component failures are often expressed in seconds or minutes, depending on technology used and how well the installation’s system recovery actions are automated. Disaster recovery (DR) is also critical to handling events that result in a full data center outage. Restoration objectives for disasters are usually expressed in hours, including the time to switch operations over to an alternate data center using replicated data. The IBM Z platform makes achieving recovery objective times easy.
The IBM Z Resiliency Maturity Model
IBM developed a Resiliency Maturity Model that maps a progression for mainframe clients with different levels of resiliency requirements. The model guides organizations through assessing current resiliency investments and determining new targets for business workloads. It also helps with determining practical steps to achieve availability goals within the data center, and suggests DR considerations between data centers to be included as part of assessments.
The Resiliency Maturity Model describes four resiliency levels that start with single server clients, and lead up to a full CA environment that is fault-tolerant within the primary data center, and exploits GDPS Continuous Availability for key workloads and DR. Horizontally, the model describes service characteristics at each level, such as automation and systems management, and includes compute requirements for each level, along with data availability for each level, all within the data center. The model also offers data recovery options between data centers for DR.
The following describes the four levels of the resiliency model, which are also depicted in Figure 1, below:
The IBM Z architecture has built-in self-detection, error correction and redundancy to eliminate single points of failure, delivering the best reliability of any enterprise system in the industry, along with 99.999% availability. Transparent processor sparing and dynamic memory sparing enable concurrent maintenance and seamless scaling without downtime. Business workloads that fail can be restarted in place. If z/OS or the LPAR fails, another instance of the workload (in another LPAR) can absorb the work. If the server fails, the workload can be restarted on another server. The resulting workload impact could be elongated due to the need to locate data copies and recover manually.
By installing a second IBM z14* in the environment, component resource sharing begins the road to full system redundancy. A spare drawer with flex memory can be dynamically reconfigured to recover from a hardware failure. Service management processes are expected to be maturing and reactive. Furthermore, by setting up Metro (synchronous) Mirror, and Global (asynchronous) Mirror for longer distances, with IBM Copy Services Manager (CSM), fast data replication can be enabled with little or no data loss.
3. Fault Tolerant: Mitigating Planned and Unplanned Outages
At this level, Parallel Sysplex is introduced with two or more servers, multiple LPARs and associated DASD. z/OS drives dynamic workload routing to application regions on any systems in the sysplex. Databases are shared across members of the sysplex, avoiding unnecessary database replicas. By introducing GDPS, the system can take advantage of Global Mirror or Metro Mirror data replication with GDPS handling all recovery automation. In failover situations, business impact can be reduced to minutes with improved automation. This also enables DR by replicating data to another site, with GDPS providing the failover capability. Metro Mirror and Global Mirror solutions can be combined into three- and four-site solutions. Auto-detection and auto-restart features allow the system to recover in 20-30 minutes.
4. Fault Tolerant Within the Primary Data Center; GDPS Continuous Availability for Key Workloads; DR
GDPS Continuous Availability consists of two sites separated by virtually unlimited distances, running the same applications with the same data sources, to provide cross-site workload balancing, CA and DR. This enables workloads to fail over to another sysplex for planned or unplanned workload outages within seconds.
The Resiliency Maturity Model enables you to determine your systems’ recovery needs, invest toward those targets, and re-evaluate organization plans, competition, and stakeholder’s expectations periodically in light of changing needs.
Bob Abrams is a senior technical staff member at IBM, responsible for IBM Z and z/OS resiliency.
David Petersen is a Distinguished Engineer at IBM, responsible for the overall IBM Z continuous availability, disaster recovery and business continuity strategy.
Sponsored Content3 Unknown Risks in Your Resiliency Armor
Post a Comment
Note: Comments are moderated and will not appear until approvedcomments powered by Disqus