|
|
|
|
|
A Recommendation for High-Availability Options in TPC Benchmarks |
By Dean Brock, Data General
Abstract: This is an article that describes what high-availbility (HA) means in today's Information Technology world, and describes a number of possible ways that HA can be implemented. Furthermore, the article recommends that the TPC organization consider implementing HA benchmark tests. This is an article submitted by a representative of a TPC member company, and the views and opinions in this article are his own and do not reflect any official TPC plan or viewpoint.
|
Introduction |
Each day it becomes more evident that the information era is here: global economies, competition, ever-changing technologies and markets, population mobility, etc. The ability of a business to quickly and intelligently respond in this environment depends directly on the timeliness, integrity and availability of relevant information resources. Accordingly, the Information Technology industry has responded to this need with a variety of Highly Available system solutions. The cost of these solutions is driven by performance, database size, and the degree of availability required to support mission critical applications. Thus, the selection of a High Availability (HA) solution is a significant investment and crucial decision by a business that can determine its success or failure.
Therefore, it is entirely appropriate and desirable that the TPC define both the criteria and method for evaluating High Availability solutions. This would be a natural evolution of essentially performance benchmarks (such as TPC-C and TPC-D) to something much closer to the real world of production data processing.
|
What Is High Availability? |
High Availability (HA) means access to data and applications whenever needed and with an acceptable level of performance. HA deals with the service aspect of the "system" as an unbroken whole and as perceived by its end-users. In this context, reliability (of hardware and software components) and performance (response-time/through-put, tpm, etc.) are parts of system availability. HA is the proportion of time a system is productive and is usually expressed as a percentage. In the spectrum of system availability, HA systems fall between 99.9% and 100% availability.
Availability is expressed as MTTF / (MTTF+MTTR), where:
|
- MTTF (mean-time-to-failure) is the average time that a system runs (without failing) after it has been set up or repaired.
- MTTR (mean-time-to-repair) is the average time needed to repair (or restore) a failed system.
|
In today's business environment, few enterprises can afford to be without access to mission critical applications for more than 8 hours nor can they tolerate more than one (1) failure per year (about 8000 hours). Furthermore, few end-users would consider a system to be "available" if performance drops below some level or if the system is only available to some percentage/special subset of the user community. Considering these facts for a minimal, "entry level" HA system, the MTTF is 8000 hours and the MTTR is 8 hours, which yields an availability of 99.9%. The term "high availability" is further defined (arbitrarily) to mean that the system is operating at greater than 80% performance for more than 60% of the user community. Systems that do not meet these basic requirements can not be HA.
For the purposes of this paper, it is assumed that availability, hardware reliability, system diagnostics, software QA, etc. are such as to provide this basic 99.9% level of High Availability.
|
99.9+% HA Solutions |
The goal of HA is to take care of any disruption in service. HA systems have features and capabilities and employ strategies specifically designed to minimize expensive system downtime. Downtime can be categorized as either planned or unplanned. Planned downtime (representing about 25% of all downtime) is time that the system is unavailable due to scheduled maintenance such as software/hardware upgrades and system backups. The approach used to minimize planned downtime is to:
|
- Provide for backups, maintenance, upgrades while the system is up and running ("hot").
- Reduce the time to perform those tasks that can be done only when the system is down.
|
Unplanned downtime is time that the system is unavailable due to component faults or environmental faults. Human error and "natural" disasters (including power outages) are examples of "environmental faults." The approach used to minimize unplanned downtime is to:
|
- Minimize the number of faults and the effect/recovery time of faults in a system.
- Avoid a single-point-of-failure by utilizing redundant parts and rerouting (failover).
- Reduce the impact of environmental faults by using UPS and off-site data mirroring and/or replication provide for "hot" repair of failed components.
|
The approaches used for systems configured to meet 99.9+% HA requirements and their incremental impact on availability could be categorized like:
|
- On-Line Backup/Recovery
To minimize service disruption (both system and database) should be on-line (i.e., on an active/hot system) or very fast. For example, this could be implemented using integrated system and database utilities and/or by mirroring disk storage. The backup/restore solution would also need to be very robust and should be able to handle single point of failure situations.
The best strategy for backup would take advantage of crests and toughs of an IT operation and have a minimal impact on throughput. On the other hand, implementation configurations and costs will vary (and therefore need to be quantified in terms of price, performance, and availability). Also, valuable transactional data, high system change rates, large database sizes and/or low level backup rates may require a nearly continuous backup strategy. In any case, the maximum backup time and performance impact can be measured (e.g., transactions per minute (tpm) rate with an on-line backup during measurement interval).
The recovery process would be subject to the MTTR requirement for system availability. Partial or incremental recovery of OS, user and database objects must be supported and conform to performance availability requirements.
- No Single Point of Failure (NSPF)
This means that the failure of any system component would not prevent continued access to applications and data. In this sense, "system" is the system-under-test (SUT) for the benchmark. To meet the NSPF requirement, all components would have to be replicated and failover would need to be automatic. The failure of any one component would result in a system configuration (of surviving system components) capable of executing the entire workload within 80% of the measured performance for at least 60% of users of the initial system.
- Multi-Node Clusters
Two (2) or more nodes in a failover or active (shared disk) configuration. The nodes in such a cluster may or may not share peripheral devices and disk storage.
- Multi-Path (Virtual) LAN
Multi-Path LAN (MPL) I/O is the ability to switch from one LAN interface to another in the event of an interface or cable failure. The back-up LAN controller is configured as a "hot spare."
- Multi-Path Disk I/O
Multi-path disk I/O allows a single system to take advantage of multiple I/O paths to its physical disks, to ride through controller failures and continue data access. I/O requests are usually sent to the first SCSI controller. If an I/O fails for any reason (host bus adapter fault, cable disconnection, or storage processor failure), the I/O request is restarted on the second controller. Data is transferred via the second storage processor to the target disks, transparent to the application. As with the other redundant features, the goal is to minimize disruption to applications and enable the system to keep running until the next scheduled service period.
- Fault-Tolerant Disk Storage
Storage failures are responsible for about 25% of all system failures. Fault tolerant storage systems eliminate virtually all downtime caused by failures in the storage subsystem. Essentially, these are disk array subsystems, that are based on Redundant Array of Independent Disks (RAID) technology. Generally, they protect applications against the following kinds of failures:
- Failure of a single drive (when it is configured in a RAID 1, RAID 3 or RAID 5 group).
- Failure of an internal SCSI bus (if no RAID group has multiple drives on the bus).
- Failure of a power supply (if redundant supplies are configured).
- Failure of a cooling fan (if a redundant cooling fan is configured).
- Failure of a cache memory module (providing that caches are mirrored).
Ideally, every major component, including the power supplies, drives, fans, and storage control processors, is fully redundant, allowing for continuous operation through a single point of failure. In the event of component failure, hot repair capabilities enable on-line maintenance. Active dual storage control processors provide dual data paths to ensure uninterrupted data access.
- Hot Repair of System Components
The objective of hot repair is to shrink or eliminate MTTR and minimize (in combination with other HA approaches such as On-Line Backup/Recovery) downtime due to common system administrative tasks such as hardware/software upgrades and additions or changes to the configuration. Generally, hot repair requires duplication or redundancy of major system components.
- Upgrade Software
This would likely require at least two (2) processing nodes (each capable of meeting performance availability requirements). OS, DBMS and application upgrades would be accomplished on one node while the other continues workload processing. Then the workload would be switched to the upgraded node, and the upgrade process repeated for the other node(s).
- Replace/Add Component
Hardware failures often require the physical disconnection (for repair) and re-connection (after repair) of major components such as processing nodes. Also, it is often necessary (and desirable) to add new components (such as peripheral devices, processing nodes, etc.) to adapt to changing requirements or meet peak/unplanned demands on the system. At this level of HA, it should be possible to dynamically modify the physical and logical configuration of the system.
- Replace Disk(s)
The disk storage subsystem should also support a "hot-spare". This drive, which should be at least as big as the largest drive in any RAID group, would be substituted for a faulty drive automatically. This means that the drive would be reformatted to the size of the failed drive and data rebuilt on it, without operator or system intervention. When the faulty drive is eventually replaced, the data would be migrated to the replacement drive, and the "hot-spare" drive would then return to its standby state.
- Disaster Backup/Recovery
This is the use of hot/warm sites to store copies of critical data in order to recover systems lost in catastrophic situations. The ideal solution for this would be distributed clusters (each node remote from the other(s)) with remotely mirrored disk storage. The main issues that should be addressed by this level of HA are:
- The consistency and currency of data and applications between the primary system and the remote system;
- The speed of switching the workload from the prime to the backup system and, if necessary, making the backup system current;
- The performance impact on the primary system of maintaining the backup system in a state of currency that is consistent with a 99.999+% level of availability.
Several approaches that could be used instead of or in conjunction with distributed clusters are:
- Symmetric Replication
The database replicate (on the backup or failover system) is identical to the primary database on the production system. Any change made to the primary database is "replicated" on the backup database. Since this is not a "2-phase commit" process, there is a small window of vulnerability in which changes made to the primary system could be lost in transit. The performance impact of replicating can be significant, but switching between backup and primary systems can be very fast and fairly transparent to the end-users.
- Standby Databases
This is a special case of replication. The "standby" database starts as an identical copy of the primary database. It is resident on the backup system and is in a continuous "recovery" mode of operation. Changes (updates, inserts, deletes) to the primary database are recorded in transaction logs which are periodically archived. Newly archived logs are then transmitted to the backup site and applied to the standby database. In a best case scenario, the standby system is behind (in terms of data currency) the primary system by the number of changes recorded on the current transaction log. So depending on the rate of change this could be a very viable HA solution with minimal impact on primary system performance. Note: perhaps the other issue that should be addressed is the time and impact of switching back to the primary system (resetting back to the original configuration after recovery from certain non-destructive disasters).
- Remote Data Mirroring
This is the classical disk mirroring except over a long distance. Depending on whether hardware or software mirroring is used, the performance impact may be minimal or significant. Switch over to the backup site can be quick and virtually transparent to end-users. The loss of data is zero although a "system crash" type of database recovery is still needed.
|
Conclusion |
How useful is the fact that a system can perform a zillion tpm or execute a query in milliseconds without some knowledge about its availability to do the job and the cost of achieving the appropriate level of performance and availability?
The technology to address this issue is generally available and in use throughout the industry. The performance impact and costs associated with HA can be quantified and measured. The information obtained by using the widely accepted, industry standard TPC benchmarks would be of great value to businesses in their IT investment decisions.
|
|
|
|
|