Origins of the TPC and the first 10 years |
by Kim Shanley, Transaction Processing Performance Council |
February, 1998 |
Preface |
In my view, the TPC's history can be best understood by focusing on two of its major organizational activities: 1) creating good benchmarks; 2) creating a good process for reviewing and monitoring those benchmarks. Good benchmarks are like good laws. They lay the foundation for civilized (fair) competition. But if we have good benchmarks, why do we need all the overhead process for reviewing and monitoring the benchmark results? Similarly, you might ask if we have good laws, why do we need police, lawyers and judges? The answer to both questions is the same. Laws and benchmarks are not, in of themselves, enough. And by this I don't mean to imply that it's simply human nature to break or bend the rules. The TPC has found that no matter how clear-cut the rules appear to be when the benchmark specifications are written, there are always gray areas, and yes, loopholes left in the benchmark law. There must be a way of addressing and resolving these gray areas and loopholes in a fair manner. And yes, even "good laws," said Aristotle, "if they are not obeyed, do not constitute good government." Therefore, there must be a means for stopping those who would break or bend the rules. While this book is primarily a technical overview of the industry's benchmarks, the TPC's history is about both benchmark law and benchmark order.
The State of Nature
In writing this early history of the TPC, I've drawn heavily upon the account by Omri Serlin published in the second edition of this handbook. It was through Omri's initiative and leadership that the TPC was founded.
In the early 1980's, the industry began a race that has accelerated over time: automation of daily end-user business transactions. The first application that received wide-spread focus was automated teller transactions (ATM), but we've seen this automation trend ripple through almost every area of business, from grocery stores to gas stations. As opposed to the batch-computing model that dominated the industry in the 1960's and 1970's, this new online model of computing had relatively unsophisticated clerks and consumers directly conducting simple update transactions against an on-line database system. Thus, the on-line transaction processing industry was born, an industry that now represents billions of dollars in annual sales.
Given the stakes--even at this point in the race--over who could claim the best OLTP system, the competition among computer vendors was intense. But, how to prove who was the best? The answer, of course, was a test--or a benchmark. Beginning in the mid-1980's, computer system and database vendors began to make performance claims based upon the TP1 benchmark, a benchmark originally developed within IBM that then found its way into the public domain. This benchmark purported to measure the performance of a system handling ATM transactions in a batch mode without the network or user interaction (think-time) components of the system workload (similar in design to what later turned out to be TPC-B). The TP1 benchmark had two major flaws. First, by ignoring the network and user interaction components of an OLTP workload, the system under test (SUT) could generate inflated performance numbers. Secondly, the benchmark was poorly defined and there was no supervision or control of the benchmark process. As a result, the TP1 marketing claims, not surprisingly, had little credibility with the press, market researchers (among them Omri Serlin), or users. The situation also deeply frustrated vendors who felt was their competitors' marketing claims, based upon flawed benchmark implementations, were ruining every vendor's credibility.
Early Attempts at Civilized Competition
In the April 1, 1985 issue of Datamation, Jim Gray in collaboration with 24 others from academy and industry, published (anonymously) an article titled, "A Measure of Transaction Processing Power." This article outlined a test for on-line transaction processing which was given the title of "DebitCredit." Unlike the TP1 benchmark, Gray's DebitCredit benchmark specified a true system-level benchmark where the network and user interaction components of the workload were included. In addition, it outlined several other key features of the benchmarking process that were later incorporated into the TPC process:
|
- Total system cost published with the performance rating. Total system cost included all hardware and software used to successfully run the benchmark, including 5 years maintenance costs. Until this concept became law in the TPC process, vendors often quote only part of the overall system cost that generated a given performance rating.
- Test specified in terms of high-level functional requirements rather than specifying any given hardware or software platform or code-level requirements. This allowed any company to run this benchmark if they could meet the functional requirements of the benchmark.
- The benchmark workload scaleup rules -- the number of users and size of the database tables -- increased proportionally with the increasing power of the system to produce higher transaction rates. The scaling prevented the workload from being overwhelmed by the rapidly increasing power of OLTP systems.
- The overall transaction rate would be constrained by a response time requirement. In DebitCredit, 95 percent of all transactions had to be completed in less than 1 second.
|
The TPC Lays Down the Law
While Gray's DebitCredit ideas were widely praised by industry opinion makers, the DebitCredit benchmark had the same success in curbing bad benchmarking as the prohibition did in stopping excessive drinking. In fact, according to industry analysts like Omri Serlin, the situation only got worse. Without a standards body to supervise the testing and publishing, vendors began to publish extraordinary marketing claims on both TP1 and DebitCredit. They often deleted key requirements in DebitCredit to improve their performance results.
From 1985 through 1988, vendors used TP1 and DebitCredit--or their own interpretation of these benchmarks--to muddy the already murky performance waters. Omri Serlin had had enough. He spearheaded a campaign to see if this mess could be straightened out. On August 10, 1988, Serlin had successfully convinced eight companies to form the Transaction Processing Performance Council (TPC).
|
TPC-A |
Using the model and the consensus that had already developed around the DebitCredit benchmark, the TPC published its first benchmark, TPC Benchmark A (TPC-A) within one year (November 1989). TPC-A differed from DebitCredit in the following respects:
|
- The requirement that 95 percent of all transactions must complete in less than 1 second was altered to 90 percent of transactions must complete in less than 2 seconds.
- The number of emulated terminals interacting with the SUT was reduced to a requirement of 10 terminals per tps and the cost of the terminals was included in the system price.
- TPC-A could be run in a local or wide-area network configuration (DebitCredit has specified only WANs).
- The production-oriented requirements of the benchmark were strengthened to prevent the reporting of peak, unsustainable performance ratings. Specifically, the ACID requirements (atomicity, consistency, isolation, and durability) were bolstered and specific tests added to ensure ACID viability.
|
Finally, TPC-A specified that all benchmark testing data should be publicly disclosed in a Full Disclosure Report.
The first TPC-A results were announced in July 1990. Four years later, at the peak of its popularity, 33 companies were publishing on TPC benchmarks and 115 different systems had published TPC-A results. In total, about 300 TPC-A benchmark results were published.
The first TPC-A result was 33 tpsA at a cost of $25,500 per transaction or tpsA. The highest TPC-A result ever recorded was 3,692 tpsA with a cost of $4,873 per tpsA. In summary, the highest tpsA rating had increased by a whopping factor of 111 times and the price/performance had improved by a factor of five. Does this increase in the top tpsA ratings correspond to an identical increase in the real world performance of OLTP systems during this period? Even in keeping in mind that this is a comparison of peak, not average benchmark ratings, the answer is no. The increase in tpsA ratings is just too great. The increase can be attributed to four major reasons: 1) the first benchmark test is usually run for bragging rights and is grossly unoptimized compared to later results; 2) real performance increases of hardware and software products; 3) vendors improving their products to eliminate performance bugs exposed by the benchmark, 4) vendors playing the benchmarking game effectively--learning from each other on how best to run the benchmark. So yes, there is a gamesmanship aspect to the TPC benchmark competition, but it should not obscure the fact that TPC benchmarks have provided an objective measure of a truly vast increase in computing power of hardware and software during this period. Indeed, the benchmarks have accelerated some of these software improvements. And all the marketing gamesmanship should also not invalidate the legacy achievement of TPC-A that for the first time, it provided the industry with an objective and standard means of comparing the performance of a vast number of systems.
|
TPC-B |
While TPC-A leveraged the industry consensus built up over DebitCredit, the TPC's was much more ambiguous about publishing a benchmark around the TP1 model. The ambiguity about what was to turn into the TPC's second benchmark, the TPC-B benchmark, lasted throughout the life of the TPC-B. Everyone within the TPC organization believed that one of TPC-A's principal strengths was that it was an end-to-end system benchmark that exercised all aspects of an OLTP system. Furthermore, this OLTP system representing users working on terminals, conducting simple transactions over a LAN connected to a database server, was a model of computing that everyone could intuitively understand.
As described earlier, TP1 (and later TPC-B) was the batch version of DebitCredit, without the network and user interaction (terminals) figured into the workload. A strong block of companies within the TPC, including hardware companies who sold "servers" (as opposed to end-to-end system solutions) and database software companies, felt that the TPC-B model was more representative of the customer environments they sold into. The anti-TPC-B crowd, on the other hand, argued that the partial-system model that TPC-B represented would reduced the stress on key system resources and would therefore produce artificially high transaction rates. In addition, while the tps rates would be artificially high, the total system cost since the network and terminal pricing would be eliminated would be artificially low, thereby artificially boosting TPC-B's price/performance ratings. Finally, the anti-TPC-B spokespeople argued, since TPC-A and TPC-B were so much alike, using identical a "tps" throughput rating system, the users and press would be confused. Whether they won the argument or not can be debated, but what cannot is that TPC-B proponents eventually won the day and in August 1990, TPC-B was published as the second TPC benchmark standard. TPC-B used the same TPC-A transaction type (banking transaction) but it cut out the network and user interaction component of the TPC-A workload. What was left was a batch transaction processing benchmark.
The first TPC-B results were published in mid-1991 and by June 1994, at the peak of its popularity, TPC-B results had been published on 73 systems. In total, about 130 TPC-B tests were published. In all, there were about 2.5 times more TPC-A results published. The first TPC-B result was 102.94 tpsB with a cost of $4,167 per tpsB. The highest TPC-B rating was a 2,025 tpsB result, and the best price/performance number was $254 per tpsB. In summary, the top TPC-B ratings increased by a factor of 19 and the price performance rating improved by a factor of 16.
It's more difficult to give the legacy of TPC-B the unconditional stamp of success. It never received the user or market analyst acceptance that TPC-A did, but many within the Council and the industry perceived a real value in this "database server" benchmark model. This belief fueled a later failed TPC efforts around TPC-S (more on this later), and continues to influence the development of TPC benchmarks today. In January 1998, the TPC announced the formation of a Web Commerce benchmark (TPC-W) which will measure OLTP and browsing performance only of the web server, excluding the network and human interaction components of the overall system. So, TPC-B proponents may not have won the public debate on the merits of TPC-A versus TPC-B, but they may take some measure of satisfaction that the back-end model of benchmarking lives on.
Political Reform Begins Immediately
The TPC was a major improvement over the state of nature that existed previously. However, as the Council was to learn, most of the work of building a successful benchmarking organization and process was ahead of them. In this sense, the early TPC was not unlike the early American colonies who idealistically believed that they could eliminate the endless political and legal conflict of the Old Word by passing laws abolishing lawyers. While such a law still appeals to a significant minority, in our more judicious moments, most would agree that it's not possible.
As soon as vendors began to publish TPC results, complaints from rival vendors began to surface. Every TPC result had to be accompanied by a Full Disclosure Report (FDR). But, what happened when people reviewed the FDR and didn't like what they read? How could protest be registered and how would it be adjudicated? Even if a member of the public or a vendor representative were, so to speak, make a citizen's arrest of a benchmark violator, there was no police or court system to turn the perpetrator over to for further investigation or if need be, prosecution. It became apparent to the Council that without an active process for reviewing and challenging benchmark compliance, there was no way that the TPC could guarantee the level playing field that the TPC had promised the industry.
Throughout 1990 and 1991, the TPC embarked on a political journey to fix this hole in its process. The Technical Advisory Board (TAB), which was originally constituted as just an advisory board, became the arm of the TPC where the public or companies could challenge published TPC benchmarks. The TAB process, which remains in place today, established a fair, deliberative mechanism for reviewing benchmark compliance challenges. Once the TAB has thoroughly researched and reviewed a challenge, the TAB makes a recommendation to the full Council. The full Council then hears the TAB's report, discusses and debates the challenge, and then votes on the challenge. If the Council finds the result non-compliant in a significant or major way, the result is immediately removed as an official TPC result.
Benchmarking Versus Benchmarketing
By the spring of 1991, the TPC was clearly a success. Dozens of companies were running multiple TPC-A and TPC-B results. Not surprisingly, these companies wanted to capitalize on the TPC's cachet and leverage the investment they had made in TPC benchmarking. Several companies launched aggressive advertising and public relations campaigns based around their TPC results. In many ways, this was exactly why the TPC was created: to provide objective measures of performance. What was wrong, therefore, with companies wanting to brag about their good results? What was wrong is that there was often a large gap between the objective benchmark results and their benchmark marketing claims--this gap, over the years, has been dubbed "benchmarketing."
So the TPC was faced with an ironic situation. It had poured an enormous amount of time and energy into creating good benchmark and even a good benchmark review process. However, the TPC had no means to control how those results were used once they were approved. The resulting problems generated intense debates within the TPC.
Out of these Council debates emerged the TPC's Fair Use policies adopted in June, 1991.
|
- When TPC results are used in publicity, the use is expected to adhere to basic standards of fidelity, candor, and due diligence, the qualities that together add up to, and define, Fair Use of TPC Results.
- Fidelity: Adherence to facts; accuracy
- Candor: Above-boardness; needful completeness
- Due Diligence: Care for integrity of TPC results
|
Have the TPC's Fair Use policies worked? By and large, they have been effective in stopping blatantly misuse or misappropriation of the TPC's trademark and good name. In other words, very few companies claim TPC results when, in fact, they don't have them. In general, TPC member companies have done a fair job in policing themselves to stop or correct fair use violations that have occurred. At times, the TPC has acted strongly, issuing cease and retraction orders, or levying fines for major violations.
It must be said, however, that there remains today among the press, market researchers, and users, a sense that the TPC hasn't gone far enough in stamping out benchmarketing. This issue has two sides. On the one hand, companies spend hundreds of thousands of dollars, even millions of dollars, running TPC benchmarks to demonstrate objective performance results. It's quite legitimate, therefore, for these companies to market the results of these tests and compare them with the results of their competitors. On the other hand, no company has a right to misrepresent or mislead the public, regardless of how legitimate the benchmark tests may be. So where does the "war" on benchmarketing stand today? Much like the war on crime, the war on benchmarketing persists, and the TPC continues to wage an active campaign to eliminate it.
Codifying the Spirit of the Law
With the creation of a good review and fair use process, and with dozens of companies publishing regularly on the TPC-A and TPC-B benchmarks, the TPC may be forgiven for lapsing into a self-satisfied belief that the road ahead was smooth. That sense of well-being was torpedoed in April, 1993 when the Standish Group, a Massachusetts-based consulting firm, charged that Oracle had added a special option (discrete transactions) to its database software, with the sole purpose of inflating Oracle's TPC-A results. The Standish Group claimed that Oracle had "violated the spirit of the TPC" because the discrete transaction option was something a typical customer wouldn't use and was, therefore, a benchmark special. Oracle vehemently rejected the accusation, stating, with some justification, that they had followed the letter of the law in the benchmark specifications. Oracle argued that since benchmark specials, much less the spirit of the TPC, were not addressed in the TPC benchmark specifications, it was unfair to accuse them of violating anything.
The benchmarking process, which sprang from the discredited TP1 and DebitCredit days, has always been treated with a fair degree of skepticism by the press. So the Standish Group's charges against Oracle and the TPC attracted broad press coverage. Headlines like the May 17, 1993 issue of Network World were not uncommon, "Report Finds Oracle TPC results to be misleading; says option discredits TPC-A as benchmark."
Whether Oracle's discrete transition option was truly a benchmark special was never formally discussed or decided by the TPC. The historical relevance of this incident was that it spurred the TPC into instituting several major changes to its benchmark review process.
|
New Anti-Benchmark Special Prohibition |
TPC benchmark rules had always required companies to run the benchmark tests commercially available software. However, after the Standish Group charges, the Council realized that it had no real protection from companies that purposely designed a benchmark special component into their commercially available software. In other words, this special component could be buried in some obscure corner of overall product code and only be used when the vendor wanted to run a TPC test. If the TPC was formed to create fair, relevant measures of performance, then yes, the benchmark special was a violation of the TPC's spirit and thus had to be prohibited. In September, 1993, the Council passed Clause 0.2, which contains the sweeping prohibition against benchmark specials that become a bedrock of the TPC process to ensure, relevant benchmarks.
The Council drew a line in the sand with the passing of the Clause 0.2 anti-benchmark prohibition that has become part of the bedrock of the TPC process to ensure fair, relevant benchmarks:
|
- Specifically prohibited are benchmark systems, products, technologies or pricing...whose primary purpose is performance optimization of TPC benchmark results without any corresponding applicability to real-world applications and environments. In other words, all "benchmark special" implementations that improve benchmark results but not real-world performance or pricing, are prohibited.
|
Clause 0.2 in TPC-A and TPC-B went into effect in June, 1994. Oracle decided not to test its discrete transaction option against the new anti-benchmark special rules in the specifications and withdrew all of its results by October, 1994. Let it also be noted that Oracle remains a TPC member and strong supporter of the organization.
|
New TPC Auditing Process |
As a result of the 1993 controversies, the TPC realized that the millions of dollars being invested in the running of TPC benchmarks would be completely wasted if the credibility of the results were challenged. The TPC's process of FDR review was fine, but it only was invoked after a result was published and publicized. Yes, the TPC could yank a result from the official results list after it was found to be non-compliant, and even fine a company for violating the specifications, but the damage to the company's competitors and TPC's credibility would already have been done. In summary, it wasn't enough to catch the bad horse after it had left the barn. The goal was to stop the bad horse from ever getting out of the barn.
The result of these discussions, passed in September and December, 1993, was the creation of a group of TPC certified auditors who would review and approve every TPC benchmark test and result before it was even submitted to the TPC as an official benchmark or publicized. While TPC benchmarks are still reviewed and challenged on a regular basis, the TPC auditing system has been very effective in preventing most of the bad horses from ever leaving the barn.
|
New and Better Benchmarks |
From the outset, I have said that the TPC's history is both about benchmark law and benchmark order. From the last sections of this chapter, the reader might have received the false impression that the TPC is exclusively a political organization endlessly embroiled in public controversies and institutional reform. The "benchmark order" activity of the TPC is certainly important, but the TPC's day-to-day focus is to build better benchmarks.
TPC-A was a major accomplishment in bringing order out of chaos, but TPC-A was primarily a codification of the simplistic TP1 and DebitCredit workloads created in the mid-1980's. However, what was very clear even as the TPC members approved TPC-A in late 1989 was that better, more robust and realistic workloads would be required for the 1990's.
Two benchmark activities were launched in 1990, the development of TPC-C, the next generation OLTP benchmark, and TPC-D, a decision support benchmark.
Both TPC-C, which was approved as a new benchmark in July, 1992 and TPC-D, which was approved in April, 1994, are covered in other chapters of this book and therefore I'll only add a few comments on these benchmarks.
The first TPC-C result published in September, 1992 was 54 tpmC result with a cost per tpmC of $188,562. As of this date (January 1998) more than six years later, the top result is a 52,871 tpmC with a cost per tpmC of $135. We have witnessed the same tremendous improvement in the top TPC-C numbers as we did for TPC-A and for the same reasons: 1) real world performance and cost improvements and 2) an increased knowledge about how to run the benchmark. (Again, keep in mind that just by looking at peak numbers, and not averages, we're seeing an exaggerated inflationary effect). Currently, there 143 official TPC-C results, a higher total than TPC-A at the peak of its popularity.
The first TPC-D result was a 100 GB result in December, 1995 with a throughput performance rating of 84 QthD and a price/performance rating of $52,170 QphD. Today, the top 100 GB throughput result that has been produced is 1205 QthD and $1877 QphD. Currently, there are 28 official TPC-D results. Why so few? TPC-D is only 2.5 years old, compared to TPC-C's venerable 6 years, and TPC-D is more expensive and complex to run.
Both TPC-C and TPC-D have gained widespread acceptance as the industry's premier benchmarks in their respective fields (OLTP and Decision Support). But the increase in the power of computing systems is relentless, and benchmark workload must continually be enhanced to keep them relevant to real world performance. Currently, a new major revision of TPC-C is being planned for release in early 1999. A new major revision of TPC-D is being planned for mid-1998 and another one in 1999.
|
|