TPC-D Detailed Description

[Note: The following is part of an interview by Kim Shanley, TPC's Chief Operating Officer, with Jack Stephens, Informix, and Francois Raab, Information Paradigm, that appeared in the April 15, 1995 TPC Quarterly Report.]

Overview of TPC-D Design

Shanley: Let's talk a little bit more detail about the benchmark itself. Obviously this technical development effort has been going on for over four years. That connotes that you ran into some major technical challenges in developing it. Could you outline some of those challenges and how you resolved them?

Raab: We started with having to define a consistent understanding of a decision support model. What does it mean to do decision support? Is it ad hoc, is it canned, is it large, is it small, is it often, is it not often? Is it live data, is it archived data? We had to define everything that goes into it. Then came some more specific issues. One had to do with whether skewed data were going to be part of the benchmark. And it was decided that skewed data would not be part of that benchmark, mainly because of the difficulty of implementing it inside the benchmark. Another question that had to be resolved was the one of random queries or controlled spontaneity of the queries and the difficulty of achieving that.

Stephens: We also had to deal with comparability between results. The key in developing the benchmark was there were three or four spectrums that we had to sort of pick either discrete points in or bound upon. One was spectrums for which we had to either pick discrete points or set bounds; another was prepared versus ad hoc; and one was how random a world can we model. After all, you can't have reproducibility and utter randomness. Next was the question of comparing a one gigabyte system to a one terabyte system-is that reasonable? Maybe yes, maybe no. Can we allow arbitrary data volumes, like the OLTP benchmarks, or do we need to pick discrete scaling points. What do we need to do about timing constraints? What level of knowledge of the data can we allow when you're creating indices or laying out the data set? How do we get a consistent data set? What kind of tools do we need to build to make sure that people can run the benchmark in an easy and reproducible fashion? Pretty broad spectrum of stuff.

Shanley: When people see a benchmark they're not familiar with, they ask, "Why doesn't it do this? Why did you leave that out?" The preamble of the benchmark says that queries and the data populating the database have been chosen to have broad industry-wide relevance while maintaining a sufficient degree of ease of implementation. Which is, I think, one fundamental key in benchmarking. Could you elaborate on this trade-off and perhaps there's a couple of other major ones where you've had to do some trading off in terms of how you design the benchmark.

Stephens: The tradeoff, of course, is that you can't measure everything. We had to decide what pieces we would approach and what pieces we wouldn't. And one of the ones that Francois mentioned a minute ago, that we consciously left out even though it is an interesting component of the world, is skewed or non-uniform data distribution. It's very likely that if you look at all fifty states, the majority of antifreeze is bought in the northern half of the country and not the southern half of the country. And it would be unreasonable in the "real world" to assume that if you got 50 units of antifreeze sold a year, then one unit is bought in every state in the union. It's pretty unlikely. But in fact, that's precisely the sort of tradeoff that we needed to make because we needed to find a way to take a very complex area like Decision Support and produce something that we could scale down to a reasonable set of experiments and a reasonable set of representative data for those experiments. What that led us to is the benchmark we have where we do in fact have uniform distribution. And while that's a shortfall, it makes it possible to get a benchmark out that has some real value and some real merit and a means to produce reliable and reproducible results. It also means that there's work to go on in the future as we evolve the benchmark and get a better understanding of it to add some complexity back in. But the challenge was to get the right amount of complexity in early while still assuring that what we put out came out as quickly as we could make it.

Raab: I think it's important to say that, just like with TPC-C, we're not saying that TPC-C is the way to build an order entry system, but we're rather saying TPC-C is representative of the type of work that an order entry system puts on a system, we have done something similar here. We are not saying this is the best way to build a decision support system. We are saying this benchmark executes work that is extremely similar and very representative of the type of work that is executed by a typical decision support system.

Shanley: Right. After all, this is a benchmark intended for comparison reasons, not an application to be modeled by a particular customer.

Raab: That's correct. The key here is whether the benchmark exercises the same component of the system as would your typical decision support system. And if it does, then we've succeeded, because it will show the strengths and show the weaknesses of such a system in any type of environment implemented to do decision support any which way the customer would want to do it for their own specific need.

Stephens: Our benchmark is representative of a large part of the decision support market, but not all of it. And as our understanding of what people do in decision support matures and as the products mature, it will doubtless turn out that there are other parts of the spectrum that merit their own benchmark and that we will get refinements on TPC-D that focus on other parts of that spectrum. And that's as it should be. But for the first entry, what we've tried to do is get as broad a coverage as possible.

Shanley: So do you address the issue of the live database versus the separate database specifically for decision support processing?

Raab: This is not a production system against which you run decision support queries. This is an extract database from the production system, just as it shows in the benchmark's preamble, where you have the OLTP system as being one, with its own database, and then next to it you have the decision support system that was extracted from the OLTP system. The point of discussion that we often run into when trying to define the decision support model that is portrayed by the benchmark is whether the database, once extracted, should be updated or not. There's basically two ways of doing business in the decision support world. You can build a totally new database, run queries against it until you've run all the queries you can think of or until you've exhausted all your query needs, and then you build a new database and do it again. There's another model, which is that you build a database once and then, as you move in time, you refresh that database, you update it with new data and you get rid of stale data. And it is the latter model that we have followed in the benchmark. But they are both models with which the queries are executed against an extraction of the data from the business operations data.

Shanley: I guess the trade-off is that you can refresh a database much faster than you can build a new database, but the bad trade-off is the execution of the queries is going to dramatically slow down while you're doing the refreshing.

Stephens: And that's another one of those decisions that the market is still in the process of making. Some people will use a "two mode" approach: run queries, refresh data, run queries, etc. Also, that refresh may be partial or it may be complete. Others will adopt a "unified" approach: concurrently querying against the data and incrementally moving the data forward in time. That's a business decision and a customer decision that in fact we didn't make in the benchmark. You can run this either way and still produce valid results and valid information about the systems you're testing.

Raab: I think it's important to always keep in mind that there are decision support applications where you do not want to run on data that changes, data that gets updated. There are applications in the decision support world where it is crucial that you run your queries against stable data where you want to look at the data of the last five years. Therefore, it's not changing, because you're looking at the past. Those applications exist and those applications do not need to be refreshed.

Shanley: Could you elaborate on the tradeoff with taking purely a historical decision support approach and one requiring refreshed data?

Stephens: There are two approaches. One says, "I'm concerned with what has already completed, so I'm looking at a specific point in time." And the other is the question of, "I want some level of currency." And then there's this whole spectrum of how current is current. Is it data as of last year, last quarter, last month, yesterday, 20 minutes ago, or is it one transaction at a time? And that entire spectrum is valid, but it has trade-offs and appropriateness for the kind of use you have for the answers.

Raab: Kim, to close on that point, we kept an update component in the benchmark because we felt that that was a very important part of decision support altogether. And that ignoring the update component would not have given us as complete a picture. But we did not over-emphasize that component.

Shanley: I guess one follow-up question on that is, you say you give people the option of running the benchmark while the updates are happening concurrently with the query execution, or not. Will those be two separate metrics or can you really compare a system that's doing the queries at the same time as the update and a system that's just doing the queries?

TPC-D Implementation Details

Stephens: The short answer is that the details of how the updates were done and whether they were co-resident with the queries or not is in the full disclosure report and that yes, we do think the results are comparable.

Shanley: How easy will it be to cheat on this benchmark?

Stephens: What we've done is to say that you can only implement this thing in ANSI 92 SQL or something very close to it. And we've also set in place a set of very specific rules and evaluations and a process for pre-approving the exact implementation, the exact phrasing of the queries that you're going to run. Since you have to write it in ANSI 92 SQL and you have to run it against a known data set and a known schema and so forth, it's much more likely that the results that you see out of a benchmark are comparable to those you would see in the real world because the phrasings and the methods will be similar.

Shanley: Can you give me some concept of how long it would take a company from scratch to produce a decent D result?

Stephens: I'll let you know in a couple of months. D comes with a soft appendix that includes the data generation utility pre-written as portable code, query generation utility pre-written as portable code, query phrasings, answer sets ready for comparison. So we've tried to give people a pretty solid leg up. And I think also that because the implementations are a little more constrained, there will be less exploration in that area than there might be in some other benchmarks. But it's a new work load, it stretches new parts of the product set, both hardware and software, so it's not going to be a trivial effort, by any means.

Raab: All benchmarks take time, whether we're dealing with standard benchmarks or customer benchmarks. And the better the number that you're shooting for, the longer it takes. I think standard benchmarks in the end save time to all parties, because by having them and being able to use them in various situations, we avoid a lot of reinventing of the wheel and developing new benchmark paradigms. And we distill it down to the bare capabilities of the system to run in a given environment.

Shanley: Just to press you a little bit further on this, we're not talking about one week to implement this benchmark from scratch, are we?

Raab: Jack, wouldn't you say that given all the pieces that are supplied, you could in two to three weeks have a running implementation of the benchmark?

Stephens: Yes, I think that's reasonable. I think you could get a first set of raw numbers in three weeks.

Shanley: Can any system run TPC-D, and what type of system will really produce competitive numbers? Can a PC run this benchmark?

Stephens: A PC that supports ANSI SQL and ACID properties and can access at least a gigabyte of data could certainly run the benchmark. In fact, I know people who have done it. Now, whether or not they'll be competitive, that's hard to anticipate. At this moment, it's possible at least from a price point, they would certainly begin to be competitive. But one of the real differences between OLTP and decision support is that OLTP at some level is volume-insensitive, because you're going out to update a particular account or to access a particular customer record. And while you have to be able to get to all of the data, you're looking at a minute fraction of it at any given point in time or within any given query. In decision support, you not only have to be able to get to the data, you have to be able to get to all of it, now, to answer a given question. When you combine that with the fact that the volumes can get very large, the PC may not be appropriate for very large sets very soon, just because of the stress on the system. If you have to actually traverse six million rows in the line item table to answer query 1, that may represent a fairly substantial barrier to lower end systems.

Raab: We do have a very wide spectrum from one gigabyte to one terabyte. This is a very large spectrum of system sizes and power. So we should expect to see a very wide range of systems executing that benchmark.

Shanley: Can you run D in a client/server mode or is this strictly a host-based benchmark?

Stephens: The difference is, I think, that in an OLTP world, if you had two systems, a front end and a back end, you would segment the work load such that the front end did one stage of the transaction and the back end did another stage of the transaction. So you'd break it up in time. The focus of decision support work isn't on staging the work and then keeping the pipeline going, but in taking this massive wall of data and chopping it up into manageable pieces.

Shanley: Which leads me into the kind of systems or architectures that you think will do well on TPC-D. It sounds like we're leading into SMP type of systems that are going to do very well on this.

Stephens: A system that can segment and parallelize the data will do really well. And whether that's SMP or a uniprocessor or a distributed cluster or something, who knows.

Raab: Again, we have a very broad range from 1 gigabyte to 1 terabyte, a very wide range that we're addressing, so there's room for many systems' architecture to do well in their own category.

Stephens: Another factor is response times. A lot of people will say that decision support at a gigabyte makes no sense, that it's too small. I would disagree with that, because I think there's a very valid section of the market that says, "I don't have to look at a lot of data, but I have to get the answers very rapidly." So it's perfectly reasonable to say, "Okay, I'm doing stock floor trading in this particular area, and I've only got a gigabyte of data, but I need to be able to look at that data and make snap decisions very rapidly to feed back to the people who are trying to run the business." It really is the entire spectrum that's valid.

Shanley: Let's talk about the processing or the execution of the queries in terms of, do they have to be processed randomly, can there be any pre-processing or pre-ordering of them? I guess, "What is ad-hoc?" is another way of phrasing it.

Raab: I call it controlled spontaneity.

Shanley: That's a beautiful phrase.

Stephens: Yes, that's a fine way of stating it. What we were trying to do was to get to a managed level of randomness. We have to phrase the SQL queries as templates, where we're going to do a select from these tables and we're going to return this many rows and we're going to take this average and so forth, but to leave the parameters for those queries open. So one run you may look for suppliers in France, in another one in Germany. Or in one case you want parts from sizes 10 to a hundred, in another from a hundred to a thousand. The result is 17 templates which actually expand out to something on the order of a thousand specific queries. I don't know the exact number. And that allows us to have an appropriate degree of randomness. You can't a priori know the exact query that will be run, only the style of work load that it will exercise. In addition, what we've said is that there are specific orderings, and the orderings for the queries are well-known, they're in the spec. A particular stream has to run query 7 and then 9 and then 14 and then 2 and so forth, a permutation of the 17. And that if you do a multiple stream execution, then each stream has to do a different permutation so that different queries will be working together on a given system.

Shanley: Let's talk about the metrics. We have a single-user query processing power of QppD and the multi-user processing power QthD of metrics. Why two metrics? Give me your rationale for development of these two different metrics. QthD is actually the query throughput metric.

Raab: Here again, with decision support we're looking at the system that can be characterized either for its power or for its throughput. What are you trying to do? Are you trying to bring all of the power you have to resolve a single problem as quickly as possible, or are you trying to resolve many problems and as many of them as possible in a certain unit of time, and that would be throughput. So those two aspects are different and are different enough to warrant their own separate metric.

Shanley: And these factors are balanced in the price performance metric?

Raab: They both participate to the price performance metric, that's correct.

Stephens: This is another one of the infamous spectrums. You give two customers the same system, the same amount of disks, the same amount of CPU, mips, same amount of memory and so forth, and one will attempt to solve a particular problem very, very rapidly and the other will attempt to solve as many problems as he can in the time that he's got. The car analogy is a great one. You can take the same engine and with the transmission make the car go very, very fast or make it pull a trailer that's very, very heavy. Same issue. And what we tried to do is address that entire range and not make a judgment as to which is more valuable, which is why they contribute equally to the price performance result. And then it's up to the customer to decide where in that spectrum he falls.

Shanley: Let's talk about some comments that have come in saying that performance on this benchmark has a lot to do with query optimization and indexing capabilities. Could you explain?

Raab: Whenever you submit a query to a database, there are a lot of ways to go about getting the data. The real challenge for the product developers who are going to use this benchmark to characterize their software is to make sure that they're making intelligent choices for the person who's asking the question. And part of that is when to look at the raw data and when to look at the index. And another is what order to look at the tables. If you're going to have to look at five tables to get an answer, do you look at them 1-2-3-4-5 or 5-3-4-2-1? How do you combine intermediate results? And all of that falls under the sort of broad category of query optimization. How do I go about taking the high level questions being asked in the SQL and breaking it into discrete actions? That's one of the real differentiators that will separate out products in decision support and in this benchmark. Another way to put it, one way that I like to look at it, is that query optimization is how can the RDBMs take best advantage of all the hardware resources available? And I think, to elevate the debate on this just a tad, this is really what this benchmark will ultimately measure: how well the software can take advantage of the hardware that it runs on, and with new hardware architectures, how will the RDBMs be able to adapt to them.

Stephens: I think that's absolutely it. It's a question of, when you phrase a query, you're telling the RDBMs what you want to do, and it's up to it to understand the data and the hardware and software that's available to know the best way to get you what you want to know.

Shanley: I/O performance has also been mentioned as having a major impact on results. Any further comments on the I/O side of things?

Raab: It's a very I/O intensive benchmark, but so were all the OLTP benchmarks very I/O intensive. I think it's I/O intensive in different ways, though. The I/O patterns are different and therefore it will stress the I/O subsystems in ways that they have not been stressed in the past.

Stephens: Database benchmarks by definition have a serious and substantial I/O component. If you can't get the data to the CPU, the CPU can't do intelligent things with it. But in the case of OLTP where you were going out and getting singleton records randomly selected, "Go give me account 7, 700, 7,000, and lots of them very rapidly," that was one I/O pattern. In decision support, it's "Go get all of them." That's a very different I/O pattern. It may be that in the course of an hour they move the same amount of data, but the assumptions you make, the optimizations you make, the design tradeoffs that you make for one versus the other are very different, and it will be interesting to see how different systems react.