Tuesday, December 16, 2008

SQL the Hard Way


Venice Beach, California. Problems that are difficult in imperative computer languages can be very simple in a declarative language such as SQL and vice versa. Let's look at an example of building a report of how many users come to a site every day. Back in 1995 we built a a site that would hand out a unique session key to every new user. Since the keys were an ascending sequence of integers, we realized that we could keep track of how many users came to the site by simply inserting an audit row every day showing the value of the session key generator. Here's the history table:

create table history (
sample_time date, -- when did the cookie have the value
sample_value integer
);

insert into history values (to_date('1998-03-01 23:59:00','YYYY-MM-DD HH24:MI:SS'),75000);
insert into history values (to_date('1998-03-02 23:59:00','YYYY-MM-DD HH24:MI:SS'),76000);
insert into history values (to_date('1998-03-03 23:59:00','YYYY-MM-DD HH24:MI:SS'),77000);
insert into history values (to_date('1998-03-04 23:59:00','YYYY-MM-DD HH24:MI:SS'),78000);

select * from history order by sample_time;

SAMPLE_TIM SAMPLE_VALUE
---------- ------------
1998-03-01 75000
1998-03-02 76000
1998-03-03 77000
1998-03-04 78000

Note: the Oracle DATE data type is able to represent time down to one-second precision, sort of like an ANSI TIMESTAMP(0); in Oracle 9i and newer you'd probably use the timestamp type.

Now that we are accumulating the data, it should be easy to write a report page, eh? It turns out to be very easy to extract the rows of a table in order, as in the last SELECT above. However, it is impossible for a row to refer to "the last row in this SELECT". You could have write a C#, Java, PL/SQL, or VB procedure to walk through the rows in order, just setting things up on the first pass through the loop and then doing the appropriate subtraction for subsequent rows. However, running into the arms of an imperative computer language is considered untasteful in the SQL world.

In SQL you start by thinking about what you want and working backward. If the history table has N rows, we want an interval table with N-1 rows. Each row should have the start time, end time, time interval, and cookie interval. Any time that you need information from two different rows in the database, the way to get it is with a JOIN. Since there is only one underlying table, though, this will have to be a self-join:


SQL> select h1.sample_time, h2.sample_time
from history h1, history h2;

SAMPLE_TIM SAMPLE_TIM
---------- ----------
1998-03-04 1998-03-04
1998-03-01 1998-03-04
1998-03-02 1998-03-04
1998-03-03 1998-03-04
1998-03-04 1998-03-01
1998-03-01 1998-03-01
1998-03-02 1998-03-01
1998-03-03 1998-03-01
1998-03-04 1998-03-02
1998-03-01 1998-03-02
1998-03-02 1998-03-02
1998-03-03 1998-03-02
1998-03-04 1998-03-03
1998-03-01 1998-03-03
1998-03-02 1998-03-03
1998-03-03 1998-03-03

16 rows selected.

A note about syntax is in order here. In an SQL FROM list, one can assign a correlation name to a table. In this case, h1 and h2 are assigned to the two copies of history from which we are selecting. Then we can refer to h1.sample_time and get "the sample_time column from the first copy of the history table."

The main problem with this query, though, has nothing to do with syntax. It is the fact that we have 13 rows too many. Instead of N-1 rows, we specified the Cartesian product and got NxN rows. We've successfully done a self-join and gotten all the pairings we need, but also all possible other pairings. Now it is time to specify which of those pairings we want in our final report:


select h1.sample_time as s1,
h2.sample_time as s2
from history h1, history h2
where h2.sample_time > h1.sample_time;

S1 S2
---------- ----------
1998-03-01 1998-03-04
1998-03-02 1998-03-04
1998-03-03 1998-03-04
1998-03-01 1998-03-02
1998-03-01 1998-03-03
1998-03-02 1998-03-03

6 rows selected.
Note first that we've given correlation names to the columns as well, resulting in a report labelled with "s1" and "s2" (in a database with a smarter SQL parser than Oracle's, we could also use these as shorthand in the WHERE clause). The critical change here is the WHERE clause, which states that we only want intervals where s2 is later than s1. That kills off 10 of the rows from the Cartesian product but there are still three unwanted rows, e.g., the pairing of 1998-03-01 and 1998-03-04. We only want the pairing of 1998-03-01 and 1998-03-02. Let's refine the WHERE clause:

select h1.sample_time as s1,
h2.sample_time as s2
from history h1, history h2
where h2.sample_time = (select min(h3.sample_time)
from history h3
where h3.sample_time > h1.sample_time)
order by h1.sample_time;

S1 S2
---------- ----------
1998-03-01 1998-03-02
1998-03-02 1998-03-03
1998-03-03 1998-03-04

3 rows selected.
Note that we are now asking the database, for each of the potential row pairings, to do a subquery:

select min(h3.sample_time)
from history h3
where h3.sample_time > h1.sample_time
This will scan the history table yet again to find the oldest sample that is still newer than s1. In the case of an unindexed history table, this query should probably take an amount of time proportional to the number of rows in the table cubed (N^3). If we'd done this procedurally, it would have taken time proportional to N*log(N) (the limiting factor being the sort for the ORDER BY clause). There are a couple of lessons to be learned here: (1) Sometimes declarative languages can be difficult to use and vastly less efficient than procedural languages and (2) it is good to have a fast database server.

Given the correct row pairings, it is easy to add syntax to the SELECT list to subtract the times and cookie values.


select h1.sample_time as s1,
h2.sample_time as s2,
h2.sample_time - h1.sample_time as gap_time,
h2.sample_value - h1.sample_value as gap_cookie
from history h1, history h2
where h2.sample_time = (select min(h3.sample_time)
from history h3
where h3.sample_time > h1.sample_time)
order by h1.sample_time;

S1 S2 GAP_TIME GAP_COOKIE
---------- ---------- ---------- ----------
1998-03-01 1998-03-02 1 1000
1998-03-02 1998-03-03 1 1000
1998-03-03 1998-03-04 1 1000

3 rows selected.

Formulating SQL queries can be an art and most programmers need time and experience to get good at thinking declaratively.

No comments:

Post a Comment