Tuesday, December 16, 2008

How RDBMS Works

How Does This RDBMS Thing Work?

Botaniska Tradgarden.  Visby, Gotland. Database researchers love to talk about relational algebra, n-tuples, normal form, and natural composition, while throwing around mathematical symbols. This patina of mathematical obscurity tends to distract your attention from their bad suits and boring personalities, but is of no value if you just want to use a relational database management system.

In fact, this is all you need to know to be a Caveman Database Programmer: A relational database is a big spreadsheet that several people can update simultaneously.

How can multiple people use a spreadsheet such as Microsoft Excel at the same time? Would they take turns at the keyboard and mouse? Fight over the keyboard and mouse? Crowd their heads together in front of the display?

We need to get rid of the mouse/keyboard/windows idea of interaction. It doesn't make sense when there are many simultaneous users. Instead imagine that the database management system is sitting by itself in a dark closet. Users write down their requests on strips of paper and slip them under the door. A request might be "create a table", "insert a row into a table", "update an existing row in a table", "give me a report of the information contained in all the rows in a table that meet the following criteria...". If a request requires an answer, the database chews on the question for awhile, then pushes a paper report back out underneath the door.

Let's examine how this works in greater detail.

Each table in the database is one spreadsheet. You tell the RDBMS how many columns each row has. For example, in our mailing list database, the table has two columns: name and e-mail. Each entry in the database consists of one row in this table. An RDBMS is more restrictive than a spreadsheet in that all the data in one column must be of the same type, e.g., integer, decimal, character string, or date. Another difference between a spreadsheet and an RDBMS is that the rows in an RDBMS are not ordered. You can have a column named row_number and ask the RDBMS to return the rows ordered according to the data in this column, but the row numbering is not implicit as it would be with a spreadsheet program. If you do define a row_number column or some other unique identifier for rows in a table, it becomes possible for a row in another table to refer to that row by including the value of the unique ID.

Here's what some SQL looks like for the mailing list application:

create table mailing_list (
email varchar(100) not null primary key,
name varchar(100)
);
The table will be called mailing_list and will have two columns, both variable length character strings. We've added a couple of integrity constraints on the email column. The not null will prevent any program from inserting a row where name is specified but email is not. After all, the whole point of the system is to send people e-mail so there isn't much value in having a name with no e-mail address. The primary key tells the database that this column's value can be used to uniquely identify a row. That means the system will reject an attempt to insert a row with the same e-mail address as an existing row. This sounds like a nice feature, but it can have some unexpected performance implications. For example, every time anyone tries to insert a row into this table, the RDBMS will have to look at all the other rows in the table to make sure that there isn't already one with the same e-mail address. For a really huge table, that could take minutes, but if you had also asked the RDBMS to create an index for mailing_list on email then the check becomes almost instantaneous. However, the integrity constraint still slows you down because every update to the mailing_list table will also require an update to the index and therefore you'll be doing twice as many writes to the hard disk.

That is the joy and the agony of SQL. Inserting two innocuous looking words can cost you a factor of 1000 in performance. Then inserting a sentence (to create the index) can bring you back so that it is only a factor of two or three. (Note that many RDBMS implementations, including Oracle, automatically define an index on a column that is constrained to be unique.)

Anyway, now that we've executed the Data Definition Language "create table" statement, we can move on to Data Manipulation Language: an INSERT.


insert into mailing_list (name, email)
values ('Philip Greenspun','philg@mit.edu');
Note that we specify into which columns we are inserting. That way, if someone comes along later and does
alter table mailing_list add (phone_number varchar(20));
(the Oracle syntax for adding a column), our INSERT will still work. Note also that the string quoting character in SQL is a single quote. Hey, it was the '70s. If you visit the newsgroups beginning with comp.databases right now, you can probably find someone asking "How do I insert a string containing a single quote into an RDBMS?" Here's one harvested back in 1998:
demaagd@cs.hope.edu (David DeMaagd) wrote:

>hwo can I get around the fact that the ' is a reserved character in
>SQL Syntax? I need to be able to select/insert fields that have
>apostrophies in them. Can anyone help?

You can use two apostrophes '' and SQL will treat it as one.

===========================================================
Pete Nelson | Programmers are almost as good at reading
weasel@ecis.com | documentation as they are at writing it.
===========================================================
We'll take Pete Nelson's advice and double the single quote in "O'Grady":
insert into mailing_list (name, email)
values ('Michael O''Grady','ogrady@fastbuck.com');

Having created a table and inserted some data, at last we are ready to experience the awesome power of the SQL SELECT. Want your data back?

select * from mailing_list;
If you typed this query into a standard shell-style RDBMS client program, for example Oracle's SQL*PLUS, you'd get ... a horrible mess. That's because you told Oracle that the columns could be as wide as 100 characters (varchar(100)). Very seldom will you need to store e-mail addresses or names that are anywhere near as long as 100 characters. However, the solution to the "ugly report" problem is not to cut down on the maximum allowed length in the database. You don't want your system failing for people who happen to have exceptionally long names or e-mail addresses. The solution is either to use a fancier tool for querying your database or to give SQL*Plus some hints for preparing a report:

SQL> column email format a25
SQL> column name format a25
SQL> column phone_number format a12
SQL> set feedback on
SQL> select * from mailing_list;

EMAIL NAME PHONE_NUMBER
------------------------- ------------------------- ------------
philg@mit.edu Philip Greenspun
ogrady@fastbuck.com Michael O'Grady

2 rows selected.
Note that there are no values in the phone_number column because we haven't set any. As soon as we do start to add phone numbers, we realize that our data model was inadequate. This is the Internet and Joe Typical User will have his pants hanging around his knees under the weight of a cell phone, beeper, and other personal communication accessories. One phone number column is clearly inadequate and even work_phone and home_phone columns won't accommodate the wealth of information users might want to give us. The clean database-y way to do this is to remove our phone_number column from the mailing_list table and define a helper table just for the phone numbers. Removing or renaming a column turns out to be impossible in Oracle 8, so we

drop table mailing_list;

create table mailing_list (
email varchar(100) not null primary key,
name varchar(100)
);

create table phone_numbers (
email varchar(100) not null references mailing_list,
number_type varchar(15) check (number_type in ('work','home','cell','beeper')),
phone_number varchar(20)
);
Note that in this table the email column is not a primary key. That's because we want to allow multiple rows with the same e-mail address. If you are hanging around with a database nerd friend, you can say that there is a relationship between the rows in the phone_numbers table and the mailing_list table. In fact, you can say that it is a many-to-one relation because many rows in the phone_numbers table may correspond to only one row in the mailing_list table. If you spend enough time thinking about and talking about your database in these terms, two things will happen:
  1. You'll get an A in an RDBMS course at a mediocre state university.
  2. You'll pick up readers of Psychology Today who think you are sensitive and caring because you are always talking about relationships. [see "Using the Internet to Pick up Babes and/or Hunks" at http://philip.greenspun.com/wtr/getting-dates before following any of the author's dating advice]
Another item worth noting about our two-table data model is that we do not store the user's name in the phone_numbers table. That would be redundant with the mailing_list table and potentially self-redundant as well, if, for example, "robert.loser@fastbuck.com" says he is "Robert Loser" when he types in his work phone and then "Rob Loser" when he puts in his beeper number, and "Bob Lsr" when he puts in his cell phone number while typing on his laptop's cramped keyboard. A database nerd would say that that this data model is consequently in "Third Normal Form". Everything in each row in each table depends only on the primary key and nothing is dependent on only part of the key. The key for the phone_numbers table is the combination of email and number_type. If you had the user's name in this table, it would depend only on the email portion of the key.

Anyway, enough database nerdism. Let's populate the phone_numbers table:


SQL> insert into phone_numbers values ('ogrady@fastbuck.com','work','(800) 555-1212');

ORA-02291: integrity constraint (SCOTT.SYS_C001080) violated - parent key not found
Ooops! When we dropped the mailing_list table, we lost all the rows. The phone_numbers table has a referential integrity constraint ("references mailing_list") to make sure that we don't record e-mail addresses for people whose names we don't know. We have to first insert the two users into mailing_list:

insert into mailing_list (name, email)
values ('Philip Greenspun','philg@mit.edu');
insert into mailing_list (name, email)
values ('Michael O''Grady','ogrady@fastbuck.com');


insert into phone_numbers values ('ogrady@fastbuck.com','work','(800) 555-1212');
insert into phone_numbers values ('ogrady@fastbuck.com','home','(617) 495-6000');
insert into phone_numbers values ('philg@mit.edu','work','(617) 253-8574');
insert into phone_numbers values ('ogrady@fastbuck.com','beper','(617) 222-3456');
Note that the last four INSERTs use an evil SQL shortcut and don't specify the columns into which we are inserting data. The system defaults to using all the columns in the order that they were defined. Except for prototyping and playing around, I don't recommend ever using this shortcut.

The first three INSERTs work fine, but what about the last one, where Mr. O'Grady misspelled "beeper"?

ORA-02290: check constraint (SCOTT.SYS_C001079) violated
We asked Oracle at table definition time to check (number_type in ('work','home','cell','beeper')) and it did. The database cannot be left in an inconsistent state.

Let's say we want all of our data out. Email, full name, phone numbers. The most obvious query to try is a join.


SQL> select * from mailing_list, phone_numbers;

EMAIL NAME EMAIL TYPE NUMBER
---------------- ---------------- ---------------- ------ --------------
philg@mit.edu Philip Greenspun ogrady@fastbuck. work (800) 555-1212
ogrady@fastbuck. Michael O'Grady ogrady@fastbuck. work (800) 555-1212
philg@mit.edu Philip Greenspun ogrady@fastbuck. home (617) 495-6000
ogrady@fastbuck. Michael O'Grady ogrady@fastbuck. home (617) 495-6000
philg@mit.edu Philip Greenspun philg@mit.edu work (617) 253-8574
ogrady@fastbuck. Michael O'Grady philg@mit.edu work (617) 253-8574

6 rows selected.
Yow! What happened? There are only two rows in the mailing_list table and three in the phone_numbers table. Yet here we have six rows back. This is how joins work. They give you the Cartesian product of the two tables. Each row of one table is paired with all the rows of the other table in turn. So if you join an N-row table with an M-row table, you get back a result with N*M rows. In real databases, N and M can be up in the millions so it is worth being a little more specific as to which rows you want:

select *
from mailing_list, phone_numbers
where mailing_list.email = phone_numbers.email;

EMAIL NAME EMAIL TYPE NUMBER
---------------- ---------------- ---------------- ------ --------------
ogrady@fastbuck. Michael O'Grady ogrady@fastbuck. work (800) 555-1212
ogrady@fastbuck. Michael O'Grady ogrady@fastbuck. home (617) 495-6000
philg@mit.edu Philip Greenspun philg@mit.edu work (617) 253-8574

3 rows selected.
Probably more like what you had in mind. Refining your SQL statements in this manner can sometimes be more exciting. For example, let's say that you want to get rid of Philip Greenspun's phone numbers but aren't sure of the exact syntax.

SQL> delete from phone_numbers;

3 rows deleted.
Oops. Yes, this does actually delete all the rows in the table and is perhaps why Oracle makes the default in SQL*Plus that you're inside a transaction. I.e., nothing happens that other users can see until you type "commit". If we type "rollback", therefore, we have an opportunity to enter the statement that we wished we'd typed:
delete from phone_numbers where email = 'philg@mit.edu';
There is one more SQL statement that is worth learning. In the 20th century parents encouraged their children to become software engineers rather than trying their luck in Hollywood. In the 21st century, however, it might be safer to pitch scripts than to sit at a computer waiting for your job to be outsourced to a Third World country. Suppose therefore that Philip Greenspun gives up programming and heads out to Beverly Hills. Clearly a change of name is in order and here's the SQL:

SQL> update mailing_list set name = 'Phil-baby Greenspun' where email = 'philg@mit.edu';

1 row updated.

SQL> select * from mailing_list;

EMAIL NAME
-------------------- --------------------
philg@mit.edu Phil-baby Greenspun
ogrady@fastbuck.com Michael O'Grady

2 rows selected.

As with DELETE, it is not a good idea to play around with UPDATE statements unless you have a WHERE clause at the end.

No comments:

Post a Comment