Suppose you wish to create a database to track the interesting information pertaining to a large collection of compact discs. It is useful to simply list data items that should be stored. One possible list might include CD title, record label, artist and song title as shown in the table below.
Compact Disc Database | |||
ARTIST | CD TITLE | RECORD LABEL | SONGS |
Macy Gray | On How Life Is | Epic | Do Something I Try Caligula ... |
Dixie Chicks | Fly | Monument | Goodbye Earl Ready To Run ... |
Sarah McLachlan | Surfacing | Arista | Full Of Grace Witness Adia Building A Mystery ... |
Sarah McLachlan | Mirrorball | Arista | Building A Mystery Fear Adia Good Enough ... |
At first glance, this table seems to meet our needs. On closer inspection, however, there are several problems.
Sarah McLachlan is repeated twice for each of her CDs. This repetition is a problem for many reasons. First, when entering artists in the database, the same artist name must be entered repeatedly. Second, and more important, if an artist's name changes, it must be updated in multiple places.
For instance, if McLachlan is misspelled the programmer will have to update the data in multiple rows. The same issue would arise if Ms. McLachlan's name changes in the future (à la Jefferson Airplane or John Cougar). As more Sarah McLachlan CDs are added to the collection, the amount of effort required to maintain data consistency would obviously increase.
Another problem with the table lies in the way songs are stored: as a list of songs in a single column. Using this data meaningfully will be problematic. Imagine having to enter and maintain such a list. What if we wish to store the length of songs as well as their names? Or we wish to search by song title?
This is where database design comes into play. One of the main purposes of database design is to eliminate redundancy from a database through techniques of normalization. Let's begin first with fundamental relational database design concepts.