The Derek Zoolander School for College Kids Who Can't Database Good

One of the most useful classes I took during my time here at Penn State was CMPSC 431W: Database Management Systems. Amazingly, this is one of two classes offered by the CS department here on databases, and it is not required to graduate. The only required class isn’t a database class, but a Java programming class that teaches students about basic SQL since one of your projects uses JDBC to connect to an extremely rudimentary database and requires basic knowledge of implicit joins and creating tables with a GUI. So, for a while, I thought databases were easy and once you knew how to type SELECT ... FROM ... WHERE ...; you knew all that a non-DBA needed to know about them.

When I took 431W, I was told by past students that the class has you build a basic web application with PHP or something and then hook it up to a database to display a list of data on a page, like video games or food and you learn that if you change the content of the database tables, the list changes (wow!). However, when I got to the class, the instructor was different and we had a new project: design a database for a specific use case (nurses in a hospital) and then implement a simplified relational database management system ((R)DBMS, like a MySQL or MariaDB) in C to run the SQL code you wrote to create and query this database.

Certainly a taller order than the previous semesters had led me to believe.

So I went through the semester and learned all about how DBMSs work and the general algorithms behind performing SELECTs and JOINs and how there is infuriatingly no right way™ of doing things in the world of structuring a database. And while I learned that database administrator was an entry that I would strike off my list of desired careers, I also learned a lot about making database code more efficient and what would be useful for a developer to know as well as what you could probably get away without knowing. Here’s a simple list the major takeaways I got from the course, in case your CS department overlooks this stuff, like Penn State’s does.

Where does the data come from?

There’s really only two ways to store data on a computer: memory and the disk, and since databases store very large amounts of data, they opt for the disk. A DBMS essentially keeps track of its records across files, then pulls the data out of the files and into memory for you when you need them. Now, it could just use the filesystem management built into the operating system and rely on the OS to manage its in-memory pages, but databases work differently than most other file use cases, so DBMSs re-implement their own paging algorithms. You don’t have to know how they work, just that it’s not literally using plain old file operations. Also, this is an oversimplification of how everything works aimed at showing how you could write a very simple DBMS, not how production code for any commercial SQL database software works.

Database tables are stored in a flat binary list of records, which are of fixed length. The record length is known since the columns in a relational database table are of a fixed type with an assigned length, hence why you have to declare a VARCHAR length, for instance. So, if you had a table of students with columns fname of type VARCHAR(10), lname of type VARCHAR(25), age of type INTEGER and student_id of type CHARACTER(10), then you would know that —assuming characters are 1 byte and integers are 4 bytes—that each record has a total length of 10+25+4+10=49 bytes and that the first 10 bytes of the record corresponds to fname and so on.

Keeping with this example, when you type a SELECT statement, you can move record by record until you match the WHERE conditions (knowing where to look by the byte offset), then return the corresponding record(s). Again, this is a very simplified look at things, but hopefully can tell you a bit about what goes on under the hood.

How Should I Structure My Data?

Designing your data model is a very complicated process with no correct answers and a lot of trade-offs, and is totally dependent on your situation. However, if you want a collection of hard and fast rules on how your tables should be laid out, you most likely want to fit the database normalization forms. There are a lot of form orders, but you pretty much never need to go past third normal form (3NF).

First Normal Form (1NF)

Let’s say you are storing a list of customers in a database, specifically their first and last names and their phone numbers (and an ID column that is the PK). At first nothing seems wrong with this design, and you can add a few customers with their respective information:

First Name	Last Name	Phone Number
Dale	Cooper	(798)-456-7910
Audrey	Horne	(465)-867-5309
Harry	Truman	(832)-483-0929

However, now we have to add Bobby Briggs, who’s a wonderful guy, but he recently signed up for cable or internet with Comcast or something and they threw in a free landline and hey, who doesn’t want a free landline? So now he has two phone numbers and wants to provide both, but you only have one phone number column, so you just serialize it as two numbers, comma-separated. Now you have this:

First Name	Last Name	Phone Number
Dale	Cooper	(798)-456-7910
Audrey	Horne	(465)-867-5309
Harry	Truman	(832)-483-0929
Bobby	Briggs	(383)-284-3892,(203)-382-1121

That new cell in the bottom right is not compliant with 1NF; it’s bad database design, since now if you wanted to query for customers with the phone number (203)-382-1121, it wouldn’t return anyone, since Bobby’s “phone number” is seen by the DBMS as “(383)-284-3892,(203)-382-1121,” which is not equivalent. If you want it to work as intended, you have to write a LIKE SQL query and have your application code handle parsing the results and make sure the developer intimately knows the table structure so that they also write all of their applications to do this too, and it just turns into a big mess.

Put simply, first normal form requires that each table cell is atomic, holding only a single piece of data that isn’t serialized and doesn’t mean anything other than what it says. To make our example comply with 1NF, you could split the “Phone Number” column into “Mobile Phone Number” and “Home Phone Number,” for example.

Second Normal Form (2NF)

A table is in second normal form (2NF) if all of its non-key columns are dependent on the primary key and no other candidate key. Essentially, this means that for any row in the table, if I want to know a non-key column value, I have to know the PK value and can’t rely on any other column (or set of columns) that is unique for every row.

For example, let’s say we have a table of big-wigs at tech companies in a table:

Name (PK)	Company	Position	Location
Satya Nadella	Microsoft	CEO	Redmond, WA
Tim Cook	Apple	CEO	Cupertino, CA
Bill Gates	Microsoft	Co-Founder	Redmond, WA
Jeff Bezos	Amazon.com	CEO	Seattle, WA

Here, we see that {Name} is the PK, but the combination {Company, Position} could also be a PK, and location depends on part of that candidate key, specifically Company. This is a violation of 2NF. To fix it, break out the table into 2 tables:

People table

Name (PK)	Company (FK)	Position
Satya Nadella	Microsoft	CEO
Tim Cook	Apple	CEO
Bill Gates	Microsoft	Co-Founder
Jeff Bezos	Amazon.com	CEO

Companies table

Company Name (PK)	Location
Microsoft	Redmond, WA
Amazon.com	Seattle, WA
Apple	Cupertino, CA

This accomplishes a couple things: firstly, it makes our data returned a bit cleaner; if we want to know who the CEO of Microsoft is, we should get Satya Nadella, and if we want more information about him, I can look it up elsewhere in the database. Second, it removes an unnecessary dependency in our People table between company and location. Now, if Amazon moves headquarters, we don’t have to update every row with Amazon as the company in the table, we just have to update the location in the Companies table and the rest will take care of itself.

Third Normal Form (3NF)

Now that we’ve removed any dependencies on a partial candidate key, we want to remove any dependencies on any column in the table that isn’t the primary key. This is pretty simple to spot, as shown below, in the table below:

Game (PK)	Winning Team	Winning Team Location
Super Bowl XLIX	Patriots	New England
Super Bowl 50	Broncos	Denver
Super Bowl LI	Patriots	New England
Super Bowl LII	Eagles	Philadelphia

Spot the issue? The game name is the primary key (and it is the only candidate key), but the Winning Team Location column doesn’t depend on it or any candidate key, it only depends on the Winning Team column. That’s a 3NF violation, and we can fix it by moving Teams and Super Bowls into different tables:

Super Bowls

Game (PK)	Winning Team (FK)
Super Bowl XLIV	Patriots
Super Bowl 50	Broncos
Super Bowl LI	Patriots
Super Bowl LII	Eagles

Teams

Team Name (PK)	Location
Patriots	New England
Broncos	Denver
Eagles	Philadelphia

Similar to 2NF, we’ve now removed redundant data from the Super Bowls table, so if we ever need to change a city associated with a team (say, if they move), we only have to update it in one place. Makes sense, right? You should always do this, right?

Right?

Wrong. There are situations where the cleanliness and theoretical perfectness of these normal forms are outweighed by a practical concern: in a lot of cases, NFL teams are written in the form [City] [Team] (i.e. “Philadelphia Eagles” is far more common than “Eagles”). In order to get the data out of the database to display it on, say, a scoreboard, you need to perform a JOIN between the Teams and Super Bowls tables, which is slower than just grabbing the rows out of the Super Bowls table. And let’s be honest, NFL teams moving is pretty rare, so you can afford to update at most 6 rows in a database if the Steelers move cities. So, it may make more sense to just keep the location in the Super Bowls table when it comes to practical performance.

What in God’s name is an index?

The following may be a conversation you may have (or had) at some point during an internship during your college career (I certainly had it happen at least once):

You: “Gosh, this database is slow!”

“Real” developer: “Yeah, maybe we should index this table.”

You: “Heh yeah, good idea. That might help…”

And then you go do something else (or get the “real” developers coffee, if you work at a shitty company) because you have not the slightest clue what an index is, what it does, or how it will make your database calls faster.

Let’s demystify indexes a bit. Remember that database data is stored in files, and we’ll even go so far as to pretend that a DBMS can access all of the data in memory. So, if there’s an array of data and you’re searching for any item that satisfies some condition (i.e. a SELECT...WHERE clause), you could perform a linear search if the data is unsorted since the DBMS has no way of guaranteeing that it didn’t miss something along the way than checking every single row’s values, and—as you learned in CS 101 (or CMPSC 122, for my fellow Nittany Lions)—linear searches are kind of slow since they have, well, linear time complexity. And God forbid you want to perform a JOIN, then you’d have nested loops!

The other way you could search the data is by taking your flat data and putting it in some sort of data structure, then using a more efficient searching algorithm on that structure to find what you’re looking for in sub-linear time. The process of putting the data from a table into one of these data structures is exactly what indexing is.

For example, if we have a table of students with columns for their student ID numbers (the primary key), first name, last name, and starting year, then our table would be sorted by student ID number, since that’s the primary key. And that’s all well and good until someone wants to write a web application for TAs to enter grades by last name rather than ID number because its more convenient. At that point, it would make sense for you to index the table. When you do, the DBMS makes a table that is sorted by last name and has mappings for each row of the form (last name => student ID). Now we can find the info for a student by finding their last name in the index table, then finding the corresponding entry to the primary key in the original table, and just like that our search is executed in log(n) time.

This is a rudimentary example, and real DBMSs have a plethora of data structures that they can make indexes from. Two of the most popular indices are hashes and B+ trees, and they both have different use cases. Are you checking for equality, looking for one specific row? A hashmap may work best. Looking for a range of values with more relative comparators? A B+ tree should do the trick.

Index all the things?

So right about now indexes are probably looking pretty damn sweet. In fact, you might be tempted to index everything in every database you ever make. Do not do this. Indexes (or indices, I don’t know which is correct) do speed up reads pretty significantly, but they also slow down writes. Why? Well, since they are essentially copying a subset of their table’s columns, they have to be updated (and sorted!) whenever a new row is added to the table or a row is updated that involves the columns being indexed.

In our example above, every new student would have to be inserted into both the original table and the index, and not only that, but inserted into the correct place in the index, since the data is sorted. Now, since students are only inserted rarely (twice a year, say) but read far more often, this may be a good trade-off to make. But, it’s something to think about.

That, right there, is the biggest takeaway I had from my databases class: data model design is less about what the best way of doing something is and more about what the best way of doing it given the current situation is. If you learned one thing from this post, I hope that’s it.

And if that sounds horrible, then I hope you—like me—also learned that being a database administrator is not your calling.