php[architect] logo

Want to check out an issue? Sign up to receive a special offer.

Database Normalization For Developers

Posted by on April 27, 2023

Not every team has a dedicated database administrator. I’ve **never** worked on a team that has had someone solely focused on making sure our tables are structured perfectly with all the forms of normalization being applied. I, like a lot of developers, have had to learn difficult lessons about database design the hard way by constantly learning from my mistakes.

In this article, we’re going to go over how to normalize our database. Knowing normalization will make our lives as developers who also administer databases easier by reducing code duplication.

Normalization

One of the concepts that are key to good database design is making sure our data is normalized. The goal of normalization is to divide larger tables into smaller tables and link them together using relationships. By doing this we reduce data redundancy, ensure data is stored logically, and help prevent anomalies when working with the data.

Developers who are new to schema design may need to work through the process of normalizing their database again and again until it becomes second nature. That’s okay. It doesn’t need to be perfect right out of the gate.

Why Do We Want Our Data To Be Normalized?

Let’s say we have a `users` table where we keep information about our users. We’re going to track several pieces of information.

  • Email address
  • Password
  • Active state
  • Listing of previous passwords (to prevent reuse)
  • Office Name
  • Office City
  • Office Zip

Our database schema will look like the following.

users
id email password active previous_passwords office_name office_phone office_city office_zip
1 alice@example.com hash 1 hash1,hash2,hash3 office_name 555-555-5555 Saginaw 48609
2 avery@example.com hash 1 hash4,hash5 office_nae 555-555-5555 Saginaw 48609

There are two issues to note here. The first is that we’re tracking the previous passwords a user had using a comma-separated list which makes it a challenge to update. The second is that because we’re tracking the office names in each row someone has misspelled it in the row with the id of 2 (“office_nae” vs “office_name”) and if it’s happened once, chances are that it’s going to happen again.

Tracking the office name in the `users` table also makes it hard to generate a list of all possible offices because an office can’t “exist” in our database until it’s been assigned to a user.

## Normalization Overview
Edgar F. Codd, the inventor of the relational model that’s the basis of relational database management systems like SQL, proposed several forms that our database can apply to be normalized. These forms eventually grew into the following list.

  • UNF: Unnormalized form
  • 1NF: First normal form
  • 2NF: Second normal form
  • 3NF: Third normal for
  • EKNF: Elementary key normal form
  • BCNF: Boyce–Codd normal form
  • 4NF: Fourth normal form
  • ETNF: Essential tuple normal form
  • 5NF: Fifth normal form
  • DKNF: Domain-key normal form
  • 6NF: Sixth normal form

We’re going to focus on the first three forms because they give us the biggest return on investment for our application.

First Normal Form

When a table is the 1st normal form it follows two rules:

1. The table contains a unique identifier, also called the primary key, that is used to identify the row.
2. Each column contains atomic values (values that can not be broken down)

We already added an `id` column to take care of number one but let’s work on number two. For this rule to hold, each column should contain values that can not be broken down into smaller pieces. The `previous_passwords` column contains values that can and should be broken down into smaller pieces to satisfy this rule. We can do that by creating a new table (`user_passwords`) to keep track.

users
id email password active office_name office_phone office_city office_zip
1 alice@example.com hash 1 office_name 555-555-5555 Saginaw 48609
2 avery@example.com hash 1 office_nae 555-555-5555 Saginaw 48609
user_passwords
id user_id password
1 1 hash1
2 1 hash2
3 1 hash3
4 2 hash4
5 2 hash5

By separating these values into a new table we can more easily manipulate them and prevent issues when inserting the next previous password.

Second Normal Form

When a table is the 2nd normal form it follows two rules:

1. Is already in 1st normal form
2. All the non-key columns are dependent on the primary key of the table

For a table to obey the second rule, all of the columns in our `users` table should contain data that pertains to a user and nothing else. Their email, password, and active status are dependent on the primary key (id) and can’t easily be broken out from the user table. However, the office information fields aren’t dependent on the primary key so they violate this rule. To fix this we can split the office information into a new table.

users
id email password active office_id
1 alice@example.com hash 1 1
2 avery@example.com hash 1 1
offices
id name phone zip city
1 office_name 555-555-5555 48609 Saginaw

The bonus to this change is that we’ve now added a helpful table that can house all of our offices even if they aren’t assigned to a user.

Third Normal Form

When a table is the 3rd normal form it follows two rules:

1. Is already in 2nd normal form
2. It contains columns that are non-transitively dependent on the primary key

When columns are transitively dependent one column’s data relies on another column through a third column. For example, our offices’ city column is dependent on the zip column which is dependent on the office’s id. To fix this we’ll split out the zip in a new table.

zipcodes
id city
48609 Saginaw

This is where we get into the area of normalization where our return on investment gets a little questionable because the difference between these two items is questionable from a support perspective. I could have easily kept the zip codes and city names where they were without losing any sleep.

More after this word from our sponsor.

Foreign Key Constraints

Now that we’ve separated all our data into multiple tables we have a new problem and that’s making sure all our relationships are pointing to valid data. For instance, right now we can delete a row from the users table, and if we’re not careful all their user_passwords entries will remain. This may not be a problem but these orphan records (records that are associated with a deleted record) can cause inconsistencies in our data.

Relational databases allow us to define foreign key constraints that prevent us from creating orphaned records. By making sure that if a table references another table that it’s always a valid link.

Adding to our user_passwords table

When we created our `user_passwords` table we added a column called `user_id` that is used to associate the user with their old passwords. Right now we can insert a row that references a non-existent user.


mysql> INSERT INTO user_passwords (user_id, password) VALUES (10000000000000, 'test');
Query OK, 1 row affected (0.01 sec)

Now let’s add a foreign key constraint to the `user_passwords` table on the `user_id` column that references the id in the `users` table.


mysql> ALTER TABLE user_passwords ADD FOREIGN KEY (user_id) REFERENCES users (id);
Query OK, 0 rows affected (0.06 sec)
Records: 0 Duplicates: 0 Warnings: 0

This created a relationship so if we try to insert bad data again we’ll get an error.


mysql> INSERT INTO user_passwords (user_id, password) VALUES (10000000000000, 'test');
ERROR 1452 (23000): Cannot add or update a child row: a foreign key constraint fails (`devel`.`user_passwords`, CONSTRAINT `user_passwords_ibfk_1` FOREIGN KEY (`user_id`) REFERENCES `users` (`id`))

What To Do About Deletions?

A downside to setting up foreign key constraints is that if we do want to delete a user and their previous passwords we need to perform two queries. One to delete all the previous passwords and then one to delete the user. If we don’t care what happens to the records it’s kind of a bother. Thankfully, when we define our foreign key we can specify what should happen in these cases.

The MySQL family of databases allows us to define the action that will happen when we delete a parent.

1. Cascade – Child rows are automatically updated or deleted.
2. Set null – Child rows are set to NULL.
3. Restrict – This prevents the deletion and is the default behavior
4. Set default – Child rows are set to the default value (not supported by all database engines)

Let’s just use the `cascade` option to automatically delete the row.


mysql> ALTER TABLE user_passwords
ADD FOREIGN KEY (user_id)
REFERENCES users (id)
ON DELETE cascade
ON UPDATE cascade;
Query OK, 0 rows affected (0.11 sec)
Records: 0 Duplicates: 0 Warnings: 0

Then deletes become **much** simpler.

mysql> INSERT INTO user_passwords (user_id, password) VALUES (100, 'test1');
Query OK, 1 row affected (0.01 sec)
mysql> INSERT INTO user_passwords (user_id, password) VALUES (100, 'test2');
Query OK, 1 row affected (0.02 sec)
mysql> INSERT INTO user_passwords (user_id, password) VALUES (100, 'test3');
Query OK, 1 row affected (0.01 sec)
mysql> delete from users where id = 100;
Query OK, 1 row affected (0.01 sec)


mysql> select * from user_passwords where user_id = 100;
Empty set (0.00 sec)

Personally, I find it best to manually delete child rows because it’s possible to lose data by deleting one row and having it cascade down into a huge portion of our database. This has bitten me in the past so learn from my mistake.

Performance With Foreign Key Constraints

There is a performance penalty to using foreign key constraints as the database is going to make sure we have valid data in each of the foreign keyed columns. It’s generally not a large performance penalty but it’s something to think about if an application is doing lots of writes.

What You Need To Know

  • Normalization is the process of dividing larger tables into smaller tables and linking them together using relationships
  • Multiple forms to follow
  • Foreign key constraints help prevent data inconsistencies

Tags: ,
 

Leave a comment

Use the form below to leave a comment: