Notes on Database Normalization
What is it? Why do we need it? How is it done?
Table of contentsData redundancyData integrity3 kinds of anomalyUpdate anomaliesInsertion anomaliesDeletion anomaliesExtensions should require minimal redesignThe normal forms1NF2NF3NFBCNF (Boyce-Codd Normal Form)4NFThings not covered
Normalization is a database design methodology. Proposed by Edgar F. Codd in 1970 - and subsequently developed in the following years with some collaborators - database normalization gives us a framework for structuring relational (usually) databases. Its aim is to reduce data redundancy, and to improve data integrity. This is done by adhering to the "normal forms", of which there are technically 10 (or 11 if you count "Unnormalized form", which I don't). Fortunately, however, a database is generally considered to be normalized by following the first three norms.
Data redundancy is the duplication of data in multiple records. Whether the data is recorded in the same table, or represented across multiple tables, it is still considered redundant. There are two reasons we want to avoid data redundancy:
- It uses unnecessary resources. There is more data to store in the first place. That means our data takes up more space, and it will take longer to query it.
- It can easily lead to corrupted data, thanks to update anomalies.
Data integrity concerns its reliability over time. Imagine we have a database of users. Each user has an email address stored. If I query for Hans Andersen's email address and the database returns email@example.com, I want to know that running the same query will return me the same address in five years' time (assuming that Hans's address is not deliberately changed).
This predicatability can be compromised by duplicate data or storing data with invalid
types, to name a couple. For example, if you want to store some phone numbers, it might be
a bad idea to use the
VARCHAR type. This would allow non-numeric characters to be
listed, and therefore invalid phone numbers.
There are three main dangers for databases that have not been normalized:
One issue with having the same data expressed in multiple places, is that it has to be updated in multiple places. So if for some reason an update is only partially successful, some records may update while others will not. It will then be possible to get two different answers to the same query.
An insertion anomaly is when you want to add a new record to a table, but can't because of the way the data has been structured. To repurpose the example given on Wikipedia, consider this table of faculty members:
|ID||Faculty Name||Faculty Hire Date||Course Code|
|2||Snarky Bufflins||01/03/1991||The Canterbury Ales: Chaucer and his beers|
Each faculty member has a unique ID, a name, the date they were hired and a course code.
Now imagine we have a new member of the faculty starting so we need to make a new record.
Marston Porkhabit will be give a unique ID, their name will be entered and their start
date, 01/09/2021, will be recorded. But, for various reasons that are none of your
business actually, they are working on their own research initially. They are not
currently taking a course. It will be impossible to add a record for Marston unless we
null for the Course Code, which may be undesirable.
A deletion anomaly occurs when deleting data holding one fact will also delete data
holding another fact. The Faculty Table suffers from this. It is not possible to delete a
course without also removing a faculty member; unless, again, we are able to set the
Course Code to
This is just a principle that is worth mentioning before we get onto the normal forms. You should be able to add new types of data to a database without changing its structure too much. Otherwise, each change in structure could potentially break existing applications that interact with the database.
- Every record (row) must be unique
- Each cell contains only one value
- All attributes (columns that aren't keys) are dependent on the keys
- In other words, if you have a column or columns that are related to each other, or some other data, but not the primary key, then they belong in a different table
- All columns/fields can be determined only by the key in the table, and not by any other column
BCNF is a slightly stronger version of 3NF developed to address a few edge cases. Its definition is slightly complex and beyond the scope of this introduction. In short, a relation (table) will be in BCNF if it satisfies 3NF and does not contain any overlapping candidate keys.
- No multi-valued dependencies
- Let's assume your primary key is a customer ID. What if that customers has multiple addresses? Or multiple credit cards? Multiple subscriptions?
- Rather than storing those multi-valued dependencies on the Customer table, they can go in their own address, credit_card and subscription tables
- That way, we prevent obvious update anomalies
Database normalization is an enormous topic. Since it is the subject of multiple hefty textbooks, please be aware that we have not even scratched the surface here.
It might be useful to be aware that there are disagreements over which normal forms are useful to adhere to. I have seen it suggested on Stack Overflow that one should aim for 5NF and use an algorithm to achieve it. Others consider 3NF to be more than enough. The professors taking CS186 at Berkley consider 2NF and 3NF to be only of historical interest, preferring to focus on 1NF and BCNF. Though, presumably, adhering to BCNF implies also covering 2NF and 3NF!
There are also cases in which normalization is less desirable. Splitting your functional dependencies over multiple tables can require joins, which results in slower queries. In fact, there is a process called denormalization that seeks to address the cases in which this is a problem for certain business cases.
And these are just, of the things that I know I haven't covered, those that I have bothered to note! So please make sure that, if you have learned anything here, you consult other resources to learn more about database normalization 👍