Tuesday, March 27, 2012

Normalising Your Database - Higher Normal Forms


Are We There Yet?
Designing your tables to comply with 3NF (third normal form) is usually sufficient to ensure good design so, most of the time, you can stop right here. The higher normal forms address less common data problems. They are included here so you'll know what they are and what to do about them if you come across them. These normal forms do get a bit more complicated.
The truth is there are times when you will want to denormalize your data. That means you may sometimes want to put the data in two normalized tables back into one denormalized table. The reasons for doing this are usually associated with performance, for example, the speed at which queries run. But at least it will be a conscious decision and represents one of the beauties of normalization. Properly normalized tables can always be put back together with no loss or gain of data.

Higher Normal Forms - Boyce-Codd Normal Form:
A table is in third normal form (3NF) and all determinants are candidate keys.
Boyce-Codd normal form (BCNF) can be thought of as a "new" third normal form. It was introduced to cover situations that the "old" third normal form did not address. Keep in mind the mean of a determinant (determines the value in another field) and candidate keys (qualify for designation as primary key). This normal form applies to situations where you have overlapping candidate keys.
If a table has no non-key fields, it is automatically in BCNF.
A Practical Approach
Look for potential problems in updating existing data (modification anomaly) and in entering new data (insertion anomaly).
Imagine that you were designing a table for a college to hold information about courses, students, and teaching assistants. You have the following business rules.
§  Each course can have many students.
§  Each student can take many courses.
§  Each course can have multiple teaching assistants (TAs).
§  Each TA is associated with only one course.
§  For each course, each student has one TA.
Some sample data:
CourseNum
Student
TA
ENG101
Jones
Clark
ENG101
Grayson
Chen
ENG101
Samara
Chen
MAT350
Grayson
Powers
MAT350
Jones
O'Shea
MAT350
Berg
Powers
COURSES_STUDENTS_TA's
To uniquely identify each record, you could choose CourseNum + Student as a primary key. This would satisfy third normal form also because the combination of CourseNum and Student determines the value in TA. Another candidate key would be Student + TA. In this case, you have overlapping candidate keys (Student is in both). The second choice, however, would not comply with third normal form because the CourseNum is not determined by the combination of Student and TA; it only depends on the value in TA (see the business rules). This is the situation that Boyce-Codd normal form addresses; the combination of Student + TA could not be considered to be a candidate key.
If you wanted to assign a TA to a course before any students enrolled, you couldn't because Student is part of the primary key. Also, if the name of a TA changed, you would have to update it in multiple records.
If you assume you have just these fields, this data would be better stored in three tables: one with CourseNum and Student, another with Student and TA, and a third with CourseNum and TA.
*CourseNum
*Student
ENG101
Jones
ENG101
Grayson
ENG101
Samara
MAT350
Grayson
MAT350
Jones
MAT350
Berg
COURSES

*Student
*TA
Jones
Clark
Grayson
Chen
Samara
Chen
Grayson
Powers
Jones
O'Shea
Berg
Powers
STUDENTS

*CourseNum
*TA
ENG101
Clark
ENG101
Chen
MAT350
O'Shea
MAT350
Powers
TA's
Above, showing tables that comply with BCNF

Fourth Normal Form:
A table is in Boyce-Codd normal form (BCNF) and there are no multi-valued dependencies.
multi-valued dependency occurs when, for each value in field A, there is a set of values for field B and a set of values for field C but fields B and C are not related.
A Practical Approach
Look for repeated or null values in non-key fields.
A multi-valued dependency occurs when the table contains fields that are not logically related. An often used example is the following table:
*Movie
*Star
*Producer
Once Upon a Time
Julie Garland
Alfred Brown
Once Upon a Time
Mickey Rooney
Alfred Brown
Once Upon a Time
Julie Garland
Muriel Humphreys
Once Upon a Time
Mickey Rooney
Muriel Humphreys
Moonlight
Humphrey Bogart
Alfred Brown
Moonlight
Julie Garland
Alfred Brown
MOVIES
A movie can have more than one star and more than one producer. A star can be in more than one movie. A producer can produce more than one movie. The primary key would have to include all three fields and so this table would be in BCNF. But you have unnecessarily repeated values, with the data maintenance problems that causes, and you would have trouble with deletion anomalies.
The Star and the Producer really aren't logically related. The Movie determines the Star and the Movie determines the Producer. The answer is to have a separate table for each of those logical relationships - one holding Movie and Star and the other with Movie and Producer, as shown below:
*Movie
*Star
Once Upon a Time
Julie Garland
Once Upon a Time
Mickey Rooney
Moonlight
Humphrey Bogart
Moonlight
Julie Garland
STARS

*Movie
*Producer
Once Upon a Time
Alfred Brown
Once Upon a Time
Muriel Humphreys
Moonlight
Alfred Brown
PRODUCERS
Above, showing tables that comply with 4NF
Below is another example of a common design error, and it's easily spotted by all the missing or blank values.
DeptCode
ProjectNum
ProjectMgrID
Equipment
PropertyID
IS
36-272-TC
EN1-15
CD-ROM
657
IS


VGA desktop monitor
305
AC
35-152-TC
EN1-15


AC


Dot-matrix printer
358
AC


Calculator with tape
239
TW
30-452-T3
EN1-10
486 PC
275
TW
30-457-T3
EN1-15


TW
31-124-T3
EN1-15
Laser printer
109
TW
31-238-TC
EN1-15
Hand-held scanner
479
RI


Fax machine
775
MK


Laser printer
858
MK


Answering machine
187
TW
31-241-TC
EN1-15
Standard 19200 bps modem
386
SL


486 Laptop PC
772
SL


Electronic notebook
458
PROJECTS_EQUIPMENT
Above, a table with many null values (note: it also does not comply with 3NF and BCNF)
It's the same problem here because not all of the data is logically related. As usual, the answer is more tables - one to hold the information on the equipment assigned to departments (with PropertyID as the primary key) and another with projects and departments. You'd have to know the business rules to know whether a project might involve more than one department or manager and be able to figure out the primary key. Assuming a project can have only one manager and be associated with only one department, the tables would be as follows.
*PropertyID
Equipment
DeptCode
657
CD-ROM
IS
305
VGA desktop monitor
IS
358
Dot-matrix printer
AC
239
Calculator with tape
AC
275
486 PC
TW
109
Laser printer
TW
479
Hand-held scanner
TW
775
Fax machine
RI
858
Laser printer
MK
187
Answering machine
MK
386
Standard 19200 bps modem
TW
772
486 Laptop PC
SL
458
Electronic notebook
SL
EQUIPMENT

*ProjectNum
ProjectMgrID
DeptCode
36-272-TC
EN1-15
IS
35-152-TC
EN1-15
AC
30-452-T3
EN1-10
TW
30-457-T3
EN1-15
TW
31-124-T3
EN1-15
TW
31-238-TC
EN1-15
TW
31-241-TC
EN1-15
TW
PROJECTS_EQUIPMENT
Above, tables that eliminate the null values and comply with 4NF

Fifth Normal Form:
A table is in fourth normal form (4NF) and there are no cyclic dependencies.
cyclic dependency can occur only when you have a multi-field primary key consisting of three or more fields. For example, let's say your primary key consists of fields A, B, and C. A cyclic dependency would arise if the values in those fields were related in pairs of A and B, B and C, and A and C.
Fifth normal form is also called projection-join normal form. A projection is a new table holding a subset of fields from an original table. When properly formed projections are joined, they must result in the same set of data that was contained in the original table.
A Practical Approach
Look for the number of records that will have to be added or maintained
Following is some sample data about buyers, the products they buy, and the companies they buy from on behalf of MegaMall, a large department store.
*Buyer
*Product
*Company
Chris
jeans
Levi
Chris
jeans
Wrangler
Chris
shirts
Levi
Lori
jeans
Levi
BUYING
Above, a table with cyclic dependencies
The primary key consists of all three fields. One data maintenance problem that occurs is that you need to add a record for every buyer who buys a product for every company that makes that product or they can't buy from them. That may not appear to be a big deal in this sample of 2 buyers, 2 products, and 2 companies (2 X 2 X 2 = 8 total records). But what if you went to 20 buyers, 50 products, and 100 companies (20 X 50 X 100 = 100,000 potential records)? It quickly gets out of hand and becomes impossible to maintain.
You might be tempted to solve this by dividing this into the following two tables.
*Buyer
*Product
Chris
jeans
Chris
shirts
Lori
jeans
BUYERS

*Product
*Company
jeans
Wrangler
jeans
Levi
shirts
Levi
PRODUCTS
However, if you joined the two tables above on the Product field, it would produce a record not part of the original data set (it would say that Lori buys jeans from Wrangler). This is where the projection-join concept comes in.
The correct solution would be three tables:
*Buyer
*Product
Chris
jeans
Chris
shirts
Lori
jeans
BUYERS

*Product
*Company
jeans
Wrangler
jeans
Levi
shirts
Levi
PRODUCTS

*Buyer
*Company
Chris
Levi
Chris
Wrangler
Lori
Levi
COMPANIES
Above, tables that comply with 5NF
When the first two tables are joined by Product and the result joined to the third table by Buyer and Company, the result is the original set of data.
In our scenario of 20 buyers, 50 products, and 100 companies, you would have, at most, 1000 records in the Buyers table (20 X 50), 5000 records in the Products table (50 X 100), and 2000 records in the Companies table (20 X 100). With a maximum of 8000 records, these tables would be much easier to maintain than the possible 100,000 records we saw earlier.

No comments:

Post a Comment