Logo Bases de données
Français     English

Applications of Databases to Humanities and Social Sciences


Fundamentals of Normalization


The theory of normalization is related to the relational data model created by Codd in 1970. This model is based on the concept of relation.

The relational data model

Concept of relation

A relation is a 2 dimensions array, whose columns represent the properties of the relation (attributes or fields), and whose lines correspond to the various values (instances, occurrences) of these attributes. All the values of an attribute can belong only to one type : integer, real, character string..., and all the lines must be different, from where the concept of key of a relation.

Key of a relation

The key of a relation is the minimum subset of the attributes which allows to identify each line in a single way. All relation must obligatorily have a key.

Schema of a relation

It corresponds to the "condensed" definition of the relation and is composed of the name of the relation followed by the list of the attributes which make this relation, the key attributes being underlined. For example the relation R made up of the attributes A1, A2 and A3 whose schema is : R(A1,A2,A3).

The array with the lines corresponds to a possible "extension" of the relation :

A1
A2
A3
1
Wording 1
50
2
Wording 2
60
3
Wording 3
50
...
 

The concept of relation is simple and intuitive, but thus led to many anomalies if one applies it in a strict sense. The most significant anomalies are the following ones :

  • lack of semantic connections between data.
  • redundancy of data.
  • "abnormal" behaviors of the relationsin in case of updates : modifications, deletions.

We will use, to illustrate the various concepts presented, the following attributes which intervene in the management of education :

Mnemonic
Wording
Type
Coefficient
Coefficient of the subjecy
Integer
Nomat
Name of the subject
String(30)
Nomens
Name of the teacher
String(30)
Numat
Number of the subject
Integer
Numens
Number of the teacher
Integer

We will suppose that a teacher teaches only in one subject, but that a subject is taught by several teachers. There are 5 teachers per subject and there are 10 subjects in the course. We define, on these attributes, the relation EDUCATION.

EDUCATION(Numens, Nomens, Numat, Nomat, Coefficient)

"Problems" of the relation EDUCATION

We notice first of all that this relation satisfies the basic definition completely and can thus be implemented and used just as it is in a relational DBMS. Indeed, the values of the attributes belong only to one type and the key of the relation (Numens) allows to distinguish, in a single way, each line.
This relation obtained, without structuring of information, presents however many anomalies.

At the semantic level

The key of this relation is the number of the teacher (Numens). That means that this attribute identifies, characterizes, the relation and thus that all the other attributes of this relation are properties depending semantically on this attribute. That implies, among others, that the coefficient of a subject, or its name, is a direct characteristics of a teacher, which is, obviously, wrong.
This problem of modeling is due to the fact that the relation EDUCATION does not represent an object, an entity, of the real world but an amalgam of several entities.

On the level of redundancy of information

In the relation EDUCATION, the characteristics of a subject (its name and its coefficient) will have to be repeated as many time as there are teachers in the subject. In our example 5 times. This repetition, which can be avoided, will involve risks of partial update of the data and thus possibilities of inconsistency.

The size, in characters, necessary to store this relation is the following one : 5*10*(2+30+2+30+2)= 3300

We considered that an integer was represented on 2 bytes.

On the level of updates

- It is impossible to keyboard the characteristics of a subject if it is not attached to, at least, a teacher. Indeed, that would like to say that the key of the relation (Numens) would have value NULL what is prohibited by the constraint on the unicity of the key.

-If there is only one teacher for a subject, the departure of this teacher implies the deletion of all information of the relation for this teacher and thus, in this case, those which relate to the subject.

All these anomalies are due to the fact that the relation EDUCATION is not in a "normal" form.

Normalization of the relations

The setting in a "normal" form of relations, or normalization, aims at removing all the abnormal behaviors described above, and is based on the concept of functional dependency.

Functional dependency (FD)

It is said that there is a functional dependency between an attribute A1 and an attribute A2, one notes A1 - > A2, if knowing a value of A1 one can associate only one value of A2. One also says that A1 determines A2. A1 is the source of the functional dependency and A2 the goal.

In our example, we have the following FDs:

Numens -> Nomens, Numat, Nomat, Coefficient
Numat -> Nomat, Coefficient

Elementary functional dependency

It is said that a functional dependency is elementary if the source does not comprise superfluous attributes. The question about the elementarity of a FD should thus arise only when the left part of the FD comprises several attributes.

In our example, all the FDs are elementary. On the other hand, if we have the following FDs :

A, B -> C
A -> C

The functional dependence A, B - > C is not elementary since B is superfluous.

Direct functional dependency

It is said that functional dependency A - > B is direct if there is not any attribute C such as one can have A - > C and C - > B. In other words, that means that the dependency between A and B cannot be obtained by transitivity.

In our example, the FD: Numat - > Nomat, Coefficient is direct. On the other hand the FD Numens - > Nomat is not direct.

We have indeed : Numens -> Numat and Numat -> Nomat.

The normal forms

First normal form(1NF)

It is said that a relation is in 1NF if all the attributes which composes it cannot be subdivided.
The relation EDUCATION is in 1NF.

Second normal form (2NF)

It is said that a relation is in 2NF if it is in 1NF and if all the FDs between the key and the other attributes are elementary.
The relation EDUCATION is in 2NF.

Third normal form (3NF)

It is said that a relation is in 3NF if it is in 2NF and if all the FDs between the key and the other attributes are elementary and direct.
The relation EDUCATION is not in 3NF.

Boyce-Codd normal form (BCNF)

It is said that a relation is in BCNF if it is in 3NF and if the only FDs which exist are those which connect the key to the other not-key attributes.
The relation EDUCATION is not in BCNF.

Relations in 3NF

The problems of the relation EDUCATION come from the fact that this relation is in 2NF. To eliminate the abnormal behaviors, it is necessary to structure the attributes of this relation so to break it up into several relations in 3NF.
To build the relations in 3NF relating to our example, we start from the whole FDs set :

Numens -> Nomens, Numat, Nomat, Coefficient
Numat -> Nomat, Coefficient

After eliminations of transitivities, we obtain:

Numens -> Nomens, Numat
Numat -> Nomat, Coefficient

What gives us the following relations :

TEACHER(Numens, Nomens, Numat)
SUBJECT(Numat, Nomat, Coefficient)

These two relations are in BCNF.

For detailed information on the design process, refer, in these pages, at the column DB Design.

Advantages of the relations in 3NF (BCFN)

At the semantic level

The 2 preceding relations correspond to objects, entities, of the real world. The attributes properties of each relation are direct characteristics of the key of the associated relation.

On the level of redundancy of information

Information characteristic of a subject (its name and its coefficient) will be present only once and not 5 times as in the relation EDUCATION.
The redundancy of information is thus reducted to the bare minimum, without loss of information, which limits the risks of inconsistency.

Moreover, the size of the data is following :

Size of the TEACHER relation : 5*10*(2+30+2)= 1700
Size of the SUBJECT relation: 10*(2+30+2)= 340

The size of the database is 2040 characters, thus a reduction of about 40% compared to the database EDUCATION (3300 characters), without loss of information.

On the level of updates

- It is possible to keyboard the characteristics of a subject even if it is not attached to a teacher. The capture is carried out only in the relation SUBJECT.

- If there is only one teacher for a subject, the departure of this teacher implies the deletion of information relating to this teacher (a row in the TEACHER relation) and not those which relate to the subject.

The two relations in 3NF, TEACHER and SUBJECT, thus constitute the "best" possible representation for the data of our example.

The construction of an "optimal" structure of data involves the determination of the relations in 3NF (BCFN), built starting from the whole set of the source data.

The process for databases design presented in these pages (DB Design), allows to directly obtain relations in 3NF (BCNF).

Back to structuring of data

 

Visitors :

©Marc Grange, February 2001 | Last update: August 4, 2011