|
Fundamentals of Normalization
The theory of normalization is related to the relational data model
created by Codd in 1970. This model is based on the concept of relation.
The relational
data model
Concept of relation
A relation is a 2 dimensions array, whose
columns represent the properties of the relation (attributes or
fields), and whose lines correspond to the various values (instances,
occurrences) of these attributes. All the values of an attribute
can belong only to one type : integer, real, character string...,
and all the lines must be different, from where the concept of key
of a relation.
Key of a relation
The key of a relation is the minimum subset
of the attributes which allows to identify each line in a single
way. All relation must obligatorily have a key.
Schema of a relation
It corresponds to the "condensed" definition
of the relation and is composed of the name of the relation followed
by the list of the attributes which make this relation, the key
attributes being underlined. For example the relation R made up
of the attributes A1, A2 and A3 whose schema is : R(A1,A2,A3).
The array with the lines corresponds to
a possible "extension" of the relation :
|
A1
|
A2
|
A3
|
|
1
|
Wording 1
|
50
|
|
2
|
Wording 2
|
60
|
|
3
|
Wording 3
|
50
|
|
...
|
|
|
The concept of relation is simple and intuitive,
but thus led to many anomalies if one applies it in a strict sense.
The most significant anomalies are the following ones :
- lack of semantic connections between data.
- redundancy of data.
- "abnormal" behaviors of the relationsin in case of updates
: modifications, deletions.
We will use, to illustrate the various concepts
presented, the following attributes which intervene in the management
of education :
|
Mnemonic
|
Wording
|
Type
|
|
Coefficient
|
Coefficient of the subjecy
|
Integer
|
|
Nomat
|
Name of the subject
|
String(30)
|
|
Nomens
|
Name of the teacher
|
String(30)
|
|
Numat
|
Number of the subject
|
Integer
|
|
Numens
|
Number of the teacher
|
Integer
|
We will suppose that a teacher teaches only in one
subject, but that a subject is taught by several teachers. There
are 5 teachers per subject and there are 10 subjects in the course.
We define, on these attributes, the relation EDUCATION.
EDUCATION(Numens, Nomens, Numat, Nomat,
Coefficient)
"Problems"
of the relation EDUCATION
We notice first of all that this relation satisfies
the basic definition completely and can thus be implemented and
used just as it is in a relational DBMS. Indeed, the values of the
attributes belong only to one type and the key of the relation (Numens)
allows to distinguish, in a single way, each line.
This relation obtained, without structuring of information, presents
however many anomalies.
At the semantic level
The key of this relation is the number of the teacher
(Numens). That means that this attribute identifies, characterizes,
the relation and thus that all the other attributes of this relation
are properties depending semantically on this attribute. That implies,
among others, that the coefficient of a subject, or its name, is
a direct characteristics of a teacher, which is, obviously, wrong.
This problem of modeling is due to the fact that the relation EDUCATION
does not represent an object, an entity, of the real world but an
amalgam of several entities.
On the level of redundancy of information
In the relation EDUCATION, the characteristics of
a subject (its name and its coefficient) will have to be repeated
as many time as there are teachers in the subject. In our example
5 times. This repetition, which can be avoided, will involve risks
of partial update of the data and thus possibilities of inconsistency.
The size, in characters, necessary to store this relation
is the following one : 5*10*(2+30+2+30+2)= 3300
We considered that an integer was represented on 2
bytes.
On the level of updates
- It is impossible to keyboard the characteristics
of a subject if it is not attached to, at least, a teacher. Indeed,
that would like to say that the key of the relation (Numens) would
have value NULL what is prohibited by the constraint on the unicity
of the key.
-If there is only one teacher for a subject, the departure
of this teacher implies the deletion of all information of the relation
for this teacher and thus, in this case, those which relate to the
subject.
All these anomalies are due to the fact that the relation
EDUCATION is not in a "normal" form.
Normalization
of the relations
The setting in a "normal" form of relations,
or normalization, aims at removing all the abnormal behaviors described
above, and is based on the concept of functional dependency.
Functional dependency (FD)
It is said that there is a functional dependency between
an attribute A1 and an attribute A2, one notes A1 - > A2, if knowing
a value of A1 one can associate only one value of A2. One also says
that A1 determines A2. A1 is the source of the functional dependency
and A2 the goal.
In our example, we have the following FDs:
Numens -> Nomens, Numat, Nomat, Coefficient
Numat -> Nomat, Coefficient
Elementary functional dependency
It is said that a functional dependency
is elementary if the source does not comprise superfluous attributes.
The question about the elementarity of a FD should thus arise only
when the left part of the FD comprises several attributes.
In our example, all the FDs are elementary.
On the other hand, if we have the following FDs :
A, B -> C
A -> C
The functional dependence A, B - > C is
not elementary since B is superfluous.
Direct functional dependency
It is said that functional dependency A
- > B is direct if there is not any attribute C such as one can
have A - > C and C - > B. In other words, that means that the dependency
between A and B cannot be obtained by transitivity.
In our example, the FD: Numat - > Nomat,
Coefficient is direct. On the other hand the FD Numens - > Nomat
is not direct.
We have indeed : Numens -> Numat and
Numat -> Nomat.
The normal
forms
First normal form(1NF)
It is said that a relation is in 1NF if all the attributes
which composes it cannot be subdivided.
The relation EDUCATION is in 1NF.
Second normal form (2NF)
It is said that a relation is in 2NF if it is in 1NF
and if all the FDs between the key and the other attributes are
elementary.
The relation EDUCATION is in 2NF.
Third normal form (3NF)
It is said that a relation is in 3NF if it is in 2NF
and if all the FDs between the key and the other attributes are
elementary and direct.
The relation EDUCATION is not in 3NF.
Boyce-Codd normal form (BCNF)
It is said that a relation is in BCNF if it is in
3NF and if the only FDs which exist are those which connect the
key to the other not-key attributes.
The relation EDUCATION is not in BCNF.
Relations in 3NF
The problems of the relation EDUCATION come from the
fact that this relation is in 2NF. To eliminate the abnormal behaviors,
it is necessary to structure the attributes of this relation so
to break it up into several relations in 3NF.
To build the relations in 3NF relating to our example, we start
from the whole FDs set :
Numens -> Nomens, Numat, Nomat, Coefficient
Numat -> Nomat, Coefficient
After eliminations of transitivities, we obtain:
Numens -> Nomens, Numat
Numat -> Nomat, Coefficient
What gives us the following relations :
TEACHER(Numens, Nomens, Numat)
SUBJECT(Numat, Nomat, Coefficient)
These two relations are in BCNF.
For detailed information on the design process, refer,
in these pages, at the column DB
Design.
Advantages
of the relations in 3NF (BCFN)
At the semantic level
The 2 preceding relations correspond to objects, entities,
of the real world. The attributes properties of each relation are
direct characteristics of the key of the associated relation.
On the level of redundancy of information
Information characteristic of a subject (its name
and its coefficient) will be present only once and not 5 times as
in the relation EDUCATION.
The redundancy of information is thus reducted to the bare minimum,
without loss of information, which limits the risks of inconsistency.
Moreover, the size of the data is following :
Size of the TEACHER relation : 5*10*(2+30+2)= 1700
Size of the SUBJECT relation: 10*(2+30+2)= 340
The size of the database is 2040 characters, thus
a reduction of about 40% compared to the database EDUCATION (3300
characters), without loss of information.
On the level of updates
- It is possible to keyboard the characteristics of
a subject even if it is not attached to a teacher. The capture is
carried out only in the relation SUBJECT.
- If there is only one teacher for a subject, the
departure of this teacher implies the deletion of information relating
to this teacher (a row in the TEACHER relation) and not those which
relate to the subject.
The two relations in 3NF, TEACHER and SUBJECT, thus
constitute the "best" possible representation for the
data of our example.
| The construction of an "optimal" structure of data involves
the determination of the relations in 3NF (BCFN), built starting
from the whole set of the source data. |
The process for databases design presented in these
pages (DB Design), allows to directly
obtain relations in 3NF (BCNF).
Back to structuring of data
|