|
Fundamentals of Normalization
The theory of normalization is related to the relational data model
created by Codd in 1970. This model is based on the concept of
relation.
The
relational data model
Concept of relation
A relation is a 2 dimensions array, whose
columns represent the properties of the relation (attributes or
fields), and whose lines correspond to the various values (instances,
occurrences) of these attributes. All the values of an attribute can
belong only to one type : integer, real, character string..., and all
the lines must be different, which leads to the concept of key of a
relation.
Key of a relation
The key of a relation is the minimum subset
of the attributes which allows to identify each line in a single way.
All relation must obligatorily have a key.
Schema of a relation
It corresponds to the "condensed"
definition of the relation and is composed of the name of the relation
followed by the list of the attributes which make this relation, the
key attributes being underlined. For example the relation R made up of
the attributes A1, A2 and A3 whose schema is : R(A1,A2,A3).
The array with the lines corresponds to a
possible "extension" of the relation :
|
A1
|
A2
|
A3
|
|
1
|
Wording 1
|
50
|
|
2
|
Wording 2
|
60
|
|
3
|
Wording 3
|
50
|
|
...
|
|
|
The concept of relation is simple and
intuitive, but thus led to many anomalies if we applie it in a strict
sense. The most significant anomalies are the following ones :
- lack of semantic connections between data.
- redundancy of data.
- "abnormal" behaviors of relations in case of updates
: modifications, deletions.
We will use, to illustrate the various
concepts presented, the following attributes which intervene in the
management of education :
|
Mnemonic
|
Wording
|
Type
|
|
Coefficient
|
Coefficient of the subject
|
Integer
|
|
Nomat
|
Name of the subject
|
String(30)
|
|
Nomens
|
Name of the teacher
|
String(30)
|
|
Numat
|
Number of the subject
|
Integer
|
|
Numens
|
Number of the teacher
|
Integer
|
We will suppose that a teacher teaches only
in one subject, but that a subject is taught by several teachers. There
are 5 teachers per subject and there are 10 subjects in the course. We
define, on these attributes, the relation EDUCATION.
EDUCATION(Numens, Nomens,
Numat, Nomat, Coefficient)
"Problems"
of the relation EDUCATION
We notice first of all, that this relation
completely satisfies the basic definition, and can thus be implemented
and used just as it is in a relational DBMS. Indeed, the values of the
attributes belong only to one type and the key of the relation (Numens)
allows to distinguish, in a single way, each line.
This relation obtained, without structuring of information, presents
however many anomalies.
At the semantic level
The key of this relation is the number of
the teacher (Numens). This means that this attribute identifies,
characterizes, the relation and thus that all the other attributes of
this relation are properties depending semantically on this attribute.
This implies, among others, that the coefficient of a subject, or its
name, is a direct characteristic of a teacher, which is, obviously,
wrong.
This problem of modeling is due to the fact that the relation EDUCATION
does not represent an object, an entity, of the real world but an
amalgam of several entities.
On the level of redundancy of
information
In the relation EDUCATION, the
characteristics of a subject (its name and its coefficient) will have
to be repeated as many time as there are teachers in the subject. In
our example 5 times. This repetition, which can be avoided, will
involve risks of partial update of the data and thus possibilities of
inconsistency.
The size, in characters, necessary to store
this relation is the following one : 5*10*(2+30+2+30+2)= 3300
We considered that an integer was
represented on 2 bytes.
On the level of updates
- It is impossible to enter the
characteristics of a subject if it is not attached to, at least, a
teacher. Indeed, that would like to say that the key of the relation
(Numens) would have value NULL what is prohibited by the constraint on
the unicity of the key.
-If there is only one teacher for a
subject, the departure of this teacher implies the deletion of all
information of the relation for this teacher and thus, in this case,
those which relate to the subject.
All these anomalies are due to the fact
that the relation EDUCATION is not in a "normal" form.
Normalization
of the relations
The setting in a "normal" form of
relations, or normalization, aims at removing all the abnormal
behaviors described above, and is based on the concept of functional
dependency.
Functional dependency (FD)
It is said that there is a functional
dependency between an attribute A1 and an attribute A2, one notes A1 -
> A2, if knowing a value of A1 one can associate only one value of
A2. One also says that A1 determines A2. A1 is the source of the
functional dependency and A2 the goal.
In our example, we have the following FDs:
Numens -> Nomens, Numat, Nomat,
Coefficient
Numat -> Nomat, Coefficient
Elementary functional dependency
It is said that a functional dependency is
elementary if the source does not comprise superfluous attributes. The
question about the elementarity of a FD should thus arise only when the
left part of the FD comprises several attributes.
In our example, all the FDs are elementary.
On the other hand, if we have the following FDs :
A, B -> C
A -> C
The functional dependence A, B - > C is
not elementary since B is superfluous.
Direct functional dependency
It is said that functional dependency A -
> B is direct if there is not any attribute C such as one can have A
- > C and C - > B. In other words, that means that the dependency
between A and B cannot be obtained by transitivity.
In our example, the FD: Numat - > Nomat,
Coefficient is direct. On the other hand the FD Numens - > Nomat is
not direct.
We have indeed : Numens -> Numat and
Numat -> Nomat.
The
normal forms
First normal form(1NF)
It is said that a relation is in 1NF if all
the attributes which composes it cannot be subdivided.
The relation EDUCATION is in 1NF.
Second normal form (2NF)
It is said that a relation is in 2NF if it
is in 1NF and if all the FDs between the key and the other attributes
are elementary.
The relation EDUCATION is in 2NF.
Third normal form (3NF)
It is said that a relation is in 3NF if it
is in 2NF and if all the FDs between the key and the other attributes
are elementary and direct.
The relation EDUCATION is not in 3NF.
Boyce-Codd normal form (BCNF)
It is said that a relation is in BCNF if it
is in 3NF and if the only FDs which exist are those which connect the
key to the other not-key attributes.
The relation EDUCATION is not in BCNF.
Relations
in 3NF
The problems of the relation EDUCATION come
from the fact that this relation is in 2NF. To eliminate the abnormal
behaviors, it is necessary to structure the attributes of this relation
and to break it up into several relations in 3NF.
To build the relations in 3NF relating to our example, we start from
the whole FDs set :
Numens -> Nomens, Numat, Nomat,
Coefficient
Numat -> Nomat, Coefficient
After eliminations of transitivities, we
obtain:
Numens -> Nomens, Numat
Numat -> Nomat, Coefficient
What gives us the following relations :
TEACHER(Numens, Nomens, Numat)
SUBJECT(Numat, Nomat, Coefficient)
These two relations are in BCNF.
For detailed information on the design
process, refer, in these pages, at the topic DB Design.
Advantages
of the relations in 3NF (BCFN)
At the semantic level
The 2 preceding relations correspond to
objects, entities, of the real world. The attributes properties of each
relation are direct characteristics of the key of the associated
relation.
On the level of redundancy of
information
Information characteristic of a subject
(its name and its coefficient) will be present only once and not 5
times as in the relation EDUCATION.
The redundancy of information is thus reducted to the bare minimum,
without loss of information, which limits the risks of inconsistency.
Moreover, the size of the data is following
:
Size of the TEACHER relation :
5*10*(2+30+2)= 1700
Size of the SUBJECT relation: 10*(2+30+2)= 340
The size of the database is 2040
characters, thus a reduction of about 40% compared to the database
EDUCATION (3300 characters), without loss of information.
On the level of updates
- It is possible to enter the
characteristics of a subject even if it is not attached to a teacher.
The data entry is carried out only in the relation SUBJECT.
- If there is only one teacher for a
subject, the departure of this teacher implies the deletion of
information relating to this teacher (a row in the TEACHER relation)
and not those which relate to the subject.
The two relations in 3NF, TEACHER and
SUBJECT, thus constitute the "best" possible representation for the
data of our example.
| The construction of an "optimal" structure of
data involves the determination of the relations in 3NF (BCFN), built
starting from the whole set of the source data. |
The process for databases design presented
in these pages (DB Design), allows to
directly obtain relations in 3NF (BCNF).
Back to structuring of data
|