aspects of classification by neural networks, including links
between neural networks and Bayesian statistical classification,
incremental learning,...
The project includes theoretical work on classification algorithms,
simulations and benchmarks, especially on realistic industrial
data. Hardware implementation, especially VLSI option, is the
last objective.
The set of databases available is to be used for tests and benchmarks
of machine-learning classification algorithms.
The databases are splitted into two parts: ARTIFICIALly generated
databases, mainly used for preliminary tests, and REAL ones, used for
objective benchmarks and comparisons of methods.
The choice of the databases has been guided by various parameters, such
as availability of published results concerning conventional
classification algorithms, size of the database, number of attributes,
number of classes, overlapping between classes and non-linearities of
the borders,... Results of PCA and DFA preprocessing of the REAL
databases are also included, together with several measures useful for
the databases characterization (statistics, fractal dimension,
dispersion,...).
All these databases and their preprocessing are available together
with a postcript technical report describing in details the different
databases ('Databases.ps.Z' - 45 pages - 777781 bytes) and a report
related to the comparative benchmarking studies of various algorithms
('Benchmarks.ps.Z' - 113 pages - 1927571 bytes) well-known by the
Statistical and Neural Network communities (MLP, RCE, LVQ, k_NN, GQC)
or developped in the framework of the Elena project (IRVQ, PLS).
A LaTeX bibfile containing more than 90 entries corresponding to
the Elena partners bibliography related to the project is also
available ('Elena.bib') in the same directory.
All files are available by anonymous ftp from the following directory:
ftp://ftp.dice.ucl.ac.be/pub/neural-nets/ELENA/databases
The databases are splitted into two parts: the 'ARTIFICIAL' ones, being
generated in order to obtain some defined characteristics, and for
which the theoretical Bayes error can be computed, and the 'REAL'
ones, collected in existing real-world applications.
The ARTIFICIAL databases ('Gaussian', 'Clouds' and 'Concentric')
were generated according to the following requirements:
- heavy intersection of the class distributions,
- high degree of nonlinearity of the class boundaries,
- various dimensions of the vectors,
- already published results on these databases.
They are restricted to two-class problems, since we believe it yield
answers to the most essential questions.
The ARTIFICIAL databases are mainly used for rapid test purposes on newly
developed algorithms.
The REAL databases ('Satimage', 'Texture', 'Iris' and 'Phoneme') were
selected according to the following requirements:
- classical databases in the field of classification (Iris),
- already published results on these databases (Phoneme,
from the ROARS ESPRIT project and 'Satimage' from the STATLOG ESPRIT
project),
- various dimensions of the vectors,
- sufficient number of vectors (to avoid the ``empty space phenomenon'').
- the 'Texture' database, generated at INPG for the Elena project is
interesting for its high number of classes (11).
##############################################################################
###########
# DETAILS #
###########
The 'Benchmarks' technical report
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The 'Benchmarks.ps' Elena report is related to the benchmarking studies of
various classifiers. Most of the classifiers which were used for the
benchmark comparative studies are are well known by the neural network
and machine learning community. These are the k-Nearest Neighbour
(k_NN) classifier, selected for its powerful probability density
estimation properties; the Gaussian Quadratic Classifier (GQC), the
most classical statistical parametric simple classification method; the
Learning Vector Quantizer (LVQ), a powerful non-linear iterative
learning algorithm proposed by Kohonen; the Reduced Coulomb Energy
(RCE) algorithm, an incremental Region Of Influence algorithm; the
Inertia Rated Vector Quantizer (IRVQ) and the Piecewise Linear
Separation (PLS) classifiers, developed in the framework of the Elena
project.
The main objectives of the 'Benchmarks.ps' Elena report report are the
following:
- to provide an overall comprehensive view of the general problem of
comparative benchmarking studies and to propose a useful common
test basis for existing and further classification methods,
- to obtain objective comparisons of the different chosen classifiers on
the set of databases described in this report (each classifier being
used with its optimal configuration for each particular database),
- to study the possible links between the data structures of the databases
viewed by some parameters, and the behavior of the studied classifiers
(mainly the evolution of their the optimal configuration parameters).
- to study the links between the preprocessing methods and the
classification algorithms from the performances and hardware constraints
point of view (especially the computation times and memory requirements).
Databases format
~~~~~~~~~~~~~~~~
All the databases available are in the following format (after decompression) :
- All files containing the databases are stored as ASCII files for
their easy edition and checking.
- In a file, each of the n lines is reserved for each vectorial sample
(instance) and each line consists of d floating-point numbers (the
attributes) followed by the class label (which must be an integer).
Example:
1.51768 12.65 3.56 1.30 73.08 0.61 8.69 0.00 0.14 1
1.51747 12.84 3.50 1.14 73.27 0.56 8.55 0.00 0.00 0
1.51775 12.85 3.48 1.23 72.97 0.61 8.56 0.09 0.22 1
1.51753 12.57 3.47 1.38 73.39 0.60 8.55 0.00 0.06 1
1.51783 12.69 3.54 1.34 72.95 0.57 8.75 0.00 0.00 3
1.51567 13.29 3.45 1.21 72.74 0.56 8.57 0.00 0.00 1
There are NO missing values.
If you desire to get a database, you MUST do it in ftp the binary mode.
So if you aren't in this mode, simply type 'binary' at the ftp prompt.
EXAMPLE: to get the "phoneme" database :
cd REAL
cd phoneme
binary
get phoneme.txt
get phoneme.dat.Z
get ...
cd ...
...
quit
After your ftp session, you simply have to type
'uncompress phoneme.dat.Z'
to get the uncompressed datafile.
Contents of the 'ARTIFICIAL' directory
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The databases of this directory contain only the 'ARTIFICIAL'
classification problems.
The present 'ARTIFICIAL' databases are only two-class problems, since it
yields answers to the most essential questions.
For each problem, the confusion matrix corresponding to the theoretical
Bayes boundary is provided with the confusion matrix obtained by a k_NN
classifier (k chosen to reach the minimum of the total Leave-One-Out error).
These databases were selected to use for preliminary test and to study the
behavior of the implemented algorithms for some particular problems:
- Overlapping classes:
The classifier should have the ability to form a decision boundary
that minimizes the amount of misclassification for all of the overlapping
classes.
- Nonlinear separability:
The classifier should be able to build decision regions that separate
classes of any shape and size.
There is one subdirectory for each database. In this subdirectory,
there is :
- A text file providing detailed information about the related database
('databasename.txt').
- The compressed database ('databasename.dat.Z).
The different patterns of each database are presented in a random order.
- For bidimensional databases, a postscript file representing the 2-D
datasets (those files are in eps format).
For each subdirectory, the directoryname is the same as the name chosen
for the concerned database. Here are the directorynames with a brief
description.
- 'clouds'
Bidimensional distributions : the class 0 is the sum of three different
normal distributions while the the class 1 is another normal, overlapping
the class 0.
5000 patterns, 2500 in each class.
This allows the study of the classifier behavior for heavy intersection
of the class distributions and for high degree of nonlinearity of the
class boundaries.
- 'gaussian'
A set of seven databases corresponding to the same problem, but with
dimensionality ranging from 2 to 8.
This allows the study of the classifier behavior for different
dimensionalities of the input vectors, for heavy overlapped
distributions and for non linear separability.
Theses databases where already studied by Kohonen in:
Kohonen, T. and Barna, G. and Chrisley, R., "Statistical Pattern
Recognition with Neural Networks: Benchmarking Studies",
IEEE Int. Conf. on Neural Networks, SOS Printing, San Diego, 1988.
In this paper,the performances of three basis types of neural-like
networks (Backpropagation network, Boltzmann machine and Learning
Vector Quantization) is evaluated and compared to the theoretical limit.
- 'concentric'
Bidimensional uniform concentric circular distributions.
2500 instances, 1579 in class 1, 921 in class 0.
This database may be used to study the linear separability of the
classifier when some classes are nested in other without overlapping.
Contents of the 'REAL' directory
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The databases of this directory contain only the real
classification problem sets selected for the Elena benchmarking studies.
There is one subdirectory for each database. In this subdirectory,
there are:
- a text file giving detailed information about the related database
(`databasename.txt'),
- the compressed original database in the Elena format
(`databasename.dat.Z'); the different patterns of each database being
presented in a random order.
- By the way of a normalization process, each original feature will have
the same importance in a subsequent classification process.
A typical method is first to center each feature separately and than
to reduce it to a unit variance; this process has been applied on all
the REAL Elena databases in order to build the ``CR'' databases
contained in the ``databasename_CR.dat.Z'' files.
The Principal Components Analysis (PCA) is a very classical method in pattern
recognition [Duda73]. PCA reduces the sample dimension in a linear way
for the best representation in lower dimensions keeping the maximum of
inertia. The best axe for the representation is however not necessary
the best axe for the discrimination. After PCA, features are selected
according to the percentage of initial inertia which is covered by the
different axes and the number of features is determined according to
the percentage of initial inertia to keep for the classification
process. This selection method has been applied on every REAL database
after centering and reduction (thus on the databasename_CR.dat files).
When quasi-linear correlations exists between some initial features,
these redundant dimensions are removed by PCA and this preprocessing is
then recommended. In this case, before a PCA, the determinant of the
data covariance matrix is near zero; this database is thus badly
conditioned for all process which use this information (the quadratic
classifier for example).
The following files, related to PCA are also available for the REAL databases:
- ``databasename_PCA.dat.Z'', the projection of the ``CR'' database on its
principal components (sorted in a decreasing order of the related
inertia percentage),
- ``databasename_corr_circle.ps.Z'', a graphical representation of the
correlation between the initial attributes and the two first
principal components,
- ``databasename_proj_PCA.ps.Z'', a graphical representation of the
projection of the initial database on the two first principal
components,
- ``databasename_EV.dat'', a file with the eigenvalues and associated
inertia percentages
The Discriminant Factorial Analysis (DFA) can be applied to a learning
database where each learning sample belongs to a particular class
[Duda73]. The number of discriminant features selected by DFA is fixed
in function of the number of classes (c) and of the number of input
dimensions (d); this number is equal to the minimum between d and c-1.
In the usual case where d is greater than c, the output dimension is
fixed equal to the number of classes minus one and the discriminant
axes are selected in order to maximize the between-variance and to
minimize the within-variance of the classes. The discrimination power
(ratio of the projected between-variance over the projected
within-variance) is not the same for each discriminant axis: this ratio
decreases for each axis. So for a problem with many classes, this
preprocessing will not be always efficient as the last output features
will not be so discriminant. This analysis uses the information of the
inverse of the global covariance matrix, so the covariance matrix must
be well conditioned (for example, a preliminary PCA must be applied to
remove the linearly correlated dimensions). The DFA preprocessing
method has been applied on the 18 first principal components of the
'satimage_PCA' and 'texture_PCA' databases (thus by keeping only the 18
first attributes of these databases before to apply the DFA
preprocessing) in order to build the 'satimage_DFA.dat.Z' and
'texture_DFA.dat.Z' database files, having respectively 5 and 10
dimensions (the 'satimage' database having 6 classes and 'texture'
11).
For each subdirectory, the directoryname is the same as the name chosen
for the contained database. Here are the directorynames with a brief
numerical description of the available databases.
- phoneme
French and Spannish phoneme recognition problem.
The aim is to distinguish between nasal (AN, IN, ON) and oral
(A, I, O, E, E') vowels.
5404 patterns, 5 attributes (the normalized amplitudes of the five
first harmonics), 2 classes.
This database was in use in the European ESPRIT 5516 project ROARS.
The aim of this project is the development and the implementation of a
REAL time analytical system for French and Spannish phoneme
recognition.
- texture
The aim is to distinguish between 11 different textures (Grass lawn,
Pressed calf leather, Handmade paper, Raffia looped to a high pile, Cotton
canvas, ...), each pattern (pixel) being characterised by 40 attributes
built by the estimation of fourth order modified moments in four orientations:
0, 45, 90 and 135 degrees.
5500 patterns, 11 classes of 500 instances (each class refers to a type
of texture in the Brodatz album).
The original source of this database is:
P. Brodatz "Textures: A Photographic Album for Artists and Designers",
Dover Publications, Inc., New York, 1966.
This database was generated by the Laboratory of Image Processing
and Pattern Recognition (INPG-LTIRF Grenoble, France) in the development
of the Esprit project ELENA No. 6891 and the Esprit working group ATHOS
No. 6620.
- satimage (*)
Classification of the multi-spectral values of an image of the Landsat
satellite. Each line contains the pixel values in four spectral bands
of each of the 9 pixels in a 3x3 neighbourhood and a number indicating
the classification label of the central pixel (corresponding to the type
of soil: red soil, cotton crop, grey soil, ...).
The aim is to predict this classification, given the multi-spectral
values.
6435 instances, 36 attributes (4 spectral bands x 9 pixels in
neighbourhood), 6 classes.
This database was in use in the European StatLog project, which
involves comparing the performances of machine learning,
statistical, and neural network algorithms on data sets from REAL-world
industrial areas including medicine, finance, image analysis, and
engineering design:
D. Michie, D.J. Spiegelhalter, and C.C. Taylor, editors.
Machine learning, Neural and Statistical Classification.
Ellis Horwood Series In Artificial Intelligence,
England, 1994.
- iris (*)
This is perhaps the best known database to be found in the pattern
recognition literature. Fisher's paper is a classic in the field
and is referenced frequently to this day. (See Duda & Hart, for
example.) The data set contains 3 classes of 50 instances each,
where each class refers to a type of iris plant. One class is
linearly separable from the other 2; the latter are NOT linearly
separable from each other.
4 attributes (sepal length, sepal width, petal length and petal width).
(*) These databases are taken from the ftp anonymous "UCI Repository Of
Machine Learning Databases and Domain Theories"
(ics.uci.edu: pub/machine-learning-databases):
Murphy, P. M. and Aha, D. W. (1992). "UCI Repository of machine
learning databases" [Machine-readable data repository]. Irvine, CA:
University of California, Department of Information and Computer Science.
[Duda73]
Duda, R.O. and Hart, P.E.,
Pattern Classification and Scene Analysis,
John Wiley & Sons, 1973.
##############################################################################
The ELENA PROJECT
~~~~~~~~~~~~~~~~~
Neural networks are now known as powerful methods for empirical
data analysis, especially for approximation (identification,
control, prediction) and classification problems. The ELENA project
investigates several aspects of classification by neural networks,
including links between neural networks and Bayesian statistical
classification, incremental learning (control of the network size
by adding or removing neurons),...
URL: http://www.dice.ucl.ac.be/neural-nets/ELENA/ELENA.html
ELENA is an ESPRIT III Basic Research Action project (No. 6891).
It involves:
INPG (Grenoble, F),
UPC (Barcelona, E),
EPFL (Lausanne, CH),
UCL (Louvain-la-Neuve, B),
Thomson-Sintra ASM (Sophia Antipolis, F)
EERIE (Nimes, F).
The coordinator of the project can be
contacted at:
Prof. Christian Jutten,
INPG-LTIRF,
46 av. Flix Viallet,
F-38031 Grenoble Cedex,
France
Phone: +33 76 57 45 48,
Fax: +33 76 57 47 90,
e-mail: chris at tirf.inpg.fr
A simulation environment (PACKLIB) has been developed in the project;
it is a smart graphical tool allowing fast programming and
interactive analysis. The PACKLIB environment greatly simplifies the
user's task by requiring only to write the basic code of the
algorithms, while the whole graphical input, output and relationship
framework is handled by the environment itself. PACKLIB is used for
extensive benchmarks in the ELENA project and in other situations
(image processing, control of mobile robots,...). Currently, PACKLIB
is tested by beta users and a demo version available in the public
domain.
URL: http://www.dice.ucl.ac.be/neural-nets/ELENA/Packlib.html
##############################################################################
IF YOU HAVE ANY PROBLEM, QUESTION OR PROPOSITION, PLEASE E_MAIL the following.
VOZ Jean-Luc or Michel Verleysen
Universite Catholique de Louvain
DICE - Lab. de Microelectronique
3, place du Levant
B-1348 LOUVAIN-LA-NEUVE
E_mail : voz at dice.ucl.ac.beverleysen at dice.ucl.ac.be