[Bioforum] IEEE ICDM'8 Data Mining Contest

Maurizio Atzori via bioforum%40net.bio.net (by icdm08-publicity from isti.cnr.it)
Tue Sep 16 09:24:41 EST 2008


IEEE International Conference on Data Mining
ICDM 2008 Data Mining Contest:
Radioxenon monitoring for verification of
the Comprehensive nuclear-Test-Ban Treaty

Organisers
----------

Kurt Ungar, Trevor Stocki and Jing Yi
Health Canada

Nathalie Japkowicz
University of Ottawa

Arno Siebes
Universiteit Utrecht

Website: http://www.cs.uu.nl/groups/ADA/icdm08cup/


The IEEE ICDM 2008 Data Mining Contest is, simply put, about keeping the
world safe using data mining. This contest is about developing and
testing data mining techniques to verify worldwide compliance of the
global ban on nuclear tests. Such tests can be detected by measuring the
amount of special xenon isotopes. Obviously, it's not just that simple;
these isotopes are also emitted during various legal activities.


Timeline for the Contest
------------------------
1) September 12, 2008: Release of the Training Data and Test Data sets
and the Software tools that will be used to evaluate the results.
2) October 22, 2008: Results from the labelled set are due.
3) November 8, 2008: Results obtained on the unlabeled data set are due
4) December 15-19: Results of the competition are announced at the
conference.


General Description of the Problem
-----------------------------------
Compliance verification of the Comprehensive Nuclear-Test-Ban Treaty
(CTBT), when the treaty enters into force, will employ four remote
sensing technologies to detect nuclear explosions.  Only radionuclide
detection can unequivocally establish that an explosion was due to a
nuclear detonation. Radioactive noble gas (the following isotopes:
Xe-131m, Xe-133m, Xe-133, and Xe-135) are sampled and measured in a
procedure called radionuclide monitoring. Different relative
combinations of these isotopes correspond to different signatures that
can be mapped to distinct sources (such as nuclear power plants, medical
isotope production facilities, or various types of weapons).

The problem of attributing a specific observation of airborne
concentrations of radioxenon to an explosion is twofold.  Firstly, in
the first few weeks after an explosion the relative concentrations of
the four isotopes are expected to be released in “fingerprint” relative
concentrations quite distinct from other background sources. Since the
CTBT stations are not located at the source of the explosion, the
radioxenon is detected at a location which can be well over a thousand
kilometres away. This atmospheric transport process can take weeks,
which can increase the complexity of this signature.  Secondly, one can
never observe radioxenons emitted purely from an explosion source but
admixtures of this gas with the radioxenons released from all background
sources. These 2 points above constitute an interesting data mining
problem for the Preparatory Commission for the Comprehensive
Nuclear-Test-Ban Treaty Organization (CTBTO).


Description of the dataset to be used
-------------------------------------

Radioxenon measurements from four to five CTBTO monitoring sites will be
provided. These will be comprised of a few hundred to a few thousand
sets of observations of the four species for each site. A synthesized a
set of explosion observations at these same sites will be added to
actual radioxenon concentrations caused by background sources.  The data
sets are composed of two classes, Background (B), and Background plus
Explosion (B+E). Each type has a set of quadruplets representing the
four activity concentrations of Xe-131m, Xe-133m, Xe-133, and Xe-135 for
a given air sample.

We will be issuing labelled data sets containing both classes during the
first phase of the competition, while teams develop a classification
method appropriate for this task. In a second phase, we will issue a new
data set also containing data from both classes, but we will withhold
the label. This testing data set will be used for our final evaluation.


Description of the computational tasks
--------------------------------------

Two versions of data sets will be provided. The first will have each
datum described according to station of origin, a unique randomly
assigned tracking number allowing the contest evaluators trace the datum
back to the original scenario of explosion release, whether it is
Background or whether it is Background plus Explosion. The second
version will have each datum described by station of origin using the
same stations as the first data set and a unique randomly assigned
tracking number allowing the contest evaluators trace the datum back to
the original scenario of explosion release. The second set of data will
contain cases of B or B+E but this will be unknown to the contestants.
The first version of the data will be employed in Tasks 1 and 2. The
final version of the data will be employed in Task 3.

Task 1: The first task is to classify as accurately as possible the
results as Background or Explosion over the entire set of stations
provide with one classifier. Contestants may combine data as they see
fit.  They may separately tune classifier parameters for each station
but they may not have separate classifier parameter types for each
station nor separate classifiers.  Contestants can to report on more
than one classifier for this task.

Task 2: In the second task, conversely, the contestant is requested to
identify an optimal algorithm for each station given.

Task 3 In the third task, the contestants will apply the classifiers
developed in Tasks 1 and 2 using the second data set and report their
results for evaluation.
The primary goal of this contest is to produce methods that are broadly
applicable over different station background measurement distributions
and explosion source hypotheses.  The best methods will also have a very
efficient learning curve.  Recognition will also be given to methods
more proficient in properly categorizing data arising from specific
classes of explosion release hypotheses or station background types,
because these methods add a forensic or diagnostic dimension.



More information about the Bioforum mailing list