Agnostic Learning v.s. Prior Knowledge Challenge

The Data

The validation set labels are now available, for the agnostic learning track and for the prior knowledge track!

We have formatted five datasets from various application domains. To facilitate entering results for all five datasets, all tasks are two-class classification problems. Download the report for more details on the datasets. These datasets were used previously in the Performance Prediction Challenge, which you may check to get baseline results (the same representation as the "agnostic track data" was used, but the patterns and features were randomized differently).

The aim of the present challenge is to predict the test labels as accurately as possible on ALL five datasets, using either data representation:

All five "agnostic learning" track datasets (45.6 MB). The data are preprocessed in a feature representation as close as possible to the raw data. You will have no knowledge of what the features are, so no opportunity to use knowledge about the task to improve your method. You should use a completely self contained learning machines and not use information disclosed to the "prior knowledge track" participants about the nature of the data..
All five "prior knowledge" track datasets (58.9 MB). The data are in their original format and you have access to all the information about what it is. Make use of this information to create learning machines that are smarter than those trained on the agnostic data: better feature extraction, better kernels, etc.

Individual datasets can also be downloaded from this table:

Name	Domain	Num. ex. (tr/val/te)	Raw data (for the prior knowledge track)	Preprocessed data (for the agnostic learning track)
ADA	Marketing	4147/415/41471	14 features, comma separated format, 0.6 MB.	48 features, non-sparse format, 0.6 MB.
GINA	Handwriting recognition	3153/315/31532	784 features, non-sparse format, 7.7 MB.	970 features, non-sparse format, 19.4 MB.
HIVA	Drug discovery	3845/384/38449	Chemical structure in MDL-SD format, 30.3 MB.	1617 features, non-sparse format, 7.6 MB.
NOVA	Text classif.	1754/175/17537	Text. 14 MB.	16969 features, sparse format, 2.3 MB.
SYLVA	Ecology	13086/1309/130857	108 features, non-sparse format, 6.2 MB.	216 features, non-sparse format, 15.6 MB.

During the challenge, the participants have only access to labeled training data and unlabeled validation and test data. The validation labels will be made available one month before the end of the challenge. The final ranking will be based on test data results, revealed only when the challenge is over.

Dataset Formats

Agnostic learning track

All "agnostic learning" data sets are in the same format and include 5 files in ASCII format:

dataname.param - Parameters and statistics about the data
dataname_train.data - Training set (a sparse or a regular matrix, patterns in lines, features in columns).
dataname_valid.data - Validation set.
dataname_test.data - Test set.
dataname_train.labels - Labels (truth values of the classes) for training examples.

The matrix data formats used are (in all cases, each line represents a pattern):

dense matrices - a space delimited file with a new-line character at the end of each line.
sparse binary matrices - for each line of the matrix, a space delimited list of indices of the non-zero values. A new-line character at the end of each line.

If you are a Matlab user, you can download some sample code to read and check the data (CLOP users, the sample code is part of CLOP).

Prior knowledge track

For the "prior knowledge" data sets there may be up to 7 files in ASCII format:

dataname.param - Parameters and statistics about the data
dataname_train.xxx - Training set.
dataname_valid.xxx - Validation set.
dataname_test.xxx - Test set.
dataname_train.labels - Binary class labels for training examples, which should be used as truth values. The problem is to predict binary labels on validation and test data.
dataname_train.mlabels - Original multiclass labels for training examples, as additional prior knowledge. Do not use as target values!
dataname.feat - Identity of the features for ADA, GINA, and SYLVA. The raw data of HIVA and NOVA are not in a feature set representation.

The extension .xxx varies from dataset to dataset: for ADA, GINA, and SYLVA, which are in a feature set representation, xxx=data. For HIVA, which uses the MDL-SD format, xxx=sd. For NOVA, which uses a plain text format, xxx=txt.

Additional "prior knowledge" on the datasets is found in this report.