PASCAL Agnostic Learning
Prior Knowledge

The challenge is now over. But it remains open for post-challenge submissions!

IMPORTANT: Entries made since February 1st 2007 might be using validation data, now available for training.

The Data

The validation set labels are now available, for the agnostic learning track and for the prior knowledge track!

We have formatted five datasets from various application domains. To facilitate entering results for all five datasets, all tasks are two-class classification problems. Download the report for more details on the datasets. These datasets were used previously in the Performance Prediction Challenge, which you may check to get baseline results (the same representation as the "agnostic track data" was used, but the patterns and features were randomized differently).

The aim of the present challenge is to predict the test labels as accurately as possible on ALL five datasets, using either data representation:

Individual datasets can also be downloaded from this table:

Name Domain Num. ex. (tr/val/te) Raw data (for the prior knowledge track) Preprocessed data (for the agnostic learning track)
ADA Marketing 4147/415/41471 14  features, comma separated format, 0.6 MB. 48 features, non-sparse format, 0.6 MB.
GINA Handwriting recognition 3153/315/31532 784 features, non-sparse format, 7.7 MB. 970 features, non-sparse format, 19.4 MB.
HIVA Drug discovery 3845/384/38449 Chemical structure in MDL-SD format, 30.3 MB. 1617 features, non-sparse format, 7.6 MB.
NOVA Text classif. 1754/175/17537 Text. 14 MB. 16969 features, sparse format, 2.3 MB.
SYLVA Ecology 13086/1309/130857 108 features, non-sparse format, 6.2 MB. 216 features, non-sparse format, 15.6 MB.

During the challenge, the participants have only access to labeled training data and unlabeled validation and test data. The validation labels will be made available one month before the end of the challenge. The final ranking will be based on test data results, revealed only when the challenge is over.

Dataset Formats

Agnostic learning track

All "agnostic learning" data sets are in the same format and include 5 files in ASCII format:

The matrix data formats used are (in all cases, each line represents a pattern):

If you are a Matlab user, you can download some sample code to read and check the data (CLOP users, the sample code is part of CLOP).

Prior knowledge track

For the "prior knowledge" data sets there may be up to 7 files in ASCII format:

The extension .xxx varies from dataset to dataset: for ADA, GINA, and SYLVA, which are in a feature set representation, xxx=data. For HIVA, which uses the MDL-SD format, xxx=sd. For NOVA, which uses a plain text format, xxx=txt.

Additional "prior knowledge" on the datasets is found in this report.