The challenge is now over. But it remains open for post-challenge submissions!
The validation set labels are now available, for the agnostic learning track and for the prior knowledge track!
We have formatted five datasets from various application domains. To facilitate entering results for all five datasets, all tasks are two-class classification problems. Download the report for more details on the datasets. These datasets were used previously in the Performance Prediction Challenge, which you may check to get baseline results (the same representation as the "agnostic track data" was used, but the patterns and features were randomized differently).
The aim of the present challenge is to predict the test labels as accurately as possible on ALL five datasets, using either data representation:
Name | Domain | Num. ex. (tr/val/te) | Raw data (for the prior knowledge track) | Preprocessed data (for the agnostic learning track) |
---|---|---|---|---|
ADA | Marketing | 4147/415/41471 | 14 features, comma separated format, 0.6 MB. | 48 features, non-sparse format, 0.6 MB. |
GINA | Handwriting recognition | 3153/315/31532 | 784 features, non-sparse format, 7.7 MB. | 970 features, non-sparse format, 19.4 MB. |
HIVA | Drug discovery | 3845/384/38449 | Chemical structure in MDL-SD format, 30.3 MB. | 1617 features, non-sparse format, 7.6 MB. |
NOVA | Text classif. | 1754/175/17537 | Text. 14 MB. | 16969 features, sparse format, 2.3 MB. |
SYLVA | Ecology | 13086/1309/130857 | 108 features, non-sparse format, 6.2 MB. | 216 features, non-sparse format, 15.6 MB. |
During the challenge, the participants have only access to labeled training data and unlabeled validation and test data. The validation labels will be made available one month before the end of the challenge. The final ranking will be based on test data results, revealed only when the challenge is over.
All "agnostic learning" data sets are in the same format and include 5 files in ASCII format:
The matrix data formats used are (in all cases, each line represents a pattern):
If you are a Matlab user, you can download some sample code to read and check the data (CLOP users, the sample code is part of CLOP).
For the "prior knowledge" data sets there may be up to 7 files in ASCII format:
Additional "prior knowledge" on the datasets is found in this report.