Data Sets

If you are interested in experimenting with the MIL data sets used in our 2002 NIPS paper, you can download a compressed tar file containing them here (~6 MB). The table below displays some statistics for these. Under the "Features" column, the number in parentheses indicates the sparsity of features for the data set. It is computed as the maximum across bags, of the cardinality of the set of features which are non-zero in at least one bag instance.

A matlab version of the above datasets is now available here (~4.3 MB). Each .mat file includes three variables: "bag_ids", "features" and "labels". "bag_ids" is a map from instances to bags. "features" is self-explanatory, although the features in these files may be normalized to have unit variance (please check). Finally, the "labels" variable gives the label for each instance if it is known, and otherwise the label of the enclosing bag (until these data sets are fully labeled, the later is the case). This format is otherwise undocumented. Please refer to the original datasets and descriptions to correctly interpret this data.

Details

Name Features (Non-Zero) Positive Bags Negative Bags Positive Instances Negative Instances
TREC9 (pretest) 1/data_200x200.svm 66552 (31) 200 200 1580 1644
TREC9 (pretest) 2/data_200x200.svm 66153 (31) 200 200 1715 1629
TREC9 (pretest) 3/data_200x200.svm 66144 (31) 200 200 1626 1620
TREC9 (pretest) 4/data_200x200.svm 67085 (32) 200 200 1754 1637
TREC9 (pretest) 7/data_200x200.svm 66823 (31) 200 200 1746 1621
TREC9 (pretest) 9/data_200x200.svm 66627 (33) 200 200 1684 1616
TREC9 (pretest) 10/data_200x200.svm 66082 (32) 200 200 1818 1635
Elephant/data_100x100.svm 230 (143) 100 100 762 629
Fox/data_100x100.svm 230 (143) 100 100 647 673
Tiger/data_100x100.svm 230 (143) 100 100 544 676
Musk/musk1norm.svm 166 47 45 207 269
Musk/musk2norm.svm 166 39 63 1017 5581