If you are interested in experimenting with the MIL data sets used in our 2002 NIPS paper, you can download a compressed tar file containing them here (~6 MB). The table below displays some statistics for these. Under the "Features" column, the number in parentheses indicates the sparsity of features for the data set. It is computed as the maximum across bags, of the cardinality of the set of features which are non-zero in at least one bag instance.
A matlab version of the above datasets is now available here (~4.3 MB). Each .mat file includes three variables: "bag_ids", "features" and "labels". "bag_ids" is a map from instances to bags. "features" is self-explanatory, although the features in these files may be normalized to have unit variance (please check). Finally, the "labels" variable gives the label for each instance if it is known, and otherwise the label of the enclosing bag (until these data sets are fully labeled, the later is the case). This format is otherwise undocumented. Please refer to the original datasets and descriptions to correctly interpret this data.
Name | Features (Non-Zero) | Positive Bags | Negative Bags | Positive Instances | Negative Instances |
TREC9 (pretest) 1/data_200x200.svm | 66552 (31) | 200 | 200 | 1580 | 1644 |
TREC9 (pretest) 2/data_200x200.svm | 66153 (31) | 200 | 200 | 1715 | 1629 |
TREC9 (pretest) 3/data_200x200.svm | 66144 (31) | 200 | 200 | 1626 | 1620 |
TREC9 (pretest) 4/data_200x200.svm | 67085 (32) | 200 | 200 | 1754 | 1637 |
TREC9 (pretest) 7/data_200x200.svm | 66823 (31) | 200 | 200 | 1746 | 1621 |
TREC9 (pretest) 9/data_200x200.svm | 66627 (33) | 200 | 200 | 1684 | 1616 |
TREC9 (pretest) 10/data_200x200.svm | 66082 (32) | 200 | 200 | 1818 | 1635 |
Elephant/data_100x100.svm | 230 (143) | 100 | 100 | 762 | 629 |
Fox/data_100x100.svm | 230 (143) | 100 | 100 | 647 | 673 |
Tiger/data_100x100.svm | 230 (143) | 100 | 100 | 544 | 676 |
Musk/musk1norm.svm | 166 | 47 | 45 | 207 | 269 |
Musk/musk2norm.svm | 166 | 39 | 63 | 1017 | 5581 |