html

Data Sets

If you are interested in experimenting with the MIL data sets used in our 2002 NIPS paper, you can download a compressed tar file containing them here (~6 MB). The table below displays some statistics for these. Under the "Features" column, the number in parentheses indicates the sparsity of features for the data set. It is computed as the maximum across bags, of the cardinality of the set of features which are non-zero in at least one bag instance.

A matlab version of the above datasets is now available here (~4.3 MB). Each .mat file includes three variables: "bag_ids", "features" and "labels". "bag_ids" is a map from instances to bags. "features" is self-explanatory, although the features in these files may be normalized to have unit variance (please check). Finally, the "labels" variable gives the label for each instance if it is known, and otherwise the label of the enclosing bag (until these data sets are fully labeled, the later is the case). This format is otherwise undocumented. Please refer to the original datasets and descriptions to correctly interpret this data.

Details

Name	Features (Non-Zero)	Positive Bags	Negative Bags	Positive Instances	Negative Instances
TREC9 (pretest) 1/data_200x200.svm	66552 (31)	200	200	1580	1644
TREC9 (pretest) 2/data_200x200.svm	66153 (31)	200	200	1715	1629
TREC9 (pretest) 3/data_200x200.svm	66144 (31)	200	200	1626	1620
TREC9 (pretest) 4/data_200x200.svm	67085 (32)	200	200	1754	1637
TREC9 (pretest) 7/data_200x200.svm	66823 (31)	200	200	1746	1621
TREC9 (pretest) 9/data_200x200.svm	66627 (33)	200	200	1684	1616
TREC9 (pretest) 10/data_200x200.svm	66082 (32)	200	200	1818	1635
Elephant/data_100x100.svm	230 (143)	100	100	762	629
Fox/data_100x100.svm	230 (143)	100	100	647	673
Tiger/data_100x100.svm	230 (143)	100	100	544	676
Musk/musk1norm.svm	166	47	45	207	269
Musk/musk2norm.svm	166	39	63	1017	5581