|
Our contributions for RapidMiner can be found here. |
Dataset | Description |
---|---|
aspects-DB | The dataset contains Aspects and the corresponding image URLS. Please follow the README for further information. |
Dataset | Description |
---|---|
captioning_in_the_wild.zip | Crowdsourcing annotations of image captions from the Yahoo Flickr Creative Commons 100 Million Dataset (YFCC100M). The dataset contains responses with respect to subjectivity, visibility, appeal and intent of around 2.2k image titles. |
Here you find the data sets that have been generated at MADM for research purposes. Detailed information about each dataset can be obtained on the specific page.
Topic | Name | Description |
---|---|---|
Document security | Doctor bills | The data set contains genuine and forged doctor bills. Forgeries are made by re-engineering of genuine documents. |
Document security | MIC dataset | The data set contains print-outs from color laser printers and copiers that show Machine Identification Codes (MIC), also known as “yellow dots” or counterfeit protection system codes. |
Document security | StaVer dataset | The data set contains scanned invoices with color logos, color text and various kinds of stamps. |
Document security | Scan Distortion dataset | The dataset contains gray scale invoices from the same source as well as copies of genuine invoices to detect and measure the scanning distortions. |
Document security | Distorted Text-Lines dataset | The dataset contains synthetic gray scale document images with single column text where the last paragraph is either rotated or mis-aligned. Different fonts and font sizes are used. |
Document security | DFKI Printing Technique dataset | This dataset contains documents printed on 7 inkjet and 13 laser printers. |
Dataset | Description |
---|---|
YouTube-22concepts | A dataset of YouTube video clips tagged with 22 different concepts for experiments with automatic video annotation. |
Dataset | Description |
---|---|
AudioPairBank | A Large-Scale Tag-Pair-Based Audio Dataset (385.5 hours, 1116 classes) |
Dataset Generator | Description |
---|---|
dfki-bayes-data-generator-1.05.zip | Python code for generating synthetic datasets with known Bayes error rate and defined statistical properties. |
EuroSAT (RGB color space images) | EuroSAT: A land use and land cover classification dataset based on Sentinel-2 satellite images. |
EuroSAT (all 13 bands) | EuroSAT: A land use and land cover classification dataset based on Sentinel-2 satellite images. |
Below datasets for unsupervised anomaly detection could be found. The outlier label must not be used for detection, only for evaluation. The first row contains the column naming. For the UCI datasets, permission for republication has been granted. For more information please refer to http://archive.ics.uci.edu/ml/
More unsupervised anomaly detection datasets for evaluation can be now found on the Harvard Dataverse: http://dx.doi.org/10.7910/DVN/OPQMVF
Dataset | Records | Dimensions | % outliers | Description |
---|---|---|---|---|
dfki-artificial-3000-unsupervised-ad.zip | 3000 | 2 | 1.23 | Artificial test data set with 4 normal distributions (one of which with low density), a micro cluster and local anomalies. |
breast-cancer-unsupervised.csv | 367 | 30 | 2.72 | Modified “Breast Cancer Wisconsin (Diagnostic)” dataset from the UCI machine learning repositoy. Original version available here. |
pen-local-unsupervised.csv | 6724 | 16 | 0.15 | Modified “Pen-Based Recognition of Handwritten Digits” dataset from the UCI machine learning repositoy. Original version available here. |
pen-global-unsupervised.csv | 809 | 16 | 11.1 | Modified “Pen-Based Recognition of Handwritten Digits” dataset from the UCI machine learning repositoy. Original version available here. |
shuttle-unsupervised.csv | 46464 | 9 | 1.89 | Modified “Statlog (Shuttle)” dataset from the UCI machine learning repositoy. Original version available here. |
satellite-unsupervised.csv | 5100 | 36 | 1.49 | Modified “Statlog (Landsat Satellite)” dataset from the UCI machine learning repositoy. Original version available here. |
annthyroid-unsupervised.csv | 6916 | 21 | 3.61 | Modified “Thyroid Disease” dataset from the UCI machine learning repositoy. See version “ann-thyroid”. Original version available here. |
kdd99-unsupervised-ad.csv | 620089 | 38 | 0.17 | Modified “KDD Cup 1999” dataset from the UCI machine learning repositoy. Only HTTP connections selected. Original version available here. |