A curated list of weak and distant supervision papers and tools.
- Introduction
- Contributing
- Overview Texts
- Surveys
- Foundational Papers
- Books
- Libraries & Tools
- Datasets & Benchmarks
Weak supervision and distant supervision provide ways to (semi-) automatically generate training data for machine learning systems in a fast and efficient manner where normal, supervised training data is lacking. This idea is popular in fields like natural language processing and computer vision and is actively researched. Here, we list interesting papers and tools to help newcomers from both the research and the application side try out weak supervision.
This list was started by the organizers for the WeaSuL Workshop on Weakly Supervised Learning at ICLR'21 and we welcome contributions to extend it.
If you want to contribute to this list, just create a pull-request or a new issue. For a paper or tool, please provide all the necessary information (authors, title, conference, link, topic tags, short description). If you are unsure, feel free to open an issue to discuss it. If you encounter any typos, just let us know. Thanks!
Texts that give a quick start into the topic.
- A brief introduction to weakly supervised learning (National Science Review,2018) This paper provides an overview on weak supervision and its approaches.
- A Visual Guide to Low-Resource NLP This article gives a quick introduction with visual examples to low-resource NLP techniques including different weak supervision methods.
Surveys give a broad overview of a field and can allow you to quickly get insights into current trends and issues for future work.
- Image Classification with Deep Learning in the Presence of Noisy Labels: A Survey (Arxiv, 2021) [CV] A survey on how to handle the errors in weakly supervised or other noisily labeled data for computer vision.
- A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios (NAACL, 2021) [NLP] Section 4 covers both methods for weakly supervision in different NLP tasks as well as how to handly noisy labels.
- A Brief Survey of Relation Extraction Based on Distant Supervision (ICCS, 2019) [NLP, RE] A survey specifically on distant supervision for relation extraction.
- Relation Extraction Using Distant Supervision: A Survey (ACM Computing Surveys, 2018) [NLP, RE] Another survey specifically on distant supervision for relation extraction.
- A Survey of Noise Reduction Methods for Distant Supervision (AKBC 2013) [NLP, RE] A survey that covers different ways to handle annotation errors in distantly supervised relation extraction data.
Important steps in how we came to the current state of the art.
- Distant Supervision for Relation Extraction Without Labeled Data (ACL, 2009) [NLP, RE]
- Data Programming: Creating Large Training Sets, Quickly (NIPS 2016) [ML]
- Practical Weak Supervision: Doing More with Less Data (O'Reilly Media, 2021) [NLP, CV] "It talks about building natural language processing and computer vision projects using weakly labeled datasets from Snorkel tool."
Open-source libraries and tools already providing implementations that get you started quickly.
- Cleanlab [ML, CV, NLP] "Python package for machine learning with noisy labels. cleanlab cleans labels and supports finding, quantifying, and learning with label errors in datasets."
- Knodle [ML, CV, NLP] "Modular weakly supervised learning with PyTorch."
- Snorkel [ML, CV, NLP] "Programmatically build and manage training data.”
- ANEA [NLP] "A tool to automatically annotate named entities in unlabeled text based on entity lists for the use as distant supervision"
- Sweak [NLP] "It provides labeling functions to automatically label documents, and aggregate their results to obtain a labeled version of the corpus."
Datasets generated through weak and distant supervision. These works can provide both insights into how to generate weakly supervised data as well as to evaluate your learning algorithms on them.
- NoisyNER. Analysing the Noise Model Error for Realistic Noisy Label Data (AAAI, 2021) [NLP, NER] A named entity recognition dataset with different label sets created through distant supervision and manual rules.
- Learning from Rules Generalizing Labeled Exemplars (ICLR, 2020) [ML, NLP] Four NLP datasets on text classification and slot filling. Using rules for labeling.
- Beyond Synthetic Noise: Deep Learning on Controlled Noisy Labels (PMLR, 2020) [CV] A dataset with noisy labels obtained through web crawling.
- Learning with Noisy Labels for Sentence-level Sentiment Classification (EMNLP, 2019) [NLP, Sentiment] A sentiment analysis dataset created through weak supervision leveraging the context of sentences.
- Food-101N. CleanNet: Transfer Learning for Scalable Image Classifier Training With Label Noise (CVPR, 2018) [CV] An image detection dataset focused on a specific domain.
- Cross-lingual Name Tagging and Linking for 282 Languages (ACL, 2017) [NLP, NER, multilingual] A multilingual named entity recognition dataset derived from Wikipedia (WikiAnn).
- WebVision. WebVision Database: Visual Learning and Understanding from Web Data (Arxiv, 2017) [CV] An image detection dataset created through weak supervision from web crawling.
- Clothing1M. Learning From Massive Noisy Labeled Data for Image Classification (CVPR, 2015) [CV] An image detection dataset that leverages the context of images for automatic labeling.
- TinyImages. 80 Million Tiny Images: A Large Data Set for Nonparametric Object and Scene Recognition (TPAMI, 2008) [CV] Large image detection dataset.