-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
784d70d
commit 4071b59
Showing
1 changed file
with
113 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,113 @@ | ||
# Multi-modal-Deep-Learning | ||
# Multi-modal-Deep-Learning | ||
|
||
Recent Multi-modal Deep Learning Advances (list of papers and highlights) | ||
|
||
Jingfeng Yang | ||
|
||
---- | ||
## Introduction | ||
|
||
### Prelude | ||
|
||
There are many advances of using unified models (e.g. Transformer) to create representations for multiple modalities. Some of them even enable infusion of multiple modalities to make different modalities help each other. Here, multiple modalities not only include natural language, vision and speech, but also include formal language (e.g. code), (semi-)structured knowledge (e.g. table, KG etc.). This is a list of recent important papers in this field. Welcome to contribute. | ||
|
||
|
||
- [Introduction](#introduction) | ||
- [Prelude](#prelude) | ||
- [Resources](#resources) | ||
- [Natural Language](#natural-language) | ||
- [Vision](#vision) | ||
- [Speech](#speech) | ||
- [Fomal Language / Code](#formal-language) | ||
- [Structured Knowledge](#structured-knowledge) | ||
- [Modality infusion](#modality-infusion) | ||
|
||
|
||
## Resources | ||
* [Microsoft UniLM series](https://github.com/microsoft/unilm/) | ||
|
||
## Natural Language | ||
|
||
* BERT, RoBERTa, BART, SpanBERT, T5, GPT-k etc. | ||
|
||
* [data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language](https://arxiv.org/pdf/2202.03555.pdf), arxiv Feb 2022. | ||
|
||
## Vision | ||
|
||
### Supervised Vision Task | ||
|
||
* [ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/pdf/2010.11929.pdf), ICLR 2021. | ||
|
||
* [Training data-efficient image transformers & distillation through attention](https://arxiv.org/pdf/2012.12877.pdf), Dec 2020. | ||
|
||
* [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/pdf/2103.14030.pdf), Aug 2021. | ||
|
||
### Unsupervised Vision Representation Learning | ||
|
||
* [DINO: Emerging Properties in Self-Supervised Vision Transformers](https://arxiv.org/pdf/2104.14294.pdf), arxiv April 2021. | ||
|
||
* [BEiT: BERT Pre-Training of Image Transformers](https://arxiv.org/abs/2106.08254), arxiv Jun 2021 | ||
|
||
* [SimMIM: A Simple Framework for Masked Image Modeling](https://arxiv.org/pdf/2111.09886.pdf), arxiv Nov 2021. | ||
|
||
* [MAE: Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/pdf/2111.06377.pdf), arxiv Nov 2021. | ||
|
||
* [data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language](https://arxiv.org/pdf/2202.03555.pdf), arxiv Feb 2022. | ||
|
||
## Speech | ||
|
||
### Unsupervised Speech Representation Learning | ||
|
||
* [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/pdf/2006.11477.pdf), arxiv Jun 2020. | ||
|
||
* [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/pdf/2106.07447.pdf), arxiv Jun 2021. | ||
|
||
* [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/pdf/2110.13900.pdf) arxiv Oct 2021. | ||
|
||
* [data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language](https://arxiv.org/pdf/2202.03555.pdf), arxiv Feb 2022. | ||
|
||
### Unsupervised Automatic speech recognition (ASR) | ||
|
||
* [wav2vec-U: Unsupervised Speech Recognition](https://arxiv.org/pdf/2105.11084.pdf), arxiv May 2021. | ||
|
||
## Formal Language | ||
|
||
* [CodeBERT: A Pre-Trained Model for Programming and Natural Languages](https://arxiv.org/pdf/2002.08155.pdf), EMNLP 2020 (Findings). | ||
|
||
* [Codex: Evaluating Large Language Models Trained on Code](https://arxiv.org/pdf/2107.03374.pdf), Jul 2021. | ||
|
||
* [GraphCodeBERT: Pre-training Code Representations with Data Flow](https://arxiv.org/pdf/2009.08366.pdf), ICLR 2021. | ||
|
||
* [AlphaCode: Competition-Level Code Generation with AlphaCode](https://storage.googleapis.com/deepmind-media/AlphaCode/competition_level_code_generation_with_alphacode.pdf). | ||
|
||
## Structured Knowledge | ||
|
||
* [UNIFIEDSKG: Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models](https://arxiv.org/abs/2201.05966), arxiv Jan 2022. | ||
|
||
### Table | ||
|
||
* [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/pdf/2004.02349.pdf), ACL 2020. | ||
|
||
* [TAPEX: Table Pre-training via Learning a Neural SQL Executor](https://arxiv.org/pdf/2107.07653.pdf), ICLR 2022. | ||
|
||
* [TableFormer: Robust Transformer Modeling for Table-Text Encoding](https://openreview.net/pdf?id=EHzvRqy6kD), ACL 2022. | ||
|
||
### Knowledge graph | ||
|
||
* [COMET: Commonsense Transformers for Automatic Knowledge Graph Construction](https://arxiv.org/pdf/1906.05317.pdf), ACL 2019. | ||
|
||
* [(COMET-)ATOMIC-2020: On Symbolic and Neural Commonsense Knowledge Graphs](https://arxiv.org/pdf/2010.05953.pdf), arxiv Oct 2020. | ||
|
||
### Retrivaled Passage as Knowledge | ||
|
||
* [REALM: Retrieval-Augmented Language Model Pre-Training](https://arxiv.org/pdf/2002.08909.pdf), arxiv Feb 2020. | ||
|
||
* [MERGE: Pre-training via Paraphrasing](https://proceedings.neurips.cc/paper/2020/file/d6f1dd034aabde7657e6680444ceff62-Paper.pdf), NeuralPS 2020. | ||
|
||
## Modality Infusion | ||
|
||
### Vision and Natural Language | ||
|
||
* [DALL·E: Zero-Shot Text-to-Image Generation](https://arxiv.org/pdf/2102.12092.pdf), Feb 2021. | ||
|
||
* [CLIP: Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/pdf/2103.00020.pdf), arxiv Feb 2021. |