Training DPCNNs with GULF

The GULF experiments with text classification in the Appendix of the GULF paper [Johnson & Zhang, 2020] used DPCNN [Johnson & Zhang, 2017] as a base model. train_yelp.py and train_yelp_embed.py are provided to reproduce these experiments using the the polarized Yelp dataset from [Zhang et al.,2015]. Also note that using this code, regular training (i.e., without using GULF) of DPCNNs can also be done, and that with slight modification, other datasets can be used.

Examples without embedding learning

To perform GULF2 with 'ini:random' in the 'large' training data setting: python3 train_yelp.py
To perform regular training in the 'large' training data setting: python3 train_yelp.py --alpha 1 --num_stages 1
To get help: python3 train_yelp.py -h

Examples with embedding learning

DPCNNs optionally take additional features from embeddings trained on unlabeled data for a language-modeling-like objective. To do this as in the GULF paper,

To train the embedding of 3-word regions as a function of a bag of words to a 250-dim vector
```
 python3 train_yelp_embed.py
```
The learned embedding is written to emb/yelppol-n1r3-emb.pth.
To train the embedding of 5-word regions as a function of a bag of {1,2,3}-grams to a 250-dim vector:
```
 python3 train_yelp_embed.py  --n_max  3
```
The learned embedding is written to emb/yelppol-n3r5-emb.pth.

To perform supervised training with GULF, using the embeddings obtained above

 python3 train_yelp.py  --x_emb  emb/yelppol-n3r5-emb.pth  emb/yelppol-n1r3-emb.pth

To perform supervised training without GULF, using the embeddings obtained above

 python3 train_yelp.py  --x_emb  emb/yelppol-n3r5-emb.pth  emb/yelppol-n1r3-emb.pth --alpha 1 --num_stages 1

Ensemble

How to make an ensemble of DPCNNs trained with GULF as in the GULF paper is explained here.

Example configurations

code	CPU cores	CPU memory	GPU
train_yelp.py	1	24GB	1
train_yelp_embed.py	7	32GB	1

GPU device memory: 12GB

NOTES for DPCNN users

This pyTorch version of DPCNN preserves the essence of DPCNN, but its details are not exactly the same as the DPCNN paper or the original C++ version. This is a result of pursuing an efficient implementation in pyTorch and some simplification. For example, the original work used the bag-of-word representation for target regions (to be predicted) and minimized squared error with negative sampling. This pyTorch version minimizes the log loss without sampling where the target probability is set by equally distributing the probability mass among the words in the target regions. However, even after modifications, embedding learning in the pyTorch version is slower than the C++ version.
In the DPCNN paper, 5- and 9-word regions of uni-grams and {1,2,3}-grams were used. Due to the above-mentioned changes in the embedding learning implementation, this choice may not be optimal for the datasets tested in the DPCNN paper. The effective setting should be experimentally chosen newly for each dataset. Our choice for the polarized Yelp is shown above.
This code downloads tokenized text (and labels, etc.) of the poloarized Yelp dataset. To use DPCNNs on some other dataset, tokenized text (and labels, etc.) must be prepared by the user. Please see the downloaded files at data/ to find out the file format and file naming conventions. You need to prepare *.tok.txt (tokens), *.cat (labels), and *.catdic (class label dictionary) at a minimum.

Data source

This code downloads a tokenized version of the Yelp dataset. The original Yelp dataset (before tokenization) was compiled by [Zhang et al., 2015].

References

[Johnson & Zhang, 2020] Guided Learning of Nonconvex Models through Successive Functional Gradient Optimization. Rie Johnson and Tong Zhang. ICML-2020.
[Johnson & Zhang, 2017] Deep pyramid convolutional neural networks for text categorization. Rie Johnson and Tong Zhang. ACL-2017.
[Zhang et al., 2015] Character-level convolutional networks for text classification. Xiang Zhang, Junbo Zhao, and Yann LeCun. NIPS-2015.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dpcnn.md

dpcnn.md

Training DPCNNs with GULF

Files

dpcnn.md

Latest commit

History

dpcnn.md

File metadata and controls

Training DPCNNs with GULF