Skip to content

Commit

Permalink
Extensive model and notebook updates
Browse files Browse the repository at this point in the history
  • Loading branch information
cgpotts committed Jul 3, 2020
1 parent b00a3d1 commit f5fb5ac
Show file tree
Hide file tree
Showing 67 changed files with 10,667 additions and 5,041 deletions.
89 changes: 88 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,95 @@
# CS224u: Natural Language Understanding

Code for [the Stanford course](https://web.stanford.edu/class/cs224u/). The code is written to run under Python 3.7; [setup.ipynb](setup.ipynb) provides additional details.
Code for [the Stanford course](https://web.stanford.edu/class/cs224u/).

Fall 2020

# Instructors

* [Bill MacCartney](https://nlp.stanford.edu/~wcmac/)
* [Christopher Potts](https://web.stanford.edu/~cgpotts/)


# Core components


## `setup.ipynb`

Details on how to get set up to work with this code.


## `tutorial_*` notebooks

Introductions to Juypter notebooks, scientific computing with NumPy and friends, and PyTorch.


## `torch_*.py` modules

A generic optimization class (`torch_model_base.py`) and subclasses for GloVe, Autoencoders, shallow neural classifiers, RNN classifiers, tree-structured networks, and grounded natural language generation.

`tutorial_pytorch_models.ipynb` shows how to use these modules as a general framework for creating original systems.


## `np_*.py` modules

Reference implementations for the `torch_*.py` models, designed to reveal more about how the optimization process works.


## `vsm_*` and `hw_wordsim.ipynb`

A until on vector space models of meaning, covering traditional methods like PMI and LSA as well as newer methods like Autoencoders and GloVe. `vsm.py` provides a lot of the core functionality, and `torch_glove.py` and `torch_autoencoder.py` are the learned models that we cover. `vsm_03_retroffiting.ipynb` is an extension that uses `retrofitting.py`.


## `sst_*` and `hw_sst.ipynb`

A unit on sentiment analysis with the [English Stanford Sentiment Treebank](https://nlp.stanford.edu/sentiment/treebank.html). The core code is `sst.py`, which includes a flexible experimental framework. All the PyTorch classifiers are put to use as well: `torch_shallow_neural_network.py`, `torch_rnn_classifier.py`, and `torch_tree_nn.py`.


## `rel_ext*` and `hw_rel_ext.ipynb`

A unit on relation extraction with distant supervision.


## `nli_*` and `hw_wordentail.ipynb`

A unit on Natural Language Inference. `nli.py` provides core interfaces to a variety of NLI dataset, and an experimental framework. All the PyTorch classifiers are again in heavy use: `torch_shallow_neural_network.py`, `torch_rnn_classifier.py`, and `torch_tree_nn.py`.


## `colors*`, `torch_color_describer.py`, and `hw_colors.ipynb`

A unit on grounded natural language generation, focused on generating context-dependent color descriptions using the [English Stanford Colors in Context dataset](https://cocolab.stanford.edu/datasets/colors.html).


## `contextualreps.ipynb`

Using pretrained parameters from [Hugging Face](https://huggingface.co) and [AllenNLP](https://allennlp.org) for featurization and fine-tuning.


## `evaluation_*.ipynb` and `projects.md`

Notebooks covering key experimental methods and practical considerations, and tips on writing up and presenting work in the field.


## `utils.py`

Miscellaneous core functions used throughout the code.


## `test/`

To run these tests, use

```py.test -vv test/*```

or, for just the tests in `test_shallow_neural_classifiers.py`,

```py.test -vv test/test_shallow_neural_classifiers.py```

If the above commands don't work, try

```python3 -m pytest -vv test/test_shallow_neural_classifiers.py```


## License

The materials in this repo are licensed under the [Apache 2.0 license](LICENSE) and a [Creative Commons Attribution-ShareAlike 4.0 International license](https://creativecommons.org/licenses/by-sa/4.0/).
27 changes: 18 additions & 9 deletions colors.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,25 +5,28 @@
import matplotlib.patches as mpatch

__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2020"
__version__ = "CS224u, Stanford, Fall 2020"


TURN_BOUNDARY = " ### "


class ColorsCorpusReader:
"""Basic interface for the Stanford Colors in Context corpus:
"""
Basic interface for the Stanford Colors in Context corpus:
https://cocolab.stanford.edu/datasets/colors.html
Parameters
----------
src_filename : str
Full path to the corpus file.
word_count : int or None
If int, then only examples with `word_count` words in their
'contents' field are included (as estimated by the number of
whitespqce tokens). If None, then all examples are returned.
normalize_colors : bool
The colors in the corpus are in HLS format with values
[0, 360], [0, 100], [0, 100]. If `normalize_colors=True`,
Expand All @@ -43,7 +46,8 @@ def __init__(self, src_filename, word_count=None, normalize_colors=True):
self.normalize_colors = normalize_colors

def read(self):
"""The main interface to the corpus.
"""
The main interface to the corpus.
As in the paper, turns taken in the same game and round are
grouped together into a single `ColorsCorpusExample` instance
Expand Down Expand Up @@ -72,7 +76,8 @@ def _word_count_filter(self, row):


class ColorsCorpusExample:
"""Interface to individual examples in the Stanford Colors in
"""
Interface to individual examples in the Stanford Colors in
Context corpus.
Parameters
Expand All @@ -81,6 +86,7 @@ class ColorsCorpusExample:
This contains all of the turns associated with a given game
and round. The assumption is that all of the key-value pairs
in these dicts are the same except for the 'contents' key.
normalize_colors : bool
The colors in the corpus are in HLS format with values
[0, 360], [0, 100], [0, 100]. If `normalize_colors=True`,
Expand Down Expand Up @@ -124,7 +130,8 @@ def __init__(self, rows, normalize_colors=True):
self.speaker_context = self._get_reps_in_order('speaker')

def parse_turns(self):
""""Turns the `contents` string into a list by splitting on
""""
Turns the `contents` string into a list by splitting on
`TURN_BOUNDARY`.
Returns
Expand All @@ -135,7 +142,8 @@ def parse_turns(self):
return self.contents.split(TURN_BOUNDARY)

def display(self, typ='model'):
"""Prints examples to the screen in an intuitive format: the
"""
Prints examples to the screen in an intuitive format: the
utterance text appears first, following by the three color
patches, with the target identified by a black border in the
'speaker' and 'model' variants.
Expand Down Expand Up @@ -213,9 +221,10 @@ def _get_target_index(self, field):

@staticmethod
def _check_row_alignment(rows):
"""We expect all the dicts in `rows` to have the same
keys and values except for the keys associated with the
messages. This function tests this assumption holds.
"""
We expect all the dicts in `rows` to have the same keys and
values except for the keys associated with the messages. This
function tests this assumption holds.
"""
keys = set(rows[0].keys())
Expand Down
Loading

0 comments on commit f5fb5ac

Please sign in to comment.