Code to pre-process dialogue corpora, identify referring expressions (REs), and categorize the REs that occur.
Note that these preprocessing steps assume that you have another directory, data
, in the same directory as ReferringExpressions
.
Specifically, you should have downloaded callhome and switchboard into the following paths:
../data/callhome_english
../data/swda
Link to download CallHome
Link to download Switchboard
Switchboard has already been separated into .csv files with tags for NPs and dialogue acts, so the preprocessing just involves concatenating these files and incorporating metadata.
python src/data/preprocess_switchboard.py
The CallHome corpus is in a raw text format, so preprocessing includes parsing into separate turns, and putting into a .csv format.
python src/data/preprocess_callhome.py
Now you can use spaCy to identify noun chunks in each turn. Substitute either switchboard
or callhome
for {DATASET_NAME}
.
python src/features/identify_np.py --dataset={DATASET_NAME}
You might also want to know how well spaCy performs. We can use the Switchboard data as a baseline. Run:
python src/features/evaluate_spacy_tags.py
Once you've identified each NP, you can further categorize them into the pre-specified bins (e.g. full NP, PRP_3rd, etc.).
python src/features/analyze_nps.py --dataset={DATASET_NAME}
You can also run a separate script to identify the length of each NP.
python src/features/analyze_lengths.py --dataset={DATASET_NAME}
Finally, you can produce a report by knitting the .Rmd file at src/reports/analysis_report.Rmd
.