Skip to content

Discourse Relations extracted from BBC News Corpus for training a shallow discourse parser

License

Notifications You must be signed in to change notification settings

rknaebel/bbc-discourse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BBC Discourse Relations

In this repository, we provide files that contain information about the processed BBC News Corpus and its extracted relations. The uploaded data is made anonymous. Therefore, we also provide scripts for producing those files as well as reconstructing the original extracted relations. If you make use of these datasets please consider citing the publication:

R. Knaebel and M. Stede. "Semi-Supervised Tri-Training for Explicit Discourse Argument Expansion", Proc. LREC 2020 [PDF] [BibTeX]

Create BBC Corpus

For corpus preparation, we refer to the make_corpus.py script. It gets the path to one of the downloaded raw BBC corpora and writes all information into one json file. The format is comparable to the CoNLL2016 format of the shared task. Corpus links:

python3 make_corpus.py CORPUS_PATH JSON_PATH.json

Dehydrate

For removing textual information, we use the dehydrate.py script. It returns a flattened json structure that contains only TokenList information and the corresponding document id.

python3 dehydrate.py RELATIONS_PATH > RELATION_ID.json

Hydrate

For back conversion, we use the hydrate.py script. It combines the extracted TokenLists with the corpus file and thus reconstructs the original extraction.

python3 hydrate.py JSON_PATH.json RELATION_ID.json > RELATION_FULL.json

About

Discourse Relations extracted from BBC News Corpus for training a shallow discourse parser

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages