GitHub - worldofnick/NLP-Information-Extraction: Information extraction system for government terrorist documents

worldofnick / NLP-Information-Extraction Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Information extraction system for government terrorist documents

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
developset		developset
scoring program		scoring program
.gitignore		.gitignore
Article.py		Article.py
ExtractedInfo.py		ExtractedInfo.py
README.txt		README.txt
Truecaser.py		Truecaser.py
classifier.py		classifier.py
classify_incident.py		classify_incident.py
detect_org.py		detect_org.py
detect_perp_individual.py		detect_perp_individual.py
detect_targets.py		detect_targets.py
detect_victims.py		detect_victims.py
detect_weapons.py		detect_weapons.py
extracted.templates		extracted.templates
infoextract		infoextract
killingverbs.txt		killingverbs.txt
main.py		main.py
nltk_download.py		nltk_download.py
orgs.txt		orgs.txt
pattern_matcher.py		pattern_matcher.py
score-ie.pl		score-ie.pl
test2.py		test2.py
weapons.txt		weapons.txt

Repository files navigation

======================================================================
TESTED ON CADE, Lab 1, Machine #19
======================================================================

How to test?
----------------------------------------------------------------------
1. Change directory (cd) to the current 
unzipped location

1.1 Make sure to give proper permissions: chmod +x infoextract

2. Run "./infoextract <input-file-location>"

3. Score by "perl score-ie.pl AGGREGATE.templates developset/answers/AGGREGATE"
======================================================================

A) Resources

NLTK Used for tokenization: (http:https://www.nltk.org)
SpaCy Used for NER: (https://spacy.io)
Truecaser: Used to convert uppercase text to correctly cased text ** NOT CURRENTLY USED (https://github.com/nreimers/truecaser)
----------------------------------------------------------------------

B) Time Per Article

The program can process the entire developer/texts in under a minute on a MacBook Pro.
----------------------------------------------------------------------

C) Contributions

Nick Porter:
File I/O, Data Pipeline, CADE Script
Text Case Correction
Incident Classification
Weapon Detection
Organization Detection
Victim Detection

Snehashish Mishra:
Data Pipeline
Organization Detection
Victim Detection
Perp Individual Detection
Target Detection
----------------------------------------------------------------------

D) Limitations

Runs pretty fast but detection on some of the categories isn’t very good. Needs a better chunker and a sequence tagger (currently developing). The NER system for organization needs improvement.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

worldofnick/NLP-Information-Extraction

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages