Skip to content

Latest commit

 

History

History
130 lines (94 loc) · 12 KB

README.md

File metadata and controls

130 lines (94 loc) · 12 KB

NERtwork

Overview

NERtwork is a collection of scripts to help you create a network graph of co-occurring named entities using open source tools. This is done by using SpaCy to identify named entities in the documents, then using NetworkX to create a bipartite projected network, which is exportable in different formats.

Software required:

  • SpaCy
  • Anaconda (strongly recommended, especially for Windows - includes the following 3 requirements, which can be installed separately)
  • Cygwin (Windows only)
  • OpenRefine
  • Any network visualization software (Palladio is great for beginners or basic exploration, Gephi is recommended for more experienced users, or for more extensive research project. Anything that will accept an edge and node CSV will work, including Cytoscape, D3.js, or Flourish).

NERtwork previously used Stanford NER—you can still view the batch script and documentation for that in the NERtwork_Stanford folder.

Process

More detailed steps are provided in the sections below

  1. Run batchner.py over a folder of text files. It will create a spreadsheet with a count of every named entity in every document in the folder.
  2. [Optional, but strongly encouraged] Use OpenRefine or spreadsheet software to normalize, or correct and combine, entities that are the same (e.g. New York, NY, N.Y.). This will significantly improve your results.
  3. [Required if you completed step 2] Run entity-collapser.py to consolidate duplicate entities creating by refinement.
  4. [Optional] Join any relevant metadata so that you may subset the collection based on archival series, author, year, or whatever other metadata is useful.
  5. Run nertwork.py to perform a bipartite network projection and save the resulting node and edge lists.
  6. Import node and edge lists into network visualization software of your choice!

batchner.py

This Python script will run the spaCy named entity recognition model (en_core_web_sm by default) each text file in a folder, count unique entities, and print the results into a spreadsheet.

The final spreadsheet (called entities.csv) will have the text filename, the entity recognized, the type of entity (organization, location, person), and the number of times that each entity occurred as that type within the document. Note: It's possible for the same word to be tagged as more than one type of entity within a document.

Requirements

This script only works on text (.txt) files, but it will work on as many text files as you'd like without any further interaction on your part.

By default, you will need to place batchner.py in the same folder as your texts, but you can change the Python script at will.

Default Folder Setup

As is, the script will run spaCy on every text (.txt) file within a folder. This expects that all of the text files and the batchner.py script are all within the same folder.

Running the Script

Mac OS X

Once all of your files are properly arranged as above:

  1. Open Terminal
  2. Navigate to the folder containing these files (using $ cd) [if you have a folder 'project_folder' on the Desktop, type (without the $) $ cd Desktop/project_folder.]
  3. Type $ python batchner.py and hit enter.

This will take a bit to run (it should generally do 4-5 files per minute, depending on the length of the document), but will print all of the results into a file in the same folder called entities.csv

Windows

Download and install Cygwin. Once your files are arranged as above:

  1. Open Cygwin
  2. Navigate to the folder containing these files (using $ cd) [if you have a folder 'project_folder' on the Desktop, type (without the $) $ cd /cygdrive/c/Users/YOUR-USERNAME/Desktop/project_folder, making sure to insert your Windows user name into the YOUR-USERNAME space.]
  3. Type $ python batchner.py and hit enter

This will take a bit to run (it should generally do 4-5 files per minute, depending on the length of the document), but will print all of the results into a file in the same folder called entities.csv

Refining the entity list

Requirements

  • OpenRefine (not absolutely required, but strongly recommended)
  • Python 3
  • A batchner output

Normalizing entities

You may find a number of entities that mean the same thing, but are spelled or represented differently. For example, you may know that "B. Obama" and "Barack Obama" are the same person, or that "Chicago, Illinois", "Chicago", and "Chicago, IL" all refer to the same place. You'll likely want to edit those to make them all the same. If you have a really small dataset, you may be able to edit them in spreadsheet software, but I'd strongly recommend using OpenRefine to locate potential duplicates and edit in bulk. You'll likely want to use the text facet and clustering features.

Once that is done, you'll likely have some duplicate data. For example, you may decide to normalize all variations of Chicago to just "Chicago". If "Chicago, IL" is used once in a document and "Chicago" is also used once, you will be left with two identical rows saying Chicago was used once in the document.

entity-collapser.py

This Python script will save you the time of finding all of these duplicates and adding the counts together.

  1. Save entity-collapser.py to your machine
  2. In a command line window (Terminal for Mac, Anaconda for Windows), navigate to the directory with your batcher-to-network.py file and type python entity-collapser.py path/to/your/refined/batchner/file
  3. The resulting spreadsheet with merged and re-counted entities will append _refined to the end of the filename.

Thank you to Devin Higgins for creating this script

Adding metadata

At this point, it may be beneficial to include any relevant metadata you may have about your documents so that you may subset the collection based on archival series, author, year, or whatever other metadata is useful. This will vary greatly on the metadata you have, but you can look to the Fannie Lou Hamer Papers metadata join script or the Illinois Library's OpenRefine's join tutorial as examples.

The next script allows limited subsetting (explained below), but for more complex needs, you may need to create separate batchner CSV files using OpenRefine's faceting or custom Python Py example.

nertwork.py

Takes a batchner output, creates a bipartite projected network of named entity co-occurrences, and saves them in your desired file format. In other words, it counts the co-occurrence of named entities within a set of documents and represents them as connected nodes in a network.

Requirements

  • Anaconda (or NetworkX, Pandas, Python)
  • A batchner output

Options

  • -i [REQUIRED] sets the location of the batchner file you want to construct a network from
  • -proj_name [recommended] allows you to give your project a name that will be at the beginning of the filename. (default = nertwork)
  • -subcol [optional] specifies the column you want to use to create a subset. This is case-sensitive. If you use subcol, you must also use subname. (default = none)
  • -subname [optional; required if subcol used] searches in the subcol for a character, word or phrase to create a subset of the data. If your search term includes spaces or special characters put quotes or double quotes on each side of the term. The search string is also included in the filename, if used. (default = none)
  • -entity [optional] allows you to limit your network to only specific types of networks (default = none)
    • none will create a network including all types of entities,
    • person, location, or organization will create a network of only that type of entity
    • all will create 4 networks: all types, person, location, and organization.
  • -minweight [optional] will filter out entity relationships with weights below a specified number. This filter is applied at the end of the process, so projection is done based on the complete list, and the output is filtered by the projected weight. This can be helpful in really large graphs. (default = 0)
  • -out [optional] will determine what type of output files you want (default = csv)
    • csv creates two csvs—an edge list and a node list
    • gexf creats a GEXF (Graph Exchange XML Format) file

Subsetting

You can create networks from subsets of the data using subcol to select a column and subname to facet on characters in that column. Although this will technically work on any column, this is likely not helpful unless you have joined additional data, or want to use a segment of information from the doc name.

  • subcol must perfectly match (including case) the name of the column.
  • subname is a case-insensitive literal string search that only needs to match on part of a field in a dataset. (Ex: -subname=correspondence will match 'Correspondence: Letter to J. Smith' as well as 'Memo: Regarding earlier correspondence')
  • If you have spaces or special characters, add single or double quotes to your argument (note: some special characters will still cause problems and you may need to your metadata or use another method for subsetting)
  • Your search term will be stripped of spaces and special characters (if any) and added to the final filename.
  • Warning: If the data in the subcol column is not a string, it will be converted to a string for matching. This also means that, for the time being at least, you cannot search for numbers above/below a point or dates before/after a date.
  • If you want to facet on multiple terms or in multiple columns, if you have special characters you need to find, or you need to facet on something other than a string, I recommend using spreadsheet software, OpenRefine, or Python to create alternative subset data to use.
  • Subsetting is a new feature and has not been tested very much. If you've run into a problem that isn't described above, or have a recommended solution to improve this, please file an issue

Instructions

  1. Save the nertwork.py file to your machine
  2. In a command line window (Terminal for Mac, Anaconda for Windows), navigate to the directory with your nertwork.py file and type python nertwork.py -i=path/to/your/batchner/file, plus any other flags you may want

Examples

  • python nertwork.py -i=data/my_dataset.csv -proj_name=my_dataset -subset=person -minweight=5
    • this will run on a file named "my_dataset.csv" that's in the data folder within your current directory, it will add a "my_dataset_" prefix to every file it creates, it will only run the network on shared person names, and it will only save nodes and edges with a weight of over 5
  • python nertwork.py -i=https://gist.githubusercontent.com/brandontlocke/f568ffdeaf7fdb17e27872916645591b/raw/f1a28d05ce2a4cdbbe11c09404cdf83c7a5fb1c1/inaugural-batchner -proj_name=inauguralspeeches
    • this will take a batchner output file from Github, prefix every file with 'inauguralspeeches_', and will include all entities. this file is available on the web, so feel free to run tests with this command