Skip to content

compstorylab/storywrangling

Repository files navigation

Storywrangler API

The Storywrangler project is a natural language processing instrument designed to carry out an ongoing, day-scale curation of over 100 billion tweets containing roughly 1 trillion 1-grams from 2008 to 2021. For each day, we break tweets into unigrams, bigrams, and trigrams spanning over 100 languages. We track ngram usage frequencies, and generate Zipf distributions, for words, hashtags, handles, numerals, symbols, and emojis. We make the data set available through an interactive time series viewer, and as downloadable time series and daily distributions. Although Storywrangler leverages Twitter data, our method of extracting and tracking dynamic changes of ngrams can be extended to any similar social media platform. We showcase a few examples of the many possible avenues of study we aim to enable including how social amplification can be visualized through contagiograms. The project is intended to enable or enhance the study of any large-scale temporal phenomena where people matter including culture, politics, economics, linguistics, public health, conflict, climate change, and data journalism.

All ngram timeseries are stored and served on Hydra, a server at the Vermont Complex Systems Center. Further details about our backend infrastructure and our Twitter stream processing framework can be found on our Gitlab repository.

If you can connect to the UVM VPN at sslvpn2.uvm.edu using your UVM credentials, then you can access our database using this Python module. Unfortunately you can not use this package if you are not connected to the UVM network for the time being. We do hope to have a workaround eventually, but in the meantime if you would like to use our ngrams dataset in your research, we provide an easy way to download daily ngrams timeseries as JSON files via our web service.

If there is a large subset of ngrams you would like from our database, please send us an email.

You can install the latest version by cloning the repo and running setup.py script in your terminal

git clone https://gitlab.com/compstorylab/storywrangling.git
cd storywrangling
python setup.py install
git clone https://gitlab.com/compstorylab/storywrangling.git
cd storywrangling
python setup.py develop

Import our library and create an instance of the master Storywrangler() class object.

from datetime import datetime
from storywrangling import Storywrangler

storywrangler = Storywrangler()

The Storywrangler() class provides a set of methods to access our database. We outline some of the main methods below.

Please ensure you are connected to the UVM VPN to bypass the university firewall.

You can get a dataframe of usage rate for a single ngram timeseries by using the get_ngram() method.

Argument Description
Name Type Default  
ngram str required target 1-, 2-, or 3-gram
lang str "en" target language (iso code)
start_time datetime datetime(2010, 1, 1) starting date for the query
end_time datetime last_updated ending date for the query
See ngrams_languages.json for a list of all supported languages.

Example code

ngram = storywrangler.get_ngram(
  "Black Lives Matter",
  lang="en",
  start_time=datetime(2010, 1, 1),
  end_time=datetime(2020, 1, 1),
)

Expected output

A single Pandas dataframe (see ngram_example.tsv).

Argument Description
time Pandas DatetimeIndex
count usage rate in all tweets (AT)
count_no_rt usage rate in original tweets (OT)
freq normalized frequency in all tweets (AT)
freq_no_rt normalized frequency in original tweets (OT)
rank usage tied-rank in all tweets (AT)
rank_no_rt usage tied-rank in original tweets (OT)

If you have a list of ngrams, then you can use the get_ngrams_array() method to retrieve a dataframe of usage rates in a single language.

Argument Description
Name Type Default  
ngrams_list list required a list of 1-, 2-, or 3-grams
lang str "en" target language (iso code)
start_time datetime datetime(2010, 1, 1) starting date for the query
end_time datetime last_updated ending date for the query

Example code

ngrams = ["Higgs", "#AlphaGo", "CRISPR", "#AI", "LIGO"]
ngrams_df = storywrangler.get_ngrams_array(
  ngrams,
  lang="en",
  start_time=datetime(2010, 1, 1),
  end_time=datetime(2020, 1, 1),
)

All ngrams should be in one language and one database collection.

Expected output

A single Pandas dataframe (see ngrams_array_example.tsv).

Argument Description
time Pandas DatetimeIndex
ngram requested ngram
count usage rate in all tweets (AT)
count_no_rt usage rate in original tweets (OT)
freq normalized frequency in all tweets (AT)
freq_no_rt normalized frequency in original tweets (OT)
rank usage tied-rank in all tweets (AT)
rank_no_rt usage tied-rank in original tweets (OT)

To request a list of ngrams across several languages, you can use the get_ngrams_tuples() method.

Argument Description
Name Type Default  
ngrams_list list(tuples) required a list of ("ngram", "iso-code")
start_time datetime datetime(2010, 1, 1) starting date for the query
end_time datetime last_updated ending date for the query

Example code

examples = [
  ('😊', '_all'),
  ('2018', '_all'),
  ('Christmas', 'en'),
  ('Pasqua', 'it'),
  ('eleição', 'pt'),
  ('sommar', 'sv'),
  ('Olympics', 'en'),
  ('World Cup', 'en'),
  ('#AlphaGo', 'en'),
  ('gravitational waves', 'en'),
  ('black hole', 'en'),
  ('Papa Francesco', 'it'),
  ('coronavirus', 'en'),
  ('Libye', 'fr'),
  ('Suriye', 'tr'),
  ('Росія', 'uk'),
  ('ثورة', 'ar'),
  ('Occupy', 'en'),
  ('Black Lives Matter', 'en'),
  ('Brexit', 'en'),
  ('#MeToo', 'en'),
]
ngrams_array = storywrangler.get_ngrams_tuples(
  examples,
  start_time=datetime(2010, 1, 1),
  end_time=datetime(2020, 1, 1),
)

Expected output

A single Pandas dataframe (see ngrams_multilang_example.tsv).

Argument Description
time Pandas DatetimeIndex
ngram requested ngram
lang requested language
count usage rate in all tweets (AT)
count_no_rt usage rate in original tweets (OT)
freq normalized frequency in all tweets (AT)
freq_no_rt normalized frequency in original tweets (OT)
rank usage tied-rank in all tweets (AT)
rank_no_rt usage tied-rank in original tweets (OT)

To get a timeseries of usage rate for a given language, you can use the get_lang() method.

Argument Description
Name Type Default  
lang str "_all" target language (iso code)
start_time datetime datetime(2010, 1, 1) starting date for the query
end_time datetime last_updated ending date for the query
See supported_languages.json for a list of all supported languages.

Example code

lang = storywrangler.get_lang(
    "en",
    start_time=datetime(2010, 1, 1),
)

Expected output

A single Pandas dataframe (see lang_example.tsv).

Argument Description
time Pandas DatetimeIndex
count usage rate of all tweets (AT)
count_no_rt usage rate of original tweets (OT)
freq normalized frequency of all tweets (AT)
freq_no_rt normalized frequency of original tweets (OT)
rank usage tied-rank of all tweets (AT)
rank_no_rt usage tied-rank of original tweets (OT)
num_1grams volume of 1-grams in all tweets (AT)
num_1grams_no_rt volume of 1-grams in original tweets (OT)
num_2grams volume of 2-grams in all tweets (AT)
num_2grams_no_rt volume of 3-grams in original tweets (OT)
num_3grams volume of 3-grams in all tweets (AT)
num_3grams_no_rt volume of 3-grams in original tweets (OT)
unique_1grams number of unique 1-grams in all tweets (AT)
unique_1grams_no_rt number of unique 1-grams in original tweets (OT)
unique_2grams number of unique 2-grams in all tweets (AT)
unique_2grams_no_rt number of unique 2-grams in original tweets (OT)
unique_3grams number of unique 3-grams in all tweets (AT)
unique_3grams_no_rt number of unique 3-grams in original tweets (OT)

To get the Zipf distribution of all ngrams in our database for a given language on a single day, please use the get_zipf_dist() method:

Argument Description
Name Type Default  
date datetime required target date
lang str "en" target language (iso code)
ngrams str "1grams" target database collection
max_rank int None max rank cutoff (optional)
min_count int None min count cutoff (optional)
top_n int None limit results to top N ngrams. applied after query (optional)
rt bool True apply filters on ATs or OTs (w/out RTs)
ngram_filter str None perform regex to filter results (optional, see below)

Example code

ngrams_zipf = storywrangler.get_zipf_dist(
  date=datetime(2010, 1, 1),
  lang="en",
  ngrams="1grams",
  max_rank=1000,
  rt=False
)

Expected output

A single Pandas dataframe (see ngrams_zipf_example.tsv).

Argument Description
ngram requested ngram
count usage rate in all tweets (AT)
count_no_rt usage rate in original tweets (OT)
freq normalized frequency in all tweets (AT)
freq_no_rt normalized frequency in original tweets (OT)
rank usage tied-rank in all tweets (AT)
rank_no_rt usage tied-rank in original tweets (OT)

To get a list of narratively dominant English ngrams of a given day compared to the year before please use the get_divergence() method. Each ngram is ranked daily by 1-year rank-divergence with \alpha=1/4 using our Allotaxonometry and rank-turbulence divergence instrument.

Argument Description
Name Type Default  
date datetime required target date
lang str "en" target language (iso code)
ngrams str "1grams" target database collection
max_rank int None max rank cutoff (optional)
rt bool True apply filters on ATs or OTs (w/out RTs)

Example code

ngrams = storywrangler.get_divergence(
    date=datetime(2010, 1, 1),
    lang="en",
    ngrams="1grams",
    max_rank=1000,
    rt=True
)

Expected output

A single Pandas dataframe (see ngrams_divergence_example.tsv).

Argument Description
ngram requested ngram
rd_contribution RTD in all tweets (AT)
rd_contribution_no_rt RTD in original tweets (OT)
normed_rd normalized RTD in all tweets (AT)
normed_rd_no_rt normalized RTD in original tweets (OT)
time_1 reference date
rank_1 usage rank at reference date in all tweets (AT)
rank_1_no_rt usage rank at reference date in original tweets (OT)
time_2 current date
rank_2 usage rank at current date in all tweets (AT)
rank_2_no_rt usage rank at current date in original tweets (OT)
rank_change new rank relative to trending ngrams in all tweets (AT)
rank_change_no_rt new rank relative to trending ngrams in original tweets (OT)

Language filters ensure that results for daily Zipf distribution and rank divergence include only specified n-gram types. All filters are applied using Mongo regex operations.

Filters are supported on get_zipf_dist() and get_divergence() methods.

There are two types of regex queries: inclusionary and exclusionary. Inclusionary matches against a standard Mongo regex query {"$regex":<regex pattern>} whereas exclusionary excludes the regex matches using {"$not":{{"$regex":<regex pattern>}}}.

For the inclusionary queries where n-grams have an order of n>1, the regex is dynamically resized so that every 1-gram in the result must match the query. For example handles-filtered 3gram queries will filter through this regex: ^(@\S+) (@\S+) (@\S+)$.

The handle and hashtag filters are not strictly valid Twitter handle or hashtags, but rather handle- and hashtag-like.

Ranks and frequencies are not adjusted to account for the filtered Zipf distributions. I.e., rank and frequency columns are calculated off of the original data. Setting max_rank will yield somewhat arbitrary results; use top_n to select ngrams in the top N of the filtered results.

Filter Name Description (<1-gram example>)
handles include only handle-like strings (^(@\S+))
hashtags include only hashtag-like strings (^(#\S+))
handles_hashtags include only handle- and hashtag-like strings (^([@|#]\S+))
no_handles_hashtags include only strings that do not match handle- and hashtag-like strings (^(?<![@#])(\b[\S]+))
latin include only latin characters w/ hyphens and apostrophes (^([A-Za-z0-9]+[\‘\’\'\-]?[A-Za-z0-9]+)$)
no_punc exclude punctuation (([!…”“\"#@$%&'\(\)\*\+\,\-\.\/\:\;<\=>?@\[\]\^_{|}~]+))

Example code

ngrams_zipf = storywrangler.get_zipf_dist(
  date=datetime(2010, 1, 1),
  lang="en",
  ngrams="1grams",
  max_rank=1000, # pull from 1grams ranked in top 1000 of unfiltered data
  ngram_filter='latin',
  top_n=10, # limit results to top 10 1grams in filtered data
  rt=False
)

In addition to our historical daily ngrams database, we provide a 15-min resolution data stream for the past 30 days

Language ISO Language ISO Language ISO
English en Spanish es Portuguese pt
Arabic ar Korean ko French fr

To use our realtime stream, create an instance of the Realtime() class object.

from datetime import datetime
from storywrangling import Realtime

storywrangler = Realtime()

The Realtime() class provides a set of methods similar to the ones found in the Storywrangler class.

You can get a dataframe of usage rate for a single n-gram timeseries by using the get_ngram() method.

Example code

ngram = api.get_ngram("virus", lang="en")

If you have a list of n-grams, then you can use the get_ngrams_array() method to retrieve a dataframe of usage rates in a single language.

Example code

ngrams = ["the pandemic", "next hour", "new cases", "😭 😭", "used to"]
ngrams_df = api.get_ngrams_array(ngrams_list=ngrams, lang="en")

To request a list of n-grams across several languages, you can use the get_ngrams_tuples() method.

Example code

examples = [
    ('covid19', 'en'),
    ('cuarentena', 'es'),
    ('quarentena', 'pt'),
    ('فيروس', 'ar'),
    ('#BTS', 'ko'),
    ('Brexit', 'fr'),
    ('virus', 'id'),
    ('Suriye', 'tr'),
    ('coronavirus', 'hi'),
    ('Flüchtling', 'de'),
    ('Pasqua', 'it'),
    ('карантин', 'ru'),
]
ngrams_array = api.get_ngrams_tuples(examples)

To get the Zipf distribution for a given 15-minute batch, please use the get_zipf_dist() method:

Example code

ngrams_zipf = api.get_zipf_dist(
  dtime=None,  # datetime(Y, m, d, H, M)
  lang="en",
  ngrams='1grams',
  max_rank=None,
  min_count=None,
  rt=True
)

See the following paper for more details, and please cite it if you use our dataset:

Alshaabi, T., Adams, J. L., Arnold, M. V., Minot, J. R., Dewhurst, D. R., Reagan, A. J., Danforth, C. M., & Dodds, P. S. Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter. Science Advances (2021).

For more information regarding our tweet's language identification and detection framework, please see the following paper:

Alshaabi, T., Dewhurst, D. R., Minot, J. R., Arnold, M. V., Adams, J. L., Danforth, C. M., & Dodds, P. S. The growing amplification of social media: Measuring temporal and social contagion dynamics for over 150 languages on Twitter for 2009--2020. EPJ Data Science (2021).