Lambda Function to Collect GDELT Data

This is an AWS Lambda Function, created via the AWS base image, that takes the GDELT v2.0 CSVs, and uses Pandas to extract parquet files containing the sources, URLs, and topics and place them into an S3 Bucket.

Currently, the function creates two new parquet files in an S3 bucket every 15 minutes. The filenames both start with the most recent 15 minute UTC timestamp formatted as 'YYYYmmDDHHMMSS'. There is then a '-scores' file which has columns 'topics' (proper noun parsed out from articles by GDELT), 'counts' (number of times given topic is mentioned), 'src_counts' (number of unique sources mentioning the topic), 'scores' (ln(counts) + 2ln(src_counts) + 1). The score is essentially a weighted geometric mean of source and article counts. The second file, suffixed with '-urls', has the same length and order of the '-scores' file with two columns of lists (matched along the indices) of the 'sources' and 'urls' respectively. The aggregate counts and scores of the past 24 hours are contained in the 'day-scores' file (added to each time a new file is added and subtracted from by every set that falls out of scope).

Per GDELT, new csv files are updated every 15 minutes, so this function will be run via the EventBridge Scheduler every 15 minutes as well (exponential backoff in case of late update). At the end of each call the data is writtent to DynamoDB via a batch write.

Currently using only the Global Knowledge Graph (GKG) file. Documentation found here

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
src		src
test		test
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
lambda_handler.py		lambda_handler.py
requirements.txt		requirements.txt
template.yaml		template.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lambda Function to Collect GDELT Data

About

Releases

Packages

Languages

License

ajbrent/gdelt_process

Folders and files

Latest commit

History

Repository files navigation

Lambda Function to Collect GDELT Data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages