Article Analysis Assistant
This program somehow creates a network of article references and provides a connection between authors and keywords, these things are usually called "Citation Graph".
There are various software and online systems for this, a brief review of which can be found here.
This tool gives you the power to create a graph of articles and analyze it. This tool is designed as a CLI (command-line interface) and you can use it as a Python library.
Clone repository:
git clone https://github.com/EhsanBitaraf/triple-a.git
or
git clone [email protected]:EhsanBitaraf/triple-a.git
Create environment variable:
python -m venv venv
Activate environment variable:
Windows
.\venv\Scripts\activate
Linux
$ source venv/bin/activate
Install poetry:
pip install poetry
Instal dependences:
poetry install
run cli:
poetry run python triplea/cli/aaa.py
get list of PMID in state 0
term = '("Electronic Health Records"[Mesh]) AND ("National"[Title/Abstract]) AND Iran'
get_article_list_all_store_to_kg_rep(term)
move from state 1
move_state_forward(1)
get list of PMID in state 0 and save to file for debugginf use
data = get_article_list_from_pubmed(1, 10,'("Electronic Health Records"[Mesh]) AND ("National"[Title/Abstract])')
data = get_article_list_from_pubmed(1, 10,'"Electronic Health Records"')
data1= json.dumps(data, indent=4)
with open("sample1.json", "w") as outfile:
outfile.write(data1)
open befor file for debuging use
f = open('sample1.json')
data = json.load(f)
f.close()
get one article from kg and save to file
data = get_article_by_pmid('32434767')
data= json.dumps(data, indent=4)
with open("one-article.json", "w") as outfile:
outfile.write(data)
Save Title for Annotation
file = open("article-title.txt", "w", encoding="utf-8")
la = get_article_by_state(2)
for a in la:
try:
article = Article(**a.copy())
except:
pass
file.write(article.Title + "\n")
You can use NLP(Natural Language Processing) methods to extract information from the structure of the article and add it to your graph. For example, you can extract NER(Named-entity recognition) words from the title of the article and add to the graph. Here's how to create a custom NER.
By using the following command, you can see the command completion help
. Each command has a separate help
.
python .\triplea\cli\aaa.py --help
output:
Get list of article identifier (PMID) base on search term and save into knowledge repository in first state (0):
use this command:
python .\triplea\cli\aaa.py search --searchterm [searchterm]
Even the PMID itself can be used in the search term.
python .\triplea\cli\aaa.py search --searchterm 36467335
output:
The preparation of the article for extracting the graph has different steps that are placed in a pipeline. Each step is identified by a number in the state value. The following table describes the state number:
List of state number
State | Description |
---|---|
0 | article identifier saved |
1 | article details article info saved (json Form) |
2 | parse details info |
3 | Get Citation |
4 | NER Title |
5 | extract graph |
-1 | Error |
There are two ways to run a pipeline. In the first method, we give the number of the existing state and all the articles in this state move forward one state.
In another method, we give the final state number and each article under that state starts to move until it reaches the final state number that we specified.
The first can be executed with the next
command and the second with the go
command.
With this command move from current state to the next state
python .\triplea\cli\aaa.py next --state [current state]
for example move all article in state 0 to 1:
python .\triplea\cli\aaa.py next --state 0
output:
go
command:
python .\triplea\cli\aaa.py go --end [last state]
python .\triplea\cli\aaa.py go --end 3
output:
You can try the NER method to extract the major topic of the article's title by using the following command. This command is independent and is used for testing and is not stored in the Arepo.
python .\triplea\cli\ner.py --title "The Iranian Integrated Care Electronic Health Record."
Import file type is .bib
, .enw
, .ris
python .\triplea\cli\import.py "C:\...\bc.ris"
output:
for details information:
python .\triplea\cli\aaa.py export_graph --help
Making a graph with the graphml
format and saving it in a file test.graphml
python .\triplea\cli\aaa.py export_graph -g gen-all -f graphml -o .\triplea\test
Making a graph with the gexf
format and saving it in a file C:\Users\Dr bitaraf\Documents\graph\article.gexf
.This graph contains article, author, affiliation and relation between them:
python .\triplea\cli\aaa.py export_graph -g article-author-affiliation -f gexf -o "C:\Users\Dr bitaraf\Documents\graph\article"
Making a graph with the graphdict
format and saving it in a file C:\Users\Dr bitaraf\Documents\graph\article.json
.This graph contains article, Reference, article cite and relation between them:
python .\triplea\cli\aaa.py export_graph -g article-reference -g article-cited -f graphdict -o "C:\Users\Dr bitaraf\Documents\graph\article.json"
Several visualizator are used to display graphs in this program. These include:
Alchemy.js : Alchemy.js is a graph drawing application built almost entirely in d3.
interactivegaraph : InteractiveGraph provides a web-based interactive visualization and analysis framework for large graph data, which may come from a GSON file
netwulf : Interactive visualization of networks based on Ulf Aslak's d3 web app.
python .\triplea\cli\aaa.py visualize -g article-reference -g article-cited -p 8001
python .\triplea\cli\aaa.py visualize -g gen-all -p 8001
output:
python .\triplea\cli\aaa.py visualize -g article-topic -g article-keyword -p 8001
output:
analysis info
command calculates specific metrics for the entire graph. These metrics include the following:
- Graph Type:
- SCC:
- WCC:
- Reciprocity :
- Graph Nodes:
- Graph Edges:
- Graph Average Degree :
- Graph Density :
- Graph Transitivity :
- Graph max path length :
- Graph Average Clustering Coefficient :
- Graph Degree Assortativity Coefficient :
python .\triplea\cli\aaa.py analysis -g gen-all -c info
output:
Creates a graph with all possible nodes and edges and calculates and lists the sorted degree centrality for each node.
python .\triplea\cli\aaa.py analysis -g gen-all -c sdc
output:
Article Repository (Arepo) is a database that stores the information of articles and graphs. Different databases can be used. We have used the following information banks here:
-
TinyDB - TinyDB is a lightweight document oriented database
-
MongoDB - MongoDB is a source-available cross-platform document-oriented database program
To get general information about the articles, nodes and egdes in the database, use the following command.
python .\triplea\cli\aaa.py arepo -c info
output:
Number of article in article repository is 122
0 Node(s) in article repository.
0 Edge(s) in article repository.
122 article(s) in state 3.
Get article data by PMID
python .\triplea\cli\aaa.py arepo -pmid 31398071
output:
Title : Association between MRI background parenchymal enhancement and lymphovascular invasion and estrogen receptor status in invasive breast cancer.
Journal : The British journal of radiology
DOI : 10.1259/bjr.20190417
PMID : 31398071
PMC : PMC6849688
State : 3
Authors : Jun Li, Yin Mo, Bo He, Qian Gao, Chunyan Luo, Chao Peng, Wei Zhao, Yun Ma, Ying Yang,
Keywords: Adult, Aged, Breast Neoplasms, Female, Humans, Lymphatic Metastasis, Magnetic Resonance Imaging, Menopause, Middle Aged, Neoplasm Invasiveness, Receptors, Estrogen, Retrospective Studies, Young Adult,
Get article data by PMID and save to article.json
file.
python .\triplea\cli\aaa.py arepo -pmid 31398071 -o article.json
For details information:
python .\triplea\cli\aaa.py config --help
Get environment variable:
python .\triplea\cli\aaa.py config -c info
Set new environment variable:
python .\triplea\cli\aaa.py config -c update
Below is a summary of important environment variables in this project:
Environment Variables | Description | Default Value |
---|---|---|
TRIPLEA_DB_TYPE | The type of database to be used in the project. The database layer is separate and you can use different databases, currently it supports MongoDB and TinyDB databases. TinyDB can be used for small scope and Mango can be used for large scope |
TinyDB |
AAA_TINYDB_FILENAME | File name of TinyDB | articledata.json |
AAA_MONGODB_CONNECTION_URL | Standard Connection String Format For MongoDB | mongodb:https://user:[email protected]:27017/ |
AAA_MONGODB_DB_NAME | Name of MongoDB Collection | articledata |
AAA_TPS_LIMIT | Transaction Per Second Limitation | 1 |
AAA_PROXY_HTTP | An HTTP proxy is a server that acts as an intermediary between a client and PubMed server. When a client sends a request to a server through an HTTP proxy, the proxy intercepts the request and forwards it to the server on behalf of the client. Similarly, when the server responds, the proxy intercepts the response and forwards it back to the client. | |
AAA_PROXY_HTTPS | HTTPS Proxy | |
AAA_REFF_CRAWLER_DEEP | 1 | |
AAA_CITED_CRAWLER_DEEP | 1 |
poetry run pytest
poetry run pytest --cov
For graph analysis:
For NLP:
For data storage:
For visualization of networks:
For CLI:
For packaging and dependency management:
With this tool, you can create datasets in different formats, here are examples of these datasets.
Pubmed Query:
"Biological Specimen Banks"[Mesh] OR BioBanking OR biobank OR dataBank OR "Bio Banking" OR "bio bank"
39,023
results
Search with this command:
python .\triplea\cli\aaa.py search --searchterm "\"Biological Specimen Banks\"[Mesh] OR BioBanking OR biobank OR dataBank OR \"Bio Banking\" OR \"bio bank\" "
Get 39,023 result until 2023/01/02
"ERROR":"Search Backend failed: Exception:\n\'retstart\' cannot be larger than 9998. For PubMed, ESearch can only retrieve the first 9,999 records matching the query. To obtain more than 9,999 PubMed records, consider using EDirect that contains additional logic to batch PubMed search results automatically so that an arbitrary number can be retrieved. For details see https://www.ncbi.nlm.nih.gov/books/NBK25499/"
This query had more than 10,000 results, and as a result, the following text was used:
To retrieve more than 10,000 UIDs from databases other than PubMed, submit multiple esearch requests while incrementing the value of retstart (see Application 3). For PubMed, ESearch can only retrieve the first 10,000 records matching the query. To obtain more than 10,000 PubMed records, consider using <EDirect>
that contains additional logic to batch PubMed search results automatically so that an arbitrary number can be retrieved.
This Query Added lately:
"bio-banking"[Title/Abstract] OR "bio-bank"[Title/Abstract] OR "data-bank"[Title/Abstract]
9,012
results
python .\triplea\cli\aaa.py search --searchterm " \"bio-banking\"[Title/Abstract] OR \"bio-bank\"[Title/Abstract] OR \"data-bank\"[Title/Abstract] "
after run this. get info:
Number of article in article repository is 47735
Export graphml
format:
python .\triplea\cli\aaa.py export_graph -g article-reference -g article-keyword -f graphml -o .\triplea\datasets\biobank.graphml
Keyword Checking:
"Breast Neoplasms"[Mesh]
"Breast Cancer"[Title]
"Breast Neoplasms"[Title]
"Breast Neoplasms"[Other Term]
"Breast Cancer"[Other Term]
"Registries"[Mesh]
"Database Management Systems"[Mesh]
"Information Systems"[MeSH Major Topic]
"Registries"[Other Term]
"Information Storage and Retrieval"[MeSH Major Topic]
"Registry"[Title]
"National Program of Cancer Registries"[Mesh]
"Registries"[MeSH Major Topic]
"Information Science"[Mesh]
"Data Management"[Mesh]
Final Pubmed Query:
("Breast Neoplasms"[Mesh] OR "Breast Cancer"[Title] OR "Breast Neoplasms"[Title] OR "Breast Neoplasms"[Other Term] OR "Breast Cancer"[Other Term]) AND ("Registries"[MeSH Major Topic] OR "Database Management Systems"[MeSH Major Topic] OR "Information Systems"[MeSH Major Topic] OR "Registry"[Other Term] OR "Registry"[Title] OR "Information Storage and Retrieval"[MeSH Major Topic])
url:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=("Breast+Neoplasms"[Mesh]+OR+"Breast+Cancer"[Title]+OR+"Breast+Neoplasms"[Title]+OR+"Breast+Neoplasms"[Other+Term]+OR+"Breast+Cancer"[Other+Term])+AND+("Registries"[MeSH+Major+Topic]+OR+"Database+Management+Systems"[MeSH+Major+Topic]+OR+"Information+Systems"[MeSH+Major+Topic]+OR+"Registry"[Other+Term]+OR+"Registry"[Title]+OR+"Information+Storage+and+Retrieval"[MeSH+Major+Topic])&retmode=json&retstart=1&retmax=10
You can download the result of this network and the relationship between the article and the keyword in graphdict
format from here. Manipulated, you can download this graph in gramphml
format from here.
It is not yet complete.
Various tools have been developed to visualize graphs. We have done a brief review and selected a few tools to use in this program.
In this project, we used one of the most powerful libraries for graph analysis. Using NetworkX, we generated many indicators to check a citation graph. Some materials in this regard are given here. You can use other libraries as well.
In the architecture of this software, the structure of the article is stored in the database and this structure also contains the summary of the article. For this reason, it is possible to perform NLP processes such as keywords extraction, topic extraction etc., which can be completed in the future.
This topic is very interesting from a research point of view, so I have included the articles that were interesting here.
We used flake8 and black libraries to increase code quality. More information can be found here.
If you use Triple A
for your scientific work, consider citing us! We're published in IEEE.
@INPROCEEDINGS{10139229,
author={Jafarpour, Maryam and Bitaraf, Ehsan and Moeini, Ali and Nahvijou, Azin},
booktitle={2023 9th International Conference on Web Research (ICWR)},
title={Triple A (AAA): a Tool to Analyze Scientific Literature Metadata with Complex Network Parameters},
year={2023},
volume={},
number={},
pages={342-345},
doi={10.1109/ICWR57742.2023.10139229}}
TripleA is available under the Apache License.