⚕ coronanlp

The current status of the project is experimental 🔬

Docs

CORD19.__init__ method

Construct a CORD19 object. Initialize with a single path or a list paths pointing to the directory with JSON files, e.g., /../dir/*.json

import coronanlp
root = coronanlp.Path('/path/to/Datasets/AllenAI/CORD-19/2020-03-13/')
source = [p for p in root.iterdir() if p.is_dir()]
cord19 = coronanlp.CORD19(source, sort_first=True)
print(cord19)
...
# CORD19(papers: 13202, files_sorted: True, source: [
#   comm_use_subset, noncomm_use_subset, pmc_custom_license, biorxiv_medrxiv,
# ])

CORD19.sample method

name	type	description
k	int	Sample size, if `k < max_ids`; random sample. Otherwise, `-1` for all ids.
s	int	Split of ids, `id` is based on the order of path(s) passed and `n` directories.
seed	Optional[Union[int, float]]	Random seed if not None.
returns	List[int]	Returns a list of paper ids.

sample = cord19.sample(s=0)  # e.g., All ids from the biorxiv_medrxiv directory
sample = cord19.sample(k=-1) # Returns all ids available in order.
sample = cord19.sample(k=1000) # 1K randomly selected ids, e.g., from all 45941

CORD19.batch method

name	type	description
sample	List[int]	A list of integer sequences.
minlen	int	Minimum length of a string to consider valid.
workers	int	Number of cores to use, if None; obtained from `cpu_count()`.
returns	SentenceStore	The newly constructed object.

sample = cord19.sample(-1)
sentence_store = cord19.batch(sample,  minlen=15)
...
# files/papers: 24% ████     | 3112/13202 [09:49<19:51, 8.47it/s]

After extracting the text from the JSON files, pre-processing, cleaning, and sentence-tokenization we get a newly constructed SentenceStore object with a container Dict[int, List[str]]. holding all the sentences.

Using the spaCy model en_core_sci_sm is recommended for building high quality sentences.

Below we can see the contrast in how the models recognize sentence boundaries; en-core-web-sm starts splitting at " ( " ? that doesn't make sense!? if this is done here; where else it will do it? Given the fact the text is scientific literature, we can expect equations, many abbreviations, and symbols - all destroyed because we used the wrong tokenizer. Meanwhile, en_core_sci_sm has no issues 🥇 scispacy - SpaCy models for biomedical text processing

# en_core_web_sm
2: We analyzed ... pulmonary artery endothelial cells (hPAECs).
3: The effect of ... electric resistance, molecule trafficking, calcium (
4: Ca 2+ ) homeostasis, gene expression and proliferation studies.

# en_core_sci_sm
2: We analyzed ... pulmonary artery endothelial cells (hPAECs).
3: The effect of ... electric resistance, molecule trafficking, calcium (Ca 2+ ) homeostasis, gene expression and proliferation studies.

coronanlp.core.SentenceStore.__str__ method

Wow 🤔 13,202 papers produce 1,890,230 million sentences and 60,847,005 million tokens! It takes about ~30 minutes with SSD and around ~40 on HDD, but it really depends on hardware.

print(sentence_store)
...
# SentenceStore(avg_seqlen=32.19, num_papers=13202, num_sents=1890230, num_tokens=60847005)

SentenceTransformer.__init__ method

encoder = coronanlp.SentenceTransformer('model_name_or_path')
sentence_embeddings = encoder.encode(list(sentence_store), batch_size=8)
...
# batches: 96% ██████████████ | 36585/38168 [17:10<01:17, 20.38it/s]

coronanlp.indexing.fit_index_ivf_hnsw method

NOTE

The following method requires faiss, which should be easy to install for both Linux and MacOS. Except for Windows which needs to be built from source, read more here: windows support? | issue<1437>

Install faiss with CUDA or CPU (I have only used the CPU version and haven't experienced any issues even at 2 million dense vectors runs extremely fast. See official repo: faiss github

cpu-version:
- conda install faiss-cpu -c pytorch
gpu-version:
- [8.0, 9.0, 10.0] replace X with your version.
  - conda install faiss-gpu cudatoolkit=X.0 -c pytorch

index_ivf = coronanlp.fit_index_ivf_hnsw(sentence_embeddings,  metric='l2')
print(index_ivf.is_trained)
...
# True

ScibertQuestionAnswering.__init__ method

The following table defines the minimum arguments required to construct a new ScibertQuestionAnswering object.

name	type	description
papers	Union[str, SentenceStore]	A path or papers object.
index	Union[str, faiss.IndexIVFFlat]	A path or index object
encoder	Union[str, SentenceTransformer]	A path or model object

from coronanlp.engine import ScibertQuestionAnswering
qa =  ScibertQuestionAnswering(sentence_store, index_ivf, encoder)
print(qa.all_model_devices)
...
# {'summarizer_model_device': device(type='cuda'),
#  'sentence_transformer_model_device': device(type='cuda'),
#  'question_answering_model_device': device(type='cpu')}

QuestionAnsweringEngine.answer method

name	type	description
question	str	Question(s) to query the QA model (split by sentence tokenization)
topk	int	Number of answers to return (chosen by likelihood)
top_p	int	A path or model object
nprobe	int	index search probe value
mode	str	Compressor mode to use on context; `'bert', 'freq'`
returns	QuestionAnsweringOutput	Object holding the predictions of the model(s)

The subsequent keyword arguments can be used in QuestionAnsweringEngine.answer method, which then gets passed to transformers.QuestionAnsweringPipeline.__call__. Click in here to read the official HuggingFace documentation.

**kwargs	type	description
doc_stride	int	(default=`128`) If input length overlaps split by chunks.
max_answer_len	int	(default=`15`) Max length of predicted answers.
max_seq_len	int	(default=`384`) Max length of the total sentence (context+question) after tokenization.
max_question_len	int	(default=`64`) Max length of the question after tokenization.
handle_impossible_answer	bool	(default=`False`) Whether or not we accept impossible as an answer.

Question Answering

Open book question-answering on CORD-19 literature

base-model: scibert-scivocab-cased
finetuned: CORD-19 dataset nli-stsb-mean-tokens (using the sentence-transformers library).
downstream-task: question-answering SQUAD 2.0

question =  ("What has been published concerning systematic, holistic approach to"
            " diagnostics (from the public health surveillance perspective to being"
            " able to predict clinical outcomes)?")
preds = qa.answer(question,  topk=5, top_p=25, nprobe=64, mode='bert')
preds.popempty()  # quickly pop any empty answers from the list.
print(preds.ids, preds.dist)
...

Each sentence id with its respective distance (nearest-neighbor to query/question):

(array([[179943,  38779,  48340, 171641,  11026,  16090, 132451,  10551,
         231547,  32627, 203359, 123822, 231157,  59945, 333167, 203328,
          37302,  74584,   1534, 425932, 261597, 268659, 397260,  27072,
         117127]]),
 array([[114.0309 , 119.05531, 127.31623, 128.10754, 134.01633, 135.51642,
         137.91711, 138.64822, 139.26003, 141.16678, 142.14966, 144.36464,
         147.06775, 147.06958, 147.22888, 148.9502 , 149.55885, 149.9512 ,
         150.12485, 150.1834 , 150.33147, 151.04956, 151.89029, 152.3189 ,
         152.47305]], dtype=float32))

In the following output we can see all the predicted answers and get a nice view of each answer's information.

list(preds)
...
[
  ModelOutput(score=0.02010919339954853, start=11, end=35, answer='laboratory confirmation,'),
  ModelOutput(score=0.00906957034021616, start=133, end=143, answer='sequencing.'),
  ModelOutput(score=0.00279330019839108, start=40, end=143, answer='national reference, ...'),
  ModelOutput(score=0.002625082153826952, start=11, end=21, answer='laboratory')
]

Get all spans [(start, end)] index of each answer. Both styles produce the same result.

for span in preds: print(span.start, span.end)
# or
preds.spans()
...
  [(11, 35), (133, 143), (40, 143), (11, 21)]

Highlighting the predicted answer:

from coronanlp import render_output as render

output = preds[2]  # e.g., iter all answer: [o.answer for o in preds]
answer = output.answer
render(answer=answer, context=preds.c, question=preds.q)

* Question

> What has been published concerning systematic, holistic approach to 
diagnostics (from the public health surveillance perspective to being 
able to predict clinical outcomes)?


* Context:

In case of laboratory confirmation, `<< the national reference laboratory 
aims to obtain material from regional laboratories for further sequencing. ANSWER >>` 
An alternate measure of program success is the extent to which screening delays the 
first importation of cases into the community, possibly providing additional time 
to train medical staff, deploy public health responders or refine travel policies 
(Cowling et al., Unlike the UK national strategy documents and plans, the US National 
Health Information Infrastructure Strategy document (also known as "Information for Health") 
refers explicitly to GIS and real-time health and disease monitoring and states that "public 
health will need to include in its toolkit integrated data systems; high-quality 
community-level data; tools to identify significant health trends in real-time data streams; 
and geographic information systems" [48]. This thinktank is used to address matters related 
to combating new issues and to expand on the Saudi reach to the international scientific 
community. For the purpose of this paper, the following de�nition is used, "Public health 
surveillance is the ongoing systematic collection, analysis, interpretation and 
dissemination of health data for the planning, implementation and evaluation of 
public health action" (see Section 2.3 below). Finally, we discuss policy and 
logistical and technological obstacles to achieving a potential transformation 
of public health microbiology.

Example of the render_output() method shown above in a jupyter notebook:

Name		Name	Last commit message	Last commit date
Latest commit History 141 Commits
.vscode		.vscode
coronanlp		coronanlp
notebooks		notebooks
src/img		src/img
.gitattributes		.gitattributes
.gitignore		.gitignore
.lgtm.yml		.lgtm.yml
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚕ coronanlp

Docs

NOTE

Question Answering

About

Releases

Packages

Contributors 2

Languages

amoux/corona

Folders and files

Latest commit

History

Repository files navigation

⚕ coronanlp

Docs

NOTE

Question Answering

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages