Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding links in a knowledge base & BIO-export in a separate column #8

Open
shigapov opened this issue Aug 5, 2022 · 5 comments
Open

Comments

@shigapov
Copy link

shigapov commented Aug 5, 2022

Is it possible in your tool to add links to entities in a knowledge base and export them to BIO-format in a separate column? If yes, how can I do that? If not, would you consider to add that functionality?

@hehuan2112
Copy link
Collaborator

hehuan2112 commented Aug 9, 2022

Thank you so much for your feedback! My understanding is that you want to add knowledge base links to entities and have those links be added in the BIO-format export. For example, we have the following annotation in the sentence "I had a fever last week":

<SYMP id="S0" text="fever" spans="8~13" cui="C0015967" />

In the UMLS KB, fever's CUI is C0015967 and the link is https://uts.nlm.nih.gov/uts/umls/concept/C0015967. So you want to have the C0015967 added to the annotation and be exported in BIO format as a separate column. For example:

I         O
had    O
a        O
fever  B-SYMP     C0015967
last     O
week   O
.          O

If the above example is correct, our previous experience is adding the knowledge link or ID as an attribute in the entity concept. Then users can add the concept to the entity while annotating.

For example, adding cui in the annotation schema:
image
Then, users can input the cui while annotating:
image

Once the annotation is finished, we can use the value of cui attribute for linking or any other tasks. Usually, we made Python scripts to parse and convert the annotation XML. So, at present, the exported BIO file only contains the entity token without any attributes. But we can add all attribute values in a future version for sure if that is needed. Could you provide more details about your task need? We can discuss how to meet it in our tool. :)

@shigapov
Copy link
Author

shigapov commented Aug 10, 2022

Oh, that's great! So the attributes can be already added into xml-files.

You described the task very well. Indeed, my use case is both NER & NEL. Therefore we need the links. Moreover, if all attributes are already added into xml-files, then we could also add nested named entities into a separate column (e.g., George Washington as PER in George Washington University Hospital as ORG). That's very useful. Thank you!

I've tested adding the links just now with the following text from German Wikipedia:

Die BASF SE mit Sitz in Ludwigshafen am Rhein ist ein börsennotierter Chemiekonzern. Sie ist in 90 Ländern vertreten und betreibt 238 Produktionsstandorte. 111.047 Mitarbeiter erwirtschafteten 2021 einen Umsatz von 78,6 Milliarden Euro.[1] Nach Umsatz ist die BASF damit der größte Chemiekonzern weltweit. Das Unternehmen hat seinen Ursprung in der 1865 in Mannheim gegründeten Badischen Anilin- & Sodafabrik. Weil dort kein geeignetes Areal zur Verfügung stand, wurde das neue Werk noch im selben Jahr am gegenüberliegenden Rheinufer in Ludwigshafen gebaut.

The exported xml looks like:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE collection SYSTEM "BioC.dtd">
<collection><source/><date>Wed Aug 10 2022 09:09:33 GMT+0200 (Central European Summer Time)</date><key/><document><id>basf_se.txt.xml</id><passage><offset>0</offset><text>Die BASF SE mit Sitz in Ludwigshafen am Rhein ist ein börsennotierter Chemiekonzern. Sie ist in 90 Ländern vertreten und betreibt 238 Produktionsstandorte. 111.047 Mitarbeiter erwirtschafteten 2021 einen Umsatz von 78,6 Milliarden Euro.[1] Nach Umsatz ist die BASF damit der größte Chemiekonzern weltweit. Das Unternehmen hat seinen Ursprung in der 1865 in Mannheim gegründeten Badischen Anilin- &amp; Sodafabrik. Weil dort kein geeignetes Areal zur Verfügung stand, wurde das neue Werk noch im selben Jahr am gegenüberliegenden Rheinufer in Ludwigshafen gebaut.</text><annotation id="O0"><location length="7" offset="4"/><text>BASF SE</text><infon key="QID">Q9401</infon></annotation><annotation id="L0"><location length="21" offset="24"/><text>Ludwigshafen am Rhein</text><infon key="QID">Q2910</infon></annotation><annotation id="O1"><location length="4" offset="260"/><text>BASF</text><infon key="QID">Q9401</infon></annotation><annotation id="L1"><location length="8" offset="357"/><text>Mannheim</text><infon key="QID">Q2119</infon></annotation><annotation id="O2"><location length="30" offset="378"/><text>Badischen Anilin- &amp; Sodafabrik</text><infon key="QID">Q9401</infon></annotation></passage></document></collection>

The links are there, that's great, but the ampersand & was converted into &amp;, and therefore the length of the last named entity is incorrect now. And in case of more named entities, the following offsets and lengths would be incorrect as well. My goal is to apply NER/NEL on OCR-ed texts with many exotic characters. What would you suggest in this case to avoid any conversions during export?

P.S. The exported BIO file has &, by the way.

@hehuan2112
Copy link
Collaborator

hehuan2112 commented Aug 11, 2022

Yes, the attributes can be exported in the BioC format:

<infon key="attribute_name">attribute_value</infon>

In the default MedTator XML format, the attributes are also saved as tag attributes (e.g., <SYMP id="S0" text="fever" spans="8~13" attribute_name="attribute_value" />).

And thank you for finding the bug! I just checked and found that It is caused by the default XML encoding when creating a text node. When converting the annotation to BioC format, the createTextNode() function is called to create a <text></text> node to save the document or annotated tokens. But the createTextNode() function in JavaScript will automatically convert some special characters to this &xxx; format, which causes the bug.

So, I have changed the function to another one that can save text data in a CDATA section to preserve its original characters. Then, the output will look like the following:

<document>
   <id>sample-5dsw7.xml</id>
   <passage>
      <offset>0</offset>
      <text><![CDATA[I have fever & headache about < 10 days.]]></text>
      <annotation id="S0">
         <location length="16" offset="7" />
         <text><![CDATA[fever & headache]]></text>
         <infon key="certainty">positive</infon>
         <infon key="comment">NA</infon>
      </annotation>
      <annotation id="S1">
         <location length="9" offset="30" />
         <text><![CDATA[< 10 days]]></text>
         <infon key="certainty">positive</infon>
         <infon key="comment">NA</infon>
      </annotation>
   </passage>
</document>

When using other tools or scripts to parse the XML, it will be parsed as it is. We also have a script in our repository for showing how to use Python to parse the MedTator XML file for reference: medtator_kits.py. Parsing the BioC format XML should be a similar process.

I have updated the fixed version to our public version https://ohnlp.github.io/MedTator/ so that you can try it now. And this fix will be included in the next release. Thank you again for your feedback! Please let us know if you find any issues.

@shigapov
Copy link
Author

shigapov commented Aug 11, 2022

Nice! Thanks a lot! I've just tested it now. Indeed, the content of CDATA has now a proper ampersand. But in the tags the ampersand is converted. Please take a look:

<?xml version="1.0" encoding="UTF-8" ?>
<NEW_SCHEMA>
<TEXT><![CDATA[Die BASF SE mit Sitz in Ludwigshafen am Rhein ist ein börsennotierter Chemiekonzern. Sie ist in 90 Ländern vertreten und betreibt 238 Produktionsstandorte. 111.047 Mitarbeiter erwirtschafteten 2021 einen Umsatz von 78,6 Milliarden Euro.[1] Nach Umsatz ist die BASF damit der größte Chemiekonzern weltweit. Das Unternehmen hat seinen Ursprung in der 1865 in Mannheim gegründeten Badischen Anilin- & Sodafabrik. Weil dort kein geeignetes Areal zur Verfügung stand, wurde das neue Werk noch im selben Jahr am gegenüberliegenden Rheinufer in Ludwigshafen gebaut.]]></TEXT>
<TAGS>
<ORG spans="4~11" text="BASF SE" id="O0" QID="NA"/>
<ORG spans="260~264" text="BASF" id="O1" QID="NA"/>
<LOC spans="538~550" text="Ludwigshafen" id="L0" QID="NA"/>
<ORG spans="378~408" text="Badischen Anilin- &amp; Sodafabrik" id="O2" QID="NA"/>
<LOC spans="357~365" text="Mannheim" id="L1" QID="NA"/>
<LOC spans="24~36" text="Ludwigshafen" id="L2" QID="NA"/>
</TAGS>
<META/>
</NEW_SCHEMA>

That's the internal MedTator xml format, right? In the BioC xml files the tag has a proper ampersand now. Thank you!

@hehuan2112
Copy link
Collaborator

hehuan2112 commented Aug 12, 2022

Yes, that's the MedTator XML format, and it saves the annotated text and other attributes as XML tag attribute values.
As you found, the XML tag attribute value in this format is different from the XML tag node value.
We must escape special XML characters (just a few, <, >, &, ", ') in the attribute value to ensure the value can be correctly parsed by libraries.

So when you parse the XML content, you don't need to worry about those escaped characters.
The XML parser library will handle it automatically for you.
For example, in Python

from xml.dom.minidom import parse
dom = parse(full_path_to_xml)
nodes = dom.getElementsByTagName('TAGS')[0].childNodes
for node in nodes:
    attrs = node.attributes.items()
    for attr in attrs:
        print(attr[0], attr[1])

and in JavaScript:

var parser = new DOMParser();
var xmlDoc = parser.parseFromString(XML_TEXT_CONTENT, "text/xml");
var elems = xmlDoc.getElementsByTagName('TAGS')[0].children;
var attrs = elems[0].getAttributeNames();
var value = elems[0].getAttribute(attrs[0]);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants