-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding links in a knowledge base & BIO-export in a separate column #8
Comments
Thank you so much for your feedback! My understanding is that you want to add knowledge base links to entities and have those links be added in the BIO-format export. For example, we have the following annotation in the sentence "I had a fever last week": <SYMP id="S0" text="fever" spans="8~13" cui="C0015967" /> In the UMLS KB, fever's CUI is C0015967 and the link is https://uts.nlm.nih.gov/uts/umls/concept/C0015967. So you want to have the
If the above example is correct, our previous experience is adding the knowledge link or ID as an attribute in the entity concept. Then users can add the concept to the entity while annotating. For example, adding Once the annotation is finished, we can use the value of |
Oh, that's great! So the attributes can be already added into xml-files. You described the task very well. Indeed, my use case is both NER & NEL. Therefore we need the links. Moreover, if all attributes are already added into xml-files, then we could also add nested named entities into a separate column (e.g., I've tested adding the links just now with the following text from German Wikipedia:
The exported xml looks like:
The links are there, that's great, but the ampersand P.S. The exported BIO file has |
Yes, the attributes can be exported in the BioC format: <infon key="attribute_name">attribute_value</infon> In the default MedTator XML format, the attributes are also saved as tag attributes (e.g., And thank you for finding the bug! I just checked and found that It is caused by the default XML encoding when creating a text node. When converting the annotation to BioC format, the So, I have changed the function to another one that can save text data in a CDATA section to preserve its original characters. Then, the output will look like the following: <document>
<id>sample-5dsw7.xml</id>
<passage>
<offset>0</offset>
<text><![CDATA[I have fever & headache about < 10 days.]]></text>
<annotation id="S0">
<location length="16" offset="7" />
<text><![CDATA[fever & headache]]></text>
<infon key="certainty">positive</infon>
<infon key="comment">NA</infon>
</annotation>
<annotation id="S1">
<location length="9" offset="30" />
<text><![CDATA[< 10 days]]></text>
<infon key="certainty">positive</infon>
<infon key="comment">NA</infon>
</annotation>
</passage>
</document> When using other tools or scripts to parse the XML, it will be parsed as it is. We also have a script in our repository for showing how to use Python to parse the MedTator XML file for reference: medtator_kits.py. Parsing the BioC format XML should be a similar process. I have updated the fixed version to our public version https://ohnlp.github.io/MedTator/ so that you can try it now. And this fix will be included in the next release. Thank you again for your feedback! Please let us know if you find any issues. |
Nice! Thanks a lot! I've just tested it now. Indeed, the content of CDATA has now a proper ampersand. But in the tags the ampersand is converted. Please take a look:
That's the internal MedTator xml format, right? In the BioC xml files the tag has a proper ampersand now. Thank you! |
Yes, that's the MedTator XML format, and it saves the annotated text and other attributes as XML tag attribute values. So when you parse the XML content, you don't need to worry about those escaped characters. from xml.dom.minidom import parse
dom = parse(full_path_to_xml)
nodes = dom.getElementsByTagName('TAGS')[0].childNodes
for node in nodes:
attrs = node.attributes.items()
for attr in attrs:
print(attr[0], attr[1]) and in JavaScript: var parser = new DOMParser();
var xmlDoc = parser.parseFromString(XML_TEXT_CONTENT, "text/xml");
var elems = xmlDoc.getElementsByTagName('TAGS')[0].children;
var attrs = elems[0].getAttributeNames();
var value = elems[0].getAttribute(attrs[0]); |
Is it possible in your tool to add links to entities in a knowledge base and export them to BIO-format in a separate column? If yes, how can I do that? If not, would you consider to add that functionality?
The text was updated successfully, but these errors were encountered: