Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

example data processing warning using google colab #54

Open
amscosta opened this issue Feb 26, 2024 · 4 comments
Open

example data processing warning using google colab #54

amscosta opened this issue Feb 26, 2024 · 4 comments

Comments

@amscosta
Copy link

Hello,
The following warning is issued when processing one of the .xml from the example data:
Processing: paperetl/file/data/0.xml
/usr/local/lib/python3.10/dist-packages/paperetl/file/tei.py:35: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument features="xml" into the BeautifulSoup constructor.
soup = BeautifulSoup(stream, "lxml")

Any clue how to avoid/correct that?
Thanks a lot.

@amscosta
Copy link
Author

I am using the colab notebook.

@davidmezzetti
Copy link
Member

You can ignore it like this:

import warnings
from bs4 import XMLParsedAsHTMLWarning

warnings.filterwarnings("ignore", category=XMLParsedAsHTMLWarning)

@amscosta2022
Copy link

Thanks.
But "using an XML parser will be more reliable" the message says.

@davidmezzetti
Copy link
Member

Feel free to fork this project and try. It doesn't work in the tests I've run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants