Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get rid of the redundant note info from the summary #138

Open
kathy-lee opened this issue Jun 27, 2024 · 3 comments
Open

Get rid of the redundant note info from the summary #138

kathy-lee opened this issue Jun 27, 2024 · 3 comments

Comments

@kathy-lee
Copy link

kathy-lee commented Jun 27, 2024

Hi there, thanks for making this great library.
I have a question about the summary of a Wikipedia page: Take https://en.wikipedia.org/wiki/South_Bruce_Peninsula as example, when I try to get its summary with this library('wikipage.summary'), it returns its summary with a prefix info as following:
"""
South Bruce Peninsula is not to be confused with the Municipality of South Bruce, Ontario

South Bruce Peninsula is a town at the base of the Bruce Peninsula of Ontario, Canada, in Bruce County between Lake Huron and Georgian Bay. It was formed on January 1, 1999, when the town of Wiarton, the village of Hepworth, and the townships of Albemarle and Amabel were amalgamated. The new municipality was created to provide necessary political representation, administrative support, and necessary municipal services on behalf of the residents.
Tourism, particularly cottage rental and providing services to visitors, is the major industry in the area. Many cottages are found along Sauble Beach (North and South).
"""
But actually the first paragraph is not part of the summary, it's additional note info, is there a way to get rid of it: "South Bruce Peninsula is not to be confused with the Municipality of South Bruce, Ontario" ?

Many thanks in advance!

@kathy-lee
Copy link
Author

Hi @barrust , could you give some hint please? Thank you!

@barrust
Copy link
Owner

barrust commented Jul 15, 2024

Really, there is nothing that can be done on the library's side as the summary is pulling the first section of the site. Summary is not part of the API but is really just the extraction extension. The only real solution for your case would be to do some NLP or text processing to determine if part or all is unnecessary.

Something simple might be to do something like this:

wikipedia = MediaWiki()
p = wikipedia.page("South Bruce Peninsula")

summary = []
for para in p.summary.split("\n"):
    if para and "not to be confused with" not in para:
        summary.append(para)

print("\n".join(summary))

Not perfect and doesn't support all languages (hence why it cannot be added directly to the library) but should do what you are looking for.

@kathy-lee
Copy link
Author

Thank you for the reply, I get your point :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants