Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stopwords being ignored #70

Open
chaturv3di opened this issue Aug 19, 2022 · 5 comments
Open

Stopwords being ignored #70

chaturv3di opened this issue Aug 19, 2022 · 5 comments

Comments

@chaturv3di
Copy link

chaturv3di commented Aug 19, 2022

I am passing the set of English stopwords which I create from yake/StopwordsList/stopwords_en.txt.

text = "YAKE! is a light-weight unsupervised automatic keyword extraction method which rests on text statistical features extracted from single documents to select the most important keywords of a text. Our system does not need to be trained on a particular set of documents, neither it depends on dictionaries, external-corpus, size of the text, language or domain. To demonstrate the merits and the significance of our proposal, we compare it against ten state-of-the-art unsupervised approaches (TF.IDF, KP-Miner, RAKE, TextRank, SingleRank, ExpandRank, TopicRank, TopicalPageRank, PositionRank and MultipartiteRank), and one supervised method (KEA). Experimental results carried out on top of twenty datasets (see Benchmark section below) show that our methods significantly outperform state-of-the-art methods under a number of collections of different sizes, languages or domains. In addition to the python package here described, we also make available a demo, an API and a mobile app."

language = "en"
max_ngram_size = 5
deduplication_thresold = 0.9
deduplication_algo = 'seqm'
windowSize = 1
numOfKeywords = 5

# Location of the file downloaded from https://github.com/LIAAD/yake/blob/master/yake/StopwordsList/stopwords_en.txt
stopwords_file = os.path.join(home_dir, "data_txt", "yake_stopwords_en.txt")
with open(stopwords_file, 'r') as sw_f:
    yake_stopwords = set(sw_f.read().lower().split("\n"))

yake_kw_extractor = yake.KeywordExtractor(lan=language, 
                                          n=max_ngram_size, 
                                          dedupLim=deduplication_thresold, 
                                          dedupFunc=deduplication_algo, 
                                          windowsSize=windowSize, 
                                          top=numOfKeywords, 
                                          features=None, 
                                          stopwords=yake_stopwords)

yake_kw_extractor.extract_keywords(text)

And the results end up containing stopwords like of, a, from, etc.

[('trained on a particular set', -60.326928913747196),
 ('keywords of a text', -0.665864990295941),
 ('important keywords of a text', -0.31206738772455755),
 ('light-weight unsupervised automatic keyword extraction', 0.00029233948201177757),
 ('statistical features extracted from single', 0.0008477866813335354)]

If I invoke the method with parameter stopwords=None, the results don't change. Am I doing something silly here?

Thanks a lot.

@secsilm
Copy link

secsilm commented Oct 10, 2023

I guess the stopwords-removing step is done in the last steps, i.e.:

  1. split words
  2. extract candidates
  3. score, dedup and remove stopwords.

@JeremyBrent
Copy link

@chaturv3di I am running in the same issue, have you found a solution?

@chaturv3di
Copy link
Author

Unfortunately not.

@JeremyBrent
Copy link

Not sure if secsilm was referring to this, but I am thinking about using my stopwords as a postprocessing step outside of the Yake Class.

@chaturv3di
Copy link
Author

chaturv3di commented Mar 14, 2024

That's not elegant but works. Eg if I wanted up to 4 word phrases without stopwords, but if I were to remove stopwords in post processing, then I'd need to fetch up to 6 word phrases hoping that up to 2 of them will be stopwords. That is clunky and increases the compute time.

OTOH, there doesn't seem to be another option right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants