Stopwords being ignored #70

chaturv3di · 2022-08-19T22:43:48Z

I am passing the set of English stopwords which I create from yake/StopwordsList/stopwords_en.txt.

text = "YAKE! is a light-weight unsupervised automatic keyword extraction method which rests on text statistical features extracted from single documents to select the most important keywords of a text. Our system does not need to be trained on a particular set of documents, neither it depends on dictionaries, external-corpus, size of the text, language or domain. To demonstrate the merits and the significance of our proposal, we compare it against ten state-of-the-art unsupervised approaches (TF.IDF, KP-Miner, RAKE, TextRank, SingleRank, ExpandRank, TopicRank, TopicalPageRank, PositionRank and MultipartiteRank), and one supervised method (KEA). Experimental results carried out on top of twenty datasets (see Benchmark section below) show that our methods significantly outperform state-of-the-art methods under a number of collections of different sizes, languages or domains. In addition to the python package here described, we also make available a demo, an API and a mobile app."

language = "en"
max_ngram_size = 5
deduplication_thresold = 0.9
deduplication_algo = 'seqm'
windowSize = 1
numOfKeywords = 5

# Location of the file downloaded from https://github.com/LIAAD/yake/blob/master/yake/StopwordsList/stopwords_en.txt
stopwords_file = os.path.join(home_dir, "data_txt", "yake_stopwords_en.txt")
with open(stopwords_file, 'r') as sw_f:
    yake_stopwords = set(sw_f.read().lower().split("\n"))

yake_kw_extractor = yake.KeywordExtractor(lan=language, 
                                          n=max_ngram_size, 
                                          dedupLim=deduplication_thresold, 
                                          dedupFunc=deduplication_algo, 
                                          windowsSize=windowSize, 
                                          top=numOfKeywords, 
                                          features=None, 
                                          stopwords=yake_stopwords)

yake_kw_extractor.extract_keywords(text)

And the results end up containing stopwords like of, a, from, etc.

[('trained on a particular set', -60.326928913747196),
 ('keywords of a text', -0.665864990295941),
 ('important keywords of a text', -0.31206738772455755),
 ('light-weight unsupervised automatic keyword extraction', 0.00029233948201177757),
 ('statistical features extracted from single', 0.0008477866813335354)]

If I invoke the method with parameter stopwords=None, the results don't change. Am I doing something silly here?

Thanks a lot.

The text was updated successfully, but these errors were encountered:

secsilm · 2023-10-10T09:03:39Z

I guess the stopwords-removing step is done in the last steps, i.e.:

split words
extract candidates
score, dedup and remove stopwords.

JeremyBrent · 2024-03-13T19:09:56Z

@chaturv3di I am running in the same issue, have you found a solution?

chaturv3di · 2024-03-13T19:14:49Z

Unfortunately not.

JeremyBrent · 2024-03-13T19:19:40Z

Not sure if secsilm was referring to this, but I am thinking about using my stopwords as a postprocessing step outside of the Yake Class.

chaturv3di · 2024-03-14T04:30:45Z

That's not elegant but works. Eg if I wanted up to 4 word phrases without stopwords, but if I were to remove stopwords in post processing, then I'd need to fetch up to 6 word phrases hoping that up to 2 of them will be stopwords. That is clunky and increases the compute time.

OTOH, there doesn't seem to be another option right now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stopwords being ignored #70

Stopwords being ignored #70

chaturv3di commented Aug 19, 2022 •

edited

Loading

secsilm commented Oct 10, 2023

JeremyBrent commented Mar 13, 2024

chaturv3di commented Mar 13, 2024

JeremyBrent commented Mar 13, 2024

chaturv3di commented Mar 14, 2024 •

edited

Loading

Stopwords being ignored #70

Stopwords being ignored #70

Comments

chaturv3di commented Aug 19, 2022 • edited Loading

secsilm commented Oct 10, 2023

JeremyBrent commented Mar 13, 2024

chaturv3di commented Mar 13, 2024

JeremyBrent commented Mar 13, 2024

chaturv3di commented Mar 14, 2024 • edited Loading

chaturv3di commented Aug 19, 2022 •

edited

Loading

chaturv3di commented Mar 14, 2024 •

edited

Loading