Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Norwegian is not supported #65

Open
emilmuller opened this issue Mar 11, 2019 · 0 comments
Open

Norwegian is not supported #65

emilmuller opened this issue Mar 11, 2019 · 0 comments

Comments

@emilmuller
Copy link

emilmuller commented Mar 11, 2019

I'm doing keyword extraction in Norwegian. If I do not use Pattern, I'm getting stop words within the keyword extraction. E.g. if I extract the keywords from the first paragraph on Albert Einstein in the Norwegian Wikipedia:

Albert Einstein var en tyskfødt teoretisk fysiker og nobelprisvinner som er mest kjent for å ha formulert relativitetsteorien og vist at masse og energi er ekvivalente ved masseenergiloven, E = mc2. Gjennom den spesielle relativitetsteorien revolusjonerte han mekanikken og presiserte tidsbegrepet. Han var sentral i utviklingen av kvantemekanikken og er grunnleggeren av moderne kosmologi. Han regnes for å være en av de mest betydningsfulle vitenskapsmenn i det 20. århundre.

I'll get the following keywords:

  • i
  • og
  • han
  • hans
  • av
  • for å
  • ble
  • om
  • einstein var en
  • ved
  • som er mest
  • relativitetsteorien
  • det
  • fysikk
  • med
  • den
  • verden
  • verdens
  • enn
  • vitenskapelige
  • århundre
  • århundrets
  • person
  • første årene
  • professor

I, og, av, for, å, ble, om, etc. are stop words, and as such, the result is unusable.

When installing Pattern, I just get:

>>> from summa.summarizer import summarize
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\summ
a\__init__.py", line 1, in <module>
    from summa import commons, graph, keywords, pagerank_weighted, \
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\summ
a\keywords.py", line 5, in <module>
    from .preprocessing.textcleaner import clean_text_by_word as _clean_text_by_
word
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\summ
a\preprocessing\textcleaner.py", line 8, in <module>
    from pattern.en import tag
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\patt
ern\text\en\__init__.py", line 61, in <module>
    from pattern.text.en.inflect import (
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\patt
ern\text\en\__init__.py", line 80, in <module>
    from pattern.text.en import wordnet
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\patt
ern\text\en\wordnet\__init__.py", line 57, in <module>
    nltk.data.find("corpora/" + token)
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk
\data.py", line 673, in find
    return find(modified_name, paths)
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk
\data.py", line 660, in find
    return ZipFilePathPointer(p, zipentry)
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk
\compat.py", line 228, in _decorator
    return init_func(*args, **kwargs)
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk
\data.py", line 506, in __init__
    zipfile = OpenOnDemandZipFile(os.path.abspath(zipfile))
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk
\compat.py", line 228, in _decorator
    return init_func(*args, **kwargs)
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk
\data.py", line 1055, in __init__
    zipfile.ZipFile.__init__(self, filename)
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\patt
ern\text\en\..\..\..\..\zipfile.py", line 1222, in __init__
    self._RealGetContents()
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\patt
ern\text\en\..\..\..\..\zipfile.py", line 1289, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

And so I cannot use Pattern (issue #30), making Norwegian unusable and unsupported. Assuming this goes for other languages as well.

@emilmuller emilmuller changed the title Can't use Pattern Norwegian is not supported Mar 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant