quelqu'hui

tokenizer for contemporary french.

text	tokens
peut-on	`peut` `-on`
prends-les	`prends` `-les`
Villar-les-bois	`Villar-les-bois`
lecteur-rice-x-s	`lecteur-rice-x-s`
correcteur·rices	`correcteur·rices`
mais.maintenant	`mais` `.` `maintenant`
relecteur.rice.s	`relecteur.rice.s`
autre(s)	`autre(s)`
(autres)	`(` `autres` `)`
(autre(s))	`(` `autre(s)` `)`
www.on-tenk.com.	`www.on-tenk.com` `.`
oui..?	`oui` `..?`
aujourd'hui	`aujourd'hui`
c'est	`c'` `est`
dedans/dehors	`dedans` `/` `dehors`
02/10/2024	`02/10/2024`
:-)	`:-)`
(:happy:)	`(` `:happy:` `)`

usage

use as a tokenizer in a spacy pipeline:

import quelquhui
import spacy

nlp = spacy.load('fr_core_news_sm')
nlp.tokenizer = quelquhui.Toquenizer(nlp.vocab)

if you save the pipeline and want to load it back:

nlp2 = spacy.load("./model_output", config={
    "nlp": {"tokenizer": {"@tokenizers": "quelquhui_tokenizer"}}
})

use as a independant tokenizer (with no dependencies):

import quelquhui

qh = quelquhui.light.Toquenizer()
doc = qh("la machine à (b)rouiller le temps s'est peut-être dérailler...")

installation

pip install git+https://github.com/thjbdvlt/quelquhui

configuration

very few options can be set to modify the tokenizer behavior:

import quelquhui

qh = quelquhui.Toquenizer(
    abbrev = ["ref", "ed[s]"], # support regex
    inclusive = True, # default
    emoticon = True, # default
    url = True, # default
    regexurl = r"(?:\w+:https://|www\.)[\S]+[\w/]", # default
    regexemoticon = r":-?[\)\(]", # (default one is too long to be reproduced here.)
    chars = {
        "APOSTROPHE": "'`´’" # default
        "HYPHEN": "-–—",  # default
        # signs you'll set here will replace defaults.
        # other won't be changed.
        # complete list with default values can be found with
        # `quelquhui.default.Chars.__dict__`
    }
    words = {
        "ELISION": ["j", "s", "jusqu"], # ...
        "INVERSION": ["on", "y", "ci"], # ...
        "SUFF_FEMININE": ["e", "rice", "ère"], # ...
        "SUFF_NONBINARY": ["x"],
        "SUFF_PLURAL": ["s", "x"],
        # there's only these 5. 
        # (default lists for the first three are longer.)
    }
)

how it works

split text on spaces.
it re-splits using a few functions (looped) that produced frozen tokens which won't be tokenized by next functions/steps (typically: urls, or text-emoji like :happy:, which may be hard to tokenized in cases like (:happy:); we don't want the regex looking for emoticons to match :): i need to defines rules to be applied in a specific order).
for each resulting substring:
1. list characters on which words must be split. typically: punctuation marks, such as comma or period. let's say they are then considered token boundaries.
2. list characters that must be kept together, even if they have been listed in step 2.i.
3. remove 2.i from 2.ii, and split on remainings splitting characters.

period

in most cases, a period is a token distinct from the word it follows: a period ending a sentence obviously isn't part of the word it follows. but in some cases, a period actually is a part of a word (abbreviations: p. 10), and in some other cases, the period and the letters following it must be kept in the token (inclusive language: auteur.rice.s). these cases are exceptions, hence they are handled in 2.ii: i remove them from periods found in 2.i. the pattern in 2.i will be: \. (match period wherever it is, without any condition), while the pattern in 2.ii could be (if simplified) (?<=[^a-z][a-z])\.|\.(?=rice|s) (match period if preceded by a single letter or followed by rice or s).

hyphen

in most cases, a hyphen isn't a token boundary, because in french the hyphen is a sign that says "these two words are actually one word", such as in Vaison-la-romaine. but in some cases, they don't: in case of verb-subject inversion (mostly). these cases are easily described and handled with a regular expression, because subjects in these cases are always personnal pronoums: -(?=je|tu|.... there are also a few cases where the following word is not a pronominalized subject, but a pronominalized object, such as prends-les, with is also easily handled in a regular expression. hence, the pattern for hyphen in 2.i is not (as for period) unconditional and simple, but rather complex and conditional (match hyphen if followed by pronominalized subject or object).

dependencies

python3
optionnel: spacy

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
quelquhui		quelquhui
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

quelqu'hui

usage

installation

configuration

how it works

period

hyphen

dependencies

About

Releases

Packages

Contributors 2

Languages

License

thjbdvlt/quelquhui

Folders and files

Latest commit

History

Repository files navigation

quelqu'hui

usage

installation

configuration

how it works

period

hyphen

dependencies

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages