A minimalistic, general purpose tokenizer generator.
Use npm to install:
$ npm install doken
Import doken and create a tokenizer with regular expression rules:
const {createTokenizer, regexRule} = require('doken')
const tokenizeJSON = createTokenizer({
rules: [
regexRule('_whitespace', /\s+/y, {lineBreaks: true}),
regexRule('brace', /[{}]/y),
regexRule('bracket', /[\[\]]/y),
regexRule('colon', /:/y),
regexRule('comma', /,/y),
regexRule('string', /"([^"\n\\]|\\[^\n])*"/y),
regexRule('number', /(-|\+)?\d+(.\d+)?/y),
regexRule('boolean', /(true|false)\b/y),
regexRule('null', /null\b/y)
]
})
let tokens = tokenizeJSON(`{"a": "Hello World!"}`)
console.log([...tokens])
A rule object contains the following fields:
-
type
<string>
- The type of the token this rule generates. Iftype
starts with an underscore_
, the token will not be emitted by the tokenizer. -
lineBreaks
<boolean>
(Optional) - Set this property totrue
if this rule might match line breaks and you want to track it correctly. -
match
<Function>
- A function with the following signature:(input: string, position: number) => null | { length: number, value: any }
This function will try to get the token of given
type
at givenposition
ininput
if applicable. Returnnull
ifinput
atposition
is not a token with the giventype
, otherwise return an object.The first
length
characters ofinput
afterposition
will be the matched token. You can optionally return avalue
which can contain any data that will be attached to the token. Ifvalue
is not given, it will default to the firstlength
characters ofinput
afterposition
.
A token will be represented by an object with the following fields:
type
<string>
|null
- The type of the token ornull
if no given rules match the input.length
<number>
- The length of the token.value
<any>
- Thevalue
generated by the rule.pos
<number>
- The zero-based position of the first character of the token.row
<number>
- The zero-based row of the first character of the token.col
<number>
- The zero-based column of the first character of the token.
options
<object>
rules
<Array<Rule>>
strategy
'first' | 'longest'
(Optional) - Default:'first'
- Returns:
<Function>
Generates a tokenize function with the following signature:
(input: string) => IterableIterator<Token>
This function will attempt to tokenize given input
, yielding
tokens matched by given rules
one by one.
Set strategy
to 'longest'
to match the token with the rule that matches the
most characters instead of using the rule that matches first.
type
<string>
regex
<RegExp>
options
<object>
lineBreaks
<boolean>
(Optional) - Set this property totrue
if this rule might match line breaks and you want to track it correctly.value
<Function>
(Optional) - A function for calculating the token value out of the match.condition
<Function>
(Optional) - A function for indicating whether to discard match or not.
- Returns:
<Rule>
Returns a rule that attempts to match input string with the
given regex
.
value
can be set to a function (match: RegExpExecArray) => any
. The
generated token will have the returned value as value
.
condition
can be set to a function (match: RegExpExecArray) => boolean
.
Return false
to indicate to discard matched result and go on with the next
rule.