Tokenization overhaul #77140

alexdima · 2019-07-10T20:59:51Z

The current tokenisation story of VS Code is based on TM grammars, which are pretty powerful, but we are running into their limits if we want to do something more than a top-down scanner can do. Also, once you implement a TM interpreter, you realise how inefficient the way in which regular expressions must be evaluated is and how TM grammars were not meant to do much more than simple colouring using just a few rules... The fact that we now have these complex grammars than end up producing beautiful tokens is more of a testament to the amazing computing power available to us than to the design of the TM grammar semantics.

At the time when TM grammars were introduced and became popular there were no language servers available which understand the semantics of a language. Therefore, TM grammars were also used to colour semantic constructs. The introduction of LSP has brought us language servers for many languages and it we want to leverage this power to reduce the complexity of the tokenizer/classifier. There is already effort under way to specify how such API might look under LSP at microsoft/vscode-languageserver-node#367

In any case, for smarter languages where we offer great extensions, such as for TypeScript or C++, we have noticed two different patterns emerge to try and compensate for these limitations.

Complex TM grammar approach (TypeScript)

This approach was taken by TypeScript, where we now have immense regular expressions, which are a testament to the smartness of the author, but which are potentially very slow to evaluate on the UI thread:

Text Editor Decorations (C++)

This approach was taken by C++, where we now receive potentially unbounded amounts of text editor decorations used to represent semantic tokens which are pushed by the C++ extension to correct or enhance the TM grammar. The limits of text editor decorations start to show, I have collected some of the issues under this query. Due to their memory cost, complexity, and breadth of usage (i.e. cannot touch the editing logic around them at this point), text editor decorations are not the right tool for this job...

Both approaches show that there is a real need for something more, and that folks which care can get really creative and smart in tackling this need even when we lack as a platform. This issue is about overhauling how tokenization works in VS Code and tries to address multiple goals at once:

1. Move tokenization off the UI thread

Today, TM tokenization runs on the UI thread. Even more interesting, we have numerous features (such as typing a } or typing (, ', ", etc) where we need to know synchronously, at the time we interpret the typed character if we are in a comment, in a string, in a regex, or somewhere else... So we have code paths were we end up tokenizing the current line synchronously given the line above is tokenized in order to find out what's the exact context that we are in and then we make a decision based on that.

We have looked into this and built a prototype where we removed the synchronous tokenization... Moving this kind of classification off the UI thread entirely would result in severe flakiness... In other words, sometimes pressing ' would insert '|' and sometimes only '|, in the same file, in the same location, based purely on typing speed and the time it takes to send tokens over from the web worker. Having an editor where typing something does one thing 90% of the time and another thing 10% of the time based on typing speed would IMHO be completely unacceptable.

As a one-off approach, we have written a fast classifier for comments, strings or regexes for TS, in TS. We will experiment to see if this classifier could be used synchronously on the UI thread to determine what to do when typing these characters (}, ', etc). The challenge here lies with making it incremental (not start from the beginning of the file for each keystroke). Also, since these syntax constructs are "rare" relative to the entire body of text, a line based representation would not be a good one. Even more ideas are that perhaps we shouldn't store the location of strings, comments, etc. but only the save-points between them given the classifier would be fast enough to compute them again.

Another idea circulating was to enable writing monarch classifiers and contributing them from extensions. This would avoid some of the bad design choices of TM, but would still mean evaluating regexes written by extensions on the UI thread. Yet another idea was to have a "fast" TM grammar that only deals with strings, comments and regexes and another normal one for tokens -- again with the same problem of running regexes written by extension on the UI thread. Another idea was to build some base parser, with components such as C-style comments, C-style strings, etc which could be exercised by extensions (i.e. some kind of higher-order constructs than regexes). Or maybe we should just hand write parsers for the top 90% of our languages to detect strings, comments and regexes... We have not yet taken any clear decision as we still need to experiment in this area to learn more...

2. Accept tokens from the extension host (semantic tokenization)

Moving TM grammars off the UI thread is good for reducing our freezes and crashes, but still does not address the fundamental limitations of TM. Here we need to add API such that semantic tokens can be pushed by the extension host. These tokens should very much behave similar to text editor decorations, but have a different implementation where we can represent them with a lot less memory (just 2 or 3 32bit numbers like we do with the other tokens). We should also tweak the way they are adjusted around typing to make most sense for tokens...

They also need to be updateable incrementally and only as needed. There are discussions of using the visible ranges APIs to prioritize the regions which should receive semantic tokens first. We have not yet began drafting an API nor a reference implementation.

3. (low priority) Enable the integration of other tokenization engines

This is just here to remind us to keep in mind that we might want to move away from the inefficient TM grammars completely at one point in the future. There is a lot of love for Tree-Sitter nowadays and we might want to investigate using it, or we might want to roll our own story, since we do actually have a lot of experience in this area...

Tasks

The text was updated successfully, but these errors were encountered:

paulyoung · 2019-08-09T21:14:41Z

It would be great if the proposal to add syntax highlighting to the language server protocol could be considered. I think the latest on that is here: microsoft/vscode-languageserver-node#367

fwcd · 2019-08-16T19:46:14Z

Tree-Sitter support would be amazing! It does have a fast parser and grammars for many popular languages (including syntax-highlighters that map Tree-Sitter AST nodes to TextMate scopes).

jeff-hykin · 2019-08-17T01:52:17Z

Excellent post @alexandrudima. While maintaining the C++ TM grammar and helping get the Tree Sitter working, I can say I've experienced the limitations of the Decorations API and TextMate first hand.

To add to your point of complex regex, the C++ grammar has a single pattern (regex + capture groups) that is 20,000 characters long: more characters than the entire file of the Go Lang TM grammar.

Base Parser Idea

Another idea was to build some base parser, with components such as C-style comments, C-style strings, etc

I'm confident in saying a base parser is most likely impractical.

Ruby and Perl have custom-delimited strings like qq() and %w()
Markdown, LaTeX, and HTML have their own string-like and code-area customizations
YAML has unquoted strings
C++ has <> strings for #includes, and unquoted strings for #error and #warning
C++ has special string literals, raw strings, and assembly strings
there are many variations of interpolation and escapes, especially for shell languages
Python has triple quotes
and many more edge cases

If there was a functional base parser, it would not be very generic. It would be more akin to a repo of every language combined.

Normal + Streamlined Grammar Idea

TM grammar that only deals with strings, comments and regexes and another normal one for tokens
we still need to experiment in this area to learn more...

I do think this is realistic, and even easy. Although it doesn't fix the underlying problem, it could prevent and reduce many of the severe slowdowns. It is only a matter of removing patterns from an existing grammar. Defaulting to a synchronous grammar, but allowing a language to specify a synchronous (semantic) and asynchronous (highlighting) grammar could provide a lot of benefits for the heavyweight grammars like C++ and TypeScript.

Adding this would allow exploring the move of TM grammars off the UI thread without the full risk. And making it a toggle-able feature would allow for users to help debug the non-UI thread early on.

[x] write a fast TS classifier of comments, strings, regex (done here with tests here )

Is there any performance data on this? In comparison to the normal TS tmLanguage and/or a comments/regex/strings only tmLanguage.

Non-Synchronous Solutions

we have numerous features (such as typing a } or typing (, ', ", etc) where we need to know synchronously, at the time we interpret the typed character if we are in a comment, in a string, in a regex,

Is there a way to make those features asynchronous as well? Part 1 makes it seem like those features must stay synchronous. However:

sometimes pressing ' would insert '|' and sometimes only '|
based purely on typing speed and the time it takes to send tokens over from the web worker

this makes it seem like web workers are a non-viable solution. But then the checklist (especially the last item) makes it sound like web workers are a planned and viable solution. Am I missing something?

Tokens from the Extension Host

how do we manage two tokens stores, one owned by TM, and one owned by the extension host, how do we merge them to give a consistent picture to the editor view?

It seems to me like the extension tokens will need a start and end, and then then extension tokens would always override/separate the TM tokens.

Semantic Representation

TM scopes cannot effectively express basic concepts; a TM token cannot be both a constant and a function, or both a variable and a function, or both a class and an object. Whatever parser is used, whether it be a language server extensions, or the tree sitter, it would be very unfortunate if their semantic information was crippled by having to use TM Scopes to communicate meaning.

If grammar scopes and extension semantic-token-information are known statically, it seems there could be a way to generate an efficient function (similar to the Trie used currently) that accepts a scope + semantic information as input, and returns an enumerated style as a 32 bit output. If the semantic information is needed on the token (for uses other than color), the semantic info could be made immutable, placed in a set, and return a 32 bit enumeration stored on the token itself. This would introduce 1 extra step in looking up semantic information. To store semantic information directly on the token would mean limiting the amount/kind of semantic information, there would need to be a universal standard.

alexzielenski · 2019-08-17T16:10:44Z

Having recently crafted a language server using text decorations for semantic highlighting as well, I'm in agreement with @jeff-hykin that the most practical and not too difficult solution would be the "Normal + Streamlined Grammar Idea".

I've found that writing a very simple and barebones TM grammar for common/context-insensitive tokens like keywords and comments works very well. I then override the default highlight where necessary with semantic coloring using text decorations.

Thus, to provide a standard API for an extension to identify ranges of text as a token id number, and allowing the extension to also statically map these token id numbers to either tmScopes or a custom tokenColorCustomization (for extensions which might like to provide rainbow coloring or some other new innovation I can't imagine) would be a more than sufficient solution for this task.

This has the added benefit of allowing tmThemes to take part in this semantic colorization with the familiar tmScope concept. tmScopes are not perfect, however, and it would be nice to attach more information to them outside of what's possible with the "dot"-separated identifier

jeff-hykin · 2019-08-17T21:21:42Z

API for an extension to identify ranges of text as a token id number,

@alexzielenski When you say token id are you talking about the tokens from this?. For example:

{ startIndex:  8, scopes: ['source.js','meta.function.js'] },

Which gets converted into:

                  // bbbbbbbbb fffffffff FFF TTT LLLLLLLL
[    8, 16793623, // 000000010 000000001 000 000 00010111 ]

would be a more than sufficient solution for this task.

There likely needs to be a way to feed text changes/updates to the extension too. Telling it what line and what character(s) changed, and allowing the extension to update its previously provided tokens.

map these token id numbers to either tmScopes or a custom tokenColorCustomization

I'd highly recommend side-stepping TM scopes and instead allowing the extensions to add labels to tokens. A TM token can't be both a variable and a function, even if semantically thats what the token is. It would make for bad habits right out of the bat to have extensions be unable to accurately inform VS Code about the token.

alexzielenski · 2019-08-17T21:29:44Z

@jeff-hykin When I said token id i just meant an arbitrary number which the extension could assign meaning to. So, the extension could provide:

{ startIndex: 8, endIndex: 12, tokenId: 5 }

And also specify a map to the api:

registerTokenTypes(tid => {
    switch (tid) {
        case 5: return "entity.name.function" // or a TextDecoration
    } 
})

Regarding incremental text updates: I am already making use of an API provided by vscode to its language server to receive text deltas. I'm unsure if there exists an equivalent API for client-side vanilla vscode extensions, though.

I'd highly recommend side-stepping TM scopes and instead allowing the extensions to add labels to tokens.

I'm especially all for this - to allow the language-server/extension to assign arbitrary labels to a given token.

jeff-hykin · 2019-08-17T21:58:25Z

the extension could provide:
{ startIndex: 8, endIndex: 12, tokenId: 5 }
And also specify a map to the api:

Ah I see, thats a great idea. That would greatly reduce the amount of data needed for communication. If there was a static way to provide such a mapping, I bet it could be combined with the Trie generated from the Theme to make for a direct way to go from tokenId: 5 to style data.

I am already making use of an API provided by vscode to its language server to receive text deltas.

Thanks for correcting me, I guess a functioning token API really is the only thing needed for getting this working on a language server.

Since there is already a text delta API, I guess it would be best to figure out how to port it to normal extensions (or it may already be available, I'm not sure either).

mattacosta · 2019-08-20T21:04:00Z

For comparison, I benchmarked a few engines:

Tokenization performance (PHP)
First tokenization run only. Does not include V8 optimizations from subsequent runs (like after editing a file).

File: PHPMailer.php
Average of 10 iterations. All times in milliseconds (lower is better).

	WASM	JS	TreeSitter	TextMate
Average	24.95	38.73	78.24	795.75
Min	23.58	36.48	75.81	789.64
Max	26.84	41.59	80.21	809.50
Standard Dev	1.01	1.57	1.39	6.07
Tokens	25921	25921	17559	30607

There are a number of implementation details to consider, but even the slowest non-TM engine is ~10x faster. Instead of creating a fast tokenizer for each language, I think it would be more efficient to create an API for other engines and immediately get the benefit of whatever is available. This doesn't even factor in that the non-TM implementations all support incremental changes as well.

AnyhowStep · 2019-08-22T13:20:13Z

Will this overhaul fix syntax highlighting for multiline generic function/method calls?

It regularly happens to me where I try to split a generic function/method call to multiple lines and the syntax highlighter just... Freaks out.

It highlights stuff in white, blue, green, other exotic colors.
The colors chosen are very random and differ from case to case.

jeff-hykin · 2019-08-22T22:57:02Z

@AnyhowStep You might want to report that to the Typescript grammar repo, I think that's solvable with the current tools.

To answer your question though: the tree sitter would likely not have that issue. This thread is currently solving a pre-requisite problem: the tree sitter (and other parsers) don't have an efficient real-time way to tell VS code about the structure of the code. We need a way to identify tokens, a way to store information about each token, a way for extensions to incrementally update those tokens, a way for themes to assign styles to tokens, and likely a way to do all those in a non-synchronous way to prevent freezing.

AnyhowStep · 2019-08-23T08:03:55Z

Ah. I didn't even know where to go to report it. I looked it up and it looks like others have reported it and the issues have been marked as "won't fix" and "design limitation".

microsoft/TypeScript-TmLanguage#767
microsoft/TypeScript-TmLanguage#745
microsoft/TypeScript-TmLanguage#479

microsoft/TypeScript-TmLanguage#475 (comment)

This is by design as the grammar lookup doesn't allow multiple lines at a time.

So, not solvable with current tools =(

jeff-hykin · 2019-08-23T14:51:31Z

Ah. I didn't even know where to go to report it.

@AnyhowStep, yeah it's definitely not obvious where to go. Mentioning it and being redirected is a valid way to find out.

Thanks for including the issue links. I understand the details of problem now, the fundamental problem isn't perfectly solveable with TM, and this situation is exactly what the tree-sitter is designed to fix, so this issue is (very indirectly) working on that problem. However, I've written the syntax for the very similar case of C++ typed function calls and I can tell you the TM grammar is very capable of doing a better job and could prevent the cascade of broken colors. The last issue you link can be solved without making any assumptions by fixing the JSX pattern (as the reporter mentions). The first two issues you link, which are less serious, can be imperfectly fixed with a convenient assumption. I might open up an issue there and see what can be done.

DanTup · 2019-08-26T08:36:32Z

This is by design as the grammar lookup doesn't allow multiple lines at a time.

Multiline comments and strings work ok, is this different to those?

(though this is probably a bit off-topic for hear)

segevfiner · 2019-12-21T19:27:19Z

Had this one with Go #76073. The TM grammer for it has been sourced from the Atom extension which is no longer willing to maintain that grammer as they since moved on to implementing syntax highlighting using tree-sitter, that is, semantic coloring. So I guess this is just one more syntax highlighting issue that will be resolved if VS Code (Probably the Go extension) adopts semantic coloring for Go.

thekingofspain · 2021-04-10T00:18:57Z

mechatroner/vscode_rainbow_csv#4

jasonwilliams · 2021-04-28T10:47:47Z

Hey @alexdima
Is there still ambition for the overhaul or was it deemed not possible?
Can the synchronous assumptions be solved?

ebkgne · 2021-07-23T10:49:30Z

Hello, I've read from this thread #64681
that this would solve our problem ? is it the case ? and is this issue being taken care of or is iddle at the time ? thank you

alexdima added the plan-item VS Code - planned item for upcoming label Jul 10, 2019

alexdima self-assigned this Jul 10, 2019

alexdima added this to the July 2019 milestone Jul 10, 2019

kieferrm mentioned this issue Jul 10, 2019

Iteration Plan for July 2019 #76735

Closed

29 tasks

alexdima modified the milestones: July 2019, August 2019 Jul 29, 2019

alexdima changed the title ~~Investigate running TextMate in a separate thread~~ Tokenization overhaul Aug 9, 2019

egamma mentioned this issue Aug 9, 2019

Proposal of the semantic highlighting protocol extension microsoft/vscode-languageserver-node#367

Closed

gulshan mentioned this issue Aug 9, 2019

Provide an API for advanced/semantic source highlighting #585

Closed

kieferrm mentioned this issue Aug 10, 2019

Iteration Plan for August 2019 #78844

Closed

37 tasks

alexr00 mentioned this issue Aug 13, 2019

Python has no TextMate scope for variables #78711

Closed

alexdima mentioned this issue Aug 14, 2019

100% Freeze on any catastrophic backtracking microsoft/vscode-textmate#104

Open

jeff-hykin mentioned this issue Aug 14, 2019

Slowness jeff-hykin/experimental-tree-sitter#9

Open

alexdima modified the milestones: August 2019, September 2019 Aug 26, 2019

mjbvz mentioned this issue Aug 26, 2019

variables containing functions should have different color #79806

Closed

yyoncho mentioned this issue Sep 5, 2019

lsp-based replacement for cc-mode emacs-lsp/lsp-mode#999

Closed

akaroml mentioned this issue Jan 9, 2020

Weird color display with Greek name variable microsoft/vscode-java-pack#286

Open

sean-mcmanus mentioned this issue Jan 13, 2020

Sluggish when editing a very large C file after some time microsoft/vscode-cpptools#4799

Closed

NovaLogicDev mentioned this issue Jan 23, 2020

Wrong token colors display in editor #89174

Closed

aeschli mentioned this issue Jan 27, 2020

Test TypeScript/JavaScript semantic highlighting #89351

Closed

2 tasks

alexdima modified the milestones: January 2020, February 2020 Jan 27, 2020

alexdima mentioned this issue Feb 3, 2020

Grammars shouldn't be able to freeze VS Code #89916

Closed

mechatroner mentioned this issue Feb 4, 2020

Supporting linebreaks inside double quoted csv fields mechatroner/vscode_rainbow_csv#4

Closed

egamma mentioned this issue Feb 10, 2020

Iteration Plan for February 2020 #90371

Closed

54 tasks

alexdima modified the milestones: February 2020, March 2020 Feb 26, 2020

kieferrm mentioned this issue Mar 8, 2020

Iteration Plan for March 2020 #92242

Closed

55 tasks

alexdima added feature-request Request for new features or functionality and removed plan-item VS Code - planned item for upcoming labels Mar 11, 2020

hyangah mentioned this issue Mar 27, 2020

Semantic Highlighting microsoft/vscode-go#3134

Closed

alexdima modified the milestones: March 2020, On Deck Mar 30, 2020

alexdima mentioned this issue Sep 18, 2020

Async support for custom TokensProvider microsoft/monaco-editor#2139

Closed

jasonwilliams mentioned this issue Feb 20, 2021

Slow syntax highlighting #64681

Open

texastoland mentioned this issue Sep 6, 2021

[grammars] provide alternative to TextMate grammars #216

Open

Lemmingh mentioned this issue Oct 5, 2021

Some syntax highlighting does not work in TypeScript when a function has several return types #134440

Closed

Lemmingh mentioned this issue Oct 30, 2021

Update KaTeX Version yzhang-gh/vscode-markdown#981

Closed

Lemmingh mentioned this issue Nov 10, 2021

Markdown citation syntax not highlighted correctly yzhang-gh/vscode-markdown#1038

Closed

agoose77 mentioned this issue Jan 21, 2022

Support extending CodeMirror highlighting with plugins agoose77/jupyterlab-markup#12

Open

hediet mentioned this issue Apr 8, 2022

Move tokenization to a web worker #147066

Closed

4 tasks

lionel- mentioned this issue Mar 28, 2024

Declarative treesitter-based indentation rules #208985

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenization overhaul #77140

Tokenization overhaul #77140

alexdima commented Jul 10, 2019 •

edited

Loading

paulyoung commented Aug 9, 2019

fwcd commented Aug 16, 2019

jeff-hykin commented Aug 17, 2019 •

edited

Loading

alexzielenski commented Aug 17, 2019 •

edited

Loading

jeff-hykin commented Aug 17, 2019 •

edited

Loading

alexzielenski commented Aug 17, 2019

jeff-hykin commented Aug 17, 2019 •

edited

Loading

mattacosta commented Aug 20, 2019 •

edited

Loading

AnyhowStep commented Aug 22, 2019

jeff-hykin commented Aug 22, 2019 •

edited

Loading

AnyhowStep commented Aug 23, 2019 •

edited

Loading

jeff-hykin commented Aug 23, 2019 •

edited

Loading

DanTup commented Aug 26, 2019 •

edited

Loading

segevfiner commented Dec 21, 2019

thekingofspain commented Apr 10, 2021

jasonwilliams commented Apr 28, 2021

ebkgne commented Jul 23, 2021

Tokenization overhaul #77140

Tokenization overhaul #77140

Comments

alexdima commented Jul 10, 2019 • edited Loading

Complex TM grammar approach (TypeScript)

Text Editor Decorations (C++)

1. Move tokenization off the UI thread

2. Accept tokens from the extension host (semantic tokenization)

3. (low priority) Enable the integration of other tokenization engines

Tasks

paulyoung commented Aug 9, 2019

fwcd commented Aug 16, 2019

jeff-hykin commented Aug 17, 2019 • edited Loading

Base Parser Idea

Normal + Streamlined Grammar Idea

Non-Synchronous Solutions

Tokens from the Extension Host

Semantic Representation

alexzielenski commented Aug 17, 2019 • edited Loading

jeff-hykin commented Aug 17, 2019 • edited Loading

alexzielenski commented Aug 17, 2019

jeff-hykin commented Aug 17, 2019 • edited Loading

mattacosta commented Aug 20, 2019 • edited Loading

AnyhowStep commented Aug 22, 2019

jeff-hykin commented Aug 22, 2019 • edited Loading

AnyhowStep commented Aug 23, 2019 • edited Loading

jeff-hykin commented Aug 23, 2019 • edited Loading

DanTup commented Aug 26, 2019 • edited Loading

segevfiner commented Dec 21, 2019

thekingofspain commented Apr 10, 2021

jasonwilliams commented Apr 28, 2021

ebkgne commented Jul 23, 2021

alexdima commented Jul 10, 2019 •

edited

Loading

jeff-hykin commented Aug 17, 2019 •

edited

Loading

alexzielenski commented Aug 17, 2019 •

edited

Loading

jeff-hykin commented Aug 17, 2019 •

edited

Loading

jeff-hykin commented Aug 17, 2019 •

edited

Loading

mattacosta commented Aug 20, 2019 •

edited

Loading

jeff-hykin commented Aug 22, 2019 •

edited

Loading

AnyhowStep commented Aug 23, 2019 •

edited

Loading

jeff-hykin commented Aug 23, 2019 •

edited

Loading

DanTup commented Aug 26, 2019 •

edited

Loading