On demand indexing #1260

robertoaloi · 2022-04-01T15:50:14Z

Historically, Erlang LS performed a full indexing of the workspace at startup. While this approach works reasonably well for a typical, small, Erlang project, it is problematic for mono-repos, where the initial indexing time can take several minutes. During these minutes, results for operations such as find references or rename may have unexpected, incorrect results.
Furthermore, containing the full indexed information in memory makes the whole server less responsive, especially on machines with limited RAM, and slows down operations.

This PR switches the indexing approach to an incremental one, which is more similar to what modern editors (most notably IntelliJ) tend to do.

During startup, Erlang LS will still performs an initial scan of all files belonging to the workspace but, instead of performing a full indexing for each file, it will only load into memory the minimum amount of information required to satisfy requests from clients.

The Points of Interest (POIs) and other information are calculated on demand, every time they are requested. Once they are calculated, the POIs are stored in memory, so that subsequent requests are faster.

To provide answers to requests that rely on global information (most notably getReferences), we adopt the so-called IDE trick (to use the wording from @michalmuskala who introduced the trick to me): first we do a quick scan of the code base to find potential candidates, then index those files and do a more accurate analysis.

This PR reduces both the initial indexing time and the memory footprint of Erlang LS significantly.

The process used to buffer incremental text edits was also rewritten as a gen_server and a few race conditions around text synchronization have been identified and solved, which should also lead to a more solid language server.

robertoaloi · 2022-04-01T16:01:20Z

Also, please try out this branch on your code bases if you get a chance and let me know if anything breaks!

plux · 2022-04-04T12:46:37Z

Seems interesting.
I think it would be appropriate to add a config to toggle this behavior.

robertoaloi · 2022-04-04T13:03:50Z

Seems interesting. I think it would be appropriate to add a config to toggle this behavior.

Fair point. My only concern is that we may end up with effectively two versions of the indexing process, which could make things harder to maintain (and different users may experience subtle bugs when using one or the other). By having one way of doing things it would be much simpler. On the other hand, it would be a good way to try this out until it reveals a valid approach.

gomoripeti · 2022-04-04T13:13:33Z

apps/els_lsp/src/els_dt_document.erl

@@ -145,18 +145,21 @@ new(Uri, Text) ->

 -spec new(uri(), binary(), atom(), kind()) -> item().
 new(Uri, Text, Id, Kind) ->
- {ok, POIs} = els_parser:parse(Text),
- MD5 = erlang:md5(Text),
+ MD5 = erlang:md5(Text),
 #{ uri => Uri
 , id => Id
 , kind => Kind
 , text => Text


is it necessary to store the Text if the POIs are not (yet) stored? if Text is still stored it could be used to grep in the in-memory binary. If it's not stored it can spare some memory.

Ideally, it would be good to perform the in-memory grep, but it may require some time before we get there. We should check how the Text is used and where. There's quite some complex logic around the apply-edit and the index buffer which we may want to simplify...

Switched to an in-memory search (that can definitely be optimized) instead of the grep.

gomoripeti · 2022-04-04T13:15:25Z

Is Mode = shallow still necessary if there will be ondemand "mode"? They might be orthogonal

robertoaloi · 2022-04-04T15:16:18Z

Is Mode = shallow still necessary if there will be ondemand "mode"? They might be orthogonal

Probably not. In the latest version I'm working on all indexing is "shallow".

robertoaloi · 2022-04-05T08:06:38Z

apps/els_lsp/src/els_dt_document.erl

@@ -201,3 +205,14 @@ wrapping_functions(Document, Line, Column) ->
 wrapping_functions(Document, Range) ->
 #{start := #{character := Character, line := Line}} = Range,
 wrapping_functions(Document, Line, Character).
+
+-spec find_candidates(binary()) -> [uri()].
+find_candidates(Pattern) ->


Let's say there's room for improvements here...

great start, though :)

At least filtering OTP modules now.

gomoripeti · 2022-04-05T11:14:24Z

apps/els_lsp/src/els_dt_references.erl

+ , type_application
+ , import_entry
+ ]),
+ [register_reference(Uri, POI) || POI <- POIs]


do I understand correctly that every time I look up references of an entity, it will call find_by and try to insert a lot of references in the ref_db (ets entries might be there already, so some of them might be noop) Why do we store the references in ets then? can't we run the matching on the POIs list in place?
Alternatively can we mark Uris, which already registered their references, so we can skip lookup_pois+register_references for them?

In my latest version (soon to be pushed), we only do this when expanding the POIs and enforcing a full indexing.

My major concern here is that, depending on the pattern, we may end up indexing a lot of files. Need to find a good strategy to reduce the number of candidates significantly.

gomoripeti · 2022-04-05T11:24:05Z

apps/els_lsp/src/els_indexing.erl

@@ -253,11 +199,11 @@ index_dir(Dir, Mode, SkipGeneratedFiles, GeneratedFilesTag) ->

 -spec entries_apps() -> [{string(), 'deep' | 'shallow'}].
 entries_apps() ->
- [{Dir, 'deep'} || Dir <- els_config:get(apps_paths)].
+ [{Dir, 'shallow'} || Dir <- els_config:get(apps_paths)].


robertoaloi · 2022-04-05T15:57:21Z

apps/els_core/src/els_utils.erl

@@ -110,7 +110,7 @@ find_header(Id) ->
 {ok, Uri};
 [] ->
 FileName = atom_to_list(Id) ++ ".hrl",
- els_indexing:find_and_index_file(FileName)
+ els_indexing:find_and_deeply_index_file(FileName)


I'm trying to make more explicit if indexing is deep or shallow. Until now it was very hard to follow, since we were relying on default parameters.

How does indexing of header files work? Are their items included in the modules including the header?

How would the following situation behave:

% foo.hrl -type header_type() :: any(). % foo.erl -module(foo). -include("foo.hrl"). -export_type([header_type/0]). % module(bar). -type x() :: foo:header_type().

Would this be properly resolved?

robertoaloi · 2022-04-05T15:59:28Z

apps/els_lsp/src/els_dt_document.erl

@@ -43,17 +44,20 @@
 %%==============================================================================
 -type id() :: atom().
 -type kind() :: module | header | other.
+-type source() :: otp | app | dep.


We were not storing references for OTP modules during the initial indexing.
Now that indexing happens on-demand, we need a way to distinguish OTP modules or app/dep modules when reading a document .

robertoaloi · 2022-04-05T15:59:53Z

apps/els_lsp/src/els_dt_document.erl


 %%==============================================================================
 %% Item Definition
 %%==============================================================================

-record(els_dt_document, { uri :: uri() | '_'
+-record(els_dt_document, { uri :: uri() | '_' | '$1'


The fact that Dialyzer complains about this makes me sad...

robertoaloi · 2022-04-05T16:00:57Z

apps/els_lsp/src/els_dt_document.erl

@@ -201,3 +205,14 @@ wrapping_functions(Document, Line, Column) ->
 wrapping_functions(Document, Range) ->
 #{start := #{character := Character, line := Line}} = Range,
 wrapping_functions(Document, Line, Character).
+
+-spec find_candidates(binary()) -> [uri()].
+find_candidates(Pattern) ->


At least filtering OTP modules now.

robertoaloi · 2022-04-05T16:06:45Z

apps/els_lsp/src/els_indexing.erl

- ok = els_dt_document_index:insert(ModuleItem),
- index_signatures(Document),
- index_references(Document, Mode, Reset).
+-spec deep_index(els_dt_document:item()) -> ok.


Trying to move all indexing logic in one place (the fact that we have different tables for signatures or references should be considered an implementation detail).

robertoaloi · 2022-04-05T16:07:45Z

apps/els_lsp/src/els_indexing.erl

- "[filename=~s, uri=~s] "
- "~p:~p:~p", [FullName, Uri, Type, Reason, St]),
- {error, {Type, Reason}}
+ ?LOG_DEBUG("Shallow indexing file. [filename=~s] [uri=~s]",


Given there's no parsing happening here, removing the try.

robertoaloi · 2022-04-05T16:10:03Z

apps/els_lsp/src/els_dt_document.erl

+ , [{{'$1', '$3'}}]}],
+ All = ets:select(name(), MS),
+ Fun = fun({Uri, Text}) ->
+ case binary:matches(Text, Pattern) of


A very common/short name could easily result in too many candidate. How can we protect against that?

The main issue I see is that if we are looking for references to something like

to() -> paris.

this matches to all kind of text and atoms since the there is no tokenisation going on

"string containing to" atom_containing_to

I think doing some cheap tokenisation would cut down false positives significantly for these cases.

TheGeorge · 2022-04-06T08:39:18Z

apps/els_lsp/src/els_dt_document.erl

+-spec find_candidates(binary()) -> [uri()].
+find_candidates(Pattern) ->
+ All = ets:tab2list(name()),
+ Fun = fun(#els_dt_document{uri = Uri, text = Text}) ->


Does that mean we read all files of the project into memory? What impact does this have on memory and indexing time?

How does binary:match compare to grep performance wise? (Seems it should be quicker since the file IO is already done)

That's unchanged. The loading time is minimal and not a concern (I will try to get some actual numbers).
At this point, the match performance is not an issue, the part that takes time is the indexing of a potentially big number of candidates (eg if you search for callers to a function called "get").

By implementing support for WorkDoneProgress we can at least report progress to the user and hopefully show intermediate results and be able to cancel a specific request.

TheGeorge · 2022-04-06T14:53:21Z

apps/els_lsp/src/els_completion_provider.erl

+ undefined ->
+ ok;
+ _ ->
+ ok = els_buffer_server:flush(Buffer)


I wonder if we cannot be even more clever here and to make flushing delayed. Looking at the LSP, I see that we can return partial completion results. So what we can do to increase responsiveness is this:

if no pending changes, just what we do now

i pending changes, trigger flushing, and return completion results based on current knowledge

when indexing is complete return full completion results

Partial results are not yet implemented, I will add this as a follow up.

robertoaloi · 2022-04-06T14:54:27Z

apps/els_lsp/src/els_completion_provider.erl

@@ -44,8 +44,18 @@ handle_request({completion, Params}, State) ->
 }
 , <<"textDocument">> := #{<<"uri">> := Uri}
 } = Params,
- ok = els_index_buffer:flush(Uri),


A bit of context may help here.

Completion requests typically come at the same time as edits. With the current implementation of the els_index_buffer it could happen that, during a completion, edits are still not processed (there's some basic protection against this, but things can still fail).

The completion provider assumes a specific length in the line when reading the text after the flush, but it's not guaranteed that all edits are applied. Fast typers can see the provider (and eventually the whole server) crashing.

With the new approach, the idea is to block callers of the flush for a certain amount of time (right now 200ms), but if edits keep coming the count restarts.

This should also prevent indexing happening multiple times.

Hmm... I'm not sure I fully understand how the processing of everything happens.

Can you confirm that didChange notifications are processed asynchronously - there's no guarantee that a later request will observe the earlier change.
It seems like this one thing introduces a lot of conceptual complexity in how everything fits together

The deep indexing was relying on many optimizations, which have been removed as part of the incremental indexing. Supporting both feels an un-necessary effort

michalmuskala · 2022-04-07T09:47:27Z

apps/els_lsp/src/els_dt_document.erl

+ (_, Words) ->
+ Words
+ end,
+ lists:foldl(Fun, sets:new(), Tokens);


sets in general is pretty slow and uses a lot of memory (it's frequently actually worse than a list with linear search), unless you use sets:new([{version, 2}]) for the map-based representation.

That said, in general it seems like shipping this data back and forth from ets might be really expensive.

An alternative would be to use a plain map to store the data: #{atom() | string() => true}

And query it with a match spec like:

MS = [{#els_dt_document{ uri = '$1' , source = '$2' , buffer = '_' , kind = '_' , text = '_' , md5 = '_' , pois = '_' , words = '$3' } , [{'=/=', '$2', otp}, {'is_map_key', Search, '$3'}] , ['$1']}],

This way there's no need to copy the data out of ETS

Version 2 is only supported from OTP 24, right? Ideally Erlang LS shouldn't care about the project version used by a user, but for the time being we have some assumptions in the code base that force us to be backward compatible. So, will need to add it as an ifdef.

Thanks for this! I will leave this as a follow up, so that we can actually measure the eventual improvement.

robertoaloi · 2022-04-07T10:18:29Z

Seems interesting. I think it would be appropriate to add a config to toggle this behavior.

@plux I attempted at making this configurable, but then realized that a bunch of optimizations have been dropped as part of the change (eg. the deleted_by_uri). By making the whole feature configurable, we would need to reintroduce them for the non-incremenental indexing (they don't really make sense with the new setup), which makes the code quite hairy. I am leaning towards just switching the behaviour and holding from a new release until this proves solid enough. That would allow us to dog-fooding the new code base. And from the first experiments, it looks promising.

michalmuskala

Some comments from me.

Some of them concern the feature itself, some are to clarify the behaviour of ELS, since I'm not fully familiar with it

apps/els_lsp/priv/code_navigation/src/purge_references.erl

michalmuskala · 2022-04-08T09:17:55Z

apps/els_lsp/src/els_buffer_server.erl

+%%==============================================================================
+%% Callbacks for the gen_server behaviour
+%%==============================================================================
+-behaviour(gen_server).


Could you explain a bit why we need a buffer server in the first place, rather than keeping the text directly and using plain functions for processing edits?

The buffer was introduced as part of support for incremental text sync and I just ported it to a gen_server, so it's not new with this feature.

The main intention was to buffer didChange events, so that re-indexing would be retriggered not too often (a didChange event is sent at every keystroke). But I agree that it complicates things and probably introduces some races.

michalmuskala · 2022-04-08T09:31:58Z

apps/els_lsp/src/els_completion_provider.erl

@@ -44,8 +44,18 @@ handle_request({completion, Params}, State) ->
 }
 , <<"textDocument">> := #{<<"uri">> := Uri}
 } = Params,
- ok = els_index_buffer:flush(Uri),


Hmm... I'm not sure I fully understand how the processing of everything happens.

Can you confirm that didChange notifications are processed asynchronously - there's no guarantee that a later request will observe the earlier change.
It seems like this one thing introduces a lot of conceptual complexity in how everything fits together

michalmuskala · 2022-04-08T09:44:35Z

apps/els_core/src/els_utils.erl

@@ -110,7 +110,7 @@ find_header(Id) ->
 {ok, Uri};
 [] ->
 FileName = atom_to_list(Id) ++ ".hrl",
- els_indexing:find_and_index_file(FileName)
+ els_indexing:find_and_deeply_index_file(FileName)


How does indexing of header files work? Are their items included in the modules including the header?

How would the following situation behave:

% foo.hrl -type header_type() :: any(). % foo.erl -module(foo). -include("foo.hrl"). -export_type([header_type/0]). % module(bar). -type x() :: foo:header_type().

Would this be properly resolved?

michalmuskala · 2022-04-08T09:45:39Z

apps/els_core/src/els_utils.erl

@@ -150,13 +150,13 @@ lookup_document(Uri) ->
 {ok, Document};
 {ok, []} ->
 Path = els_uri:path(Uri),
- {ok, Uri} = els_indexing:index_file(Path),
+ {ok, Uri} = els_indexing:shallow_index(Path, app),


How do we know this module is part of the application? Could the lookup request come for any document?

The Source (app) is only used to decide whether to index references or not. In principle, we should be able to detect from the Uri if it refers to a app, otp or dep. In practice, we just store references even for otp modules, which shouldn't have a significant effect for the user. But it's a fair question, so I will add a comment to eventually follow up.

michalmuskala · 2022-04-08T09:50:49Z

apps/els_lsp/src/els_dt_signatures.erl

+-spec delete_by_uri(uri()) -> ok.
+delete_by_uri(Uri) ->
+ case filename:extension(Uri) of
+ <<".erl">> ->


Hmm... It looks like we also index headers. Should there be some processing for them in here as well?

Signatures are currently indexed by MFA and the respective info is indexed while the module itself is indexed (taking into account the chain of header files). The reason why the check is here is that we also process headers for indexing and that caused an issue since it was deleting all, say, file specs when indexing file.hrl, which is of course incorrect. Indexing of headers should be reconsidered, but I would keep that out of scope for this PR.

michalmuskala · 2022-04-08T09:54:10Z

apps/els_lsp/src/els_indexing.erl

- do_index(Document, Mode, LookupResult =/= [])
+-spec ensure_deeply_indexed(uri()) -> ok.
+ensure_deeply_indexed(Uri) ->
+ {ok, #{pois := POIs} = Document} = els_utils:lookup_document(Uri),


In general it seems to me that this lookups happen very often and they seem rather expensive - given they copy a loooot of data from ets.

I guess one way would be to look up at the beginning of the request and then pass the document, rather than just Uri, though I understand this would be a huge effort to change this.

That sounds like a great optimization that can be implemented as a follow up.

michalmuskala · 2022-04-08T09:55:34Z

apps/els_lsp/src/els_indexing.erl

+deep_index(Document) ->
+ #{id := Id, uri := Uri, text := Text, source := Source} = Document,
+ {ok, POIs} = els_parser:parse(Text),
+ ok = els_dt_document:insert(Document#{pois => POIs}),


Could there be some concurrent modification to Document that would mean this overrides the data?

In particular, could there be a concurrent didOpen request that adds a buffer pid, but this would erase it?

Yes, that could happen (even if that would require two didOpen for the same file, which is not very likely). Said that, I'm considering to remove the buffer altogether and find an alternative solution for avoiding multiple re-index operations.

michalmuskala · 2022-04-08T10:30:30Z

apps/els_lsp/src/els_text_search.erl

+extract_pattern({include, Id}) ->
+ include_id(Id);
+extract_pattern({include_lib, Id}) ->
+ include_id(Id);


Hmm... It looks like in the word splitting we save full strings and then compare for equality. But here we only extract the rootname.

I think this means that include matching would never actually match since we have indexed strings like "foo.hrl", but we're looking here for "foo". Similar for include_lib we'd index "app/include/foo.hrl"

That is correct, need to fix.

Actually, we don't even need to index strings, with the exceptions of includes, so I may apply the same transformation on indexing.

robertoaloi · 2022-04-08T11:25:00Z

@michalmuskala For some reason I cannot reply inline to this comment:

#1260 (comment)

That should be handled correctly, but it would be great to encode that in a test case.

michalmuskala

With the clarifying comments and fixes, I think this looks pretty good to go for me

robertoaloi added 8 commits March 31, 2022 15:09

Store POIs on-demand

2ff5acf

Store references on demand

a9882c5

Store signatures on demand

7dd5582

Patches

fffe202

Add missing module

75d30f2

Lower severity for missing config message

6c20c94

Fix end-per-testcase

92c7e62

Lower verbosity

055a922

Only index specs when needed

8934fad

gomoripeti reviewed Apr 4, 2022

View reviewed changes

robertoaloi added 3 commits April 4, 2022 15:21

Prune signatures and references during re-indexing

0af544f

Only register references when necessary

9a478cc

Refactor text search

03579f9

robertoaloi added 5 commits April 4, 2022 18:04

Handle whitespaces, actually store references

9c73c66

Fixes

1a59fd8

Fix Dialyzer specs

afecb31

In memory grep

056aa61

Do not rely on order

5ba84b8

robertoaloi commented Apr 5, 2022

View reviewed changes

gomoripeti reviewed Apr 5, 2022

View reviewed changes

Refactoring and fixes

ff720ce

robertoaloi commented Apr 5, 2022

View reviewed changes

TheGeorge reviewed Apr 6, 2022

View reviewed changes

One buffer per document

de7f47a

TheGeorge reviewed Apr 6, 2022

View reviewed changes

robertoaloi commented Apr 6, 2022

View reviewed changes

robertoaloi added 8 commits April 6, 2022 17:43

Set reference to undefined

fc9c456

Show edits in logs

cec3040

Make incremental indexing configurable

d136370

Clarify indexing logs

a003d31

Show background job title on error

dc94b07

Revert configuration of Incremental

41ca00b

The deep indexing was relying on many optimizations, which have been removed as part of the incremental indexing. Supporting both feels an un-necessary effort

Basic words scanner

413b5dd

Lower verbosity

5416d7a

michalmuskala reviewed Apr 7, 2022

View reviewed changes

Real devs write match specs by hand. Not.

d98eb18

robertoaloi added 4 commits April 7, 2022 14:28

Explicitly store buffers

d59306d

Fixes

8d34dfa

Debug failing testcase in OTP 22

3971114

Do not delete specs for header files, to avoid conflict

bbcb689

robertoaloi changed the title ~~WIP - On demand indexing~~ On demand indexing Apr 8, 2022

robertoaloi added 2 commits April 8, 2022 08:53

Remove dbg expressions

f054f92

Fix Dialyzer issues

8a91a23

robertoaloi marked this pull request as ready for review April 8, 2022 07:22

robertoaloi requested a review from michalmuskala April 8, 2022 10:16

michalmuskala reviewed Apr 8, 2022

View reviewed changes

robertoaloi added 2 commits April 8, 2022 12:49

Remove obsolete module

161dc2c

Fix indexing of include attributes

c505eff

michalmuskala approved these changes Apr 8, 2022

View reviewed changes

robertoaloi merged commit a9727c9 into main Apr 8, 2022

robertoaloi deleted the store-pois-on-demand branch April 8, 2022 12:57

the-mikedavis mentioned this pull request May 15, 2022

Opening a non-existent file leads to a crash #1299

Closed

On demand indexing #1260

On demand indexing #1260

Conversation

robertoaloi commented Apr 1, 2022 • edited Loading

robertoaloi commented Apr 1, 2022

plux commented Apr 4, 2022

robertoaloi commented Apr 4, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gomoripeti commented Apr 4, 2022

robertoaloi commented Apr 4, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheGeorge Apr 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michalmuskala Apr 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertoaloi commented Apr 7, 2022

michalmuskala left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertoaloi commented Apr 8, 2022

michalmuskala left a comment

Choose a reason for hiding this comment

robertoaloi commented Apr 1, 2022 •

edited

Loading

TheGeorge Apr 6, 2022 •

edited

Loading

michalmuskala Apr 7, 2022 •

edited

Loading