Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On demand indexing #1260

Merged
merged 38 commits into from
Apr 8, 2022
Merged

On demand indexing #1260

merged 38 commits into from
Apr 8, 2022

Conversation

robertoaloi
Copy link
Member

@robertoaloi robertoaloi commented Apr 1, 2022

Historically, Erlang LS performed a full indexing of the workspace at startup. While this approach works reasonably well for a typical, small, Erlang project, it is problematic for mono-repos, where the initial indexing time can take several minutes. During these minutes, results for operations such as find references or rename may have unexpected, incorrect results.
Furthermore, containing the full indexed information in memory makes the whole server less responsive, especially on machines with limited RAM, and slows down operations.

This PR switches the indexing approach to an incremental one, which is more similar to what modern editors (most notably IntelliJ) tend to do.

During startup, Erlang LS will still performs an initial scan of all files belonging to the workspace but, instead of performing a full indexing for each file, it will only load into memory the minimum amount of information required to satisfy requests from clients.

The Points of Interest (POIs) and other information are calculated on demand, every time they are requested. Once they are calculated, the POIs are stored in memory, so that subsequent requests are faster.

To provide answers to requests that rely on global information (most notably getReferences), we adopt the so-called IDE trick (to use the wording from @michalmuskala who introduced the trick to me): first we do a quick scan of the code base to find potential candidates, then index those files and do a more accurate analysis.

This PR reduces both the initial indexing time and the memory footprint of Erlang LS significantly.

The process used to buffer incremental text edits was also rewritten as a gen_server and a few race conditions around text synchronization have been identified and solved, which should also lead to a more solid language server.

@robertoaloi
Copy link
Member Author

Also, please try out this branch on your code bases if you get a chance and let me know if anything breaks!

@plux
Copy link
Contributor

plux commented Apr 4, 2022

Seems interesting.
I think it would be appropriate to add a config to toggle this behavior.

@robertoaloi
Copy link
Member Author

Seems interesting. I think it would be appropriate to add a config to toggle this behavior.

Fair point. My only concern is that we may end up with effectively two versions of the indexing process, which could make things harder to maintain (and different users may experience subtle bugs when using one or the other). By having one way of doing things it would be much simpler. On the other hand, it would be a good way to try this out until it reveals a valid approach.

@@ -145,18 +145,21 @@ new(Uri, Text) ->

-spec new(uri(), binary(), atom(), kind()) -> item().
new(Uri, Text, Id, Kind) ->
{ok, POIs} = els_parser:parse(Text),
MD5 = erlang:md5(Text),
MD5 = erlang:md5(Text),
#{ uri => Uri
, id => Id
, kind => Kind
, text => Text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it necessary to store the Text if the POIs are not (yet) stored? if Text is still stored it could be used to grep in the in-memory binary. If it's not stored it can spare some memory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, it would be good to perform the in-memory grep, but it may require some time before we get there. We should check how the Text is used and where. There's quite some complex logic around the apply-edit and the index buffer which we may want to simplify...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to an in-memory search (that can definitely be optimized) instead of the grep.

@gomoripeti
Copy link
Contributor

Is Mode = shallow still necessary if there will be ondemand "mode"? They might be orthogonal

@robertoaloi
Copy link
Member Author

Is Mode = shallow still necessary if there will be ondemand "mode"? They might be orthogonal

Probably not. In the latest version I'm working on all indexing is "shallow".

@@ -201,3 +205,14 @@ wrapping_functions(Document, Line, Column) ->
wrapping_functions(Document, Range) ->
#{start := #{character := Character, line := Line}} = Range,
wrapping_functions(Document, Line, Character).

-spec find_candidates(binary()) -> [uri()].
find_candidates(Pattern) ->
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say there's room for improvements here...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great start, though :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least filtering OTP modules now.

, type_application
, import_entry
]),
[register_reference(Uri, POI) || POI <- POIs]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do I understand correctly that every time I look up references of an entity, it will call find_by and try to insert a lot of references in the ref_db (ets entries might be there already, so some of them might be noop) Why do we store the references in ets then? can't we run the matching on the POIs list in place?
Alternatively can we mark Uris, which already registered their references, so we can skip lookup_pois+register_references for them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my latest version (soon to be pushed), we only do this when expanding the POIs and enforcing a full indexing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My major concern here is that, depending on the pattern, we may end up indexing a lot of files. Need to find a good strategy to reduce the number of candidates significantly.

@@ -253,11 +199,11 @@ index_dir(Dir, Mode, SkipGeneratedFiles, GeneratedFilesTag) ->

-spec entries_apps() -> [{string(), 'deep' | 'shallow'}].
entries_apps() ->
[{Dir, 'deep'} || Dir <- els_config:get(apps_paths)].
[{Dir, 'shallow'} || Dir <- els_config:get(apps_paths)].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -110,7 +110,7 @@ find_header(Id) ->
{ok, Uri};
[] ->
FileName = atom_to_list(Id) ++ ".hrl",
els_indexing:find_and_index_file(FileName)
els_indexing:find_and_deeply_index_file(FileName)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to make more explicit if indexing is deep or shallow. Until now it was very hard to follow, since we were relying on default parameters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does indexing of header files work? Are their items included in the modules including the header?

How would the following situation behave:

% foo.hrl
-type header_type() :: any().

% foo.erl
-module(foo).
-include("foo.hrl").
-export_type([header_type/0]).


% module(bar).
-type x() :: foo:header_type().

Would this be properly resolved?

@@ -43,17 +44,20 @@
%%==============================================================================
-type id() :: atom().
-type kind() :: module | header | other.
-type source() :: otp | app | dep.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were not storing references for OTP modules during the initial indexing.
Now that indexing happens on-demand, we need a way to distinguish OTP modules or app/dep modules when reading a document .


%%==============================================================================
%% Item Definition
%%==============================================================================

-record(els_dt_document, { uri :: uri() | '_'
-record(els_dt_document, { uri :: uri() | '_' | '$1'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that Dialyzer complains about this makes me sad...

@@ -201,3 +205,14 @@ wrapping_functions(Document, Line, Column) ->
wrapping_functions(Document, Range) ->
#{start := #{character := Character, line := Line}} = Range,
wrapping_functions(Document, Line, Character).

-spec find_candidates(binary()) -> [uri()].
find_candidates(Pattern) ->
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least filtering OTP modules now.

ok = els_dt_document_index:insert(ModuleItem),
index_signatures(Document),
index_references(Document, Mode, Reset).
-spec deep_index(els_dt_document:item()) -> ok.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to move all indexing logic in one place (the fact that we have different tables for signatures or references should be considered an implementation detail).

"[filename=~s, uri=~s] "
"~p:~p:~p", [FullName, Uri, Type, Reason, St]),
{error, {Type, Reason}}
?LOG_DEBUG("Shallow indexing file. [filename=~s] [uri=~s]",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given there's no parsing happening here, removing the try.

, [{{'$1', '$3'}}]}],
All = ets:select(name(), MS),
Fun = fun({Uri, Text}) ->
case binary:matches(Text, Pattern) of
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A very common/short name could easily result in too many candidate. How can we protect against that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main issue I see is that if we are looking for references to something like

to() -> paris.

this matches to all kind of text and atoms since the there is no tokenisation going on

"string containing to"
atom_containing_to

I think doing some cheap tokenisation would cut down false positives significantly for these cases.

-spec find_candidates(binary()) -> [uri()].
find_candidates(Pattern) ->
All = ets:tab2list(name()),
Fun = fun(#els_dt_document{uri = Uri, text = Text}) ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean we read all files of the project into memory? What impact does this have on memory and indexing time?

How does binary:match compare to grep performance wise? (Seems it should be quicker since the file IO is already done)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's unchanged. The loading time is minimal and not a concern (I will try to get some actual numbers).
At this point, the match performance is not an issue, the part that takes time is the indexing of a potentially big number of candidates (eg if you search for callers to a function called "get").

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By implementing support for WorkDoneProgress we can at least report progress to the user and hopefully show intermediate results and be able to cancel a specific request.

undefined ->
ok;
_ ->
ok = els_buffer_server:flush(Buffer)
Copy link
Contributor

@TheGeorge TheGeorge Apr 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we cannot be even more clever here and to make flushing delayed. Looking at the LSP, I see that we can return partial completion results. So what we can do to increase responsiveness is this:

  • if no pending changes, just what we do now
  • i pending changes, trigger flushing, and return completion results based on current knowledge
  • when indexing is complete return full completion results

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial results are not yet implemented, I will add this as a follow up.

@@ -44,8 +44,18 @@ handle_request({completion, Params}, State) ->
}
, <<"textDocument">> := #{<<"uri">> := Uri}
} = Params,
ok = els_index_buffer:flush(Uri),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of context may help here.

Completion requests typically come at the same time as edits. With the current implementation of the els_index_buffer it could happen that, during a completion, edits are still not processed (there's some basic protection against this, but things can still fail).

The completion provider assumes a specific length in the line when reading the text after the flush, but it's not guaranteed that all edits are applied. Fast typers can see the provider (and eventually the whole server) crashing.

With the new approach, the idea is to block callers of the flush for a certain amount of time (right now 200ms), but if edits keep coming the count restarts.

This should also prevent indexing happening multiple times.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I'm not sure I fully understand how the processing of everything happens.

Can you confirm that didChange notifications are processed asynchronously - there's no guarantee that a later request will observe the earlier change.
It seems like this one thing introduces a lot of conceptual complexity in how everything fits together

(_, Words) ->
Words
end,
lists:foldl(Fun, sets:new(), Tokens);
Copy link
Contributor

@michalmuskala michalmuskala Apr 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sets in general is pretty slow and uses a lot of memory (it's frequently actually worse than a list with linear search), unless you use sets:new([{version, 2}]) for the map-based representation.

That said, in general it seems like shipping this data back and forth from ets might be really expensive.

An alternative would be to use a plain map to store the data: #{atom() | string() => true}

And query it with a match spec like:

  MS = [{#els_dt_document{ uri = '$1'
                         , source = '$2'
                         , buffer = '_'
                         , kind = '_'
                         , text = '_'
                         , md5 = '_'
                         , pois = '_'
                         , words = '$3'
                         }
        , [{'=/=', '$2', otp}, {'is_map_key', Search, '$3'}]
        , ['$1']}],

This way there's no need to copy the data out of ETS

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Version 2 is only supported from OTP 24, right? Ideally Erlang LS shouldn't care about the project version used by a user, but for the time being we have some assumptions in the code base that force us to be backward compatible. So, will need to add it as an ifdef.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! I will leave this as a follow up, so that we can actually measure the eventual improvement.

@robertoaloi
Copy link
Member Author

Seems interesting. I think it would be appropriate to add a config to toggle this behavior.

@plux I attempted at making this configurable, but then realized that a bunch of optimizations have been dropped as part of the change (eg. the deleted_by_uri). By making the whole feature configurable, we would need to reintroduce them for the non-incremenental indexing (they don't really make sense with the new setup), which makes the code quite hairy. I am leaning towards just switching the behaviour and holding from a new release until this proves solid enough. That would allow us to dog-fooding the new code base. And from the first experiments, it looks promising.

@robertoaloi robertoaloi changed the title WIP - On demand indexing On demand indexing Apr 8, 2022
@robertoaloi robertoaloi marked this pull request as ready for review April 8, 2022 07:22
Copy link
Contributor

@michalmuskala michalmuskala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments from me.

Some of them concern the feature itself, some are to clarify the behaviour of ELS, since I'm not fully familiar with it

apps/els_lsp/priv/code_navigation/src/purge_references.erl Outdated Show resolved Hide resolved
%%==============================================================================
%% Callbacks for the gen_server behaviour
%%==============================================================================
-behaviour(gen_server).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain a bit why we need a buffer server in the first place, rather than keeping the text directly and using plain functions for processing edits?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The buffer was introduced as part of support for incremental text sync and I just ported it to a gen_server, so it's not new with this feature.

The main intention was to buffer didChange events, so that re-indexing would be retriggered not too often (a didChange event is sent at every keystroke). But I agree that it complicates things and probably introduces some races.

@@ -44,8 +44,18 @@ handle_request({completion, Params}, State) ->
}
, <<"textDocument">> := #{<<"uri">> := Uri}
} = Params,
ok = els_index_buffer:flush(Uri),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I'm not sure I fully understand how the processing of everything happens.

Can you confirm that didChange notifications are processed asynchronously - there's no guarantee that a later request will observe the earlier change.
It seems like this one thing introduces a lot of conceptual complexity in how everything fits together

@@ -110,7 +110,7 @@ find_header(Id) ->
{ok, Uri};
[] ->
FileName = atom_to_list(Id) ++ ".hrl",
els_indexing:find_and_index_file(FileName)
els_indexing:find_and_deeply_index_file(FileName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does indexing of header files work? Are their items included in the modules including the header?

How would the following situation behave:

% foo.hrl
-type header_type() :: any().

% foo.erl
-module(foo).
-include("foo.hrl").
-export_type([header_type/0]).


% module(bar).
-type x() :: foo:header_type().

Would this be properly resolved?

@@ -150,13 +150,13 @@ lookup_document(Uri) ->
{ok, Document};
{ok, []} ->
Path = els_uri:path(Uri),
{ok, Uri} = els_indexing:index_file(Path),
{ok, Uri} = els_indexing:shallow_index(Path, app),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know this module is part of the application? Could the lookup request come for any document?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Source (app) is only used to decide whether to index references or not. In principle, we should be able to detect from the Uri if it refers to a app, otp or dep. In practice, we just store references even for otp modules, which shouldn't have a significant effect for the user. But it's a fair question, so I will add a comment to eventually follow up.

-spec delete_by_uri(uri()) -> ok.
delete_by_uri(Uri) ->
case filename:extension(Uri) of
<<".erl">> ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... It looks like we also index headers. Should there be some processing for them in here as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signatures are currently indexed by MFA and the respective info is indexed while the module itself is indexed (taking into account the chain of header files). The reason why the check is here is that we also process headers for indexing and that caused an issue since it was deleting all, say, file specs when indexing file.hrl, which is of course incorrect. Indexing of headers should be reconsidered, but I would keep that out of scope for this PR.

do_index(Document, Mode, LookupResult =/= [])
-spec ensure_deeply_indexed(uri()) -> ok.
ensure_deeply_indexed(Uri) ->
{ok, #{pois := POIs} = Document} = els_utils:lookup_document(Uri),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general it seems to me that this lookups happen very often and they seem rather expensive - given they copy a loooot of data from ets.

I guess one way would be to look up at the beginning of the request and then pass the document, rather than just Uri, though I understand this would be a huge effort to change this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a great optimization that can be implemented as a follow up.

deep_index(Document) ->
#{id := Id, uri := Uri, text := Text, source := Source} = Document,
{ok, POIs} = els_parser:parse(Text),
ok = els_dt_document:insert(Document#{pois => POIs}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could there be some concurrent modification to Document that would mean this overrides the data?

In particular, could there be a concurrent didOpen request that adds a buffer pid, but this would erase it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that could happen (even if that would require two didOpen for the same file, which is not very likely). Said that, I'm considering to remove the buffer altogether and find an alternative solution for avoiding multiple re-index operations.

extract_pattern({include, Id}) ->
include_id(Id);
extract_pattern({include_lib, Id}) ->
include_id(Id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... It looks like in the word splitting we save full strings and then compare for equality. But here we only extract the rootname.

I think this means that include matching would never actually match since we have indexed strings like "foo.hrl", but we're looking here for "foo". Similar for include_lib we'd index "app/include/foo.hrl"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct, need to fix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we don't even need to index strings, with the exceptions of includes, so I may apply the same transformation on indexing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@robertoaloi
Copy link
Member Author

@michalmuskala For some reason I cannot reply inline to this comment:

#1260 (comment)

That should be handled correctly, but it would be great to encode that in a test case.

Copy link
Contributor

@michalmuskala michalmuskala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the clarifying comments and fixes, I think this looks pretty good to go for me

@robertoaloi robertoaloi merged commit a9727c9 into main Apr 8, 2022
@robertoaloi robertoaloi deleted the store-pois-on-demand branch April 8, 2022 12:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants