Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String interning #385

Open
SimonSapin opened this issue Dec 13, 2022 · 6 comments
Open

String interning #385

SimonSapin opened this issue Dec 13, 2022 · 6 comments
Labels
apollo-compiler issues/PRs pertaining to semantic analysis & validation triage
Milestone

Comments

@SimonSapin
Copy link
Contributor

SimonSapin commented Dec 13, 2022

The HIR in apollo-compiler uses String a lot. Cloning, hashing, and comparing those strings repeatedly has cost that can add up. This can be solved by having some kind of string "interning" that deduplicates strings in memory, so that for example equality would be as cheap as a pointer comparison.

Some secondary goals are:

  • Strings that become unused are freed eventually. Not doing this enables interesting designs but would be a DoS vector for a long-lived server processing GraphQL input from untrusted sources
  • Mixing up interned strings from different compiler instances does not give incorrect results (such as equality false negatives)
    • Ideally this gives correct results
    • Perhaps having this panic instead is acceptable?
  • Interned strings are represented by a lifetime-free type that implements AsRef<str> , Eq, Hash, etc. That is, using them does not require a separate access a string pool / interner.

With a quick search I found a number of interning crates that don’t do all of this. Servo’s string-cache does but comes with non-trivial complexity for features that are maybe not as useful here. (Namely: multiple static sets of "well-known" strings that can be interned without allocation and in const context, small string optimization)

Perhaps something custom to apollo-rs would make sense. I think the design might involve atomic reference-counting, precomputed hash, and a process-global mutex-protected hash table. It could use https://crates.io/crates/weak-table, or something even simpler since we don’t need a weak counter (as it would always be 1). But maybe a full custom hash table is too much wheel-reinvention.

@SimonSapin SimonSapin added triage apollo-compiler issues/PRs pertaining to semantic analysis & validation labels Dec 13, 2022
@SimonSapin
Copy link
Contributor Author

Let’s call Symbol this hypothetical type with Deref<Target = str> and cheap Clone.

Ideally to reduce copying we’d convert strings to Symbol as early as possible (in the lexer) and keep them as such as long as they are needed. In the HIR we define data structures so we can choose to store Symbol instead of String.

The AST however is backed by Rowan, which has its own opinion about string storage. GreenNodeBuilder takes a &str and has an internal cache/interner. GreenToken contains a rowan-private ThinArc smart pointer.

I’m not sure how to reconcile both.

@lrlna lrlna added this to the 06/2023 milestone May 26, 2023
@SimonSapin
Copy link
Contributor Author

Deduplicating indentical strings is nice to save memory and to make == very fast, but perhaps not as important for GraphQL documents as it is for HTML element names in Servo. (When you routinely have many thousands of <div> elements you’d rather not allocate the string "div" that many times.)

A smart string type with reference counting (fast clone) and pre-computed hash (fast Hash and fast path for ==) would already bring a lot of benefits over the status quo, without the implementation complexity of full dedup.

@SimonSapin
Copy link
Contributor Author

The WIP mut branch introduces a NodeStr type which is similar to Arc<str>, replacing many allocations and string copies with atomic integer operations.

We decided against pre-computed hashing, since hashing a pre-computed 8 bytes hash is not much less work than hashing names that are rarely dozens of bytes.

Deduplication is also not done. An explicit per-document interner would not work since a NodeStr can be taken from one parsed document and reusing when mutating another. A global implicit interner would require non-trivial synchronization.

@SimonSapin
Copy link
Contributor Author

#868 changes the memory representation of NodeStr / Name. We considered but rejected a process-global string interner that would String::leak newly-inserted strings and give out &'static str. However I think a variation of it is possible that gives out Arc<str>, and stores a weak reference to let unused strings be deallocated. Either way, this interner would be synchronized by a single RwLock. It remains to be seen whether the benefit of deduplicated strings is worth the synchronization cost of RwLock. Benchmarking is not easy because this cost happens mostly under multi-threaded contention, so a synthetic benchmark may not be representative of real-world workloads.

Reopening to track ideas in case we want to reconsider this later.

@SimonSapin
Copy link
Contributor Author

I just came across this post again, about the cost of many threads cloning the same Arc: https://pkolaczk.github.io/server-slower-than-a-laptop/

If the Router ever has a tight loop that clones a Name from its schema, an Arc-based string interner could potentially trigger this same performance problem. But maybe enough code manipulates &Name without touching the atomic reference counter?

@goto-bus-stop
Copy link
Member

My impression is that apollo-federation today clones Names a lot a lot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
apollo-compiler issues/PRs pertaining to semantic analysis & validation triage
Projects
None yet
Development

No branches or pull requests

3 participants