String interning #385

SimonSapin · 2022-12-13T13:21:22Z

The HIR in apollo-compiler uses String a lot. Cloning, hashing, and comparing those strings repeatedly has cost that can add up. This can be solved by having some kind of string "interning" that deduplicates strings in memory, so that for example equality would be as cheap as a pointer comparison.

Some secondary goals are:

Strings that become unused are freed eventually. Not doing this enables interesting designs but would be a DoS vector for a long-lived server processing GraphQL input from untrusted sources
Mixing up interned strings from different compiler instances does not give incorrect results (such as equality false negatives)
- Ideally this gives correct results
- Perhaps having this panic instead is acceptable?
Interned strings are represented by a lifetime-free type that implements AsRef<str> , Eq, Hash, etc. That is, using them does not require a separate access a string pool / interner.

With a quick search I found a number of interning crates that don’t do all of this. Servo’s string-cache does but comes with non-trivial complexity for features that are maybe not as useful here. (Namely: multiple static sets of "well-known" strings that can be interned without allocation and in const context, small string optimization)

Perhaps something custom to apollo-rs would make sense. I think the design might involve atomic reference-counting, precomputed hash, and a process-global mutex-protected hash table. It could use https://crates.io/crates/weak-table, or something even simpler since we don’t need a weak counter (as it would always be 1). But maybe a full custom hash table is too much wheel-reinvention.

The text was updated successfully, but these errors were encountered:

SimonSapin · 2023-01-13T12:23:50Z

Let’s call Symbol this hypothetical type with Deref<Target = str> and cheap Clone.

Ideally to reduce copying we’d convert strings to Symbol as early as possible (in the lexer) and keep them as such as long as they are needed. In the HIR we define data structures so we can choose to store Symbol instead of String.

The AST however is backed by Rowan, which has its own opinion about string storage. GreenNodeBuilder takes a &str and has an internal cache/interner. GreenToken contains a rowan-private ThinArc smart pointer.

I’m not sure how to reconcile both.

SimonSapin · 2023-06-01T13:39:48Z

Deduplicating indentical strings is nice to save memory and to make == very fast, but perhaps not as important for GraphQL documents as it is for HTML element names in Servo. (When you routinely have many thousands of <div> elements you’d rather not allocate the string "div" that many times.)

A smart string type with reference counting (fast clone) and pre-computed hash (fast Hash and fast path for ==) would already bring a lot of benefits over the status quo, without the implementation complexity of full dedup.

SimonSapin · 2023-09-24T19:34:58Z

The WIP mut branch introduces a NodeStr type which is similar to Arc<str>, replacing many allocations and string copies with atomic integer operations.

We decided against pre-computed hashing, since hashing a pre-computed 8 bytes hash is not much less work than hashing names that are rarely dozens of bytes.

Deduplication is also not done. An explicit per-document interner would not work since a NodeStr can be taken from one parsed document and reusing when mutating another. A global implicit interner would require non-trivial synchronization.

SimonSapin · 2024-06-18T15:51:55Z

#868 changes the memory representation of NodeStr / Name. We considered but rejected a process-global string interner that would String::leak newly-inserted strings and give out &'static str. However I think a variation of it is possible that gives out Arc<str>, and stores a weak reference to let unused strings be deallocated. Either way, this interner would be synchronized by a single RwLock. It remains to be seen whether the benefit of deduplicated strings is worth the synchronization cost of RwLock. Benchmarking is not easy because this cost happens mostly under multi-threaded contention, so a synthetic benchmark may not be representative of real-world workloads.

Reopening to track ideas in case we want to reconsider this later.

SimonSapin · 2024-07-04T06:30:18Z

I just came across this post again, about the cost of many threads cloning the same Arc: https://pkolaczk.github.io/server-slower-than-a-laptop/

If the Router ever has a tight loop that clones a Name from its schema, an Arc-based string interner could potentially trigger this same performance problem. But maybe enough code manipulates &Name without touching the atomic reference counter?

goto-bus-stop · 2024-07-04T09:22:41Z

My impression is that apollo-federation today clones Names a lot a lot

SimonSapin added triage apollo-compiler issues/PRs pertaining to semantic analysis & validation labels Dec 13, 2022

lrlna added this to the 06/2023 milestone May 26, 2023

SimonSapin closed this as completed Sep 24, 2023

SimonSapin reopened this Jun 18, 2024

SimonSapin mentioned this issue Jun 27, 2024

Prepare [email protected] release #875

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String interning #385

String interning #385

SimonSapin commented Dec 13, 2022 •

edited

Loading

SimonSapin commented Jan 13, 2023

SimonSapin commented Jun 1, 2023

SimonSapin commented Sep 24, 2023

SimonSapin commented Jun 18, 2024

SimonSapin commented Jul 4, 2024

goto-bus-stop commented Jul 4, 2024

String interning #385

String interning #385

Comments

SimonSapin commented Dec 13, 2022 • edited Loading

SimonSapin commented Jan 13, 2023

SimonSapin commented Jun 1, 2023

SimonSapin commented Sep 24, 2023

SimonSapin commented Jun 18, 2024

SimonSapin commented Jul 4, 2024

goto-bus-stop commented Jul 4, 2024

SimonSapin commented Dec 13, 2022 •

edited

Loading