-
Notifications
You must be signed in to change notification settings - Fork 24.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit number of synonyms that can be created #109390
base: main
Are you sure you want to change the base?
Limit number of synonyms that can be created #109390
Conversation
Documentation preview: |
@@ -82,7 +82,8 @@ dependencies { | |||
internalClusterTestImplementation(project(":test:framework")) { | |||
exclude group: 'org.elasticsearch', module: 'server' | |||
} | |||
|
|||
internalClusterTestImplementation(project(path: ':modules:reindex')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed to add update by query
and match_only_text
plugins for the new IT to work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW project(path: ':modules:reindex')
can be simplified to project(':modules:reindex')
.
|
||
import static org.elasticsearch.action.synonyms.SynonymsTestUtils.randomSynonymsSet; | ||
|
||
public class SynonymsManagementAPIServiceIT extends ESSingleNodeTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created an IT to test edge cases that had to do with max synonyms sets - YAML seemed impractical for this 😉
reloadAnalyzers(synonymsSetId, false, l2, updateStatus); | ||
checkSynonymSetExists(synonymsSetId, listener.delegateFailureAndWrap((l1, obj) -> { | ||
// Count synonym rules to check if we're at maximum | ||
client.prepareSearch(SYNONYMS_ALIAS_NAME) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First count number of synonyms, to ensure we're not at the limit.
We could be more fine grained here, and allow to update a single rule when we're already at the limit - but I don't think the added complexity for this edge case is worth it. Happy to reconsider or add this later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still possible to add synonyms in parallel and to end up with more rules than the max allowed. It's probably an edge case but I don't like the fact that we'll ignore these synonyms silently when creating the analyzer for the index.
@@ -7,8 +7,7 @@ | |||
|
|||
Creates or updates a synonyms set. | |||
|
|||
NOTE: Synonyms sets are limited to a maximum of 10,000 synonym rules per set. | |||
Synonym sets with more than 10,000 synonym rules will provide inconsistent search results. | |||
NOTE: Synonyms sets are limited to a maximum of 100,000 synonym rules per set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we expect this increase to allowed rules to have any adverse side-effect in terms of performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will have an impact on heap when retrieving them - I need to perform some testing in order to check that we're not having too much memory usage for small clusters.
On non-API related operations, this should be no different than using file-based synonyms - and there's no limit set there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a note that we recommend less than 10,000?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense - I'll be adding that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this again - I guess we're mainly limited by heap size on this. It's hard to put a limit on this for example and not on file-based synonyms, which would have the same problem.
I'll try and do some memory usage tests and come back later with a proposal for the users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be we should avoid any recommendation in terms of sizing, just put a note that for a large synonyms sets put extra demand on memory.
@elasticmachine update branch |
@Override | ||
public void onResponse(PagedResult<SynonymRule> synonymRulePagedResult) { | ||
// TODO This fails in CI but passes locally. Why? | ||
assertEquals(rulesNumber, synonymRulePagedResult.totalResults()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this fails in CI. Results should be updated correctly as we're doing a refresh when we update in the API side 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for doing this @carlosdelest !
.execute(l1.delegateFailureAndWrap((searchListener, searchResponse) -> { | ||
long synonymsSetSize = searchResponse.getHits().getTotalHits().value; | ||
if (synonymsSetSize >= MAX_SYNONYMS_SETS) { | ||
// We could potentially update a synonym rule when we're at max capacity, but we're keeping this simple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could consider supporting updates here by adding a must_not
clause to the query on the synonym ID? Then it would be below the max if it existed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good way of doing this! Again, I feel it a bit complicated for such an edge case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think it's worth taking this edge case into consideration, because you know someone is going to try it and file a bug later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, it's not really an edge case to be at capacity and to update the existing rules. I think we need a more robust approach here like executing all updates on the master node and executing the updates sequentially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not really an edge case to be at capacity and to update the existing rules
This edge case happens only if we're updating an individual synonym rule when we're at max capacity. Updating rules in batch means that first we remove all the rules and apply the updates in a bulk request.
It sounds weird to me to update individual rules when we have 100k synonyms in a synonym set- at that scale, I would think that users do batch updates of rules, which effectively replace the existing ones.
But if y'all think this needs to be dealt with, I will! I'll update the code and ping when ready 👍
I think we need a more robust approach here like executing all updates on the master node and executing the updates sequentially
So do you think that this should be a TransportMasterNodeAction
? As this updates an index and not the cluster state, what would be the advantages of applying the action on the master node?
As we're doing a bulk request under the hood, doesn't that mean that we're applying the updates sequentially?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating rules in batch means that first we remove all the rules and apply the updates in a bulk request.
This is another case of non-consistent updates, we need to ensure some ordering here.
So do you think that this should be a TransportMasterNodeAction? As this updates an index and not the cluster state, what would be the advantages of applying the action on the master node?
Having a single place where updates can occur but as you noticed that won't be enough. We also need to ensure that all updates are done sequentially, not in a single bulk request but globally when multiple synonyms updates are done in parallel. See the MetadataMappingService
for an example of service that applies updates sequentially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jimczi , I will take a look into that.
I think we should create a separate issue for ensuring updates are applied sequentially, as this is not directly related to this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's related, we cannot limit the number of synonyms without ensuring that updates are applied sequentially.
@@ -55,6 +55,38 @@ setup: | |||
- synonyms: "bye => goodbye" | |||
id: "test-id-2" | |||
|
|||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth adding a test here to test that you can't insert a new synonym after reaching the max?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did that as part of the integration tests - it would be very inconvenient to create 100k docs in a YAML test, and we can use randomization to increase coverage as well
@@ -7,8 +7,7 @@ | |||
|
|||
Creates or updates a synonyms set. | |||
|
|||
NOTE: Synonyms sets are limited to a maximum of 10,000 synonym rules per set. | |||
Synonym sets with more than 10,000 synonym rules will provide inconsistent search results. | |||
NOTE: Synonyms sets are limited to a maximum of 100,000 synonym rules per set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a note that we recommend less than 10,000?
@@ -18,8 +18,7 @@ This provides an alternative to: | |||
Synonyms sets can be used to configure <<analysis-synonym-graph-tokenfilter,synonym graph token filters>> and <<analysis-synonym-tokenfilter,synonym token filters>>. | |||
These filters are applied as part of the <<analysis,analysis>> process by the <<search-analyzer,search analyzer>>. | |||
|
|||
NOTE: Synonyms sets are limited to a maximum of 10,000 synonym rules per set. | |||
Synonym sets with more than 10,000 synonym rules will provide inconsistent search results. | |||
NOTE: Synonyms sets are limited to a maximum of 100,000 synonym rules per set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a note that we recommend less than 10,000?
@Override | ||
public void onResponse(PagedResult<SynonymRule> synonymRulePagedResult) { | ||
// TODO This fails in CI but passes locally. Why? | ||
assertEquals(rulesNumber, synonymRulePagedResult.totalResults()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timing?
@@ -351,7 +351,7 @@ public static Reader getReaderFromFile(Environment env, String filePath, String | |||
|
|||
public static Reader getReaderFromIndex(String synonymsSet, SynonymsManagementAPIService synonymsManagementAPIService) { | |||
final PlainActionFuture<PagedResult<SynonymRule>> synonymsLoadingFuture = new PlainActionFuture<>(); | |||
synonymsManagementAPIService.getSynonymSetRules(synonymsSet, 0, 10_000, synonymsLoadingFuture); | |||
synonymsManagementAPIService.getSynonymSetRules(synonymsSet, synonymsLoadingFuture); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retrieve all synonyms sets. It does not specify the maximum to allow the service to set that up depending on the index setting for bwc reasons.
} | ||
|
||
// Used for testing, so we don't need to test for MAX_SYNONYMS_SETS and put unnecessary memory pressure on the test cluster | ||
SynonymsManagementAPIService(Client client, int maxSynonymsSets) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of testing using a lot of synonyms, which made tests slow and consumed a lot of heap, I decided to artificially limit the max on tests
Pinging @elastic/es-search (Team:Search) |
I've marked this ready for review after dealing with some testing issues, which were caused by the sheer number of synonyms being tested AFAICT. I know @jimczi was keen on adding a cluster setting for limiting the number of max synonyms if needed. I can add that as a separate PR if this one looks good. |
Hi @carlosdelest, I've created a changelog YAML for you. |
@@ -82,7 +82,8 @@ dependencies { | |||
internalClusterTestImplementation(project(":test:framework")) { | |||
exclude group: 'org.elasticsearch', module: 'server' | |||
} | |||
|
|||
internalClusterTestImplementation(project(path: ':modules:reindex')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW project(path: ':modules:reindex')
can be simplified to project(':modules:reindex')
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking really good! Just one comment/discussion point.
.execute(l1.delegateFailureAndWrap((searchListener, searchResponse) -> { | ||
long synonymsSetSize = searchResponse.getHits().getTotalHits().value; | ||
if (synonymsSetSize >= MAX_SYNONYMS_SETS) { | ||
// We could potentially update a synonym rule when we're at max capacity, but we're keeping this simple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think it's worth taking this edge case into consideration, because you know someone is going to try it and file a bug later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach doesn't seem very robust and could suffer from the same issue we're seeing today (silent ignoring of indexed synonym rules). I am also concerned that a synonym set at capacity will be difficult to manage.
I am in favour of a more robust approach here even if that increases the complexity.
reloadAnalyzers(synonymsSetId, false, l2, updateStatus); | ||
checkSynonymSetExists(synonymsSetId, listener.delegateFailureAndWrap((l1, obj) -> { | ||
// Count synonym rules to check if we're at maximum | ||
client.prepareSearch(SYNONYMS_ALIAS_NAME) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still possible to add synonyms in parallel and to end up with more rules than the max allowed. It's probably an edge case but I don't like the fact that we'll ignore these synonyms silently when creating the analyzer for the index.
.execute(l1.delegateFailureAndWrap((searchListener, searchResponse) -> { | ||
long synonymsSetSize = searchResponse.getHits().getTotalHits().value; | ||
if (synonymsSetSize >= MAX_SYNONYMS_SETS) { | ||
// We could potentially update a synonym rule when we're at max capacity, but we're keeping this simple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, it's not really an edge case to be at capacity and to update the existing rules. I think we need a more robust approach here like executing all updates on the master node and executing the updates sequentially.
Will address API limits first in this PR: #109981 . Then I'll address enforcing the correct order via master node and raising the synonyms limit |
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
Hi @carlosdelest, I've updated the changelog YAML for you. |
Closes #108785
Update the maximum number of synonyms to 100,000, versus the current limit of 10,000.
Also, include checks at the API level to avoid creating synonyms sets with more than 10,000 rules.