Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit number of synonyms that can be created #109390

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

carlosdelest
Copy link
Member

Closes #108785

Update the maximum number of synonyms to 100,000, versus the current limit of 10,000.

Also, include checks at the API level to avoid creating synonyms sets with more than 10,000 rules.

Copy link

github-actions bot commented Jun 5, 2024

Documentation preview:

@carlosdelest carlosdelest added >bug :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Meta label for search team and removed v8.15.0 labels Jun 5, 2024
@@ -82,7 +82,8 @@ dependencies {
internalClusterTestImplementation(project(":test:framework")) {
exclude group: 'org.elasticsearch', module: 'server'
}

internalClusterTestImplementation(project(path: ':modules:reindex'))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed to add update by query and match_only_text plugins for the new IT to work

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW project(path: ':modules:reindex') can be simplified to project(':modules:reindex').


import static org.elasticsearch.action.synonyms.SynonymsTestUtils.randomSynonymsSet;

public class SynonymsManagementAPIServiceIT extends ESSingleNodeTestCase {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created an IT to test edge cases that had to do with max synonyms sets - YAML seemed impractical for this 😉

reloadAnalyzers(synonymsSetId, false, l2, updateStatus);
checkSynonymSetExists(synonymsSetId, listener.delegateFailureAndWrap((l1, obj) -> {
// Count synonym rules to check if we're at maximum
client.prepareSearch(SYNONYMS_ALIAS_NAME)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First count number of synonyms, to ensure we're not at the limit.

We could be more fine grained here, and allow to update a single rule when we're already at the limit - but I don't think the added complexity for this edge case is worth it. Happy to reconsider or add this later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still possible to add synonyms in parallel and to end up with more rules than the max allowed. It's probably an edge case but I don't like the fact that we'll ignore these synonyms silently when creating the analyzer for the index.

@@ -7,8 +7,7 @@

Creates or updates a synonyms set.

NOTE: Synonyms sets are limited to a maximum of 10,000 synonym rules per set.
Synonym sets with more than 10,000 synonym rules will provide inconsistent search results.
NOTE: Synonyms sets are limited to a maximum of 100,000 synonym rules per set.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect this increase to allowed rules to have any adverse side-effect in terms of performance?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will have an impact on heap when retrieving them - I need to perform some testing in order to check that we're not having too much memory usage for small clusters.

On non-API related operations, this should be no different than using file-based synonyms - and there's no limit set there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a note that we recommend less than 10,000?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense - I'll be adding that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this again - I guess we're mainly limited by heap size on this. It's hard to put a limit on this for example and not on file-based synonyms, which would have the same problem.

I'll try and do some memory usage tests and come back later with a proposal for the users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be we should avoid any recommendation in terms of sizing, just put a note that for a large synonyms sets put extra demand on memory.

@carlosdelest
Copy link
Member Author

@elasticmachine update branch

@Override
public void onResponse(PagedResult<SynonymRule> synonymRulePagedResult) {
// TODO This fails in CI but passes locally. Why?
assertEquals(rulesNumber, synonymRulePagedResult.totalResults());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this fails in CI. Results should be updated correctly as we're doing a refresh when we update in the API side 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timing?

Copy link
Member

@kderusso kderusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for doing this @carlosdelest !

.execute(l1.delegateFailureAndWrap((searchListener, searchResponse) -> {
long synonymsSetSize = searchResponse.getHits().getTotalHits().value;
if (synonymsSetSize >= MAX_SYNONYMS_SETS) {
// We could potentially update a synonym rule when we're at max capacity, but we're keeping this simple
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could consider supporting updates here by adding a must_not clause to the query on the synonym ID? Then it would be below the max if it existed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good way of doing this! Again, I feel it a bit complicated for such an edge case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think it's worth taking this edge case into consideration, because you know someone is going to try it and file a bug later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it's not really an edge case to be at capacity and to update the existing rules. I think we need a more robust approach here like executing all updates on the master node and executing the updates sequentially.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not really an edge case to be at capacity and to update the existing rules

This edge case happens only if we're updating an individual synonym rule when we're at max capacity. Updating rules in batch means that first we remove all the rules and apply the updates in a bulk request.

It sounds weird to me to update individual rules when we have 100k synonyms in a synonym set- at that scale, I would think that users do batch updates of rules, which effectively replace the existing ones.

But if y'all think this needs to be dealt with, I will! I'll update the code and ping when ready 👍

I think we need a more robust approach here like executing all updates on the master node and executing the updates sequentially

So do you think that this should be a TransportMasterNodeAction? As this updates an index and not the cluster state, what would be the advantages of applying the action on the master node?

As we're doing a bulk request under the hood, doesn't that mean that we're applying the updates sequentially?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating rules in batch means that first we remove all the rules and apply the updates in a bulk request.

This is another case of non-consistent updates, we need to ensure some ordering here.

So do you think that this should be a TransportMasterNodeAction? As this updates an index and not the cluster state, what would be the advantages of applying the action on the master node?

Having a single place where updates can occur but as you noticed that won't be enough. We also need to ensure that all updates are done sequentially, not in a single bulk request but globally when multiple synonyms updates are done in parallel. See the MetadataMappingService for an example of service that applies updates sequentially.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jimczi , I will take a look into that.

I think we should create a separate issue for ensuring updates are applied sequentially, as this is not directly related to this change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's related, we cannot limit the number of synonyms without ensuring that updates are applied sequentially.

@@ -55,6 +55,38 @@ setup:
- synonyms: "bye => goodbye"
id: "test-id-2"

---
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth adding a test here to test that you can't insert a new synonym after reaching the max?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did that as part of the integration tests - it would be very inconvenient to create 100k docs in a YAML test, and we can use randomization to increase coverage as well

@@ -7,8 +7,7 @@

Creates or updates a synonyms set.

NOTE: Synonyms sets are limited to a maximum of 10,000 synonym rules per set.
Synonym sets with more than 10,000 synonym rules will provide inconsistent search results.
NOTE: Synonyms sets are limited to a maximum of 100,000 synonym rules per set.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a note that we recommend less than 10,000?

@@ -18,8 +18,7 @@ This provides an alternative to:
Synonyms sets can be used to configure <<analysis-synonym-graph-tokenfilter,synonym graph token filters>> and <<analysis-synonym-tokenfilter,synonym token filters>>.
These filters are applied as part of the <<analysis,analysis>> process by the <<search-analyzer,search analyzer>>.

NOTE: Synonyms sets are limited to a maximum of 10,000 synonym rules per set.
Synonym sets with more than 10,000 synonym rules will provide inconsistent search results.
NOTE: Synonyms sets are limited to a maximum of 100,000 synonym rules per set.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a note that we recommend less than 10,000?

@Override
public void onResponse(PagedResult<SynonymRule> synonymRulePagedResult) {
// TODO This fails in CI but passes locally. Why?
assertEquals(rulesNumber, synonymRulePagedResult.totalResults());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timing?

@@ -351,7 +351,7 @@ public static Reader getReaderFromFile(Environment env, String filePath, String

public static Reader getReaderFromIndex(String synonymsSet, SynonymsManagementAPIService synonymsManagementAPIService) {
final PlainActionFuture<PagedResult<SynonymRule>> synonymsLoadingFuture = new PlainActionFuture<>();
synonymsManagementAPIService.getSynonymSetRules(synonymsSet, 0, 10_000, synonymsLoadingFuture);
synonymsManagementAPIService.getSynonymSetRules(synonymsSet, synonymsLoadingFuture);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retrieve all synonyms sets. It does not specify the maximum to allow the service to set that up depending on the index setting for bwc reasons.

}

// Used for testing, so we don't need to test for MAX_SYNONYMS_SETS and put unnecessary memory pressure on the test cluster
SynonymsManagementAPIService(Client client, int maxSynonymsSets) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of testing using a lot of synonyms, which made tests slow and consumed a lot of heap, I decided to artificially limit the max on tests

@carlosdelest carlosdelest marked this pull request as ready for review June 12, 2024 14:11
@carlosdelest carlosdelest requested a review from a team as a code owner June 12, 2024 14:11
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@carlosdelest
Copy link
Member Author

I've marked this ready for review after dealing with some testing issues, which were caused by the sheer number of synonyms being tested AFAICT.

I know @jimczi was keen on adding a cluster setting for limiting the number of max synonyms if needed. I can add that as a separate PR if this one looks good.

@elasticsearchmachine
Copy link
Collaborator

Hi @carlosdelest, I've created a changelog YAML for you.

@@ -82,7 +82,8 @@ dependencies {
internalClusterTestImplementation(project(":test:framework")) {
exclude group: 'org.elasticsearch', module: 'server'
}

internalClusterTestImplementation(project(path: ':modules:reindex'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW project(path: ':modules:reindex') can be simplified to project(':modules:reindex').

Copy link
Member

@kderusso kderusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking really good! Just one comment/discussion point.

.execute(l1.delegateFailureAndWrap((searchListener, searchResponse) -> {
long synonymsSetSize = searchResponse.getHits().getTotalHits().value;
if (synonymsSetSize >= MAX_SYNONYMS_SETS) {
// We could potentially update a synonym rule when we're at max capacity, but we're keeping this simple
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think it's worth taking this edge case into consideration, because you know someone is going to try it and file a bug later.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach doesn't seem very robust and could suffer from the same issue we're seeing today (silent ignoring of indexed synonym rules). I am also concerned that a synonym set at capacity will be difficult to manage.
I am in favour of a more robust approach here even if that increases the complexity.

reloadAnalyzers(synonymsSetId, false, l2, updateStatus);
checkSynonymSetExists(synonymsSetId, listener.delegateFailureAndWrap((l1, obj) -> {
// Count synonym rules to check if we're at maximum
client.prepareSearch(SYNONYMS_ALIAS_NAME)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still possible to add synonyms in parallel and to end up with more rules than the max allowed. It's probably an edge case but I don't like the fact that we'll ignore these synonyms silently when creating the analyzer for the index.

.execute(l1.delegateFailureAndWrap((searchListener, searchResponse) -> {
long synonymsSetSize = searchResponse.getHits().getTotalHits().value;
if (synonymsSetSize >= MAX_SYNONYMS_SETS) {
// We could potentially update a synonym rule when we're at max capacity, but we're keeping this simple
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it's not really an edge case to be at capacity and to update the existing rules. I think we need a more robust approach here like executing all updates on the master node and executing the updates sequentially.

@carlosdelest
Copy link
Member Author

Will address API limits first in this PR: #109981 . Then I'll address enforcing the correct order via master node and raising the synonyms limit

@elasticsearchmachine elasticsearchmachine added v8.16.0 Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch and removed v8.15.0 labels Jul 4, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

@elasticsearchmachine
Copy link
Collaborator

Hi @carlosdelest, I've updated the changelog YAML for you.

@elasticsearchmachine elasticsearchmachine removed the Team:Search Meta label for search team label Jul 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug >enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v8.16.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Large synonyms sets inconsistently return synonym results
8 participants