Limit number of synonyms that can be created #109390

carlosdelest · 2024-06-05T11:15:28Z

Closes #108785

Update the maximum number of synonyms to 100,000, versus the current limit of 10,000.

Also, include checks at the API level to avoid creating synonyms sets with more than 10,000 rules.

github-actions · 2024-06-05T11:15:40Z

Documentation preview:

✨ Changed pages

carlosdelest · 2024-06-05T11:18:38Z

server/build.gradle

@@ -82,7 +82,8 @@ dependencies {
 internalClusterTestImplementation(project(":test:framework")) {
 exclude group: 'org.elasticsearch', module: 'server'
 }
-
+ internalClusterTestImplementation(project(path: ':modules:reindex'))


Needed to add update by query and match_only_text plugins for the new IT to work

FWIW project(path: ':modules:reindex') can be simplified to project(':modules:reindex').

carlosdelest · 2024-06-05T11:19:13Z

.../src/internalClusterTest/java/org/elasticsearch/synonyms/SynonymsManagementAPIServiceIT.java

+
+import static org.elasticsearch.action.synonyms.SynonymsTestUtils.randomSynonymsSet;
+
+public class SynonymsManagementAPIServiceIT extends ESSingleNodeTestCase {


I created an IT to test edge cases that had to do with max synonyms sets - YAML seemed impractical for this 😉

carlosdelest · 2024-06-05T11:24:07Z

server/src/main/java/org/elasticsearch/synonyms/SynonymsManagementAPIService.java

- reloadAnalyzers(synonymsSetId, false, l2, updateStatus);
+ checkSynonymSetExists(synonymsSetId, listener.delegateFailureAndWrap((l1, obj) -> {
+ // Count synonym rules to check if we're at maximum
+ client.prepareSearch(SYNONYMS_ALIAS_NAME)


First count number of synonyms, to ensure we're not at the limit.

We could be more fine grained here, and allow to update a single rule when we're already at the limit - but I don't think the added complexity for this edge case is worth it. Happy to reconsider or add this later.

It's still possible to add synonyms in parallel and to end up with more rules than the max allowed. It's probably an edge case but I don't like the fact that we'll ignore these synonyms silently when creating the analyzer for the index.

…limit

pmpailis · 2024-06-05T12:09:05Z

docs/reference/synonyms/apis/put-synonyms-set.asciidoc

@@ -7,8 +7,7 @@

 Creates or updates a synonyms set.

-NOTE: Synonyms sets are limited to a maximum of 10,000 synonym rules per set.
-Synonym sets with more than 10,000 synonym rules will provide inconsistent search results.
+NOTE: Synonyms sets are limited to a maximum of 100,000 synonym rules per set.


Do we expect this increase to allowed rules to have any adverse side-effect in terms of performance?

It will have an impact on heap when retrieving them - I need to perform some testing in order to check that we're not having too much memory usage for small clusters.

On non-API related operations, this should be no different than using file-based synonyms - and there's no limit set there.

Should we add a note that we recommend less than 10,000?

That makes sense - I'll be adding that.

Thinking about this again - I guess we're mainly limited by heap size on this. It's hard to put a limit on this for example and not on file-based synonyms, which would have the same problem.

I'll try and do some memory usage tests and come back later with a proposal for the users.

May be we should avoid any recommendation in terms of sizing, just put a note that for a large synonyms sets put extra demand on memory.

carlosdelest · 2024-06-06T13:30:46Z

@elasticmachine update branch

…ng format

…t locally

carlosdelest · 2024-06-10T11:36:12Z

.../src/internalClusterTest/java/org/elasticsearch/synonyms/SynonymsManagementAPIServiceIT.java

+ @Override
+ public void onResponse(PagedResult<SynonymRule> synonymRulePagedResult) {
+ // TODO This fails in CI but passes locally. Why?
+ assertEquals(rulesNumber, synonymRulePagedResult.totalResults());


I don't understand why this fails in CI. Results should be updated correctly as we're doing a refresh when we update in the API side 🤔

kderusso

Thanks so much for doing this @carlosdelest !

kderusso · 2024-06-10T15:41:56Z

server/src/main/java/org/elasticsearch/synonyms/SynonymsManagementAPIService.java

+ .execute(l1.delegateFailureAndWrap((searchListener, searchResponse) -> {
+ long synonymsSetSize = searchResponse.getHits().getTotalHits().value;
+ if (synonymsSetSize >= MAX_SYNONYMS_SETS) {
+ // We could potentially update a synonym rule when we're at max capacity, but we're keeping this simple


We could consider supporting updates here by adding a must_not clause to the query on the synonym ID? Then it would be below the max if it existed.

That's a good way of doing this! Again, I feel it a bit complicated for such an edge case.

I do think it's worth taking this edge case into consideration, because you know someone is going to try it and file a bug later.

Agreed, it's not really an edge case to be at capacity and to update the existing rules. I think we need a more robust approach here like executing all updates on the master node and executing the updates sequentially.

it's not really an edge case to be at capacity and to update the existing rules

This edge case happens only if we're updating an individual synonym rule when we're at max capacity. Updating rules in batch means that first we remove all the rules and apply the updates in a bulk request.

It sounds weird to me to update individual rules when we have 100k synonyms in a synonym set- at that scale, I would think that users do batch updates of rules, which effectively replace the existing ones.

But if y'all think this needs to be dealt with, I will! I'll update the code and ping when ready 👍

I think we need a more robust approach here like executing all updates on the master node and executing the updates sequentially

So do you think that this should be a TransportMasterNodeAction? As this updates an index and not the cluster state, what would be the advantages of applying the action on the master node?

As we're doing a bulk request under the hood, doesn't that mean that we're applying the updates sequentially?

Updating rules in batch means that first we remove all the rules and apply the updates in a bulk request.

This is another case of non-consistent updates, we need to ensure some ordering here.

So do you think that this should be a TransportMasterNodeAction? As this updates an index and not the cluster state, what would be the advantages of applying the action on the master node?

Having a single place where updates can occur but as you noticed that won't be enough. We also need to ensure that all updates are done sequentially, not in a single bulk request but globally when multiple synonyms updates are done in parallel. See the MetadataMappingService for an example of service that applies updates sequentially.

Thanks @jimczi , I will take a look into that.

I think we should create a separate issue for ensuring updates are applied sequentially, as this is not directly related to this change?

I think it's related, we cannot limit the number of synonyms without ensuring that updates are applied sequentially.

kderusso · 2024-06-10T15:44:10Z

rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/synonyms/20_synonyms_get.yml

@@ -55,6 +55,38 @@ setup:
 - synonyms: "bye => goodbye"
 id: "test-id-2"

+---


Is it worth adding a test here to test that you can't insert a new synonym after reaching the max?

I did that as part of the integration tests - it would be very inconvenient to create 100k docs in a YAML test, and we can use randomization to increase coverage as well

kderusso · 2024-06-10T15:44:34Z

docs/reference/synonyms/apis/put-synonyms-set.asciidoc

@@ -7,8 +7,7 @@

 Creates or updates a synonyms set.

-NOTE: Synonyms sets are limited to a maximum of 10,000 synonym rules per set.
-Synonym sets with more than 10,000 synonym rules will provide inconsistent search results.
+NOTE: Synonyms sets are limited to a maximum of 100,000 synonym rules per set.


Should we add a note that we recommend less than 10,000?

kderusso · 2024-06-10T15:44:55Z

docs/reference/synonyms/apis/synonyms-apis.asciidoc

@@ -18,8 +18,7 @@ This provides an alternative to:
 Synonyms sets can be used to configure <<analysis-synonym-graph-tokenfilter,synonym graph token filters>> and <<analysis-synonym-tokenfilter,synonym token filters>>.
 These filters are applied as part of the <<analysis,analysis>> process by the <<search-analyzer,search analyzer>>.

-NOTE: Synonyms sets are limited to a maximum of 10,000 synonym rules per set.
-Synonym sets with more than 10,000 synonym rules will provide inconsistent search results.
+NOTE: Synonyms sets are limited to a maximum of 100,000 synonym rules per set.


Should we add a note that we recommend less than 10,000?

kderusso · 2024-06-10T15:45:18Z

.../src/internalClusterTest/java/org/elasticsearch/synonyms/SynonymsManagementAPIServiceIT.java

+ @Override
+ public void onResponse(PagedResult<SynonymRule> synonymRulePagedResult) {
+ // TODO This fails in CI but passes locally. Why?
+ assertEquals(rulesNumber, synonymRulePagedResult.totalResults());


…m of synonyms

carlosdelest · 2024-06-12T14:09:42Z

server/src/main/java/org/elasticsearch/index/analysis/Analysis.java

@@ -351,7 +351,7 @@ public static Reader getReaderFromFile(Environment env, String filePath, String

 public static Reader getReaderFromIndex(String synonymsSet, SynonymsManagementAPIService synonymsManagementAPIService) {
 final PlainActionFuture<PagedResult<SynonymRule>> synonymsLoadingFuture = new PlainActionFuture<>();
- synonymsManagementAPIService.getSynonymSetRules(synonymsSet, 0, 10_000, synonymsLoadingFuture);
+ synonymsManagementAPIService.getSynonymSetRules(synonymsSet, synonymsLoadingFuture);


Retrieve all synonyms sets. It does not specify the maximum to allow the service to set that up depending on the index setting for bwc reasons.

carlosdelest · 2024-06-12T14:10:21Z

server/src/main/java/org/elasticsearch/synonyms/SynonymsManagementAPIService.java

+ }
+
+ // Used for testing, so we don't need to test for MAX_SYNONYMS_SETS and put unnecessary memory pressure on the test cluster
+ SynonymsManagementAPIService(Client client, int maxSynonymsSets) {


Instead of testing using a lot of synonyms, which made tests slow and consumed a lot of heap, I decided to artificially limit the max on tests

elasticsearchmachine · 2024-06-12T14:11:23Z

Pinging @elastic/es-search (Team:Search)

carlosdelest · 2024-06-12T14:12:36Z

I've marked this ready for review after dealing with some testing issues, which were caused by the sheer number of synonyms being tested AFAICT.

I know @jimczi was keen on adding a cluster setting for limiting the number of max synonyms if needed. I can add that as a separate PR if this one looks good.

elasticsearchmachine · 2024-06-12T14:16:26Z

Hi @carlosdelest, I've created a changelog YAML for you.

…m of synonyms

mark-vieira · 2024-06-12T15:39:51Z

server/build.gradle

@@ -82,7 +82,8 @@ dependencies {
 internalClusterTestImplementation(project(":test:framework")) {
 exclude group: 'org.elasticsearch', module: 'server'
 }
-
+ internalClusterTestImplementation(project(path: ':modules:reindex'))


FWIW project(path: ':modules:reindex') can be simplified to project(':modules:reindex').

kderusso

Looking really good! Just one comment/discussion point.

kderusso · 2024-06-13T12:24:10Z

server/src/main/java/org/elasticsearch/synonyms/SynonymsManagementAPIService.java

+ .execute(l1.delegateFailureAndWrap((searchListener, searchResponse) -> {
+ long synonymsSetSize = searchResponse.getHits().getTotalHits().value;
+ if (synonymsSetSize >= MAX_SYNONYMS_SETS) {
+ // We could potentially update a synonym rule when we're at max capacity, but we're keeping this simple


I do think it's worth taking this edge case into consideration, because you know someone is going to try it and file a bug later.

jimczi

The approach doesn't seem very robust and could suffer from the same issue we're seeing today (silent ignoring of indexed synonym rules). I am also concerned that a synonym set at capacity will be difficult to manage.
I am in favour of a more robust approach here even if that increases the complexity.

jimczi · 2024-06-13T13:05:04Z

server/src/main/java/org/elasticsearch/synonyms/SynonymsManagementAPIService.java

- reloadAnalyzers(synonymsSetId, false, l2, updateStatus);
+ checkSynonymSetExists(synonymsSetId, listener.delegateFailureAndWrap((l1, obj) -> {
+ // Count synonym rules to check if we're at maximum
+ client.prepareSearch(SYNONYMS_ALIAS_NAME)


It's still possible to add synonyms in parallel and to end up with more rules than the max allowed. It's probably an edge case but I don't like the fact that we'll ignore these synonyms silently when creating the analyzer for the index.

jimczi · 2024-06-13T13:14:54Z

server/src/main/java/org/elasticsearch/synonyms/SynonymsManagementAPIService.java

+ .execute(l1.delegateFailureAndWrap((searchListener, searchResponse) -> {
+ long synonymsSetSize = searchResponse.getHits().getTotalHits().value;
+ if (synonymsSetSize >= MAX_SYNONYMS_SETS) {
+ // We could potentially update a synonym rule when we're at max capacity, but we're keeping this simple


Agreed, it's not really an edge case to be at capacity and to update the existing rules. I think we need a more robust approach here like executing all updates on the master node and executing the updates sequentially.

carlosdelest · 2024-06-20T16:26:26Z

Will address API limits first in this PR: #109981 . Then I'll address enforcing the correct order via master node and raising the synonyms limit

elasticsearchmachine · 2024-07-04T09:16:38Z

Pinging @elastic/es-search-foundations (Team:Search Foundations)

elasticsearchmachine · 2024-07-04T09:16:38Z

Hi @carlosdelest, I've updated the changelog YAML for you.

carlosdelest added 5 commits June 4, 2024 19:42

Change synonyms limit

09f3257

Add checks for page and size

52147f1

Check maximum number of synonym rules for each synonym set

a80e68b

Update docs with new synonym limits

8cb2a54

Group limit tests together for bwc

98671d0

elasticsearchmachine added the v8.15.0 label Jun 5, 2024

carlosdelest added >bug :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Meta label for search team and removed v8.15.0 labels Jun 5, 2024

carlosdelest mentioned this pull request Jun 5, 2024

Large synonyms sets inconsistently return synonym results #108785

Open

carlosdelest commented Jun 5, 2024

View reviewed changes

Merge branch 'refs/heads/main' into carlosdelest/fix-synonyms-number-…

0f26055

…limit

pmpailis reviewed Jun 5, 2024

View reviewed changes

carlosdelest added 2 commits June 6, 2024 14:11

Add index MAX_RESULT_WINDOW index setting

73ab43b

add tests

294ecf5

elasticmachine and others added 8 commits June 6, 2024 23:30

Merge branch 'main' into carlosdelest/fix-synonyms-number-limit

4375e1b

It's not necessary to change the index format, as this is not a mappi…

bad39a2

…ng format

Wait for synonyms index to be ready before running IT

9d4cdaf

Give it a shot with ESIntegTestCase

9589065

Small test linting

d59a5fe

Linting madness

a27d6f5

Removes check for number of synonyms created as it fails in CI but no…

2eac8ab

…t locally

I just want to know what's going on. Really.

53b1213

carlosdelest commented Jun 10, 2024

View reviewed changes

Check system index settings to avoid failing on mixed cluster envs

1e94fc0

kderusso reviewed Jun 10, 2024

View reviewed changes

Took a different approach to testing - set a testing limit for max nu…

a18686e

…m of synonyms

carlosdelest commented Jun 12, 2024

View reviewed changes

carlosdelest marked this pull request as ready for review June 12, 2024 14:11

carlosdelest requested a review from a team as a code owner June 12, 2024 14:11

carlosdelest requested a review from mayya-sharipova June 12, 2024 14:11

carlosdelest added v8.15.0 >enhancement labels Jun 12, 2024

Update docs/changelog/109390.yaml

dd56249

carlosdelest added 2 commits June 12, 2024 16:17

Took a different approach to testing - set a testing limit for max nu…

6cbb4da

…m of synonyms

Update changelog

091ad21

mark-vieira approved these changes Jun 12, 2024

View reviewed changes

carlosdelest requested a review from kderusso June 13, 2024 08:22

kderusso reviewed Jun 13, 2024

View reviewed changes

jimczi requested changes Jun 13, 2024

View reviewed changes

elasticsearchmachine added v8.16.0 Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch and removed v8.15.0 labels Jul 4, 2024

elasticsearchmachine removed the Team:Search Meta label for search team label Jul 4, 2024

Update docs/changelog/109390.yaml

3aed8c2


		import static org.elasticsearch.action.synonyms.SynonymsTestUtils.randomSynonymsSet;

		public class SynonymsManagementAPIServiceIT extends ESSingleNodeTestCase {

Limit number of synonyms that can be created #109390

Are you sure you want to change the base?

Limit number of synonyms that can be created #109390

Conversation

carlosdelest commented Jun 5, 2024

github-actions bot commented Jun 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carlosdelest commented Jun 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kderusso left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticsearchmachine commented Jun 12, 2024

carlosdelest commented Jun 12, 2024

elasticsearchmachine commented Jun 12, 2024

Choose a reason for hiding this comment

kderusso left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimczi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carlosdelest commented Jun 20, 2024

elasticsearchmachine commented Jul 4, 2024

elasticsearchmachine commented Jul 4, 2024