Make await security migrations more robust #109854

jfreden · 2024-06-18T12:45:00Z

This is a potential fix for #109845, #109894, #110015 and #109538. I think that there is a race condition where the persistent task that's responsible for the security migration hasn't started yet when awaitSecurityMigration is executed, but right before the teardown executes, it starts and therefore the search context is open.

This takes a different approach, that makes sure that the latest migration version has been written to cluster state if the security index exists. Since the test depends on the security index, it must have been created when the teardown happens and therefore it makes it more robust.

elasticsearchmachine · 2024-06-18T12:57:47Z

Pinging @elastic/es-security (Team:Security)

…ation_robust

jfreden · 2024-06-19T08:14:11Z

Looks like this was hiding a real issue. The lang-painless module isn't available in some tests when they run in a single classloader and can't be added due to a transient dependency versioning conflict. This results in: java.lang.IllegalArgumentException: script_lang not supported [painless]. Luckily this is only a test issue, since painless is always available in the default distribution.

To fix this there are some options:

Make sure the security migrations do not run for these tests.
Remove the dependency on painless.
Catch script_lang not supported [painless] and ignore it in the migration code.

jfreden · 2024-06-19T10:30:24Z

The failing CI is a known issue. #109903

albertzaharovits · 2024-06-19T12:25:55Z

...ugin/security/src/main/java/org/elasticsearch/xpack/security/support/SecurityMigrations.java

+ if (exception instanceof IllegalArgumentException
+ && exception.getMessage() != null
+ && exception.getMessage().contains("script_lang not supported [painless]")) {


Why do we need this? The update by query with the script should be executed only when there are really roles that need migrating, no? Are there such cases, also with the painless plugin not installed?

In any case, this doesn't look like the right approach to me. It's adding main code for test purposes, and it's not clear what the state of the migration is in actuality when this is encountered.
I think we should either properly handle the case where the migration is not applicable (because the painless module is missing) or simply install the painless plugin where necessary in the ITs encountering it.

Why do we need this? The update by query with the script should be executed only when there are really roles that need migrating, no? Are there such cases, also with the painless plugin not installed?

Because these tests create roles before/in parallel with the migration and therefore the search for roles to migrate finds them. The only indicator we have that a role needs to be migrated or not is the metadata field, but since it's not indexed we can't run exists on it (so we can't improve the search query as far as I can see).

In any case, this doesn't look like the right approach to me. It's adding main code for test purposes, and it's not clear what the state of the migration is in actuality when this is encountered.

Yes, this only happens for these tests, but I agree, it's not ideal at all.

or simply install the painless plugin where necessary in the ITs encountering it.

We can't per my understanding. This happens because of the same issue you and Ryan are discussing here: https://elastic.slack.com/archives/C8UUBNASY/p1702482296432399?thread_ts=1701962741.577089&cid=C8UUBNASY

The options I've been working on so far:

Make sure the security migrations do not run for these tests (write to cluster state in @Before, not great).

Remove the dependency on painless (a custom script service, feels like that's overkill).

Catch script_lang not supported [painless] and ignore it in the migration code (current implementation).

Another option might be to try to figure out if an index manager state change is the result of the index being created, but couldn't get that reliable before.

Because these tests create roles before/in parallel with the migration and therefore the search for roles to migrate finds them. The only indicator we have that a role needs to be migrated or not is the metadata field, but since it's not indexed we can't run exists on it (so we can't improve the search query as far as I can see).

I see, thanks for explaining.
Can we wait for the migration in a @BeforeClass method of the test suite class? Can we add a setting to disable migrations altogether (for test suites that we know migrations will be noops anyways)?

Yes, I can investigate that. The problem is that the .security index is created by the role creation that's part of the test. The creation of the index in turn triggers the migration.

I ended up adding code to detect if the index is new or not, if it's new don't apply any migrations. This should cover all the test cases we currently have.

albertzaharovits

Handling of the missing painless script plugin doesn't look right to me.

albertzaharovits · 2024-06-20T07:02:21Z

This test #109905 failure is also caused by this one.
I think it's worth going through the recently raised CI failures and take a quick look if this fix addresses. If that's the case take care to also unmute the test here.

…ation_robust

jfreden · 2024-06-24T07:38:25Z

x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/Security.java

@@ -788,8 +789,8 @@ Collection<Object> createComponents(
 this.persistentTasksService.set(persistentTasksService);

 systemIndices.getMainIndexManager().addStateListener((oldState, newState) -> {
- if (clusterService.state().nodes().isLocalNodeElectedMaster()) {
- applyPendingSecurityMigrations(newState);
+ if (clusterService.state().nodes().isLocalNodeElectedMaster() && oldState != UNRECOVERED_STATE) {


This piece is what was missing when I previously tried to get this working.

The addition of this guarantees that we can now trust both the old and new state and therefore use it to determine if the security index was just created or not.

I don't understand why this is needed. I think this amounts to skipping the very first notification for the security index state update (which has the oldState in its default value), but why?

jfreden · 2024-06-24T07:39:25Z

x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/Security.java

+ if (oldState.creationTime == null) {
+ // Bypass migrations for when the security index is new
+ submitPersistentMigrationTask(SecurityMigrations.MIGRATIONS_BY_VERSION.lastKey(), false);
+ return;


When the index is new, we tell the migration executor to not apply any of the migrations, just update cluster state with the latest version since there is no data to migrate.

What if the .security index does not exist AND is also not created soon? What would this update do in that case?

Why cannot we let a regular migration be a noop when the index is just created and instead we create this new code path that we have to test and maintain?

jfreden · 2024-06-24T07:42:39Z

The failing test is a known issue: #109890

…re applied

…ation_robust

albertzaharovits

Except the change that considers the migration completed if the .security index has the latest version metadata (if the index exists), I don't follow the reasoning behind the other changes.
I think I'll need more context.

albertzaharovits · 2024-06-25T10:32:27Z

...security/src/internalClusterTest/java/org/elasticsearch/test/SecuritySingleNodeTestCase.java

+ IndexMetadata indexMetadata = state.metadata().index(TestRestrictedIndices.INTERNAL_SECURITY_MAIN_INDEX_7);
+ if (indexMetadata == null) {
+ // If the security index doesn't exist, no migrations to apply
+ return true;
+ }
+ Map<String, String> customMetadata = indexMetadata.getCustomData(MIGRATION_VERSION_CUSTOM_KEY);
+ if (customMetadata == null) {
+ return false;
+ }
+ String version = customMetadata.get(MIGRATION_VERSION_CUSTOM_DATA_KEY);
+ return Integer.parseInt(version) == SecurityMigrations.MIGRATIONS_BY_VERSION.lastKey();


Can you make this a static method in SecurityIndexManager?

albertzaharovits · 2024-06-25T11:02:44Z

x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/Security.java

+ if (oldState.creationTime == null) {
+ // Bypass migrations for when the security index is new
+ submitPersistentMigrationTask(SecurityMigrations.MIGRATIONS_BY_VERSION.lastKey(), false);
+ return;


What if the .security index does not exist AND is also not created soon? What would this update do in that case?

albertzaharovits · 2024-06-25T11:04:04Z

x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/Security.java

+ if (oldState.creationTime == null) {
+ // Bypass migrations for when the security index is new
+ submitPersistentMigrationTask(SecurityMigrations.MIGRATIONS_BY_VERSION.lastKey(), false);
+ return;


Why cannot we let a regular migration be a noop when the index is just created and instead we create this new code path that we have to test and maintain?

albertzaharovits · 2024-06-25T11:08:49Z

x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/Security.java

@@ -788,8 +789,8 @@ Collection<Object> createComponents(
 this.persistentTasksService.set(persistentTasksService);

 systemIndices.getMainIndexManager().addStateListener((oldState, newState) -> {
- if (clusterService.state().nodes().isLocalNodeElectedMaster()) {
- applyPendingSecurityMigrations(newState);
+ if (clusterService.state().nodes().isLocalNodeElectedMaster() && oldState != UNRECOVERED_STATE) {


I don't understand why this is needed. I think this amounts to skipping the very first notification for the security index state update (which has the oldState in its default value), but why?

jfreden · 2024-06-25T13:36:07Z

@albertzaharovits added the code we talked about offline to make this more clear. We now have the condition:

If we went from an old state where the index didn't exist to a state where the index exists and the old state was recovered (valid), we can assume it's a newly created index.
If the old state was not recovered and the index exists in the new state we can assume the recovered state contained the index, so it's not new.
If the old state is unrecovered and the new state doesn't have the index (creationTime == null), we don't want to migrate anyway (since index metadata doesn't exist), so do nothing.

Pending response from the question to the distributed team. Even if this is not always exactly true, if it works in the test cases it might be enough. We should then add a comment explaining that.

jfreden · 2024-06-26T07:12:28Z

To add some more context around the work in this PR:

We're trying to avoid running the migration code for some scenarios in test, since it's not needed (nothing to migrate), it's not what we're testing (we don't need to run the migration for all tests that create the security index) and it might break because the test environment doesn't reflect what will actually run in production (we don't have painless in some test environments).

To prevent the migration from running when not needed, this PR adds code to check if an index is new. Checking if an index is new is a little challenging and there is (at least) one scenario where it won't work.

The security index is created.
The new index is spread to all other nodes in cluster state and written to disk or s3 if serverless.
One of the nodes crashes.
The node recovers and reads the newly created index from disk, so the full migration would be triggered since the transition STATE_NOT_RECOVERED -> .security::creationTime != null happens (the index is considered "already existing" when it's actually new).

For our purposes this is acceptable. In the tests we won't have this scenario and in production it doesn't matter if we run the migration for a new security index, even though it's preferred not to.

jfreden · 2024-06-26T11:46:46Z

I've added code to:

Bump the current index version. This means that when a new security index is created it will be created with a new version.
In the security index manager I get the index version, if it is >= latest version (I don't explicitly check that it's my version, just that it's the latest) I assume it's a new index that doesn't need a migration.

In the future this could be improved by associating certain migrations with index versions instead of using the node features, however that's out of scope for this PR where we're just trying to avoid running the migrations on a new security index.

…ation_robust

albertzaharovits

I left a few more comments.

albertzaharovits · 2024-06-27T06:29:25Z

x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/Security.java

+ if (newState.migrationsVersion == 0
+ && isMigrationNeededForIndexVersion(newState.indexVersionCreated, maxDataNodeCompatibleIndexVersion)) {


I think it would be cleaner if we'd move this check from here (with clusterService.state().nodes().getMaxDataNodeCompatibleIndexVersion()) inside the SecurityIndexManager.
I mean that the SecurityIndexManager#State would expose a method that says if migrations are required by: indexExists() && migrationVersion == 0 && indexVersionCreated.onOrAfter(event.state().nodes().getMaxDataNodeCompatibleIndexVersion().

I instead moved everything into a boolean in the state. I think that fits better with how the rest of the security index manager works. Let me know what you think.

albertzaharovits · 2024-06-27T06:32:04Z

x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/Security.java

 // Check if next migration that has not been applied is eligible to run on the current cluster
- if (systemIndices.getMainIndexManager().isEligibleSecurityMigration(nextMigration.getValue()) == false) {
+ if (nextMigration == null || systemIndices.getMainIndexManager().isEligibleSecurityMigration(nextMigration.getValue()) == false) {


👍 Ideally, we should be worried that this has not been covered by tests.
Exceptionally, I think we can merge without, but I'll raise a GH issue for it.

Yes, I agree. Would be nice to try to come up with a way to test that.

albertzaharovits · 2024-06-27T06:44:01Z

...src/main/java/org/elasticsearch/xpack/core/security/support/SecurityMigrationTaskParams.java

 }

 @Override
 public void writeTo(StreamOutput out) throws IOException {
 out.writeInt(migrationVersion);
+ out.writeBoolean(migrationNeeded);


This change technically requires adding a new TransportVersion, and guarding the serialization and deserialization, to cover for the case where the task is started to a node that doesn't know about this new task parameter, see TransformTaskParams as an example.

Good catch! I've added a new transport version + check in the serialization/deserialization of the transport payload. Thanks!

albertzaharovits · 2024-06-27T07:01:03Z

...src/main/java/org/elasticsearch/xpack/core/security/support/SecurityMigrationTaskParams.java

 );

 static {
 PARSER.declareInt(constructorArg(), new ParseField("migration_version"));
+ PARSER.declareBoolean(constructorArg(), new ParseField("migration_needed"));


I think this needs to be an optionalConstructorArg, work with Boolean (and default it to true).
This is a thing that goes into the cluster state so it could be that we deserialize and serialize from different versions.

Good catch! I've updated this to be an optional + added a null check. Thanks!

albertzaharovits · 2024-06-27T07:02:46Z

...curity/src/main/java/org/elasticsearch/xpack/security/support/SecurityMigrationExecutor.java

- }));
+ });
+
+ if (params.isMigrationNeeded() == false) {


can you also please log that the migration version has been bumped to latest without running them really?

Yes, added a log message. Thanks!

…ation_robust

jfreden · 2024-06-27T09:04:49Z

CI failure is a know issue: #106426

albertzaharovits

LGTM

Make await security migrations more robust

e648c71

elasticsearchmachine added the v8.15.0 label Jun 18, 2024

jfreden marked this pull request as ready for review June 18, 2024 12:56

jfreden requested a review from albertzaharovits June 18, 2024 12:56

elasticsearchmachine added the needs:triage Requires assignment of a team area label label Jun 18, 2024

jfreden added >non-issue :Security/Security Security issues without another label >test Issues or PRs that are addressing/adding tests and removed needs:triage Requires assignment of a team area label labels Jun 18, 2024

elasticsearchmachine added the Team:Security Meta label for security team label Jun 18, 2024

jfreden added 2 commits June 18, 2024 15:02

fixup! Unit test

8eae71a

Merge remote-tracking branch 'upstream/main' into make_await_sec_migr…

4d23d4a

…ation_robust

jfreden added 3 commits June 19, 2024 10:29

fixup! Add hacky fix

0310772

fixup! Condition

647fb4e

fixup! Condition

73ba66f

fixup! add else

f5223bd

albertzaharovits reviewed Jun 19, 2024

View reviewed changes

jfreden added 5 commits June 20, 2024 11:05

Check if the security index is new and skip migrations if it is

366db26

fixup! CI

1eb8ec9

Merge remote-tracking branch 'upstream/main' into make_await_sec_migr…

f313d2f

…ation_robust

Merge remote-tracking branch 'upstream/main' into make_await_sec_migr…

17b3a43

…ation_robust

Merge remote-tracking branch 'upstream/main' into make_await_sec_migr…

bc01015

…ation_robust

jfreden commented Jun 24, 2024

View reviewed changes

jfreden added 5 commits June 24, 2024 10:19

fixup! Param in exception test

d98ce27

Unmute another test

4ea4c36

nodeLocalMigrationRetryCount should always be reset when migrations a…

0673357

…re applied

Merge remote-tracking branch 'upstream/main' into make_await_sec_migr…

de86da5

…ation_robust

fixup! Mute

1caa2ed

albertzaharovits reviewed Jun 25, 2024

View reviewed changes

jfreden added 2 commits June 25, 2024 15:17

Update bypass migration condition

9ff351d

fixup! Code review comment

d86500f

fixup! Test bug

e047dd2

Improve new index check

c87948f

jfreden added 4 commits June 26, 2024 14:01

Merge remote-tracking branch 'upstream/main' into make_await_sec_migr…

9474c85

…ation_robust

update muted-tests.yml

92acdaa

Merge remote-tracking branch 'upstream/main' into make_await_sec_migr…

ca0c97c

…ation_robust

fixup! muted-tests again

9804226

albertzaharovits self-requested a review June 26, 2024 12:57

albertzaharovits reviewed Jun 27, 2024

View reviewed changes

jfreden added 2 commits June 27, 2024 09:46

fixup! Code review - move check to state

0edcc32

fixup! Add transport version check and log message

d658896

jfreden requested a review from albertzaharovits June 27, 2024 08:05

Merge remote-tracking branch 'upstream/main' into make_await_sec_migr…

a5b10af

…ation_robust

albertzaharovits approved these changes Jun 27, 2024

View reviewed changes

jfreden merged commit 10ad8a6 into elastic:main Jun 27, 2024
20 checks passed

		if (newState.migrationsVersion == 0
		&& isMigrationNeededForIndexVersion(newState.indexVersionCreated, maxDataNodeCompatibleIndexVersion)) {

Make await security migrations more robust #109854

Make await security migrations more robust #109854

Conversation

jfreden commented Jun 18, 2024 • edited Loading

elasticsearchmachine commented Jun 18, 2024

jfreden commented Jun 19, 2024

jfreden commented Jun 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

albertzaharovits Jun 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

albertzaharovits left a comment

Choose a reason for hiding this comment

albertzaharovits commented Jun 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jfreden commented Jun 24, 2024

albertzaharovits left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jfreden commented Jun 25, 2024 • edited Loading

jfreden commented Jun 26, 2024

jfreden commented Jun 26, 2024

albertzaharovits left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jfreden commented Jun 27, 2024

albertzaharovits left a comment

Choose a reason for hiding this comment

jfreden commented Jun 18, 2024 •

edited

Loading

albertzaharovits Jun 19, 2024 •

edited

Loading

jfreden commented Jun 25, 2024 •

edited

Loading