Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make await security migrations more robust #109854

Merged
merged 30 commits into from
Jun 27, 2024

Conversation

jfreden
Copy link
Contributor

@jfreden jfreden commented Jun 18, 2024

This is a potential fix for #109845, #109894, #110015 and #109538. I think that there is a race condition where the persistent task that's responsible for the security migration hasn't started yet when awaitSecurityMigration is executed, but right before the teardown executes, it starts and therefore the search context is open.

This takes a different approach, that makes sure that the latest migration version has been written to cluster state if the security index exists. Since the test depends on the security index, it must have been created when the teardown happens and therefore it makes it more robust.

@jfreden jfreden marked this pull request as ready for review June 18, 2024 12:56
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Jun 18, 2024
@jfreden jfreden added >non-issue :Security/Security Security issues without another label >test Issues or PRs that are addressing/adding tests and removed needs:triage Requires assignment of a team area label labels Jun 18, 2024
@elasticsearchmachine elasticsearchmachine added the Team:Security Meta label for security team label Jun 18, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-security (Team:Security)

@jfreden
Copy link
Contributor Author

jfreden commented Jun 19, 2024

Looks like this was hiding a real issue. The lang-painless module isn't available in some tests when they run in a single classloader and can't be added due to a transient dependency versioning conflict. This results in: java.lang.IllegalArgumentException: script_lang not supported [painless]. Luckily this is only a test issue, since painless is always available in the default distribution.

To fix this there are some options:

  1. Make sure the security migrations do not run for these tests.
  2. Remove the dependency on painless.
  3. Catch script_lang not supported [painless] and ignore it in the migration code.

@jfreden
Copy link
Contributor Author

jfreden commented Jun 19, 2024

The failing CI is a known issue. #109903

Comment on lines 99 to 101
if (exception instanceof IllegalArgumentException
&& exception.getMessage() != null
&& exception.getMessage().contains("script_lang not supported [painless]")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? The update by query with the script should be executed only when there are really roles that need migrating, no? Are there such cases, also with the painless plugin not installed?

In any case, this doesn't look like the right approach to me. It's adding main code for test purposes, and it's not clear what the state of the migration is in actuality when this is encountered.
I think we should either properly handle the case where the migration is not applicable (because the painless module is missing) or simply install the painless plugin where necessary in the ITs encountering it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? The update by query with the script should be executed only when there are really roles that need migrating, no? Are there such cases, also with the painless plugin not installed?

Because these tests create roles before/in parallel with the migration and therefore the search for roles to migrate finds them. The only indicator we have that a role needs to be migrated or not is the metadata field, but since it's not indexed we can't run exists on it (so we can't improve the search query as far as I can see).

In any case, this doesn't look like the right approach to me. It's adding main code for test purposes, and it's not clear what the state of the migration is in actuality when this is encountered.

Yes, this only happens for these tests, but I agree, it's not ideal at all.

or simply install the painless plugin where necessary in the ITs encountering it.

We can't per my understanding. This happens because of the same issue you and Ryan are discussing here: https://elastic.slack.com/archives/C8UUBNASY/p1702482296432399?thread_ts=1701962741.577089&cid=C8UUBNASY

The options I've been working on so far:

  1. Make sure the security migrations do not run for these tests (write to cluster state in @Before, not great).
  2. Remove the dependency on painless (a custom script service, feels like that's overkill).
  3. Catch script_lang not supported [painless] and ignore it in the migration code (current implementation).

Another option might be to try to figure out if an index manager state change is the result of the index being created, but couldn't get that reliable before.

Copy link
Contributor

@albertzaharovits albertzaharovits Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because these tests create roles before/in parallel with the migration and therefore the search for roles to migrate finds them. The only indicator we have that a role needs to be migrated or not is the metadata field, but since it's not indexed we can't run exists on it (so we can't improve the search query as far as I can see).

I see, thanks for explaining.
Can we wait for the migration in a @BeforeClass method of the test suite class? Can we add a setting to disable migrations altogether (for test suites that we know migrations will be noops anyways)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can investigate that. The problem is that the .security index is created by the role creation that's part of the test. The creation of the index in turn triggers the migration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up adding code to detect if the index is new or not, if it's new don't apply any migrations. This should cover all the test cases we currently have.

Copy link
Contributor

@albertzaharovits albertzaharovits left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handling of the missing painless script plugin doesn't look right to me.

@albertzaharovits
Copy link
Contributor

This test #109905 failure is also caused by this one.
I think it's worth going through the recently raised CI failures and take a quick look if this fix addresses. If that's the case take care to also unmute the test here.

@@ -788,8 +789,8 @@ Collection<Object> createComponents(
this.persistentTasksService.set(persistentTasksService);

systemIndices.getMainIndexManager().addStateListener((oldState, newState) -> {
if (clusterService.state().nodes().isLocalNodeElectedMaster()) {
applyPendingSecurityMigrations(newState);
if (clusterService.state().nodes().isLocalNodeElectedMaster() && oldState != UNRECOVERED_STATE) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This piece is what was missing when I previously tried to get this working.

The addition of this guarantees that we can now trust both the old and new state and therefore use it to determine if the security index was just created or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this is needed. I think this amounts to skipping the very first notification for the security index state update (which has the oldState in its default value), but why?

if (oldState.creationTime == null) {
// Bypass migrations for when the security index is new
submitPersistentMigrationTask(SecurityMigrations.MIGRATIONS_BY_VERSION.lastKey(), false);
return;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the index is new, we tell the migration executor to not apply any of the migrations, just update cluster state with the latest version since there is no data to migrate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the .security index does not exist AND is also not created soon? What would this update do in that case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why cannot we let a regular migration be a noop when the index is just created and instead we create this new code path that we have to test and maintain?

@jfreden
Copy link
Contributor Author

jfreden commented Jun 24, 2024

The failing test is a known issue: #109890

Copy link
Contributor

@albertzaharovits albertzaharovits left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except the change that considers the migration completed if the .security index has the latest version metadata (if the index exists), I don't follow the reasoning behind the other changes.
I think I'll need more context.

Comment on lines 96 to 106
IndexMetadata indexMetadata = state.metadata().index(TestRestrictedIndices.INTERNAL_SECURITY_MAIN_INDEX_7);
if (indexMetadata == null) {
// If the security index doesn't exist, no migrations to apply
return true;
}
Map<String, String> customMetadata = indexMetadata.getCustomData(MIGRATION_VERSION_CUSTOM_KEY);
if (customMetadata == null) {
return false;
}
String version = customMetadata.get(MIGRATION_VERSION_CUSTOM_DATA_KEY);
return Integer.parseInt(version) == SecurityMigrations.MIGRATIONS_BY_VERSION.lastKey();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this a static method in SecurityIndexManager?

if (oldState.creationTime == null) {
// Bypass migrations for when the security index is new
submitPersistentMigrationTask(SecurityMigrations.MIGRATIONS_BY_VERSION.lastKey(), false);
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the .security index does not exist AND is also not created soon? What would this update do in that case?

if (oldState.creationTime == null) {
// Bypass migrations for when the security index is new
submitPersistentMigrationTask(SecurityMigrations.MIGRATIONS_BY_VERSION.lastKey(), false);
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why cannot we let a regular migration be a noop when the index is just created and instead we create this new code path that we have to test and maintain?

@@ -788,8 +789,8 @@ Collection<Object> createComponents(
this.persistentTasksService.set(persistentTasksService);

systemIndices.getMainIndexManager().addStateListener((oldState, newState) -> {
if (clusterService.state().nodes().isLocalNodeElectedMaster()) {
applyPendingSecurityMigrations(newState);
if (clusterService.state().nodes().isLocalNodeElectedMaster() && oldState != UNRECOVERED_STATE) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this is needed. I think this amounts to skipping the very first notification for the security index state update (which has the oldState in its default value), but why?

@jfreden
Copy link
Contributor Author

jfreden commented Jun 25, 2024

@albertzaharovits added the code we talked about offline to make this more clear. We now have the condition:

  • If we went from an old state where the index didn't exist to a state where the index exists and the old state was recovered (valid), we can assume it's a newly created index.
  • If the old state was not recovered and the index exists in the new state we can assume the recovered state contained the index, so it's not new.
  • If the old state is unrecovered and the new state doesn't have the index (creationTime == null), we don't want to migrate anyway (since index metadata doesn't exist), so do nothing.

Pending response from the question to the distributed team. Even if this is not always exactly true, if it works in the test cases it might be enough. We should then add a comment explaining that.

@jfreden
Copy link
Contributor Author

jfreden commented Jun 26, 2024

To add some more context around the work in this PR:

We're trying to avoid running the migration code for some scenarios in test, since it's not needed (nothing to migrate), it's not what we're testing (we don't need to run the migration for all tests that create the security index) and it might break because the test environment doesn't reflect what will actually run in production (we don't have painless in some test environments).

To prevent the migration from running when not needed, this PR adds code to check if an index is new. Checking if an index is new is a little challenging and there is (at least) one scenario where it won't work.

  1. The security index is created.
  2. The new index is spread to all other nodes in cluster state and written to disk or s3 if serverless.
  3. One of the nodes crashes.
  4. The node recovers and reads the newly created index from disk, so the full migration would be triggered since the transition STATE_NOT_RECOVERED -> .security::creationTime != null happens (the index is considered "already existing" when it's actually new).

For our purposes this is acceptable. In the tests we won't have this scenario and in production it doesn't matter if we run the migration for a new security index, even though it's preferred not to.

@jfreden
Copy link
Contributor Author

jfreden commented Jun 26, 2024

I've added code to:

  1. Bump the current index version. This means that when a new security index is created it will be created with a new version.
  2. In the security index manager I get the index version, if it is >= latest version (I don't explicitly check that it's my version, just that it's the latest) I assume it's a new index that doesn't need a migration.

In the future this could be improved by associating certain migrations with index versions instead of using the node features, however that's out of scope for this PR where we're just trying to avoid running the migrations on a new security index.

@albertzaharovits albertzaharovits self-requested a review June 26, 2024 12:57
Copy link
Contributor

@albertzaharovits albertzaharovits left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few more comments.

Comment on lines 1217 to 1218
if (newState.migrationsVersion == 0
&& isMigrationNeededForIndexVersion(newState.indexVersionCreated, maxDataNodeCompatibleIndexVersion)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be cleaner if we'd move this check from here (with clusterService.state().nodes().getMaxDataNodeCompatibleIndexVersion()) inside the SecurityIndexManager.
I mean that the SecurityIndexManager#State would expose a method that says if migrations are required by: indexExists() && migrationVersion == 0 && indexVersionCreated.onOrAfter(event.state().nodes().getMaxDataNodeCompatibleIndexVersion().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I instead moved everything into a boolean in the state. I think that fits better with how the rest of the security index manager works. Let me know what you think.

// Check if next migration that has not been applied is eligible to run on the current cluster
if (systemIndices.getMainIndexManager().isEligibleSecurityMigration(nextMigration.getValue()) == false) {
if (nextMigration == null || systemIndices.getMainIndexManager().isEligibleSecurityMigration(nextMigration.getValue()) == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Ideally, we should be worried that this has not been covered by tests.
Exceptionally, I think we can merge without, but I'll raise a GH issue for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. Would be nice to try to come up with a way to test that.

}

@Override
public void writeTo(StreamOutput out) throws IOException {
out.writeInt(migrationVersion);
out.writeBoolean(migrationNeeded);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change technically requires adding a new TransportVersion, and guarding the serialization and deserialization, to cover for the case where the task is started to a node that doesn't know about this new task parameter, see TransformTaskParams as an example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I've added a new transport version + check in the serialization/deserialization of the transport payload. Thanks!

);

static {
PARSER.declareInt(constructorArg(), new ParseField("migration_version"));
PARSER.declareBoolean(constructorArg(), new ParseField("migration_needed"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be an optionalConstructorArg, work with Boolean (and default it to true).
This is a thing that goes into the cluster state so it could be that we deserialize and serialize from different versions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I've updated this to be an optional + added a null check. Thanks!

}));
});

if (params.isMigrationNeeded() == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also please log that the migration version has been bumped to latest without running them really?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, added a log message. Thanks!

@jfreden
Copy link
Contributor Author

jfreden commented Jun 27, 2024

CI failure is a know issue: #106426

Copy link
Contributor

@albertzaharovits albertzaharovits left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>non-issue :Security/Security Security issues without another label Team:Security Meta label for security team >test Issues or PRs that are addressing/adding tests v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants