-
Notifications
You must be signed in to change notification settings - Fork 24.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make await security migrations more robust #109854
Make await security migrations more robust #109854
Conversation
Pinging @elastic/es-security (Team:Security) |
Looks like this was hiding a real issue. The To fix this there are some options:
|
The failing CI is a known issue. #109903 |
if (exception instanceof IllegalArgumentException | ||
&& exception.getMessage() != null | ||
&& exception.getMessage().contains("script_lang not supported [painless]")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this? The update by query with the script should be executed only when there are really roles that need migrating, no? Are there such cases, also with the painless plugin not installed?
In any case, this doesn't look like the right approach to me. It's adding main code for test purposes, and it's not clear what the state of the migration is in actuality when this is encountered.
I think we should either properly handle the case where the migration is not applicable (because the painless module is missing) or simply install the painless plugin where necessary in the ITs encountering it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this? The update by query with the script should be executed only when there are really roles that need migrating, no? Are there such cases, also with the painless plugin not installed?
Because these tests create roles before/in parallel with the migration and therefore the search for roles to migrate finds them. The only indicator we have that a role needs to be migrated or not is the metadata
field, but since it's not indexed we can't run exists
on it (so we can't improve the search query as far as I can see).
In any case, this doesn't look like the right approach to me. It's adding main code for test purposes, and it's not clear what the state of the migration is in actuality when this is encountered.
Yes, this only happens for these tests, but I agree, it's not ideal at all.
or simply install the painless plugin where necessary in the ITs encountering it.
We can't per my understanding. This happens because of the same issue you and Ryan are discussing here: https://elastic.slack.com/archives/C8UUBNASY/p1702482296432399?thread_ts=1701962741.577089&cid=C8UUBNASY
The options I've been working on so far:
- Make sure the security migrations do not run for these tests (write to cluster state in
@Before
, not great). - Remove the dependency on painless (a custom script service, feels like that's overkill).
- Catch script_lang not supported [painless] and ignore it in the migration code (current implementation).
Another option might be to try to figure out if an index manager state change is the result of the index being created, but couldn't get that reliable before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because these tests create roles before/in parallel with the migration and therefore the search for roles to migrate finds them. The only indicator we have that a role needs to be migrated or not is the metadata field, but since it's not indexed we can't run exists on it (so we can't improve the search query as far as I can see).
I see, thanks for explaining.
Can we wait for the migration in a @BeforeClass
method of the test suite class? Can we add a setting to disable migrations altogether (for test suites that we know migrations will be noops anyways)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I can investigate that. The problem is that the .security
index is created by the role creation that's part of the test. The creation of the index in turn triggers the migration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up adding code to detect if the index is new or not, if it's new don't apply any migrations. This should cover all the test cases we currently have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handling of the missing painless script plugin doesn't look right to me.
This test #109905 failure is also caused by this one. |
@@ -788,8 +789,8 @@ Collection<Object> createComponents( | |||
this.persistentTasksService.set(persistentTasksService); | |||
|
|||
systemIndices.getMainIndexManager().addStateListener((oldState, newState) -> { | |||
if (clusterService.state().nodes().isLocalNodeElectedMaster()) { | |||
applyPendingSecurityMigrations(newState); | |||
if (clusterService.state().nodes().isLocalNodeElectedMaster() && oldState != UNRECOVERED_STATE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This piece is what was missing when I previously tried to get this working.
The addition of this guarantees that we can now trust both the old and new state and therefore use it to determine if the security index was just created or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this is needed. I think this amounts to skipping the very first notification for the security index state update (which has the oldState in its default value), but why?
if (oldState.creationTime == null) { | ||
// Bypass migrations for when the security index is new | ||
submitPersistentMigrationTask(SecurityMigrations.MIGRATIONS_BY_VERSION.lastKey(), false); | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the index is new, we tell the migration executor to not apply any of the migrations, just update cluster state with the latest version since there is no data to migrate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the .security index does not exist AND is also not created soon? What would this update do in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why cannot we let a regular migration be a noop when the index is just created and instead we create this new code path that we have to test and maintain?
The failing test is a known issue: #109890 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except the change that considers the migration completed if the .security index has the latest version metadata (if the index exists), I don't follow the reasoning behind the other changes.
I think I'll need more context.
IndexMetadata indexMetadata = state.metadata().index(TestRestrictedIndices.INTERNAL_SECURITY_MAIN_INDEX_7); | ||
if (indexMetadata == null) { | ||
// If the security index doesn't exist, no migrations to apply | ||
return true; | ||
} | ||
Map<String, String> customMetadata = indexMetadata.getCustomData(MIGRATION_VERSION_CUSTOM_KEY); | ||
if (customMetadata == null) { | ||
return false; | ||
} | ||
String version = customMetadata.get(MIGRATION_VERSION_CUSTOM_DATA_KEY); | ||
return Integer.parseInt(version) == SecurityMigrations.MIGRATIONS_BY_VERSION.lastKey(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make this a static method in SecurityIndexManager
?
if (oldState.creationTime == null) { | ||
// Bypass migrations for when the security index is new | ||
submitPersistentMigrationTask(SecurityMigrations.MIGRATIONS_BY_VERSION.lastKey(), false); | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the .security index does not exist AND is also not created soon? What would this update do in that case?
if (oldState.creationTime == null) { | ||
// Bypass migrations for when the security index is new | ||
submitPersistentMigrationTask(SecurityMigrations.MIGRATIONS_BY_VERSION.lastKey(), false); | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why cannot we let a regular migration be a noop when the index is just created and instead we create this new code path that we have to test and maintain?
@@ -788,8 +789,8 @@ Collection<Object> createComponents( | |||
this.persistentTasksService.set(persistentTasksService); | |||
|
|||
systemIndices.getMainIndexManager().addStateListener((oldState, newState) -> { | |||
if (clusterService.state().nodes().isLocalNodeElectedMaster()) { | |||
applyPendingSecurityMigrations(newState); | |||
if (clusterService.state().nodes().isLocalNodeElectedMaster() && oldState != UNRECOVERED_STATE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this is needed. I think this amounts to skipping the very first notification for the security index state update (which has the oldState in its default value), but why?
@albertzaharovits added the code we talked about offline to make this more clear. We now have the condition:
Pending response from the question to the distributed team. Even if this is not always exactly true, if it works in the test cases it might be enough. We should then add a comment explaining that. |
To add some more context around the work in this PR: We're trying to avoid running the migration code for some scenarios in test, since it's not needed (nothing to migrate), it's not what we're testing (we don't need to run the migration for all tests that create the security index) and it might break because the test environment doesn't reflect what will actually run in production (we don't have painless in some test environments). To prevent the migration from running when not needed, this PR adds code to check if an index is new. Checking if an index is new is a little challenging and there is (at least) one scenario where it won't work.
For our purposes this is acceptable. In the tests we won't have this scenario and in production it doesn't matter if we run the migration for a new security index, even though it's preferred not to. |
I've added code to:
In the future this could be improved by associating certain migrations with index versions instead of using the node features, however that's out of scope for this PR where we're just trying to avoid running the migrations on a new security index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few more comments.
if (newState.migrationsVersion == 0 | ||
&& isMigrationNeededForIndexVersion(newState.indexVersionCreated, maxDataNodeCompatibleIndexVersion)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be cleaner if we'd move this check from here (with clusterService.state().nodes().getMaxDataNodeCompatibleIndexVersion()
) inside the SecurityIndexManager
.
I mean that the SecurityIndexManager#State
would expose a method that says if migrations are required by: indexExists() && migrationVersion == 0 && indexVersionCreated.onOrAfter(event.state().nodes().getMaxDataNodeCompatibleIndexVersion()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I instead moved everything into a boolean in the state. I think that fits better with how the rest of the security index manager works. Let me know what you think.
// Check if next migration that has not been applied is eligible to run on the current cluster | ||
if (systemIndices.getMainIndexManager().isEligibleSecurityMigration(nextMigration.getValue()) == false) { | ||
if (nextMigration == null || systemIndices.getMainIndexManager().isEligibleSecurityMigration(nextMigration.getValue()) == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Ideally, we should be worried that this has not been covered by tests.
Exceptionally, I think we can merge without, but I'll raise a GH issue for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. Would be nice to try to come up with a way to test that.
} | ||
|
||
@Override | ||
public void writeTo(StreamOutput out) throws IOException { | ||
out.writeInt(migrationVersion); | ||
out.writeBoolean(migrationNeeded); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change technically requires adding a new TransportVersion
, and guarding the serialization and deserialization, to cover for the case where the task is started to a node that doesn't know about this new task parameter, see TransformTaskParams
as an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I've added a new transport version + check in the serialization/deserialization of the transport payload. Thanks!
); | ||
|
||
static { | ||
PARSER.declareInt(constructorArg(), new ParseField("migration_version")); | ||
PARSER.declareBoolean(constructorArg(), new ParseField("migration_needed")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be an optionalConstructorArg
, work with Boolean
(and default it to true
).
This is a thing that goes into the cluster state so it could be that we deserialize and serialize from different versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I've updated this to be an optional + added a null check. Thanks!
})); | ||
}); | ||
|
||
if (params.isMigrationNeeded() == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also please log that the migration version has been bumped to latest without running them really?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, added a log message. Thanks!
CI failure is a know issue: #106426 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This is a potential fix for #109845, #109894, #110015 and #109538. I think that there is a race condition where the persistent task that's responsible for the security migration hasn't started yet when
awaitSecurityMigration
is executed, but right before theteardown
executes, it starts and therefore the search context is open.This takes a different approach, that makes sure that the latest migration version has been written to cluster state if the security index exists. Since the test depends on the security index, it must have been created when the teardown happens and therefore it makes it more robust.