Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Transform] latest transform skipping some source documents #106363

Open
przemekwitek opened this issue Mar 14, 2024 · 3 comments
Open

[Transform] latest transform skipping some source documents #106363

przemekwitek opened this issue Mar 14, 2024 · 3 comments
Labels
>bug :ml/Transform Transform Team:ML Meta label for the ML team

Comments

@przemekwitek
Copy link
Contributor

przemekwitek commented Mar 14, 2024

Elasticsearch Version

8.13

Installed Plugins

No response

Java Version

bundled

OS Version

MacOS

Problem Description

Latest transform was reported to skip some source documents.

I identified 2 potential issues:

  1. When there are multiple source documents with the same @timestamp value, the latest transform only picks one of them.
  2. sync.time.delay field does not seem to influence the filter range queries issued by the latest transform.

Ad 1.:
This is how we build the range query in the code:

        // We are only interested in documents that were created in the timeline of the current checkpoint.
        // Older documents cannot influence the transform results as we require the sort field values to change monotonically over time.
        return QueryBuilders.rangeQuery(synchronizationField)
            .gte(lastCheckpoint.getTimeUpperBound())
            .lt(nextCheckpoint.getTimeUpperBound())
            .format("epoch_millis");

So I think it can be that because of this lt the documents that have the same timestamp as the document that was already involved in the checkpoint will not get processed.
This should be taken care of by the time.sync.delay but apparently it doesn't work in this case (Ad 2.)

Steps to Reproduce

This has been reproduced by the Kibana team (https://github.com/elastic/security-team/issues/8893).
Now I'm working on reproducing it locally.

Logs (if relevant)

No response

@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Mar 14, 2024
@przemekwitek przemekwitek self-assigned this Mar 14, 2024
@przemekwitek przemekwitek added :ml/Transform Transform and removed needs:triage Requires assignment of a team area label labels Mar 14, 2024
@przemekwitek przemekwitek changed the title [Transform] sync.time.delay field has no effect for the latest transform [Transform] latest transform skipping some source documents Mar 14, 2024
@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Mar 14, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@syepes
Copy link

syepes commented Jun 17, 2024

If there currently any workarounds or setting that could be adjusted?
In our use case records in between checkpoint / executions must never be skipped.

@przemekwitek przemekwitek removed their assignment Jun 21, 2024
@syepes
Copy link

syepes commented Sep 5, 2024

Any news or version ETA on this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml/Transform Transform Team:ML Meta label for the ML team
Projects
None yet
Development

No branches or pull requests

3 participants