Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support for data sourcing and transformation with data template #4958

Merged
merged 58 commits into from
Feb 26, 2021

Conversation

simster7
Copy link
Member

@simster7 simster7 commented Jan 27, 2021

Motivation

Users often source and transform data as part of their workflows. We want to add first-class support for these common operations in a new data template.

Goals

  • Provide a varied set of sources and transformations so that users don't have to code and spin up their own Pods.

  • Be able to "automagically" decide if we are able to do the transformations in-memory in the controller, thereby saving the need to create a new Pod.

  • Provide an experience similar to common bash file manipulations. For example:

    cat file.txt | grep ... | sed ... | sort | uniq 
    

    Here, cat file.txt acts as a "source", but it could also be a curl or find, etc. grep ... | sed ... | sort | uniq act as transformations.

Maybe Goals

Non Goals

  • Provide a substitute for a container. We want this to be a convenience for common use cases with sourcing and transforming data. If the convenience of the data template is not enough for our users, they should at said point define their own containers.

Examples

A data template could look like:

- name: generate-artifacts
  data:
    source:                       # Define a source for the data, only a single "source" is
                                  # permitted
      withArtifactPaths:          # Generate a list of all artifact paths in a given repository
        s3:
          bucket: test
          endpoint: minio:9000
          insecure: true
          accessKeySecret:
            name: my-minio-cred
            key: accesskey
          secretKeySecret:
            name: my-minio-cred
            key: secretkey
            
    transformation:               # The source is then passed to be transformed by
                                  # transformations defined here

      - filter:                   # Filters contents of source
          regex: "PROJECT_NAME"   # Only keep files that match this regex (e.g. contain
                                  # "PROJECT_NAME")
          
      - map:                      # Manipulate the files in some way
          replace:                # Self-explanatory
            old: "PROJECT_NAME"
            new: ""
            
      - group:                    # Group files, this can be very useful when used with withItems
          batch: 10               # Create groups of 10 files each. This can be useful to evenly
                                  # distribute work across pods expanded with withItems

data templates could also receive inputs from previous steps. In this case, since sourcing is not necessary, we can perform the transformations in-memory in the controller, saving the need to create a new pod and continuing immediately with execution.

- name: collect-artifacts
  inputs:
    parameters:
      - name: processed-file-list # Use and input parameter as source
  data:
    transformation:
      - filter:
          exclude: true           # Exclude if match
          regex: "\\.log"         # Don't include all log files generated by the processor
      - group:
          regex: ".+?\\.(.+?)$"   # Group by the result of the regex capture group, in this case
                                  # it's the file extension. This could be useful to post-process
                                  # or zip files differently based on extension

Open Questions

  • How to provide extensive transformation without needing to enumerate them ourselves? Libraries such as expr and Sprig can be useful, but also limited (expr) or too domain specific (Sprig). Provide access to bash tools (e.g. grep, sed, etc.)? This is a balance between providing enough tooling for users to not need to spin up their own pods, but not enough that we're redoing work that users could do in their specific pods.

  • What are the main use cases we should target with this template? What sources are a must-have? What transformations are a must-have?

  • Could data: templates be used for quick in-line transformations? For example

    - name: collect-artifacts
      inputs:
        parameters:
          - name: processed
      data:
        transformation:
          - map:
              expression: "{{item}} + 1"

go.mod Outdated
replace github.com/argoproj/argo/v2 => ./
replace (
github.com/argoproj/argo/v2 => ./
github.com/argoproj/pkg v0.3.0 => ../pkg
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compilation errors will be due to this change, which I have made to test my argoproj/pkg changes locally

Signed-off-by: Simon Behar <[email protected]>
Signed-off-by: Simon Behar <[email protected]>
Signed-off-by: Simon Behar <[email protected]>
Signed-off-by: Simon Behar <[email protected]>
Signed-off-by: Simon Behar <[email protected]>
Signed-off-by: Simon Behar <[email protected]>
Signed-off-by: Simon Behar <[email protected]>
Signed-off-by: Simon Behar <[email protected]>
Signed-off-by: Simon Behar <[email protected]>
Signed-off-by: Simon Behar <[email protected]>
@@ -249,3 +253,7 @@ func uploadObject(client *storage.Client, bucket, key, localPath string) error {
}
return nil
}

func (g *ArtifactDriver) ListObjects(artifact *wfv1.Artifact) ([]string, error) {
panic("implement me")
Copy link
Contributor

@alexec alexec Feb 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think we should return an error here (rather than panic) - and do the implementation later, in a v1.1

cmd/argoexec/commands/data.go Outdated Show resolved Hide resolved
cmd/argoexec/commands/data.go Outdated Show resolved Hide resolved
config/config.go Outdated Show resolved Hide resolved
docs/data-sourcing-and-transformation.md Outdated Show resolved Hide resolved
docs/data-sourcing-and-transformation.md Outdated Show resolved Hide resolved
workflow/data/data.go Show resolved Hide resolved
workflow/data/data.go Outdated Show resolved Hide resolved
workflow/data/data.go Show resolved Hide resolved
workflow/executor/data.go Outdated Show resolved Hide resolved
workflow/executor/executor.go Outdated Show resolved Hide resolved
config/config.go Outdated
@@ -361,7 +361,8 @@ type MetricsConfig struct {
}

type WorkflowRestrictions struct {
TemplateReferencing TemplateReferencing `json:"templateReferencing"`
TemplateReferencing TemplateReferencing `json:"templateReferencing,omitempty"`
DataTemplatePodPolicy DataTemplatePodPolicy `json:"dataTemplatePodPolicy,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can I please ask you to remove this from this PR and we can follow up with >= v1.1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think this is important this is shipped with the MVP as it protects from a possible exploit. Terry's comment describes it well: #4958 (review)

Copy link
Member

@terrytangyuan terrytangyuan Feb 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I ask what v1.1 represents? I've been seeing that a lot in recent PRs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v1.1 of this feature (not of Argo)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's slang:

  • v1.0 - the MVP
  • v1.1/v1.2/v1.3 etc- an improvement to a feature we probably want follow-up soon after v1.0
  • v2.0 - a complex/big improvement we want to hold off on on

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

give (a) this PR will only ship with artifactsPaths and (b) we only provide an expression transformation - I don't think we have a security risk with this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regardless, lets narrow the scope v1.0 and follow up with a v1.1 after discussion

Signed-off-by: Simon Behar <[email protected]>
@alexec
Copy link
Contributor

alexec commented Feb 24, 2021

I've been thinking about other sources we might have. An obvious one is a normal artifact. This made me think about raw. Raw would get templated, which would be more powerful with expression tags (nee expression template, nee enhanced templates), but an expression or script would be more powerful always.

So I suggest:

  • Descope `raw' from the PR (go narrower!!).
  • Follow up with a new PR to introduce expression as a source, similar to raw you should be able to use inputs for this too.

I feel like we may end up with the following sources, each in its own PR based on community suggestions:

  • artifact
  • script
  • raw

Signed-off-by: Simon Behar <[email protected]>
@simster7
Copy link
Member Author

@alexec Done!

@simster7 simster7 requested a review from alexec February 25, 2021 17:05
config/config.go Outdated Show resolved Hide resolved
docs/data-sourcing-and-transformation.md Outdated Show resolved Hide resolved
docs/data-sourcing-and-transformation.md Outdated Show resolved Hide resolved
@@ -0,0 +1,100 @@
# See doc docs/data-sourcing-and-transformation.md
apiVersion: argoproj.io/v1alpha1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump

go.mod Outdated
@@ -2,6 +2,8 @@ module github.com/argoproj/argo-workflows/v3

go 1.15

replace github.com/argoproj/pkg v0.4.0 => github.com/simster7/pkg v0.2.1-0.20210205175442-61f69ff4a52f
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still pending?

pkg/apis/workflow/v1alpha1/data_types.go Outdated Show resolved Hide resolved
test/e2e/functional_test.go Outdated Show resolved Hide resolved
@@ -0,0 +1,100 @@
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could make this a testable example - BUT

a testable example should make a good example, if you need to be more complex, then better to have both examples and test

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you need to be more complex, then better to have both examples and test

I think in this case we need to be more complex. See comment: #4958 (comment)

test/e2e/testdata/data-transformation.yaml Show resolved Hide resolved
wfv1 "github.com/argoproj/argo-workflows/v3/pkg/apis/workflow/v1alpha1"
)

var inMemoryDataNode = `
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

200 lines of code - anyway we can reduce the maintenance this might need by making it somehow more focused?

Copy link
Member Author

@simster7 simster7 Feb 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's 200 lines of node status, not code :) I have already reduced this as much as possible. (Including removing lines of node status that are not needed, such as Progress)

workflow/executor/data.go Show resolved Hide resolved
Signed-off-by: Simon Behar <[email protected]>
Signed-off-by: Simon Behar <[email protected]>
Signed-off-by: Simon Behar <[email protected]>
Signed-off-by: Simon Behar <[email protected]>
Signed-off-by: Simon Behar <[email protected]>
Signed-off-by: Simon Behar <[email protected]>
@simster7 simster7 requested a review from alexec February 25, 2021 22:39
Copy link
Contributor

@alexec alexec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great - some mino comments which could be left to anther day

@@ -0,0 +1,100 @@
# See doc docs/data-sourcing-and-transformation.md
apiVersion: argoproj.io/v1alpha1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, we can change our mind later

pkg/apis/workflow/v1alpha1/data_types.go Outdated Show resolved Hide resolved
@@ -2630,6 +2650,38 @@ func (woc *wfOperationCtx) executeResource(ctx context.Context, nodeName string,
return node, err
}

func (woc *wfOperationCtx) executeData(ctx context.Context, nodeName string, templateScope string, tmpl *wfv1.Template, orgTmpl wfv1.TemplateReferenceHolder, opts *executeTemplateOpts) (*wfv1.NodeStatus, error) {
Copy link
Contributor

@alexec alexec Feb 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly, but it woc is MASSIVE and you might prefer to find your code in a single focussed file

workflow/controller/operator.go Outdated Show resolved Hide resolved
Signed-off-by: Simon Behar <[email protected]>
Signed-off-by: Simon Behar <[email protected]>
Signed-off-by: Simon Behar <[email protected]>
@simster7 simster7 merged commit 0c226ca into argoproj:master Feb 26, 2021
@alexec alexec added this to the v3.1 milestone Mar 1, 2021
@simster7 simster7 mentioned this pull request Mar 1, 2021
32 tasks
@simster7 simster7 mentioned this pull request Mar 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants