Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undefined behavior when turning coerced optional File? to null + Clarification about where String to File coercion takes place #673

Open
stxue1 opened this issue Jul 2, 2024 · 4 comments

Comments

@stxue1
Copy link
Collaborator

stxue1 commented Jul 2, 2024

I originally encountered these issues at chanzuckerberg/miniwdl#696.

One thing the WDL spec is vague about is how a task should coerce string to file. The spec says that all non-output declarations must run prior to the command section. My implicit understanding is that the output declaration will be running under a different directory than the rest of the task. It sounds like the output declarations are running in the current directory under the host machine, while the output section is running in the current directory inside the container.

For example:

task test {
  input {
    File f_input = "test.txt"
  }
  command <<<printf "hello" > test.txt>>>
  File f_body = "test.txt"
  output {
    File f_output = "test.txt"
  }
}

Assuming all files exist, it's implicitly assumed that f_input and f_body will point to some file on the host machine, but f_output will point to the file inside the container. Maybe this should be clarified in the SPEC, as it is not immediately obvious.

Another issue that arose when testing around with miniwdl is that there can be inconsistent behavior with coerced optional files.

Given the WDL workflow:

version 1.1
workflow testWorkflow {
  input {
  }
  call testTask
  output {
    Array[File?] array_in_output = testTask.array_in_output
    Int len_in_output = testTask.len_in_output
    Array[File?] array_in_body_out = testTask.array_in_body_out
    Int len_in_body_out = testTask.len_in_body_out
    Array[File?] array_in_input_out = testTask.array_in_input_out
    Int len_in_input_out = testTask.len_in_input_out
  }
}

task testTask {
  input {
    Array[File?] array_in_input = ["example1.txt", "example2.txt"]
    Int len_in_input = length(select_all(array_in_input))
  }
  command <<<>>>
  Array[File?] array_in_body = ["example1.txt", "example2.txt"]
  Int len_in_body = length(select_all(array_in_body))
  output {
    Array[File?] array_in_output = ["example1.txt", "example2.txt"]
    Int len_in_output = length(select_all(array_in_output))
    Array[File?] array_in_body_out = array_in_body
    Int len_in_body_out = len_in_body
    Array[File?] array_in_input_out = array_in_input
    Int len_in_input_out = len_in_input
  }
}

The spec says that optional file types at task outputs will be coerced to null.

For one, is there a reason why this scope is limited to just task outputs and not workflow outputs?

Additionally, because the spec says this null coercion is applied at the output step, given that the files example1.txt and example2.txt don't exist, the assumed correct output for the WDL workflow above is:

{
  "dir": "/home/heaucques/Documents/wdl-conformance-tests/20240626_184902_testWorkflow",
  "outputs": {
    "testWorkflow.array_in_body_out": [
      null,
      null
    ],
    "testWorkflow.array_in_input_out": [
      null,
      null
    ],
    "testWorkflow.array_in_output": [
      null,
      null
    ],
    "testWorkflow.len_in_body_out": 2,
    "testWorkflow.len_in_input_out": 2,
    "testWorkflow.len_in_output": 0
  }
}

Because the null coercion happens at the task output, the select_all function calls all will return different values depending on what part of the section it is called in; the body will return ["example1.txt", "example2.txt"], giving a length of 2. However, for the task output declaration, the function select_all will return [null, null], giving a length of 0. Since this can be counterintuitive as one may expect that a nonexistent file will always not be counted in a select_all call, is this the expected behavior, or what should the expected behavior be?

@stxue1 stxue1 changed the title Undefined behavior for coercing String to optional File? to null + Clarification about where String to File coercion takes place Undefined behavior when turning coerced optional File? to null + Clarification about where String to File coercion takes place Jul 2, 2024
@jdidion
Copy link
Collaborator

jdidion commented Jul 24, 2024

To address your first question: paths are relative to the execution environment. So if a container is used, then that means relative to the working directory inside the container. But that means it's generally a bad idea to use relative paths as default values for inputs because the execution engine is free to make the working directory be whatever it wants. For example, it could mount a volume on the host machine to /foo in the container and then make the working directory in the container be /foo/bar. Obviously "test.txt" won't exist in this directory. The spec should probably be updated to make this caveat clear and caution against using relative paths.

@jdidion
Copy link
Collaborator

jdidion commented Jul 24, 2024

You are correct that missing optional workflow outputs should also be None.

@jdidion
Copy link
Collaborator

jdidion commented Jul 24, 2024

For the last part - I'm not sure I fully understand. Could you create a separate issue with a minimal reproducible test case (including any necessary input files)? Thanks

@stxue1
Copy link
Collaborator Author

stxue1 commented Jul 24, 2024

For the last part - I'm not sure I fully understand. Could you create a separate issue with a minimal reproducible test case (including any necessary input files)? Thanks

Sure, I think my main concern in the original included WDL workflow is how the len outputs (len_in_body_out, len_in_input_out, and len_in_output) are all different, with 2, 2, and 0 respectively, even though the declarations they compute should be functionally equivalent.

For example (Much of the WDL evaluation is in tasks instead of the workflow due to a limitation of miniwdl):

version 1.1
workflow testWorkflow_output {
  input {
  }
  call testTask
  output {
    Array[File?] array_of_files = testTask.array_of_files
    Int len = testTask.len
  }
}

task testTask {
  input {
  }
  command <<<>>>
  output {
    Array[File?] array_of_files = ["example1.txt", "example2.txt"]
    Int len = length(select_all(array_of_files))
  }
}

In the task, the files example1.txt and example2.txt do not exist, so I'd expect that array_of_files is equal to [null, null]. Because they do not exist, I'd expect select_all(array_of_files) to equal [], as all elements are null. Thus, len should be 0.

For this workflow, Miniwdl agrees:

{
  "dir": "/home/heaucques/Documents/wdl-conformance-tests/20240724_125731_testWorkflow_output",
  "outputs": {
    "testWorkflow_output.array_of_files": [
      null,
      null
    ],
    "testWorkflow_output.len": 0
  }
}

However, given a workflow like this:

version 1.1
workflow testWorkflow_body {
  input {
  }
  call testTask
  output {
    Array[File?] array_of_files = testTask.array_of_files
    Int len = testTask.len
  }
}

task testTask {
  input {
  }
  command <<<>>>
  Array[File?] array_of_files_0 = ["example1.txt", "example2.txt"]
  Int len_0 = length(select_all(array_in_body))
  output {
    Array[File?] array_of_files = array_of_files_0
    Int len = len_0
  }
}

I've effectively moved these lines from the output of the task to the body:

    Array[File?] array_of_files = ["example1.txt", "example2.txt"]
    Int len = length(select_all(array_of_files))

Thus, the second workflow will compute array_of_files in the body section (computed in array_of_files_0 and copied into array_of_files), while the first workflow computes array_of_files in the output section.

Assuming the files do not exist on my machine, my "intuition" would think that the output of both workflows are the same:

For the first workflow testWorkflow_output: the files don't exist in the container, so the array array_of_files should be empty and the length of the array after a select_all function should be 0.

For the second workflow testWorkflow_body: the files don't exist on my machine, so the array array_of_files_0 (and in extension array_of_files) should be empty as well, and the length should also be 0.

However, the spec specifically says that the coercion of an optional file will become null only in task outputs:

wdl/SPEC.md

Line 3880 in caff59d

All file outputs are required to exist, otherwise the task will fail. However, an output may be declared as optional (e.g., `File?` or `Array[File?]`), in which case the value will be undefined if the file does not exist.

All file outputs are required to exist, otherwise the task will fail... the value will be undefined if the file does not exist

As a result, for the second workflow testWorkflow_body, the nonexistent files in array_of_files_0 are not coerced, so select_all(array_of_files_0) will see the array ["example1.txt", "example2.txt"] instead of [null, null]. Only when the output section is reached, or where array_of_files is declared, will the file elements be coerced to null.

Miniwdl seems to follow this concept, as the output for testWorkflow_body is:

{
  "dir": "/home/heaucques/Documents/wdl-conformance-tests/20240724_131039_testWorkflow_body",
  "outputs": {
    "testWorkflow_body.array_of_files": [
      null,
      null
    ],
    "testWorkflow_body.len": 2
  }
}

Even though array_of_files is [null, null], the length (or the number of nonnull elements in the array) is technically reported as 2.

Rather than applying the coercion of the nonexistent files ["example1.txt", "example2.txt" to become [null, null] when array_of_files is declared in the output section (with the line array_of_files = array_of_files_0), should the coercion also happen in the body? Or since this is defined in the spec, is this behavior expected?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants