Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add option to not override results by Shaper #4231

Merged
merged 4 commits into from
Feb 22, 2023

Conversation

tstadel
Copy link
Member

@tstadel tstadel commented Feb 22, 2023

Related Issues

Proposed Changes:

  • add publish_outputs option, defaults to True (same behavior as before)
  • when False outputs won't be published to the pipeline result
  • add answers to the eligible outputs that can be published

How did you test it?

  • added tests

Notes for the reviewer

  • This will save us two additional Shapers in a QA pipeline using PromptNode. See the issue above for more information.
  • The yaml of the issue pipeline would then look like this:
version: '1.13.2'
components:
  - name: DocumentStore
    type: InMemoryDocumentStore
  - name: Retriever
    type: BM25Retriever
    params:
      document_store: DocumentStore
      top_k: 3
  - name: PromptModel
    type: PromptModel
    params:
      model_name_or_path: google/flan-t5-large
      model_kwargs:
        model_kwargs:
          num_return_sequences: 3
          num_beams: 3
  - name: PromptNode 
    type: PromptNode
    params:
      default_prompt_template: question-answering
      model_name_or_path: PromptModel
  - name: InputDocumentShaper
    type: Shaper
    params:
      func: join_documents
      inputs:
        documents: documents
      outputs:
        - documents
      publish_outputs: false # this line can save two shapers
      params:
        delimiter: " - "
  - name: InputQuestionsShaper
    type: Shaper
    params:
      func: value_to_list
      inputs:
        value: query
      outputs:
        - questions
      params:
        target_list: [1]
  - name: OutputAnswerShaper
    type: Shaper
    params:
      func: strings_to_answers
      inputs:
        strings: results
      outputs:
        - answers
pipelines:
  - name: query
    nodes:
      - name: Retriever
        inputs: [Query]
      - name: InputDocumentShaper
        inputs: [Retriever]
      - name: InputQuestionsShaper
        inputs: [InputDocumentShaper]
      - name: PromptNode
        inputs: [InputQuestionsShaper]
      - name: OutputAnswerShaper
        inputs: [PromptNode]

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added tests that demonstrate the correct behavior of the change
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

@tstadel tstadel requested a review from a team as a code owner February 22, 2023 10:51
@tstadel tstadel requested review from ZanSara and removed request for a team February 22, 2023 10:51
@tstadel tstadel added this to the 1.14.0 milestone Feb 22, 2023
@vblagoje
Copy link
Member

@tstadel it seems totally reasonable and a small change. I was just wondering why did we restrict output to only the variables listed on line if output_key in ["query", "file_paths", "labels", "documents", "meta"]:?

@tstadel
Copy link
Member Author

tstadel commented Feb 22, 2023

@tstadel it seems totally reasonable and a small change. I was just wondering why did we restrict output to only the variables listed on line if output_key in ["query", "file_paths", "labels", "documents", "meta"]:?

TBH I don't know. Maybe we don't want to spoil the pipeline result with intermediate results. But I agree, I guess with this option this selection seems superfluous. @vblagoje Shall we remove it then?

@vblagoje
Copy link
Member

Removing it seems risky tbh, although all the tests pass (at least for me locally). But that doesn't make it safe. Do you have some additional tests besides shaper and prompt node tests?

Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -404,7 +425,7 @@ def run( # type: ignore
results = {}
for output_key, output_value in zip(self.outputs, output_values):
invocation_context[output_key] = output_value
if output_key in ["query", "file_paths", "labels", "documents", "meta"]:
if self.publish_outputs and output_key in ["query", "file_paths", "labels", "documents", "meta", "answers"]:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll do so.

@tstadel
Copy link
Member Author

tstadel commented Feb 22, 2023

@vblagoje running the pipeline above works. Besides that I don't have any tests. But we would be able to control this via publish_outputs, just in case.

@@ -404,7 +425,7 @@ def run( # type: ignore
results = {}
for output_key, output_value in zip(self.outputs, output_values):
invocation_context[output_key] = output_value
if output_key in ["query", "file_paths", "labels", "documents", "meta"]:
if self.publish_outputs and output_key in ["query", "file_paths", "labels", "documents", "meta", "answers"]:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll do so.

@ZanSara
Copy link
Contributor

ZanSara commented Feb 22, 2023

Let's add a test for what happens if there are nodes downstream, though: @vblagoje is right, Pipeline might fail if it receives unexpected output that is not included in that list.

Not critical, but might be a source of weird bugs.

@tstadel
Copy link
Member Author

tstadel commented Feb 22, 2023

@vblagoje @ZanSara do you think we need finer-grained control about the publishing? E.g. if there is a function that returns two outputs and only one of them should be published?
For now, we don't even have such a function. But I want to here your opinions on that.

@vblagoje
Copy link
Member

vblagoje commented Feb 22, 2023

Let's add a test for what happens if there are nodes downstream, though: @vblagoje is right, Pipeline might fail if it receives unexpected output that is not included in that list.

Not critical, but might be a source of weird bugs.

That's exactly what I was worried about @ZanSara . Some component downstream from Shaper that doesn't declare run variable that the Shaper outputs?! It should be fine if Shaper is the last component in the pipeline.

@tstadel
Copy link
Member Author

tstadel commented Feb 22, 2023

Let's add a test for what happens if there are nodes downstream, though: @vblagoje is right, Pipeline might fail if it receives unexpected output that is not included in that list.
Not critical, but might be a source of weird bugs.

That's exactly what I was worried about @ZanSara . Some component downstream from Shaper that doesn't declare run variable that the Shaper outputs?!

Theoretically it should be safe as we're inspecting the run_signature before passing it, see

if key in run_signature_args:

But I'll add a test to be safe.

@vblagoje
Copy link
Member

@vblagoje @ZanSara do you think we need finer-grained control about the publishing? E.g. if there is a function that returns two outputs and only one of them should be published? For now, we don't even have such a function. But I want to here your opinions on that.

How about a list of published outputs? If it is not set we publish the default vars (what we have now) and if it is set we publish the list of init declared vars. This could be useful in the last Shaper in the pipeline where we want to have those vars in the root of results dict?

@tstadel
Copy link
Member Author

tstadel commented Feb 22, 2023

@vblagoje @ZanSara do you think we need finer-grained control about the publishing? E.g. if there is a function that returns two outputs and only one of them should be published? For now, we don't even have such a function. But I want to here your opinions on that.

How about a list of published outputs? If it is not set we publish the default vars (what we have now) and if it is set we publish the list of init declared vars. This could be useful in the last Shaper in the pipeline where we want to have those vars in the root of results dict?

By just having a list, it might not seem intuitive that the default value (=None) would result in all outputs being published. How about converting it into Union[bool, List[str]] and having True as the default value? That way you can just switch it on or off for all and if you want to have finer-grained control, you go for a list.

@vblagoje
Copy link
Member

@vblagoje @ZanSara do you think we need finer-grained control about the publishing? E.g. if there is a function that returns two outputs and only one of them should be published? For now, we don't even have such a function. But I want to here your opinions on that.

How about a list of published outputs? If it is not set we publish the default vars (what we have now) and if it is set we publish the list of init declared vars. This could be useful in the last Shaper in the pipeline where we want to have those vars in the root of results dict?

By just having a list, it might seem intuitive that the default value (=None) would result in all outputs being published. How about converting it into Union[bool, List[str]] and having True as the default value? That way you can just switch it on or off for all and if you want to have finer-grained control, you go for a list.

Sounds good to me. I'll let @ZanSara have the last word

@ZanSara
Copy link
Contributor

ZanSara commented Feb 22, 2023

Sounds good to me too! Let's implement it that way

@tstadel
Copy link
Member Author

tstadel commented Feb 22, 2023

@ZanSara done. I'll merge once all tests passed.

@tstadel tstadel merged commit 32b2abf into main Feb 22, 2023
@tstadel tstadel deleted the feat/shaper_publish_outputs branch February 22, 2023 13:37
vblagoje pushed a commit that referenced this pull request Feb 22, 2023
* add  option to shaper and support answers

* remove publish restrictions on outputs

* support list
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

documents get overridden by Shaper
3 participants