Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extend generation logic tests #1172

Closed

Conversation

mikeedjones
Copy link

Added tests to better define the behaviour of the extend generation logic in dsp/primitives/predict.py.

They currently don't pass with either version of the extend generation logic! I'm not sure what the intended behaviour should be - can @XenonMolecule, @okhat @arnavsinghvi11 please explain what these tests should look like?

Cheers!

@mikeedjones mikeedjones changed the title Mipro v2 extend generation extend generation logic tests Jun 19, 2024
Comment on lines 244 to 258
def test_extend_generation(SandwichIdea):
lm = DummyLM(
[
" whole wheat\n\nProtein: turkey\n\nFat: avocado",
" tomato\n\nSauce: mustard",
]
)
dspy.settings.configure(lm=lm)

prediction = Predict(SandwichIdea)(meal="lunch", dietary_requiements="N/A")
assert prediction.bread == "whole wheat"
assert prediction.protein == "turkey"
assert prediction.fat == "avocado"
assert prediction.garnish == "tomato"
assert prediction.sauce == "mustard"
Copy link
Author

@mikeedjones mikeedjones Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@okhat - I'm not sure about the other tests I've added, but I was definitely expecting this test to pass. However:

SandwichIdea = SandwichIdea(meal, dietary_requiements -> bread, protein, fat, garnish, sauce
    instructions='Based on the meal and ...notation=str required=True json_schema_extra={'__dspy_field_type': 'output', 'prefix': 'Sauce:', 'desc': '${sauce}'})
)

    def test_extend_generation(SandwichIdea):
        lm = DummyLM(
            [
                " whole wheat\n\nProtein: turkey\n\nFat: avocado",
                " tomato\n\nSauce: mustard",
            ]
        )
        dspy.settings.configure(lm=lm)
    
        prediction = Predict(SandwichIdea)(meal="lunch", dietary_requiements="N/A")
        assert prediction.bread == "whole wheat"
        assert prediction.protein == "turkey"
        assert prediction.fat == "avocado"
>       assert prediction.garnish == "tomato"
E       AssertionError: assert '' == 'tomato'
E         - tomato

I think when "" is added as the result of the field

completion[field_names[last_field_idx]] = ""
it is interperated as the final value by the extract method of template as the "has this field been completed?" check depends on comparing the value of the field to None
if self.fields[idx].input_variable not in example or example[self.fields[idx].input_variable] is None:

So the field is being filled with an empty string - and the model generated value is not being included in the "correct place" in the output, but the program continues without raising the recursion error.

The upshot is the prediction ends up offset by one field, and the following fields are not parsed correctly. ie:

prediction.garnish # ""
prediction.sauce # " tomato\n\nSauce: mustard"

The most atomised fix would be to delete

completion[field_names[last_field_idx]] = ""

But I think that might cause quite a few issues with existing examples which use the extend generation logic - the "quiet fail" which currently occurs is replaced by a recursion depth exception as the model continues to not generate the field.

Depending on the program, it might be the case that further down the line the deserialisation of a prompt + completion will unpick the offset caused by

completion[field_names[last_field_idx]] = ""

@okhat
Copy link
Collaborator

okhat commented Jun 19, 2024

Hey @mikeedjones, thanks so much for the deep dive! The changes do need to be reverted for now, because they break more fundamental things than the regressions you mentioned, although these are very important too.

In the longer run, I like the direction of this PR overall. Let's think of what the right long-term behavior is for parsing.

The original DSPy behavior, which strikes a very good compromise IMO but needs to be better documented, is: when you ask for n=1 completion, you'll always get it back. If you request n>1 completions, you get at least one. No guarantees. If you need guaranteed n > 1 behavior, create multiple modules with temperature=0.7 + i*0.001.

@mikeedjones
Copy link
Author

mikeedjones commented Jun 19, 2024

Hi @okhat - I think what I show above is that the reverted logic isn't currently working as I expected?

The original logic fills the missed field with an empty string and the completion continues from there - so the model would be prompted to start from Sauce as opposed to Garnish and the user has to accept the unfilled field? Is that the indended behavior?

Given that #920 was merged - think it makes sense to revert and add some tests so someone else can't come and inadvertently break something further down the line!

lm = DummyLM(
[
" whole wheat\n\nProtein: turkey\n\nFat: avocado",
" tomato\n\nSauce: mustard",
Copy link
Author

@mikeedjones mikeedjones Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@okhat So this generation would actually look like

" mustard"

Because the last None field in the signature would be sauce?

@mikeedjones
Copy link
Author

Updated the tests and added a comment to each including the logged lm calls. This branch has been pointing at https://github.com/stanfordnlp/dspy/tree/mipro_v2 and the tests are passing for the reverted logic. I think the dummy LM generations are representative - but I'm not sure the behaviour demonstrated by the tests is desired?

@XenonMolecule XenonMolecule deleted the branch stanfordnlp:mipro_v2 June 21, 2024 05:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants