Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset destroys Example.input_keys values #898

Closed
jsleight opened this issue Apr 24, 2024 · 2 comments
Closed

Dataset destroys Example.input_keys values #898

jsleight opened this issue Apr 24, 2024 · 2 comments

Comments

@jsleight
Copy link

Minimal example (on dspy v2.4.0):

import dspy
examples = [dspy.Example(foo=f, bar=b).with_inputs("foo") for f, b in zip("abcd", "1234")]
print(examples)  # [Example({'foo': 'a', 'bar': '1'}) (input_keys={'foo'}), Example({'foo': 'b', 'bar': '2'}) (input_keys={'foo'}), Example({'foo': 'c', 'bar': '3'}) (input_keys={'foo'}), Example({'foo': 'd', 'bar': '4'}) (input_keys={'foo'})]

from dspy.datasets.dataset import Dataset

class MyDataset(Dataset):
    def __init__(self, examples):
        super().__init__(train_size=1, dev_size=1, test_size=1)
        self._train = [examples[0]]
        self._dev = [examples[1]]
        self._test = [examples[2]]

dataset = MyDataset(examples)
print(dataset.train)  # [Example({'foo': 'a', 'bar': '1'}) (input_keys=None)]
print(dataset.dev)    # [Example({'foo': 'b, 'bar': '2'}) (input_keys=None)]
print(dataset.test)   # [Example({'foo': 'c', 'bar': '3'}) (input_keys=None)]

Expected to have the input_keys persist through the Dataset object. This line seems to be the problem.

@arnavsinghvi11
Copy link
Collaborator

arnavsinghvi11 commented Apr 27, 2024

Hi @jsleight , thanks for raising this. Currently, the behavior lies in declaring your Dataset type first and then setting the inputs - example from intro.ipynb:

from dspy.datasets import HotPotQA

# Load the dataset.
dataset = HotPotQA(train_seed=1, train_size=20, eval_seed=2023, dev_size=50, test_size=0)

# Tell DSPy that the 'question' field is the input. Any other fields are labels and/or metadata.
trainset = [x.with_inputs('question') for x in dataset.train]
devset = [x.with_inputs('question') for x in dataset.dev]

len(trainset), len(devset)

but it does make sense to me to have input_keys() persist if they exist. Feel free to push a PR for this change!

@jsleight
Copy link
Author

I might have some time to make a PR. I can envision a couple of approaches so interested to see which you'd prefer.

  1. Just change the line in Dataset that creates copies of the examples to also do with_inputs.
  2. A bit more fundamental change to Examples to have Examples(**example) persist the input_keys. Would make the Dataset class persist the input_keys while adding a bit more functionality to the Examples class. But idk if you'd like Examples to work this way or not.

arnavsinghvi11 added a commit that referenced this issue Jun 17, 2024
Fix the issue of handling input_keys using Dataset class (Issue #898)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants