Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[7/7] Multimodal datasets (The Cauldron, LLaVA-Instruct-150K) #1158

Open
wants to merge 50 commits into
base: main
Choose a base branch
from

Conversation

RdoubleA
Copy link
Contributor

@RdoubleA RdoubleA commented Jul 10, 2024

Changelog

  • Add two example multimodal dataset builders, each with their own dataset transform: The Cauldron and LLaVA-Instruct-150K. Both require slightly different preprocessing and serve as good examples
  • Add a utility to quickly map raw text with image tags into what's expected for Message content field
  • Upgrade ShareGPTToMessages with support for an image column

Test plan

  • Unit test for The Cauldron, LLaVA Instruct datasets
  • Unit test for split_text_by_image_tag
  • TODO: update unit test for ShareGPTToMessages

Docs

TODO: update docstrings and API ref

Copy link

pytorch-bot bot commented Jul 10, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1158

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 5 Cancelled Jobs

As of commit 8530958 with merge base 9e65fa9 (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 10, 2024
@RdoubleA RdoubleA marked this pull request as draft July 10, 2024 00:26
@RdoubleA RdoubleA changed the title [WIP] Multimodal datasets (The Cauldron, LLaVA-Instruct-150K) Multimodal datasets (The Cauldron, LLaVA-Instruct-150K) Aug 22, 2024
@RdoubleA RdoubleA marked this pull request as ready for review August 22, 2024 01:13
@RdoubleA RdoubleA changed the title Multimodal datasets (The Cauldron, LLaVA-Instruct-150K) [7/7] Multimodal datasets (The Cauldron, LLaVA-Instruct-150K) Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants