[7/7] Multimodal datasets (The Cauldron, LLaVA-Instruct-150K) #1158

RdoubleA · 2024-07-10T00:26:15Z

Changelog

Add two example multimodal dataset builders, each with their own dataset transform: The Cauldron and LLaVA-Instruct-150K. Both require slightly different preprocessing and serve as good examples
Add a utility to quickly map raw text with image tags into what's expected for Message content field
Upgrade ShareGPTToMessages with support for an image column

Test plan

Unit test for The Cauldron, LLaVA Instruct datasets
Unit test for split_text_by_image_tag
TODO: update unit test for ShareGPTToMessages

Docs

TODO: update docstrings and API ref

…transforms

pytorch-bot · 2024-07-10T00:26:18Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1158

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 5 Cancelled Jobs

As of commit 8530958 with merge base 9e65fa9 ():

NEW FAILURES - The following jobs have failed:

GPU tests / gpu_test (3.10, stable) (gh)
tests/torchtune/datasets/test_the_cauldron_dataset.py::TestTheCauldronDataset::test_label_masking
GPU tests / gpu_test (3.11, stable) (gh)
tests/torchtune/datasets/test_the_cauldron_dataset.py::TestTheCauldronDataset::test_label_masking
Unit Test / unit_tests (3.11) (gh)
tests/torchtune/datasets/test_the_cauldron_dataset.py::TestTheCauldronDataset::test_label_masking

CANCELLED JOBS - The following jobs were cancelled. Please retry:

GPU tests / gpu_test (3.8, stable) (gh)
GPU tests / gpu_test (3.9, stable) (gh)
tests/torchtune/datasets/test_the_cauldron_dataset.py::TestTheCauldronDataset::test_label_masking
Unit Test / unit_tests (3.10) (gh)
Unit Test / unit_tests (3.8) (gh)
Unit Test / unit_tests (3.9) (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

RdoubleA added 30 commits June 11, 2024 23:49

complete tokenizer refactor

75dae87

move tokenizers under data/

0c20ba9

fix all tests

730a2c9

Merge branch 'main' into tokenizer

acf7e81

start to address comments

2ae157c

load in special tokens, move tokenizer directory back, address comments

6a50cd5

fix encode whitespace

61534d0

updates after manual comparisons

1d6e5e3

default special tokens

5712de4

fix docs

d84bbda

fix doc strings

5a8b82b

Merge branch 'main' into tokenizer

52643cb

fix tests

a00c1dc

fix SP test

29273ca

add image support

aa43095

tool support

8afaaf9

update tests

d3d4b66

update tests

d326dca

use images as attachments instead

58e3e9d

update all tests

7fdccae

use list of dicts for MM messages

820d9ac

fix chat formats

7ba4216

add multimodal dataset, test, and the cauldron

42f8c83

multimodal dataset test

7cad2dc

fix rebase

335e85f

Merge branch 'main' into tokenizer

adca77e

update api ref

b204563

Merge branch 'main' into tokenizer

e236916

fix llama3 toeknizer test:

93028cf

add image support

fb12cbb

RdoubleA added 17 commits July 2, 2024 10:55

tool support

b5bf410

update tests

00f266f

update tests

c815069

use images as attachments instead

21b3ea8

update all tests

adbfb20

use list of dicts for MM messages

1e40a9d

fix chat formats

0d3665c

run linter

95edf70

Merge branch 'main' into tokenizer_updates

a3067aa

merge main

d49febf

fix chat formats

7da4189

Merge branch 'tokenizer_updates' into mm_dataset

58babf0

fix merge

7bcdaf8

Merge branch 'main' into mm_dataset

82e1dea

fix merge

ff81c5c

multimodal dataset, unit test, and two example dataset builders with …

258e98f

…transforms

Merge branch 'main' into mm_dataset

1410d70

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 10, 2024

RdoubleA marked this pull request as draft July 10, 2024 00:26

RdoubleA added 2 commits August 21, 2024 14:30

Merge branch 'main' into mm_dataset

5aea048

update with latest APIs

2731a60

RdoubleA changed the title ~~[WIP] Multimodal datasets (The Cauldron, LLaVA-Instruct-150K)~~ Multimodal datasets (The Cauldron, LLaVA-Instruct-150K) Aug 22, 2024

RdoubleA marked this pull request as ready for review August 22, 2024 01:13

RdoubleA changed the title ~~Multimodal datasets (The Cauldron, LLaVA-Instruct-150K)~~ [7/7] Multimodal datasets (The Cauldron, LLaVA-Instruct-150K) Aug 22, 2024

fix lint

8530958

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[7/7] Multimodal datasets (The Cauldron, LLaVA-Instruct-150K) #1158

[7/7] Multimodal datasets (The Cauldron, LLaVA-Instruct-150K) #1158

RdoubleA commented Jul 10, 2024 •

edited

Loading

pytorch-bot bot commented Jul 10, 2024 •

edited

Loading

[7/7] Multimodal datasets (The Cauldron, LLaVA-Instruct-150K) #1158

Are you sure you want to change the base?

[7/7] Multimodal datasets (The Cauldron, LLaVA-Instruct-150K) #1158

Conversation

RdoubleA commented Jul 10, 2024 • edited Loading

Changelog

Test plan

Docs

pytorch-bot bot commented Jul 10, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1158

❌ 3 New Failures, 5 Cancelled Jobs

RdoubleA commented Jul 10, 2024 •

edited

Loading

pytorch-bot bot commented Jul 10, 2024 •

edited

Loading