Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add fine-tuning codebase toward StableLM-JP-Tuned-Alpha #4

Merged
merged 20 commits into from
May 30, 2023

Conversation

fujiki-saij
Copy link

@fujiki-saij fujiki-saij commented May 19, 2023

TODOs

  • refactor the initial codebase according to the codebase used for StableLM Tuned Alpha.
  • put configs in a separate file
  • add inference code

@fujiki-saij fujiki-saij self-assigned this May 19, 2023
@fujiki-saij fujiki-saij marked this pull request as draft May 19, 2023 02:08
Copy link
Collaborator

@leemengtw leemengtw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fujiki-saij except some small changes are needed, generally LGTMeng. Feel free to merge after considering my comments, thanks!

data_path: "fujiki/japanese_alpaca_data"
train_size: 0.98
trainer: "text"
max_text_len": 1024
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove the additional " here.

logging_steps: 100
save_strategy: "steps"
save_steps: 100
save_total_limit": 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently passing this key to training config will raise error, could you double check whether we need to rename or remove it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let me double check these configurations. (I just manually translated the config to this yaml format, so there might be some mistakes.)

Comment on lines 22 to 29
if "rinna" in config["tokenizer_path"]:
tokenizer = AutoTokenizer.from_pretrained(
config["tokenizer_path"], use_fast=False, cache_dir=config.get("cache_dir", None),
)
else:
tokenizer = AutoTokenizer.from_pretrained(
config["tokenizer_path"], cache_dir=config.get("cache_dir", None),
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 changes might be needed here:

  1. change config["tokenizer_path"] to config["tokenizer"]["tokenizer_name_or_path"] to match with what you have in sample.yaml
  2. maybe remove the if statement and just use a general purpose one:
tokenizer = AutoTokenizer.from_pretrained(
            config["tokenizer"]["tokenizer_name_or_path"], use_fast=config.get("use_fast", False), cache_dir=config.get("cache_dir", None),
        )

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I already committed and pushed this change, but actually just staged and forgot to commit and push it.
0691ffa
Thanks for the review!

Comment on lines +55 to +62
# TODO: clean up comment outed lines
# # Data expected in prompt response pairs
# if os.path.exists(cache_name + "inputids.pt"):
# print("using cached dataset")
# self.input_ids = torch.load(cache_name+"inputids.pt")
# self.attn_masks = torch.load(cache_name+"attnmask.pt")
# self.labels = torch.load(cache_name+"inputids.pt")
# return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yap maybe clean up a little bit and add back the cache feature. Right now cache_name param is not used and the cache it would be helpful for debugging purpose 👍

# remove parenthesis that might be introduced by some NMTs
new_example = {}
for k, v in example.items():
if example[k].startswith("「") and example[k].endswith("」"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JFYI when I run the program, this if statement was never triggered.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. I've already updated the dataset at HF by applying this processing.
So, we don't need this functionality here now.

@fujiki-saij fujiki-saij changed the title [WIP] add fine-tuning codebase toward StableLM-JP-Tuned-Alpha add fine-tuning codebase toward StableLM-JP-Tuned-Alpha May 25, 2023
@fujiki-saij fujiki-saij marked this pull request as ready for review May 25, 2023 09:36
### 応答:
{response}"""

NO_INPUT_PROMPT = """以下は、タスクを説明する指示と、文脈のある入力の組み合わせです。要求を適切に満たす応答を書きなさい。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fujiki-saij I think we should modify this instruction because there is no 文脈のある入力 so the 文脈のある入力の組み合わせ is incorrect and we might get better result with improved instruction.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Let me add prompt template versioning functionality to experiment with different templates (in this PR or another PR).

@fujiki-saij fujiki-saij merged commit 0a471e8 into Stability-AI:jp-stable May 30, 2023
leemengtw pushed a commit that referenced this pull request Jun 7, 2023
Get EOS ID from SentencePiece directly
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants