-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add fine-tuning codebase toward StableLM-JP-Tuned-Alpha #4
add fine-tuning codebase toward StableLM-JP-Tuned-Alpha #4
Conversation
…base used in StableLM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fujiki-saij except some small changes are needed, generally LGTMeng. Feel free to merge after considering my comments, thanks!
finetune/configs/sample.yaml
Outdated
data_path: "fujiki/japanese_alpaca_data" | ||
train_size: 0.98 | ||
trainer: "text" | ||
max_text_len": 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's remove the additional "
here.
finetune/configs/sample.yaml
Outdated
logging_steps: 100 | ||
save_strategy: "steps" | ||
save_steps: 100 | ||
save_total_limit": 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently passing this key to training config will raise error, could you double check whether we need to rename or remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let me double check these configurations. (I just manually translated the config to this yaml format, so there might be some mistakes.)
finetune/finetune_base_ja.py
Outdated
if "rinna" in config["tokenizer_path"]: | ||
tokenizer = AutoTokenizer.from_pretrained( | ||
config["tokenizer_path"], use_fast=False, cache_dir=config.get("cache_dir", None), | ||
) | ||
else: | ||
tokenizer = AutoTokenizer.from_pretrained( | ||
config["tokenizer_path"], cache_dir=config.get("cache_dir", None), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 changes might be needed here:
- change
config["tokenizer_path"]
toconfig["tokenizer"]["tokenizer_name_or_path"]
to match with what you have insample.yaml
- maybe remove the
if
statement and just use a general purpose one:
tokenizer = AutoTokenizer.from_pretrained(
config["tokenizer"]["tokenizer_name_or_path"], use_fast=config.get("use_fast", False), cache_dir=config.get("cache_dir", None),
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I already committed and pushed this change, but actually just staged and forgot to commit and push it.
0691ffa
Thanks for the review!
# TODO: clean up comment outed lines | ||
# # Data expected in prompt response pairs | ||
# if os.path.exists(cache_name + "inputids.pt"): | ||
# print("using cached dataset") | ||
# self.input_ids = torch.load(cache_name+"inputids.pt") | ||
# self.attn_masks = torch.load(cache_name+"attnmask.pt") | ||
# self.labels = torch.load(cache_name+"inputids.pt") | ||
# return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yap maybe clean up a little bit and add back the cache feature. Right now cache_name
param is not used and the cache it would be helpful for debugging purpose 👍
finetune/finetune_base_ja.py
Outdated
# remove parenthesis that might be introduced by some NMTs | ||
new_example = {} | ||
for k, v in example.items(): | ||
if example[k].startswith("「") and example[k].endswith("」"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JFYI when I run the program, this if
statement was never triggered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. I've already updated the dataset at HF by applying this processing.
So, we don't need this functionality here now.
### 応答: | ||
{response}""" | ||
|
||
NO_INPUT_PROMPT = """以下は、タスクを説明する指示と、文脈のある入力の組み合わせです。要求を適切に満たす応答を書きなさい。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fujiki-saij I think we should modify this instruction because there is no 文脈のある入力
so the 文脈のある入力の組み合わせ
is incorrect and we might get better result with improved instruction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Let me add prompt template versioning functionality to experiment with different templates (in this PR or another PR).
4a6f0bb
to
bed0b5a
Compare
bed0b5a
to
7d9df8d
Compare
Get EOS ID from SentencePiece directly
TODOs