Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIM implementation question #932

Closed
liddalidd opened this issue May 12, 2023 · 3 comments
Closed

FIM implementation question #932

liddalidd opened this issue May 12, 2023 · 3 comments
Assignees
Labels
feature request New feature or request

Comments

@liddalidd
Copy link

Hello, maintainers and contributors,

I hope you are all doing well. I am writing this issue to inquire about the FIM implementation in this repository. I noticed that the implementation differs from the one described in the original paper.

The FIM paper mentioned that the joint 50-50 SPM+PSM mixture may be the most efficient and flexible than the solely SPM mode. However, this FIM-clean branch only implemented the SPM mode, with
<SUF> ◦ Enc(suffix) ◦ <PRE> ◦ Enc(prefix) ◦ <MID> ◦ Enc(middle) ◦ <EOT>
instead of
<PRE> ◦ <SUF> ◦ Enc(suffix) ◦ <MID> ◦ Enc(prefix) ◦ Enc(middle) ◦ <EOT>
Is there a specific reason for these changes, such as improvements in performance or stability?

I appreciate your time and work on this project, and I'm looking forward to understanding the rationale behind the differences in the implementation.

Thank you in advance for your help and support.

Best regards,

@liddalidd liddalidd added the feature request New feature or request label May 12, 2023
@liddalidd
Copy link
Author

Still looking for your reply, @haileyschoelkopf. Thanks in advance.

@haileyschoelkopf
Copy link
Contributor

Hi! Sorry for the delay on this.

At the time I did not thoroughly compare the different variants implemented in the paper. My rationale for not implementing joint PSM-SPM was that at inference time, one's interface to infilling (if e.g. serving a model that has FIM capabilities) can be controlled behind the scenes--there's no reason that pops out to me that explains why one would want to sometimes do PSM mode, sometimes SPM mode, at inference time. I suppose it might be the case that one mode allows more attention to the prefix vs. suffix compared to the other, but I am not certain this is likely or desirable. Does this make sense to you?

Some other information that may interest you:

  • I helped Bigcode expand my implementation a bit and it was used to train Santacoder (paper: https://arxiv.org/abs/2301.03988, code: https://github.com/bigcode-project/Megatron-LM/blob/multi-query-attention/megatron/data/gpt_dataset.py). This code should implement joint SPM + PSM as done in the paper because Bigcode's intent was to maximally derisk FIM training and confirm that the minor performance hit was the same for identical settings to the FIM paper's recommendation.
  • FIM still caused a minor performance hit as seen on HumanEval left-to-right performance in Santacoder, about the same size drop as I observed in my own SPM experiments.
  • I don't know how thoroughly UL2-style FIM training, or head-to-head between InCoder's Causal Masking vs. FIM training have been compared. it may be the case that one of these styles is better downstream or causes less of a performance drop, but I'm not aware of work that fairly compares these, though UL2 was used in the new CodeGen2.

@liddalidd
Copy link
Author

Thank you very much for your detailed and informative response! It has greatly helped me in understanding the issue and resolving my problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants