FIM implementation question #932

liddalidd · 2023-05-12T03:12:37Z

Hello, maintainers and contributors,

I hope you are all doing well. I am writing this issue to inquire about the FIM implementation in this repository. I noticed that the implementation differs from the one described in the original paper.

The FIM paper mentioned that the joint 50-50 SPM+PSM mixture may be the most efficient and flexible than the solely SPM mode. However, this FIM-clean branch only implemented the SPM mode, with
<SUF> ◦ Enc(suffix) ◦ <PRE> ◦ Enc(prefix) ◦ <MID> ◦ Enc(middle) ◦ <EOT>
instead of
<PRE> ◦ <SUF> ◦ Enc(suffix) ◦ <MID> ◦ Enc(prefix) ◦ Enc(middle) ◦ <EOT>
Is there a specific reason for these changes, such as improvements in performance or stability?

I appreciate your time and work on this project, and I'm looking forward to understanding the rationale behind the differences in the implementation.

Thank you in advance for your help and support.

Best regards,

The text was updated successfully, but these errors were encountered:

liddalidd · 2023-05-16T08:24:34Z

Still looking for your reply, @haileyschoelkopf. Thanks in advance.

haileyschoelkopf · 2023-05-18T20:02:49Z

Hi! Sorry for the delay on this.

At the time I did not thoroughly compare the different variants implemented in the paper. My rationale for not implementing joint PSM-SPM was that at inference time, one's interface to infilling (if e.g. serving a model that has FIM capabilities) can be controlled behind the scenes--there's no reason that pops out to me that explains why one would want to sometimes do PSM mode, sometimes SPM mode, at inference time. I suppose it might be the case that one mode allows more attention to the prefix vs. suffix compared to the other, but I am not certain this is likely or desirable. Does this make sense to you?

Some other information that may interest you:

I helped Bigcode expand my implementation a bit and it was used to train Santacoder (paper: https://arxiv.org/abs/2301.03988, code: https://github.com/bigcode-project/Megatron-LM/blob/multi-query-attention/megatron/data/gpt_dataset.py). This code should implement joint SPM + PSM as done in the paper because Bigcode's intent was to maximally derisk FIM training and confirm that the minor performance hit was the same for identical settings to the FIM paper's recommendation.
FIM still caused a minor performance hit as seen on HumanEval left-to-right performance in Santacoder, about the same size drop as I observed in my own SPM experiments.
I don't know how thoroughly UL2-style FIM training, or head-to-head between InCoder's Causal Masking vs. FIM training have been compared. it may be the case that one of these styles is better downstream or causes less of a performance drop, but I'm not aware of work that fairly compares these, though UL2 was used in the new CodeGen2.

liddalidd · 2023-05-22T06:44:34Z

Thank you very much for your detailed and informative response! It has greatly helped me in understanding the issue and resolving my problem.

liddalidd added the feature request New feature or request label May 12, 2023

StellaAthena assigned haileyschoelkopf May 12, 2023

liddalidd closed this as completed May 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIM implementation question #932

FIM implementation question #932

liddalidd commented May 12, 2023

liddalidd commented May 16, 2023

haileyschoelkopf commented May 18, 2023

liddalidd commented May 22, 2023

FIM implementation question #932

FIM implementation question #932

Comments

liddalidd commented May 12, 2023

liddalidd commented May 16, 2023

haileyschoelkopf commented May 18, 2023

liddalidd commented May 22, 2023