-
Notifications
You must be signed in to change notification settings - Fork 361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evolutionary Merging Method #207
Comments
https://arxiv.org/abs/2403.13187 Was just about to link this paper in case @cg123 hadn't seen it yet! I'd be interested to know what the norm was for the matrix |
Also if anybody interested in extending their ideas: For very homogeneous sets of parameters (like their https://en.m.wikipedia.org/wiki/Simultaneous_perturbation_stochastic_approximation and has already been shown to work quite well with LLMs/transformers: https://arxiv.org/abs/2305.17333 It's sad SPSA was named so badly, as unless you are dealing with very noisy evaluations and/or heterogeneous parameters; it nearly always outperforms CMA-ES and similar algorithms (Cross Entropy Method, etc). |
Hi @jukofyork, I am from the Arcee-ai R&D team. I like your ideas, and I wonder whether we can discuss them further. We are happy to collaborate on this :) |
@mkshing @hardmaru - have you considered opensourcing your algo code in addition to the evals in https://github.com/SakanaAI/evolutionary-model-merge The community would greatly benefit from iterating on this together |
I'm also interested in reproducing Evolutionary Model Merge, if there is a group forming around this I'd love to be included. |
Hi @shamanez, sorry I missed your message. I'm only really into this out of interest - I lack the compute to try anything serious and not very motivated to learn Python and Pytorch :) I'm happy to outline some other ideas I've had for anyone who's interested (mostly regarding ensembles):
It's very hard to explain (3) here, but I'm pretty sure it is quite possible to impliment and am happy to try to better explain the idea to someone seriously interested in the idea using some online whiteboard or something... If you have ever played online poker and have seen the graphs of "expected profit" that smooth out the all-in randomness; I was the first person to impliment this around 15 years ago if that means anything so do have experience of dealing with stuff like this. I can't be certain it will work, but from exploring the different tokenizers for some of the common models it looks like most have a large overlap and often just break up longer words differently ("INSTRUCTION" vs "INST" + ...). One final thing with regard to (3) too, also related to poker game trees and Beam Search: when people first tried to roll out expected value calculations into the next hand(s) they tried to just roll out the maximum probability successors at each node. This had the effect of creating biased probability estimates where the most likely sequence of action was: fold, fold, fold, etc. The current problem of using Beam Search for LLMs creating wierd/repeating sequences is likely a manifestation of this same problem! The solution in the case of poker was to roll out the trees probabilisticallly so that low-probability successors had a chance of being rolled out near the top of the tree too. So even if the idea of merging tokenizers in (3) doesn't work; there may well be other interesting avenues of research to explore regarding tokenizer trees. One other idea that I think might be worth exploring (although unrelated to Merging or Ensembles) would be to look at Tensor Decomposition of instruction-tuned versions of models in a similar way to this: https://github.com/thomasgauthier/LoRD uses SVD to create LORAs from existing models. If you search for "randomised tensor decomposition" or "power iteration tensor decomposition" there are lots of recently developed algorithms for this, eg: https://arxiv.org/abs/1703.09074 There is a ton of research on this and two main variations on the decomposition: "CP" which is like SVD with the singular values multiplied into U or V (ie: similar to matrix decompositions created by SDG often used in recommendation systems), and "Tucker" which is like SVD with an actual core tensor similar to S in SVD. The latter might be the most interesting for if we want to explore the "dimensions" in a similar way to PCA, etc. If it is computationally feasible, it would be really interesting to see if the decomposition could separate off "instruction following" from "alignment" dimensions, and could possibly even improve the performance of the models in the process! |
(With regard to the OP). One thing that bothers me about the general idea of merging (and fine-tuning) models is exactly what are we trying to optimise? Metrics like perplexity make sense if we just want to optimise the base models' auto-regressive ability to predict the next token, but it seems far from clear what we should be optimising for if we want to merge instruction-tuned models, etc. It's quite easy to subjectively tell you have destroyed a model or lessend its context, but to actually optimise it we really need some kind of fine-grained objective measure to test against... The actual details of the optimization algorithm are really secondary to having a good optimization criterion to work against IMO. |
@jukofyork Thanks for your ideas. These are great. Do you have any lead on their use of an evolutionary algorithm to optimize the weight space merging and the data for space merging? @jukofyork @NolanGC @codelauncher444 I am happy to arrange a meeting. We also have some ideas on how to start this. [email protected] |
There are a couple of ideas I can think of related to serial dependantcy:
http:https://demo.cs.brandeis.edu/papers/icga95.pdf So the basic idea would be to maintain a "beam" of partial solutions (instead of a population or distribution of full solutions), extend each partial solution by a "block" (somehow), evaluate how well the extended solutions allow for continuation (somehow) and then update the beam and move one more level down... This should in theory optimise towards nice flat mimima and allow you to better control the exploration compute (as opposed to quickly getting attracted to a sharp/unescapable mimima). I've no idea how well it would work in practice, but it would definitely optimise towards a different set of minima that could potentially be interesting and/or complimentary (for ensembles, etc). |
We (at Arcee) are working on getting ahold of Sakana to join on this @jukofyork there certainly might be something more direct like (1) - I know this is a view also held by @thomasgauthier and @shamanez It seems to me like the best course might be for @cg123 to suggest how we might mock the methods + SDK interface, then we should just all kick off a branch together off of that |
[Skip to tl;dr to save time] I also have a similar unreleased script that optimizes for the lowest perplexity score VS the PTB dataset through randomized interlayer % merging between two models with light sampling and CTX for speedier cycle completion. This script requires some refining as my last archive of it isn't where it had been with the latest build. The idea is adapting it to allow any custom dataset to be optimized against. Eventually, random search would be hybridized with a less random method, Pearsonr can theoretically use the samples of % merges per layer VS sampled perplexity scores of every cycle to plot them as data points and calculate a more absolute optimum lowest perplexity score layer merge ratio set; treating the original method as calibration sampling. Though I need to double check Pearsonr is the best method for that. tl;dr, SakanaAI |
Much appreciation for everyone working together on this! I was excited when I saw SakanaAI's paper and I hoped that the fine people behind the mergekit project would tackle implementing it. Thank you for all your hard work on this project. The open LLM ecosystem wouldn't be nearly as diverse or interesting right now if not for your efforts. |
Just saw this linked on Hacker News (it's from February but somehow it never showed up when searching for LLMs+ensembles, etc): https://arxiv.org/abs/2402.05120 This already shows quite large improvements and is using a very crude measure of similarity: https://aclanthology.org/P02-1040.pdf To put this into context: using n-grams like this is like trying to decide which action to take in a game of poker where all you get to see is short snippets of action between 2-3 players ignoring all that went before and all that went after! Imagine trying to judge the EV of your actions in a poker hand without knowing what the player after you did (it's not 100% like this as there are likely lots of "semantic transpositions" with the word order; in a similar way to multiple chess moves leading to the same position). If anybody is seriously interested in exploring tree-rollout based ensembles then I'm definitely interested too! The biggest hurdle to me is that everything is written in Python, which in turn relies so much on opaque libraries underneath. To get this to work the very first thing is going to be investigating customised code to allow the KV cache to be rolled back/forward/stored, etc efficiently. I think A more stable (even semi-dead) project that just implements the CUDA kernels in unquantized float16 or float32 would probably be best - it needs to be reasonably efficient but not necessarily targeted at running large quantized models to start with. If that's not possible then I think even a crappy proof-of-concept CPU based implementation might be worthwhile: if it shows promise using small sub-7b models then it will likely attract more interest anyway and expand from there. |
I'm pretty sure I read a paper a few weeks ago about trying to shuffle the order of the blocks during training to break the serial dependantcy and encourage them to adapt to be more independent of their surrounding blocks, but I can seem to find it and don't seem to have saved it anywhere :( I think the paper and/or the architecture they created had "sort" in the title, and it was likely linked to this paper: |
This is the paper: https://arxiv.org/abs/2309.08968 I haven't time to read it again, but can't see anything about shuffling the block order so that could actually be a paper it references that mentioned this (I see it references "LayerDrop", etc). |
Hi, this might be a bit of a late reply, but I've been working on something similar in my repository, which you can find here: https://github.com/Aratako/Task-Vector-Merge-Optimzier |
Hello, if anyone is actively working on the problem - I'd be happy to join in |
Hi! Please consider implementing Evolutionary Merging Method
The text was updated successfully, but these errors were encountered: