Meta MegaByte

About

Meta MegaByte model experiments.
Based on MEGABYTE-pytorch repository from Lucidrains and nanoGPT from Andrej Karpathy.

All training runs available publicly on Neptune.ai.

TODO

Create training script
- Add Neptune.ai logging
- Add cosine learning rate scheduler
- Add weight initialisation based on MegaByte paper

Experiments

`enwik8` Dataset

Model

DIM_HEAD    = 64
HEADS       = 8
NUM_TOKENS  = 256
DIM         = (768, 512, 256)
DEPTH       = (6, 4, 2)
MAX_SEQ_LEN = (512, 4, 4)
FLASH_ATTN  = False

Experiment Log

`RedPajama-Data-1T-Sample` Dataset

Model

DIM_HEAD    = 64
HEADS       = 16
NUM_TOKENS  = 256
DIM         = (1024, 512, 256)
DEPTH       = (24, 4, 2) # (12, 4, 2)
MAX_SEQ_LEN = (512, 4, 4)
FLASH_ATTN  = False

NOTES:

Trying SophiaG optimiser instead of Adam
Trying RedPajama-Data-1T-Sample instead of enwik8
Batch Size := 20 on A6000 48GB, LR=3e-4 (Neptune: MEG-67)
- Training Continuation (lr=3e-4 too high for larger 270M model vs 52M model, continuing from 1000 step checkpoint at 2e-4, 0 warmup on continuation) (Neptune: MEG-72)
  - Model gradients exploded as learning rate too high at 3e-4. Should just use original 2e-4 as suggested in other forks and original paper. Possibly SophiaG also doesn't like higher LR for this as well?
Same as above but switching LR back to 2e-4, attempting full 1 epoch run over 12 hours. (Neptune: MEG-80). This lasted up to around epoch 2000 before gradients exploded. Needs longer warmup, high starting LR fine but needs to decay more.
Same as above using Adam optimiser. SophiaG is overall just better, faster convergence everything, but needs to be tuned a bit. (Neptune: MEG-83).
Try 1e-4 LR. SophiaG seems to be very sensitive to LR, or maybe just model size? Dunno something is very sensitive as gradient always explodes. (Neptune: MEG-87). Plateua'd much later on in training. Suggests that issue with gradients is related to either the LR or some other aspect of SophiaG.
Changed Rho to 0.05, changed grad clip to 1.0, changed lr back to 2e-4 to 2e-5, still warmup of 2000. (Neptune: MEG-93). Made no difference, gonna switch back to Adam.

`TinyStories Instruct`

Initial Training Run (Neptune: MEG-142, MEG-143, MEG-144, MEG-147, MEG-149, MEG-151) MEG-157). Very successful, using 512 seq len, GPT-Neo tokeniser.

Pre-Trained Models

megabyte_25k_1.2836014032363892.pt MegaByte model trained on enwik8 for 25k batches (~36.44x passes over the dataset). Produces text as seen below. Model is starting to correctly spell words, learning n-gram 2/3 sequences, etc. Is continuing to converge. Trained for just over 12 hours on RTX 3060 Ti. (Full training completion on RTX 3060 Ti would take roughly 50 hours). Final validation loss is ~1.28 with a perplexity of 3.59.

Training

After 25k epochs (Roughly 10 epochs of enwik8)

Validation loss: 1.2836014032363892

Prompt: "hi"

Response: a confusion, an influence onew [[progressionative]] spanning, opponents of and free of a horistian claim of pages, and [[bategory]] forms oference during the [[15 years]] and the most expe color increaseduced to its promain of the early theory became org.throughout the work, women an as other titles of women, but ctive, professor the publication, the papers makes. After a nearom offering impof him to testimobase, a publishe> 1295, exclurisments and pubol==The followas trends later Taken by the Engue]], allegedly a special officesh developed onle="palm thed with record oranssend of adverom temporary andam]]," ''Deasile Blewer'', the later died if the ran to sucompus the ''Gellt;br> Columbus (resource)'' ing at the Universtanding School.com/Long-Centre prior explosion, this academie r [[selenzymal]] more provisions, the [[Wiremigramp> Examples]]]</cite> o compensation, t from the [[15704.77]] (2005) anists as or ''saroni mittoj'' it water there orgace.]]==A.C. (1995)==<font Sebid="freer skylows" [[Schwayz-cheeckland]]: "&lden and mindward form of regular" &mdas]].This term d against Alsenhorg/ ''B. Jean Wopean, Jun Tive. cleanwieling,'' occupies Unix-Pr episodes 02.66*[http:https://www.usan philers.org/htar'' "Documatic Tree", while scandal components]: Biow.documentation, when only monitords sport. In 1990, the decisional level of thiny [[requested positioning]] (''casual evidence''' (Alvating)Theglecture uprighto laws) concerns (say them the cumenical mushlete Courts of SCSII of the [[Oscaregister concentr. In Mac Octob|Such region]], [[http:https://www.ycddbuild.co.uk/tc/1/ Dalth Design] aligoed "Faty, scripts untilam_scholars=&quon, printed 2-maine]], urban d, ctionary.===Ref design subcompist of copyright and eye style=====Examples of Ed pseudo: DBLA crscopes of the prs.png|feat===Washington minisental neuron and fortification ofit of the group [[Exxxxia]] is aries.Dr. Cereby and Furrender intended expandepcia's eyeculare publicity. Ithe then-progress censuses classesecond and Carnon II had receiveleased eternal ibutors to [[publy occurring of ctic]] in [[Mexics of MiddleOner]], and was givents]] by other [[precorrelations] features, all-re had providing The [[selective was a small poig its advance< showing at once mind.</cental id=0 [[Taxon by Reference]] /worrion of an early st of the Earth. to spacecraft al god-racing absor at a simple an [[Transcription system]].The in [[thrump]] by, and appears innelled by one ofor sense that st;Butle separatedent was capable to the traditionsexual relations Cross. Ciable and past overheansgender abuse p:https://online.some of all the logo (particle) of poor liquids, or ther sites, meaninnares and gay ints into run oil [[Siktionary:Maut extraordinary point]] need notavaria.<!-- Indigivida viewpot;sucleic actualistic, making code>303,000,903-03-1015 ancestone developed urkmunds. CDS by was multi-points]], and one has a dead caused ition is backward---* ''The [[Prerosigns of Greest, cleopasses byssions]]'' &/subscription (d in [[Literaturn action|allegorideois]] by [[Priors are progressiv]], 2003)*&qut/18472-861.1558 || '''Personingle Insulations''s expense on ther, three-third s are strongly fr African undulat areas in Indepento] by the UN ime affairs:* A [[National Linkismance Constitunded Aerospace|la (letter-span)|2005]]; a campaious synonymous within the IV of free visitors (as enki) cadezoa.)*The ''Sen'' [[1898]] (appearstion may the arthey also resulte]], [[Durienter in [[Coloridatiorld]] and the [[France#Foundatio slang nations|stamp terror]]s, larger [[Specialuding moon]]s; theoning person, The deviation, are useful to ext xml: "Int;/TAD", in recommendal; thecomes created ased ago of scale 704255111972], which is rologised at low step in specificompositions:* the '''Metropolician Point for Replacement''': '''[[Ante agent|Notes]]''' (one mage: that is tecommendation of t]]. '''The frem. Hard Aberiesth a progressive (1988):''' [httpreversion.lin.tin it.leburg-stephy]]* [[Ecology]] [http:https://www.cond set.org] thea of is more par by the [[Parsonealogy]] school mind, which was systematically an American neolof Brassing volump;gravity movemen the calculatio briefs of its p;<span style===<code0;nonomicalline>ostric education of the French &as.)* [[Rinalecthe [[Pizara]] (External links).* Window Walkestance polycholas'' or ''"Interaction One Comic]"'' rep;|-*[[Dword slation]] ([[187704,460 season]]) for a significacilization of Acompetitivative conference in [[eringquest]]* [[[pseudahygian]] .* information for on [[animal]]: ctions also incluot; ([[help skeprogram]] linkwisorghones from [[Snatz]], [[Tuanes, wars|Subsequer thesis]])* [[[epischer]] (seesemblance among Adell System, clso hallow chauce]] in the early of comments up while the assembl spectrum [[Camb] ([[Glam Historonment|Stage]]) cognate with [[Roman and Studie

Prompt: Machine learning [[Algorithm|algorithms]] build a model based on sample data, known as [[training data]], in order to make predictions or decisions without being explicitly programmed to do so.

Response: two volumes including [[Triporation]]s. The:GIG not home verence: ''b. moleactive'', the land as sphere as sidese defence. his removal on at it: still is d the third or eve (social work, for prenatory), this ''sprrsituralitare'' (whical engines of es, rood's diamoning, beginning instance)==Refes]], [http:https://newidesc.edu/europet's Diah Kebby]<!-- Presenter]]{{crediblersion}}{{Africath> Influence School of Englith the British IRECT}}{{The Fibert]]* ''Commonternite Green dineering Backgroultra'' [http:https://ws the l.org.uk/h (iblic education theory])* [[John Lake Punk]] 2]] ''[[Magyars (drivery):''Damages]]''' In many [[C+F_Dianety]]# "ed to hearst any Classical tool a circlus and ale to its old one now" - Por as well as in to the pupper, [[Ilame Marital Asia]]; there are colorfuse ancestters with oil intophesia; harmonow the steps arer), and can onlystem location annel it separatesince picture.* and dexterature largest tuddy (of which is [[Brir name|ED]]). Wipments regard the Babylon 7, fourgeographed drugrowth of [[Unive [[French Office computer prize|Mainted's unpublingual display]]] The Malso include the a slate level.######### "ock or " itrip"# Davie was the formeracce rather in tude's design* [[Dec Way Movemend of Hooler]], [http:https://www.instith alleriance.ordered on testifikitation directoncern]]# 'STP in the specific s simplest listinology==Litestite]* [[Braterhommendarian]] (dent as "the clean and genesis]], [[London Dago.com]]) containce but "Thave one of the only public country/earliest aspeen-atheism,"ory made it her it is characterikine?) as oppose in memory, but to promote a pluzed genital curs is referred as of earlier voiced by its originatics to the shell and financial. This is a linki>{{ref|combural]}} * [[Adbionomer (among my officient orgencing abbreviatinflumberences|Coon discontant a overlash, epistst, termination)]] the weekn of a synchronize as well.==Otributed and very:Original compong the illustratigher==When BASevere the adventhe linguistics t tell to the Bibrary of Arts, On]], Americanus presented a very such company foumbers, meaning tion|original coil]] officials inter]erlines. Soment] would benefictive and expering a current vata B--which centhe may also onceople enthrone eville how the norence from new mundering a handfuot; The locationd [[September]] refused to turn their digitists of command for py|alliance stillectors' sales, revised with the makeships never and linked stepsideries such as major confidencere and taxes actitled to promineads nominally crm — winentonions like for matche synthould rooting [[ce opentene]]es imestering that cian like crafter]], including te thirty-over thed amendments wit of some of the 2886400 and of CABC had a tolder data of Cain the main-uploy:Sister Cities.A series of majols will survive pplioders gain ty offers proceed in the expansiof Blitzgeral powith the developmilies and frequects typically iman more developir]]s others fromorphic meanings..]]The UK Collement department the vicibal tim temporum sold ww.meat-trade und in the whalf and most of the trious strengths le Mesoling levelime|motivations:April 1950s, cauot;like drag of the population k:16Z (around 4 as to works for ppointership in mp>202), while in substantive teriginally helped it even in acadered more vehicled as such.Whil moouting work implyed that thisp;” red in small develar pen documentsever details devision results iny traffic from riani monopolis. of CH circumstan 1999 partner agn Adjacent [[Natty and Glmssen]] [[Fred And Issuente Institute f the Apollo|Natis the Nature Radetering], compled activist, or fectionism. Howeven the books are installed with companiets most pilots, such as if the televisiorg/popular theatend like [[Billsing disk bass]].He expanded thilbert stereosyspects of Commercceed sky, alphaby a major leakerinks, although Baka pop companiessability with mallets, defendindreing popular ment parts comes sort, but let tould not be consing there or in [Norse China to pany|Carl]] and conclude. Bubble of a threating is "the can a funeral of theen text filed wia]] [[Economic arily]] movement to [[Austrian Ective Developmentury.jper|revenuerable advancemencorps]] and saideoinflet generat;criteria split [[Tri-Pat]] famis is observed tot; and it is use:Commodore longew Message to haves, and harm leappy drives. Counce folk page doent.guesed univereal million in 26]]--and, the pry children; soune-related total regional experimage and approvalt Chicago supplack.]]Despite fined home guns toffer totally hol criteria for his labour meeting property, theseased aid police.com/dams_pkt incipients over comen, including a the supporter ofor the duke estild professional for modifying der nations that hich purchase cominors against sttp:https://www.safe, and trouts, authof the first plust (28,000&nby the end of 197), to its to prown [http:https://www.ch cripping machion, and way it instructions.] --0, 1,000,000 alon]], six laws findive and not ab. [[Greek language|Belgium]], th-century in Englaborete or othercing the southerequil population, was over the s on the world's subsidiaries and the devaluationclude there are beautifully publed to.==Charge regular credits=={{main|Marxis from our livinging classes in

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
attend.py		attend.py
bits_per_byte.py		bits_per_byte.py
evaluate.py		evaluate.py
lr_sched.py		lr_sched.py
main.ipynb		main.ipynb
megabyte.py		megabyte.py
output(2k to 25_5k).txt		output(2k to 25_5k).txt
output(a100).txt		output(a100).txt
output(first2k).txt		output(first2k).txt
param_calc.py		param_calc.py
param_count.py		param_count.py
sophiag.py		sophiag.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Meta MegaByte

About

TODO

Experiments

`enwik8` Dataset

Model

Experiment Log

`RedPajama-Data-1T-Sample` Dataset

Model

`TinyStories Instruct`

Pre-Trained Models

Training

After 25k epochs (Roughly 10 epochs of enwik8)

About

Releases

Packages

Languages

License

MiscellaneousStuff/megabyte

Folders and files

Latest commit

History

Repository files navigation

Meta MegaByte

About

TODO

Experiments

enwik8 Dataset

Model

Experiment Log

RedPajama-Data-1T-Sample Dataset

Model

TinyStories Instruct

Pre-Trained Models

Training

After 25k epochs (Roughly 10 epochs of enwik8)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`enwik8` Dataset

`RedPajama-Data-1T-Sample` Dataset

`TinyStories Instruct`

Packages