Hacker News new | past | comments | ask | show | jobs | submit login
Open source AI is the path forward (fb.com)
2341 points by atgctg 4 days ago | hide | past | favorite | 886 comments





Related ongoing thread:

Llama 3.1 - https://news.ycombinator.com/item?id=41046540 - July 2024 (114 comments)


“The Heavy Press Program was a Cold War-era program of the United States Air Force to build the largest forging presses and extrusion presses in the world.” This ”program began in 1944 and concluded in 1957 after construction of four forging presses and six extruders, at an overall cost of $279 million. Six of them are still in operation today, manufacturing structural parts for military and commercial aircraft” [1].

$279mm in 1957 dollars is about $3.2bn today [2]. A public cluster of GPUs provided for free to American universities, companies and non-profits might not be a bad idea.

[1] https://en.m.wikipedia.org/wiki/Heavy_Press_Program

[2] https://data.bls.gov/cgi-bin/cpicalc.pl?cost1=279&year1=1957...


The National Science Foundation has been doing this for decades, starting with the supercomputing centers in the 80s. Long before anyone talked about cloud credits, NSF has had a bunch of different programs to allocate time on supercomputers to researchers at no cost, these days mostly run out of the Office of Advanced Cyberinfrastruture. (The office name is from the early 00s) - https://new.nsf.gov/cise/oac

(To connect universities to the different supercomputing centers, the NSF funded the NSFnet network in the 80s, which was basically the backbone of the Internet in the 80s and early 90s. The supercomputing funding has really, really paid off for the USA)


> NSF has had a bunch of different programs to allocate time on supercomputers to researchers at no cost, these days mostly run out of the Office of Advanced Cyberinfrastruture

This would be the logical place to put such a programme.


The DoE has also been a fairly active purchaser of GPUs for almost two decades now thanks to the Exascale Computing Project [0] and other predecessor projects.

The DoE helped subsidize development of Kepler, Maxwell, Pascal, etc along with the underlying stack like NVLink, NGC, CUDA, etc either via purchases or allowing grants to be commercialized by Nvidia. They also played matchmaker by helping connect private sector research partners with Nvidia.

The DoE also did the same thing for AMD and Intel.

[0] - https://www.exascaleproject.org/


The DoE subsidized the development of GPUs, but so did Bitcoin.

But before that, it was video games, like quake. Nvidia wouldn't be viable if not for games.

But before that, graphics research was subsidized by the DoD, back when visualizing things in 3D cost serious money.

It's funny how technology advances.


It was really Ethereum / Alt coins not Bitcoin that caused the GPU demand in 2021.

Bitcoin moved to FPGAs/ASIC very quickly because dedicated hardware was vastly more efficient they were only viable from Oct 2010. By 2013 when ASIC’s came online GPU’s only made sense if someone else was paying for both the hardware and electricity.


As you've rightly pointed out, we have the mechanism, now let's fund it properly!

I'm in Canada, and our science funding has likewise fallen year after year as a proportion of our GDP. I'm still benefiting from A100 clusters funded by tax payer dollars, but think of the advantage we'd have over industry if we didn't have to fight over resources.


Where do you get access to those as a member of the general public?

In Australia at least, anyone who is enrolled at or works at a university can use the taxpayer-subsidised "Gadi" HPC which is part of the National Computing Infrastructure (https://nci.org.au/our-systems/hpc-systems). I also do mean anyone, I have an undergraduate student using it right now (for free) to fine-tune several LLMs.

It also says commercial orgs can get access via negotiation, I expect a random member of the public would be able to go that route as well. I expect that there would be some hurdles to cross, it isn't really common for random members of the public to be doing the kinds of research Gadi was created to benefit. I expect it is the same way in this case in Canada. I suppose the argument is if there weren't any gatekeeping at all, you might end up with all kinds of unsuitable stuff on the cluster, e.g. crypto miners and such.

Possibly another way for a true random person to get access would be to get some kind of 0-hour academic affiliation via someone willing to back you up, or one could enrol in a random AI course or something and then talk to the lecturer in charge.

In reality, the (also taxpayer-subsidised) university pays some fee for access, but it doesn't come from any of our budgets.


Australia's peak HPC has a total of: "2 nodes of the NVIDIA DGX A100 system, with 8 A100 GPUs per node".

It's pretty meagre pickings!


Well, one, it has:

> 160 nodes each containing four Nvidia V100 GPUs

and two, well, it's a CPU-based supercomputer.


I get my resources through a combination of servers my lab bought with using a government grant and the Digital Research Alliance of Canada (nee Compute Canada)'s cluster.

These resources aren't available to the public, but if I were king for a day we'd increase science funding such that we'd have compute resources available to high-school students and the general public (possibly following training on how to use it).

Making sure folks didn't use it to mine bitcoin would be important, though ;)


I'm going to guess it's Compute Canada, which I don't think we non-academics have access to.

That's correct (they go by the Digital Research Alliance of Canada now... how boring).

I wish that wasn't the case though!


Yeah, the specific AI/ML-focused program is NAIRR.

https://nairrpilot.org/

Terrible name unless they low-key plan to make AI researchers' hair fall out.


the US already pays for 2+ aws region for cia/dod. why not pay for a region that is only available to researchers?

In the Netherlands, for instance, there is "the national supercomputer" Snellius: https://www.surf.nl/en/services/snellius-the-national-superc... I am not sure about its budget, but my impression as a user is that its resources are never fully used. At least, none of my jobs ever had to queue. I doubt that it can compete with the scale of resources that FAANG companies have available, but then again, I also doubt how research would benefit.

Sure, academia could build LLMs, and there is at least one large-scale project for that: https://gpt-nl.com/ On the other hand, this kind of models still need to demonstrate specific scientific value that goes beyond using a chatbot for generating ideas and summarizing documents.

So I fully agree that the research budget cuts in the past decades have been catastrophic, and probably have contributed to all the disasters the world is currently facing. But I think that funding prestigious super-projects is not the best way to spend funds.


Snellius is a nice resource. A powerful Slurm based HTC cluster with different cues for different workloads (cpu/genomics, gpu/deep learning).

To access the resource I had to go through EuroCC [0], which is a network facilitating access to and exploitation of HPC/HTC infra. It is (or can be) a great competing model to US cloud providers.

As a small business I got 8 hrs of consultancy and 10k compute hours for free. I’m still learning the details but my understanding is is that after that the prices are very competitive.

[0] https://www.eurocc-access.eu/


Italy built the Leonardo HPC cluster, it's one of the largest in EU and was created by a consortium of universities. After just over a year it's already at full capacity and expansion plans have been anticipated because of this.

Doubtful that GPUs purchased today would be in use for a similar time scale. Govt investment would also drive the cost of GPUs up a great deal.

Not sure why a publicly accessible GPU cluster would be a better solution than the current system of research grants.


Of course they won't. The investment in the Heavy Press Program was the initial build, and just citing one example, the Alcoa 50,000 ton forging press was built in 1955, operated until 2008, and needed ~$100M to get it operational again in 2012.

The investment was made to build the press, which created significant jobs and capital investment. The press, and others like it, were subsequently operated by and then sold to a private operator, which in turn enabled the massive expansion of both military manufacturing, and commercial aviation and other manufacturing.

The Heavy Press Program was a strategic investment that paid dividends by both advancing the state of the art in manufacturing at the time it was built, and improving manufacturing capacity.

A GPU cluster might not be the correct investment, but a strategic investment in increasing, for example, the availability of training data, or interoperability of tools, or ease of use for building, training, and distributing models would probably pay big dividends.


I don't think there's a shortage of capital for AI... probably the opposite

Of all the things to expand the scope of government spending why would they choose AI, or more specifically GPUs?


There may however, be a shortage of capital for open source AI, which is the subject under consideration.

As for the why... because there's no shortage of capital for AI. It sounds like the government would like to encourage redirecting that capital to something that's good for the economy at large, rather than good for the investors of a handful of Silicon Valley firms interested only in their own short term gains.


Look at it from the perspective of an elected official:

If it succeeds, you were ahead of the curve. If it fails, you were prudent enough to fund an investigation early. Either way, bleeding edge tech gives you a W.


Or you wasted a bunch of tax payer money on some over hyped and over funded nonsense.

Yeah. There is alot of over hyped and over funded nonsense that comes out of NASA. Some of it is hype from the marketing and press teams, other hype comes from misinterpretation of releases.

None of that changes that there have been major technical breakthroughs, and entire classes of products and services that didn't exist before those investments in NASA (see https://en.wikipedia.org/wiki/NASA_spin-off_technologies for a short list). There are 15 departments and dozens of Agencies that comprise the US Federal government, many of whom make investments in science and technology as part of their mandates, and most of that is delivered through some structure of public-private partnerships.

What you see as over-hyped and over-funded nonsense could be the next ground breaking technology, and that is why we need both elected leaders who (at least in theory) represent the will of the people, and appointed, skilled bureaucrats who provide the elected leaders with the skills, domain expertise, and experience that the winners of the popularity contest probably don't have.

Yep, there will be waste, but at least with public funds there is the appearance of accountability that just doesn't exist with private sector funds.


You'll be long gone before they find out.

Which happens every single day in every government in the world.

how would you determine that without investigation?

If it succeeds the idea gets sold to private corporations or the technology is made public and everyone thinks the corporation with the most popular version created it.

If it fails certain groups ensure everyone knows the government "wasted" taxpayer money.


> A GPU cluster might not be the correct investment, but a strategic investment in increasing, for example, the availability of training data, or interoperability of tools, or ease of use for building, training, and distributing models would probably pay big dividends

Would you mind expanding on these options? Universal training data sounds intriguing.


Sure, just on the training front, building and maintaining a broad corpus of properly managed training data with metadata that provides attribution (for example, content that is known to be human generated instead of model generated, what the source of data is for datasets such as weather data, census data, etc), and that also captures any licensing encumbrance so that consumers of the training data can be confident in their ability to use it without risk of legal challenge.

Much of this is already available to private sector entities, but having a publicly funded organization responsible for curating and publishing this would enable new entrants to quickly and easily get a foundation without having to scrape the internet again, especially given how rapidly model generated content is being published.


I think the EPC (energy performance certificate) dataset in the UK is a nice example of this. Anyone can download a full dataset of EPC data from https://epc.opendatacommunities.org/

Admittedly it hasn't been cleaned all that much - you still need to put a bit of effort into that (newer certificates tend to be better quality), but it's very low friction overall. I'd love to see them do this with more datasets


If the public is going to go to all the trouble of doing something, why would that public not make it clear that there is no legal threat to using any data available?

The public is incredibly lazy, though. Don't expect them to do anything until their hand is forced, which doesn't bode well for the action to meet a desirable outcome.


there are many things i think are more capital constrained, if the government is trying to subsidize things.

> Doubtful that GPUs purchased today would be in use for a similar time scale

Totally agree. That doesn't mean it can't generate massive ROI.

> Govt investment would also drive the cost of GPUs up a great deal

Difficult to say this ex ante. On its own, yes. But it would displace some demand. And it could help boost chip production in the long run.

> Not sure why a publicly accessible GPU cluster would be a better solution than the current system of research grants

Those receiving the grants have to pay a private owner of the GPUs. That gatekeeping might be both problematic, if there is a conflict of interests, and inefficient. (Consider why the government runs its own supercomputers versus contracting everything to Oracle and IBM.)


It would be better that the government removes IP on such technology for public use, like drugs got generics.

This way the government pays 2'500 USD per card, not 40'000 USD or whatever absurd.


> It would be better that the government removes IP on such technology for public use, like drugs got generics.

20-25 year old drugs are a lot more useful than 20-25 year old GPUs, and the manufacturing supply chain is not a bottleneck.

There's no generics for the latest and greatest drugs, and a fancy gene therapy might run a lot more than $40k.


> better that the government removes IP on such technology for public use, like drugs got generics

You want to punish NVIDIA for calling its shots correctly? You don't see the many ways that backfires?


No. But I do want to limit the amount we reward NVIDIA for calling the shots correctly to maximize the benefit to society. For instance by reducing the duration of the government granted monopolies on chip technology that is obsolete well before the default duration of 20 years is over.

That said, it strikes me that the actual limiting factor is fab capacity not nvidia's designs and we probably need to lift the monopolies preventing competition there if we want to reduce prices.


> reducing the duration of the government granted monopolies on chip technology that is obsolete well before the default duration of 20 years is over

Why do you think these private entities are willing to invest the massive capital it takes to keep the frontier advancing at that rate?

> I do want to limit the amount we reward NVIDIA for calling the shots correctly to maximize the benefit to society

Why wouldn't NVIDIA be a solid steward of that capital given their track record?


> Why do you think these private entities are willing to invest the massive capital it takes to keep the frontier advancing at that rate?

Because whether they make 100x or 200x they make a shitload of money.

> Why wouldn't NVIDIA be a solid steward of that capital given their track record?

The problem isn't who is the steward of the capital. The problem is that economically efficient thing to do for a single company is (given sufficient fab capacity, and a monopoly) to raise prices to extract a greater share of the pie at the expense of shrinking the size of the pie. I'm not worried about who takes the profit, I'm worried about the size of the pie.


> Because whether they make 100x or 200x they make a shitload of money.

It's not a certainty that they 'make a shitload of money'. Reducing the right tail payoffs absolutely reduces the capital allocated to solve problems - many of which are risky bets.

Your solution absolutely decreases capital investment at the margin, this is indisputable and basic economics. Even worse when the taking is not due to some pre-existing law, so companies have to deal with the additional uncertainty of whether & when future people will decide in retrospect that they got too large a payoff and arbitrarily decide to take it from them.


You can't just look at the costs to an action, you also have to look at the benefits.

Of course I agree I'm going to stop marginal investments from occurring into research into patent-able technologies by reducing the expect profit. But I'm going to do so very slightly because I'm not shifting the expected value by very much. Meanwhile I'm going to greatly increase the investment into the existing technology we already have, and allow many more people to try to improve upon it, and I'm going to argue the benefits greatly outweigh the costs.

Whether I'm right or wrong about the net benefit, the basic economics here is that there are both costs and benefits to my proposed action.

And yes I'm going to marginally reduce future investments because the same might happen in the future and that reduces expected value. In fact if I was in charge the same would happen in the future. And the trade-off I get for this is that society gets the benefit of the same actually happening in the future and us not being hamstrung by unbreachable monopolies.


> But I'm going to do so very slightly because I'm not shifting the expected value by very much

I think you're shifting it by a lot. If the government can post-hoc decide to invalidate patents because the holder is getting too successful, you are introducing a substantial impact on expectations and uncertainty. Your action is not taken in a vacuum.

> Meanwhile I'm going to greatly increase the investment into the existing technology we already have, and allow many more people to try to improve upon it, and I'm going to argue the benefits greatly outweigh the costs.

I think this is a much more speculative impact. Why will people even fund the improvements if the government might just decide they've gotten too large a slice of the pie later on down the road?

> the trade-off I get for this is that society gets the benefit of the same actually happening in the future and us not being hamstrung by unbreachable monopolies.

No the trade-off is that materially less is produced. These incentive effects are not small. Take for instance, drug price controls - a similar post-facto taking because we feel that the profits from R&D are too high. Introducing proposed price controls leads to hundreds of fewer drugs over the next decade [0] - and likely millions of premature deaths downstream of these incentive effects. And that's with a policy with a clear path towards short-term upside (cheaper drug prices). Discounted GPUs by invalidating nvidia's patents has a much more tenuous upside and clear downside.

[0]: https://bpb-us-w2.wpmucdn.com/voices.uchicago.edu/dist/d/312...


> I'm going to do so very slightly because I'm not shifting the expected value by very much

You're massively increasing uncertainty.

> the same would happen in the future. And the trade-off I get for this is that society gets the benefit

Why would you expect it would ever happen again? What you want is an unrealized capital gains tax. Not to nuke our semiconductor industry.


You have proposed state ownership of all successful IP. That is a massive change and yet you have demonstrated zero understanding of the possible costs.

Your claim that removing a profit motivation will increase investment is flat out wrong. Everything else crumbles from there.


No, I've proposed removing or reducing IP protections, not transferring them to the state. Allowing competitors to enter the market will obviously increase investment in competitors...

This is already happening - its called China. There's a reason they don't innovate in anything, and they are always playing catch-up, except in the art of copying (stealing) from others.

I do think there are some serious IP issues, as IP rules can be hijacked in the US, but that means you fix those problems, not blow up IP that was rightfully earned


> they don't innovate in anything

They are leaders in solar and EVs.

Remember how Japan leapfrogged the western car industry, and six sigma became required reading for managers in every industry?


Removing IP restrictions transfers them to the state. Grow up.

>Why wouldn't NVIDIA be a solid steward of that capital given their track record?

Past performance is not indicative of future results.


> That said, it strikes me that the actual limiting factor is fab capacity not nvidia's designs and we probably need to lift the monopolies preventing competition there if we want to reduce prices.

Lol it's not "monopolies" limiting fab capacity. Existing fab companies can barely manage to stand-up a new fab in different cities. Fabs are impossibly complex and beyond risky to fund.

It's the kind of thing you'd put government money to making but it's so risky government really don't want to spend billions and fail so they give existing companies billions so if they fail it's not the governments fault.


So, if a private company is successful, you will nationalize its IP under some guise of maximizing the benefit to society? That form of government was tried once. It failed miserably.

Under your idea, we’ll try a badly broken economic philosophy again. And while we’re at it, we will completely stifle investment in innovation.


there is no such thing as a lump-sum transfer, this will shift expectations and incentives going forward and make future large capital projects an increasingly uphill battle

There was a post[0] on here recently about how the US went from producing woefully insufficient numbers of aircraft to producing 300k by the end of world war 2.

One of the things that the post mentioned was the meager profit margin that the companies made during this time.

But the thing is that this set the America auto and aviation industry up to rule the world for decades.

A government going to a company and saying 'we need you to produce this product for us at a lower margin thab you'd like to' isn't the end of the world.

I don't know if this is one of those scenarios but they exist.

[0] https://www.construction-physics.com/p/how-to-build-300000-a...


In the case of NVIDIA it's even more sneaky.

They are an intellectual property company holding the rights on plans to make graphic cards, not even a company actually making graphic cards.

The government could launch an initiative "OpenGPU" or "OpenAI Accelerator", where the government orders GPUs from TSMC directly, without the middleman.

It may require some tweaking in the law to allow exception to intellectual property for "public interest".


y'all really don't understand how these actions would seriously harm capital markets and make it difficult for private capital formation to produce innovations going forward.

> y'all really don't understand how these actions would seriously harm capital markets and make it difficult for private capital

Reflexively, I count that harm as a feature. I don't like private capital markets because I've been screwed by private capital on multiple occasions.

But you are right: I don't understand how these actions would harm. So please do expand your concerns.


If we have public capital formation, we don’t necessarily need private capital. Private innovation in weather modelling isn’t outpacing government work by leaps and bounds, for instance.

because it is extremely challenging to capture the additional value that is being produced by better weather forecasts and generally the forecasts we have right now are pretty good.

private capital is absolutely the driving force for the vast majority of innovations since the beginning of the 20th century. public capital may be involved, but it is dwarfed by private capital markets.


It’s challenging to capture the additional value and the forecasts are pretty good because of continual large-scale government investment into weather forecasting. NOAA is launching satellites! it’s a big deal!

Private nuclear research is heavily dependent on governmental contracts to function. Solar was subsidized to heck and back for years. Public investment does work, and does make a didference.

I would even say governmental involvement is sometimes even the deciding factor, to determine if research is worth pursuing. Some major capital investors have decided AI models cannot possibly gain enough money to pay for their training costs. So what do we do when we believe something is a net good for society, but isn’t going to be profitable?


They said remove legally-enforced monopolies on what they produce. Many of these big firms made their tech with millions to billions of taxpayer dollars at various points in time. If we’ve given them millions, shouldn’t we at least get to make independent implementations of the tech we already paid for?

To the extent these are incremental units that wouldn't have been sold absent the government program, it's difficult to see how NVIDIA is "harmed".

> Those receiving the grants have to pay a private owner of the GPUs.

Along similar lines, I'm trying to build a developer credits program where I get whomever (AMD/Dell) to purchase credits on my super computers, that we then give away to developers to build solutions, which drives more demand for our hardware, and we commit to re-invest those credits back into more hardware. The idea is to create a win-win-win (us, them, you) developer flywheel ecosystem. It isn't a new idea at all, Nvidia and hyperscalers have been doing this for ages.


A much better investment would be to (somehow) revolutionize production of chips for AI so that it's all cheaper, more reliable, and faster to stand up new generations of software and hardware codesign. This is probably much closer to the program mentioned in the top level comment: It wasn't to produce one type of thing, but to allow better production of any large thing from lighter alloys.

> Not sure why a publicly accessible GPU cluster would be a better solution than the current system of research grants.

You mean a better solution than different teams paying AWS over and over, potentially spending 10x on rent rather than using all that cash as a down payment on actually owning hardware? I can't really speak for the total costs of depreciation/hardware maintenance but renting forever isn't usually a great alternative to buying.


Do you have some information to share to support your bias against leasing especially with a depreciating asset?

In Canada, all three major AI research centers use clusters created with public money. These clusters receive regular additional hardware as new generations of GPUs become available. Considering how these institutions work, I'm pretty confident they've considered the alternatives (renting, AWS, etc). So that's one data point.

sure, I’ll hand it over after you spend your own time first to show that everything everywhere that’s owned instead of leased is a poor financial decision.

AWS is not only hardware but also software, documentation, support and more.

How about using some of that money to develop CUDA alternatives so everyone is not paying the Nvidia tax?

Or just develop the next wave of chips designed for specifically transformer-based architectures (and ternary computing), and bypass the needs for GPUs and CUDA altogether

That would be betting against other architectures like Mamba, which does not seem like an obviously good bet to make yet. Maybe it is though.

You're right, there are a number of avenues that are viable alternatives to the gpu monopoly.

I like the fact that these can be made with just mass-printed multiplication (and in ternary computing's case - addition) gates which require little more than 10 year old tech which is already widely distributed.


It would be probably cheaper to negate some IP. There are quite some projects and initiatives to make CUDA code run on AMD for example, but as far as I know, they all stopped at some point, probably because of fear of being sued into oblivion.

It is being done already...

https://docs.scale-lang.com/


It seems like rocm is already fully ready for transformer inference, so you are just referring to training?

ROCm is buggy and largely undocumented. That’s why we don’t use it.

It is actively improving every day.

https://news.ycombinator.com/item?id=41052750


That's the kind of work that can come out of academia and open source communities when societies provide the resources required.

Please start with the Windows Tax first for Linux users buying hardware...and the Apple Tax for Android users...

Either you port Tensorflow (Apple)[1] or PyTorch to your platform or you allow CUDA to run on your hardware (AMD) [2]. Companies are incentives to not have NVIDIA having a monopoly but the thing is that CUDA is a huge moat due to compatibility of all frameworks and everyone knows it. Also, all of the cloud or on premises providers use NVIDIA regardless.

[1] https://developer.apple.com/metal/tensorflow-plugin/ [2] https://www.xda-developers.com/nvidia-cuda-amd-zluda/


>> Either you port Tensorflow (Apple)[1] or PyTorch to your platform or you allow CUDA to run on your hardware (AMD) [2]. Companies are incentives to not have NVIDIA having a monopoly but the thing is that CUDA is a huge moat due to compatibility of all frameworks and everyone knows it. Also, all of the cloud or on premises providers use NVIDIA regardless.

This never made sense to me -- Apple could easily hire top talent to write Apple Silicon bindings for these popular libraries. I work at a creative ad agency, we have tons of high end apple devices yet the neural cores sit unused most of the time.


A lot of libraries seem to be working on Apple Silicon GPUs but not on ANE. I found this discussion interesting, seems like the ANE has a lot of limitations, is not well documented, and can only be used indirectly through Core ML. https://github.com/ggerganov/llama.cpp/discussions/336

The problem is that any public cluster would be outdated in 2 years. At the same time, GPUs are massively overpriced. Nvidia's profit margins on the H100 are crazy.

Until we get cheaper cards that stand the test of time, building a public cluster is just a waste of money. There are far better ways to spend $1b in research dollars.


> any public cluster would be outdated in 2 years

The private companies buying hundreds of billions of dollars of GPUs aren't writing them off in 2 years. They won't be cutting edge for long. But that's not the point--they'll still be available.

> Nvidia's profit margins on the H100 are crazy

I don't see how the current practice of giving a researcher a grant so they can rent time on a Google cluster that runs H100s is more efficient. It's just a question of capex or opex. As a state, the U.S. has a structual advantage in the former.

> far better ways to spend $1b in research dollars

One assumes the U.S. government wouldn't be paying list price. In any case, the purpose isn't purely research ROI. Like the heavy presses, it's in making a prohibitively-expensive capital asset generally available.


What about dollar cost averaging your purchases of GPUs? So that you're always buying a bit of the newest stuff every year rather than just a single fixed investment in hardware that will become outdated? Say 100 million a year every year for 20 years instead of 2 billion in a single year?

I just watched this 1950s DoD video on the heavy press program and highly recommend it: https://www.youtube.com/watch?v=iZ50nZU3oG8


Don't these public clusters exist today, and have been around for decades at this point, with varying architectures? In the sense that you submit a proposal, it gets approved, and then you get access for your research?

This is the most recent iteration of a national platform. They have tons of GPUs (and CPUs, and flash storage) hooked up as a Kubernetes cluster, available for teaching and research.

https://nationalresearchplatform.org/


Not--to my knowledge--for the GPUs necessary to train cutting-edge LLMs.

All of the major cloud providers offer grants for public research https://www.amazon.science/research-awards https://edu.google.com/intl/ALL_us/programs/credits/research https://www.microsoft.com/en-us/azure-academic-research/

NVIDIA offers discounts https://developer.nvidia.com/education-pricing

eg. for Australia, the National Computing Infrastructure allows researchers to reserve time on:

- 160 nodes each containing four Nvidia V100 GPUs and two 24-core Intel Xeon Scalable 'Cascade Lake' processors.

- 2 nodes of the NVIDIA DGX A100 system, with 8 A100 GPUs per node.

https://nci.org.au/our-systems/hpc-systems


> A public cluster of GPUs provided for free to American universities, companies and non-profits might not be a bad idea.

USA and Europe is already doing that in a grand scale, in different forms. Both at national and international scale.

I work at an HPC center which provides servers nationally and collaborates on international level.


Great idea, too bad the DOE and NSF were there first.

Eric Schmidt advocated for this exact thing in an Op-ed piece in the latest MIT Technology Review.

[1] https://www.technologyreview.com/2024/05/13/1092322/why-amer...


Better idea would be to make various open source packages utilities and put maintainers everywhere funded by public good.

AI is a fad, the brick and mortar of the future is open source tools.


It makes much more sense to invest in a next generation fab for GPUs than to buy GPUs and more closely matches this kind of project.

Does it? You're looking at a gargantuan investment in terms of money that would also require thousands of staff.

That just doesn't seem a good idea.


> gargantuan investment

it's a bigger investment, but it's an investment which will pay dividends for decades. with a compute cluster, the government is taking on an asset in the form of the cluster but also liabilities in the form of operations and administration.

with a fab, the government takes on either a promise of lower taxes for N years or hands over a bag of cash. after that they're clear of it. the company operating the fab will be responsible for the risks and on-going expenses.

on top of that...

> thousand of staff

the company will employ/attract even more top talent, each of whom will pay taxes and eventually go on to found related companies or teach the next generation or what have you. not to mention the risk reduction that comes with on-shoring something as critical to national security and the economy as a fab.

a public-access compute cluster isn't a bad idea, but it probably makes more sense to fund/operate it in similar PPP model. non-profit consortium of universities and business pool resources to plan, build, and operate it, government recognizes it as a public good and chips in a significant amount of money to help.


Now, I have no idea.

How much capability would $3.2bn in terms of AI computing power provide, including the operational and power costs of the cluster?

Certainly, you could build a "$3.2bn GPU cluster", but it would be dark.

So, how much learning time would $3.2bn provide? 1 year? 10 years?

Just curious about hand wavy guesses. I have no idea the scope of the these clusters.


Very much in this spirit is the NSF-funded National Deep Inference Fabric, which lets researchers run remote experiments on foundation models: https://ndif.us. They just announced a pilot program for Llama405b!

The size of the cluster would have to be massive or else your job will be on the queue for a year. And even then what are you going to do downsize the resources requested so you can get in earlier? After a certain point it starts to make more sense to just buy your own xeons and run your own cluster.

I'd like to see big programs to increase the amount of cheap, clean energy we have. AI compute would be one of many beneficiaries of super cheap energy, especially since you wouldn't need to chase newer, more efficient hardware just to keep costs down.

Yeah this would be the real equivalent of the program people are talking about above. That an investing in core networking infrastructure (like cables) instead of just giving huge handouts to certain corporations that then pocket the money.....

For the DoE, take a look at:

https://doeleadershipcomputing.org/


What about distributed training on volunteer hardware? Is that feasible?

It is an exciting concept, there's a huge wealth of gaming hardware deployed that is inactive at most hours of the day. And I'm sure people are willing to pay well above the electricity cost for it.

Unfortunately, the dominant LLM architecture makes it relatively infeasible right now.

- Gaming hardware has too limited VRAM for training any kind of near-state-of-the-art model. Nvidia is being annoyingly smart about this to sell enterprise GPUs at exorbitant markups.

- Right now communication between machines seems to be the bottleneck, and this is way worse with limited VRAM. Even with data-centre-grade interconnect (mostly Infiniband, which is also Nvidia, smart-asses), any failed links tend to cause big delays in training.

Nevertheless, it is a good direction to push towards, and the government could indeed help, but it will take time. We need both a more healthy competitive landscape in hardware, and research towards model architectures that are easy to train in a distributed manner (this was also the key to the success of Transformers, but we need to go further).


Couldn’t VRAM be subsidised with SSDs on a lower end machine? It would make it slower but maybe useful at last.

Perhaps, the landscape has improved a lot in the last couple of years, there are lots of implementation tricks to improve efficiency on consumer hardware, particularly for inference.

Although it is clear that the computing capacity of the GPU would be very underutilized with the SSD as the bottleneck. Even using RAM instead of VRAM is pretty impractical. It might be a bit better for chips like Apple's where the CPU, RAM and GPU are all tightly connected on the same SoC, and the main RAM is used as the VRAM.

Would that performance be still worth more than the electricity cost? Would the earnings be high enough for a wide population to be motivated to go through the hassle of setting up their machine to serve requests?


Ever heard of SETI@home?

https://setiathome.berkeley.edu


Followed the link and got two, for me, new infos: both the project and Drake are dead.

Used to contribute in the early 2000s with my Pentium for a while.

Ever got any results?

Also, for training LLMs, I understand there is a huge bandwith problem with this approach.


Imagine if they made a data center with 1957 electronics that cost $279 million.

They probably won't be using it now because the phone in your pocket is likely more powerful. Moore law did end but data center stuff are still evolving order of magnitudes faster than forging presses.


So we'll have the government bypass markets and force the working class to buy toys for the owning class?

If anything, allocate compute to citizens.


> If anything, allocate compute to citizens.

If something like this were to become a reality, I could see something like "CitizenCloud" where once you prove that you are a US Citizen (or green card holder or some other requirement), you can then be allocated a number of credits every month for running workloads on the "CitizenCloud". Everyone would get a baseline amount, from there if you can prove you are a researcher or own a business related to AI then you can get more credits.


Overall government doing anything is a bad idea. There are cases however where government is the only entity that can do certain things. These are things that involve military, law enforcement etc. Outside of this we should rely on private industry and for-profit industry as much as possible.

The American healthcare industry demonstrates the tremendous benefits of rigidly applying this mindset.

Why couldn’t law enforcement be private too? You call 911, several private security squads rush to solve your immediate crime issue, and the ones who manage to shoot the suspect send you a $20k bill. Seems efficient. If you don’t like the size of the bill, you can always get private crime insurance.


For a further exploration of this particular utopia, see Snowcrash by Neal Stephenson.

Ugh.

Government distorting undeveloped markets that have a lot of room for competition to increase efficiencies is a bad thing.

Government agencies running programs that should not be profitable, or where the only profit to be left comes at the expense of society as a whole, is a good thing.

Lots of basic medicine is the go to example here, treating cancer isn't going to be "profitable" and attempting to make it such just leads to dead people.

On the flip side, one can argue that dentistry has seen amazing strides in affordability and technological progress through the free market. From dental xrays to improvements in dental procedures to make them less painful for the patients.

Eye surgery is another area where competition has lead to good consumer outcomes.

But life of death situations where people can't spend time researching? The only profit there comes through exploiting people.


> Overall government doing anything is a bad idea.

that is bereft of detail enough to just be wrong. There are things that government is good for and things that government is bad for, but "anything" is just too broad, and reveals an anti-government bias which just isn't well thought out.


Why are governments a bad idea? Seems the human race has opted for governments doing things since the dawn of civilization. Building roads, providing defense, enforcing rights, provide social safety nets, funding costly scientific endeavors.

To summarise: There are some things where government action is the best solution, however by default see if the private sector can sort it first.

And it has been demonstrated long ago that the private market is not the most efficient solution for the general society for handling healthcare insurance.

That’s not correct. The American health care system is an extreme example of where private organisations fail overall society.

"Eventually though, open source Linux gained popularity – initially because it allowed developers to modify its code however they wanted ..."

I find the language around "open source AI" to be confusing. With "open source" there's usually "source" to open, right? As in, there is human legible code that can be read and modified by the user? If so, then how can current ML models be open source? They're very large matrices that are, for the most part, inscrutable to the user. They seem akin to binaries, which, yes, can be modified by the user, but are extremely obscured to the user, and require enormous effort to understand and effectively modify.

"Open source" code is not just code that isn't executed remotely over an API, and it seems like maybe its being conflated with that here?


"Open weights" is a more appropriate term but I'll point out that these weights are also largely inscrutable to the people with the code that trained it. And for licensing reasons, the datasets may not be possible to share.

There is still a lot of modifying you can do with a set of weights, and they make great foundations for new stuff, but yeah we may never see a competitive model that's 100% buildable at home.

Edit: mkolodny points out that the model code is shared (under llama license at least), which is really all you need to run training https://github.com/meta-llama/llama3/blob/main/llama/model.p...


"Open weights" means you can use the weights for free (as in beer). "Open source" means you get the training dataset and the methodology. ~Nobody does open source LLMs.

Indeed, since when the deliverable being a jpeg/exe, which is similar to what the model file is, is considered the source? it is more like open result or freely available vm image, which works, but has its core FS scrambled or crypted.

Zuck knows this very well and it does him no honour to speak like, and from his position this equals attempt ate trying to change the present semantics of open source. Of course, others do that too - using the notion of open source to describe something very far from open.

What Meta is doing under his command can better be desdribed as releasing the resulting...build, so that it can be freely poked around and even put to work. But the result cannot be effectively reversed engineered.

Whats more ridiculous is that precisely because the result is not the source in its whole form, that these graphical structures can made available. Only thanks to the fact it is not traceable to the source, which makes the whole game not only closed, but like... sealed forever. An unfair retell of humanity's knowledge tossed around in very obscure container that nobody can reverse engineer.

how's that even remotely similar to open source?


Even if everything was released how you described, what good would that really do for an individual without access to heaps of compute? Functionally there seems to be no difference between open weights and open compute because nobody could train a facsimile model. Furthermore, all frontier models are inscrutable due to their construction. It’s wild to me seeing people complain semantics when meta dropped their model for cheap. Now I’m not saying we should suck the zuck for this act of charity, but you have to imagine that other frontier models are not thrilled that meta has invalidated their compute moats with the release of llama. Whether we like it or not, we’re on this AI rollercoaster and I’m glad that it’s not just oligopolists dictating the direction forward. I’m happy to see meta take this direction, knowing that the alternatives are much worse.

That's not the discussion. We're talking about what open source is, and it's having the weights and the method to recreate the model.

If someone gives me an executable that I can run for free, and then says "eh why do you want the source, it would take you a long time to compile", that doesn't make it open source, it just makes it gratis.


Calling weights an executable is disingenuous and not a serious discussion. You can do a lot more with weights than you could with a binary executable.

You can do a lot more with an executable as well than just execute it. So maybe the analogy is apt, even if not exact.

Actually executables you can reverse engineer it into something that could be compiled back into an executable with the exact same functionality, which is AFAIK impossible to do with "open weights". Still, we don't call free executables "open source".


Its not really an analogy. LLMs are quite literally executables in the same way that jpegs are executables. They both specify machine readable (but not human readable) domain specific instructions executed by the image viewer/inference harness.

And yes, like other executables, they are not literal black boxes. Rather, they provide machine readable specifications which are not human readable without immense effort.

For an LLM to be open source there would need to be source code. Source code, btw, is not just a procedure that can be handed to a machine to produce code that can be executed by the machine. That means the training data and code is not sufficient (or necessary) for an open source model.

What we need for an open source model is a human readable specification of the model's functionality and data structures which allows the user to modify specific arbitrary functionally/structure, and can be used to produce an executable (the model weights).

We simply need much stronger interpretability for that to be possible.


This is debatable, even an executable is valuable artifact. You can also do a lot with executable in expert hand.

I'd find knowing what's in the training data hugely valuable - can analyse it to understand and predict capabilities.

Linux is open source and is mostly C code. You cannot run C code directly, you have to compile it and produce binaries. But it's the C code, not binary form, where the collaboration happens.

With LLMs, weights are the binary code: it's how you run the model. But to be able to train the model from scratch, or to collaborate on new approaches, you have to operate at a the level of architecture, methods, and training data sets. They are the source code.


Analogies are always going to fall short. With LLM weights, you can modify them (quant, fine-tuning) to get something different, which is not something you do with compiled binaries. There are ample areas for collaboration even without being able to reproduce from scratch, which takes $X Millions of dollars, also something that a typical binary does not have as a feature.

You can absolutely modify compiled binaries to get something different. That's how lots of video game modding and ROM hacks work.

And we would absolutely do it more often if compiling would cost as much as training of an LLM costs now.

I considered adding "normally" to the binary modifications expecting a response like this. The concepts are still worlds apart

Weights aren't really a binary in the same sense that a compiler produces, they lack instructions and are more just a bunch of floating point values. Nor can you run model weights without separate code to interpret them correctly. In this sense, they are more like a JPEG or 3d model


JPEGs and 3D models are also executable binaries. They, like model weights, contain domain specific instructions that execute in a domain specific and turing incomplete environment. The model weights are the instructions, and those instructions are interpreted by the inference harness to produce outputs.

>Nobody does open source LLMs.

There are a bunch of independent, fully open source foundation models from companies that share everything (including all data). AMBER and MAP-NEO for example. But we have yet to see one in the 100B+ parameter category.


Sorry, the tilde before "nobody" is my notation for "basically nobody" or "almost nobody". I thought it was more common.

It is more common when it comes to numbers I guess. There are ~5 ancestors in this comment chain, if I would agree roughly 4-6 is acceptable.

It's the literal (figurative) nobody rather than the literal (literal) nobody.

There are plenty of open source LLMs, they just aren’t at the top of the leaderboards yet. Here’s a recent example, I think from Apple: https://huggingface.co/apple/DCLM-7B

Using open data and dclm: https://github.com/mlfoundations/dclm


If weights are not the source, then if they gave you the training data and scripts but not the weights, would that be "open source"?

Yes, but they won't do that. Possibly because extensive copyright violation in the training data that they're not legally allowed to share.

If somebody would leak the training data and they would deny that it’s real ergo not getting sued and the data would be available.

Edit typo.


It's not available if you can't use it because you don't have as many lawyers as facebook and can't ignore laws so easily.

This is bending the definition to the other extreme.

Linux doesn't ship you the compiler you need to build the binaries either, that doesn't mean it's closed source.

LLMs are fundamentally different to software and using terms from software just muddies the waters.


And LLMs don't ship with a Python distribution.

Linux sources :: dataset that goes into training

Linux sources' build confs and scripts :: training code + hyperparameters

GCC :: Python + PyTorch or whatever they use in training

Compiled Linux kernel binary :: model weights


Just because you keep saying it doesn't make it true.

LLMs are not software any more than photographs are.


Then what is the "source"? If we are to use the term "source" then what does that mean here, as distinct from it merely being free?

It means nothing because LLMs aren't software.

Do they not run on a computer?

So does a video. Is a video open source if you're given the permissions to edit it? To distribute it? Given the files to generate it? What if the files can only be open in a proprietary program?

Videos aren't software and neither are llms.


If a video doesn't have source code, then it can't be open source. Likewise, if you feel that an LLM doesn't have source code because of some property of what it is -- as you claim it isn't software and somehow that means that it abstractly removes it from consideration for this concept (an idea I think is ridiculous, FWIW: an LLM is clearly software that runs in a particularly interesting virtual machine defined by the model architecture) -- then; somewhat trivially, it also can't be open source. It is, as the person you are responding to says, at best "open weights".

If a video somehow does have source code which can "generate it", then the question of what it means for the source code to the video to be open even if the only program which can read it and generate the video is closed source is equivalent to asking if a program written in Visual Basic can ever be open source given that the Visual Basic compiler is closed source. Personally, I can see arguments either way on this issue, though most people seem to agree that the program is still open source in such a situation.

However, we need not care too much about the answer to that specific conundrum, as the moral equivalent of both the compiler and the runtime virtual machine are almost always open source. What is then important is much easier: if you don't provide the source code to the project, even if the compiler is open source and even if it runs on an open source machine, clearly the project -- whatever it is that we might try to be discussing, including video files -- cannot be open source. The idea that a video can be open source when what you mean is the video is unencrypted and redistributanle but was merely intended to be played in an open source video player is absurd.


> Is a video open source if you're given the permissions to edit it? To distribute it? Given the files to generate it?

If you're given the source material and project files to continue editing where the original editors finished, and you're granted the rights to re-distribute - Yes, that would be open source[1].

Much like we have "open source hardware" where the "source" consists of original schematics, PCB layouts, BOM, etc. [2]

[1] https://en.wikipedia.org/wiki/Open-source_film

[2] https://en.wikipedia.org/wiki/Open-source_hardware


Videos and images are software. They are compiled binaries with very domain specific instructions executed in a very non-turing complete context. They are generally not released as open source, and in many cases the source code (the file used to edit the video or image) is lost. They are not seen, colloquially, as software, but that does not mean that they are not software.

If a video lacks a specification file (the source code) which can be used by a human reader to modify specific features in the video, then it is software that is simply incapable of being open sourced.


"LLMs are fundamentally different to software and using terms from software just muddies the waters."

They're still software, they just don't have source code (yet).


There is a comment elsewhere claiming there are a few dozen fully open source models: https://news.ycombinator.com/item?id=41048796

Why is the dataset required for it to be open source?

If I self host a project that is open sourced rather than paying for a hosted version, like Sentry.io for example, I don't expect data to come along with the code. Licensing rights are always up for debate in open source, but I wouldn't expect more than the code to be available and reviewable for anything needed to build and run the project.

In the case of an LLM I would expect that to mean the code run to train the model, the code for the model data structure itself, and the control code for querying the model should all be available. I'm not actually sure if Meta does share all that, but training data is separate from open source IMO.


The open source movement, from which the name derives, was about the freedom to make bespoke alterations to the software you choose to run. Provided you have reasonably widespread proficiency in industry standard tools, you can take something that's open source, modify that source, and rebuild/redeploy/reinterpret/re-whatever to make it behave the way that you want or need it to behave.

This is in contrast to a compiled binary or obfuscated source image, where alteration may be possible with extraordinairy skill and effort but is not expected and possibly even specirically discouraged.

In this sense, weights are entirely like those compiler binaries or obfuscated sources rather than the source code usually associated with "open source"

To be "open source" we would want LLM's where one might be able to manipulate the original training data or training algorithm to produce a set of weights more suited to one's own desires and needs.

Facebook isn't giving us that yet, and very probably can't. They're just trading on the weird boundary state of the term "open source" -- it still carries prestige and garners good will from its original techno-populist ideals, but is so diluted by twenty years of naive consumers who just take it to mean "I don't have to pay to use this" that the prestige and good will is now misplaced.


>The open source movement, from which the name derives, was about the freedom to make bespoke alterations to the software you choose to run.

The open source movement was a cash grab to make the free software movement more palatable to big corp by moving away from copy left licenses. The MIT license is perfectly open source and means that you can buy software without ever seeing its code.


If you obtain open source licensed software you can pass it on legally (and freely). With some licenses you also have to provide the source code.

The sticking point is you can’t build the model. To be able to build the model from scratch you need methodology and a complete description of the data set.

They only give you a blob of data you can run.


Got it, that makes sense. I still wouldn't expect them to have to publicly share the data itself, but if you can't take the code they share and run it against your own data to build a model that wouldn't be open source in my understanding of it.

Data is the source code here, though. Training code is effectively a build script. Data that goes into training a model does not function like assets in videogames; you can't swap out the training dataset after release and get substantially the same thing. If anything, you can imagine the weights themselves are the asset - and even if the vendor is granting most users a license to copy and modify it (unlike with videogames), the asset itself isn't open source.

So, the only bit that's actually open-sourced in these models is the inference code. But that's a trivial part that people can procure equivalents of elsewhere or reproduce from published papers. In this sense, even if you think calling the models "open source" is correct, it doesn't really mean much, because the only parts that matter are not open sourced.


Compare/contrast:

DOOM-the-engine is open source (https://github.com/id-Software/DOOM), even though DOOM-the-asset-and-scenario-data is not. While you need a copy of DOOM-the-asset-and-scenario-data to "use DOOM to run DOOM", you are free to build other games using DOOM-the-engine.


I think no one would claim that “Doom” is open source though, if that’s the situation.

That's what op is saying, the engine is GPLv2, but the assets are copyrighted. There's Freedoom though and it's pretty good [0].

[0]: https://freedoom.github.io/


The thing they are pointing at and which is the thing people want is the output of the training engine, not the inputs. This is like someone saying they have an open source kernel, but they only release a compiler and a binary... the kernel code is never released, but the kernel is the only reason anyone even wants the compiler. (For avoidance of anyone being somehow confused: the training code is a compiler which takes training data and outputs model weights.)

The output of the training engine, I.E. the model itself, isn't source code at all though. The best approximation would be considering it obfuscated code, and even then it's a stretch since it is more similar to compressed data.

It sounds like Meta doesn't share source for the training logic. That would be necessary for it to really be open source, you need to be able to recreate and modify the codebase but that has nothing to do with the training data or the trained model.


I didn't claim the output is source code, any more than the kernel is. Are you sure you don't simply agree with me?

> not actually sure if Meta does share all that

Meta shares the code for inference but not for training, so even if we say it can be open-source without the training data, Meta's models are not open-source.

I can appreciate Zuck's enthusiasm for open-source but not his willingness to mislead the larger public about how open they actually are.


https://opensource.org/osd

"The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed."

> In the case of an LLM I would expect that to mean the code run to train the model, the code for the model data structure itself, and the control code for querying the model should all be available

The M in LLM is for "Model".

The code you describe is for an LLM harness, not for an LLM. The code for the LLM is whatever is needed to enable a developer to modify to inputs and then build a modified output LLM (minus standard generally available tools not custom-created for that product).

Training data is one way to provide this. Another way is some sort of semantic model editor for an interpretable model.


I still don't quite follow. If Meta were to provide all code required to train a model (it sounds like they don't), and they provided the code needed to query the model you train to get answers how is that not open source?

> Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed.

This definition actually makes it impossible for any LLM to be considered open source until the interpretability problem is solved. The trained model is functionally obfuscated code, it can't be read or interpreted by a human.

We may be saying the same thing here, I'm not quite sure if you're saying the model must be available or if what is missing is the code to train your own model.


I'm not the person you replied directly to so I can't speak for them, but I did start this thread, and I just wanted to clarify what I meant in my OP, because I see a lot of people misinterpreting what I meant.

I did not mean that LLM training data needs to be released for the model to be open source. It would be a good thing if creators of models did release their training data, and I wouldn't even be opposed to regulation which encourages or even requires that training data be released when models meet certain specifications. I don't even think the bar needs to be high there- We could require or encourage smaller creators to release their training data too and the result would be a net positive when it comes to public understanding of ML models, control over outputs, safety, and probably even capabilities.

Sure, its possible that training data is being used illegally, but I don't think the solution to that is to just have everyone hide that and treat it as an open secret. We should either change the law, or apply it equally.

But that being said, I don't think it has anything to do with whether the model is "open source". Training data simply isn't source code.

I also don't mean that the license that these models are released under is too restrictive to be open source. Though that is also true, and if these models had source code, that would also prevent them from being open source. (Rather, they would be "source available" models)

What I mean is "The trained model is functionally obfuscated code, it can't be read or interpreted by a human." As you point out, it is definitionally impossible for any contemporary LLM to be considered open source. (Except for maybe some very, very small research models?) There's no source code (yet) so there is no source to open.

I think it is okay to acknowledge when something is technically infeasible, and then proceed to not claim to have done that technically infeasible thing. I don't think the best response to that situation is to, instead, use that as justification for muddying the language to such a degree that its no longer useful. And I don't think the distinction is trivial or purely semantic. Using the language of open source in this way is dangerous for two reason.

The first is that it could conceivably make it more challenging for copyleft licenses such as the GPL to protect the works licensed with them. If the "public" no longer treats software with public binaries and without public source code as closed source, then who's to say you can't fork the linux kernel, release the binary, and keep the code behind closed doors? Wouldn't that also be open source?

The second is that I think convincing a significant portion of the open source community that releasing a model's weights is sufficient to open source a model will cause the community to put more focus on distributing and tuning weights, and less time actually figuring out how to construct source code for these models. I suspect that solving interpretability and generating something resembling source code may be necessary to get these models to actually do what we want them to do. As ML models become increasingly integrated into our lives and production processes, and become increasingly sophisticated, the danger created by having models optimized towards something other than what we would actually like them optimized towards increases.


Data is to models what code is to software.

I don't quite agree there. Based on other comments it sounds like Meta doesn't open source the code used to train the model, that would make it not open source in my book.

The trained model doesn't need to be open source though, and frankly I'm not sure what the value there is specifically with regards to OSS. I'm not aware of a solution to interpretability problem, even if the model is shared we can't understand what's in it.

Microsoft ships obfuscated code with Windows builds, but that doesn't make it open source.


Wouldn't the "source code" of the model be closer to the source code of a compiler or the runtime library?

IMO a pre-trained model given with the source code used to train/run it is analogous to a company shipping a compiler and a compiled binary without any of the source, which is why I don't think it's "open source" without the training data.


You really should be able to train a model on whatever data you choose to use though.

Training data instead source code at all, it's content fed into the ingestion side to train a model. As long as source for ingedting and training a model is available, which it sounds like isn't the case for Meta, that would be open source as best I understand it.

Said a little differently, I would need to be able to review all code used to generate a model and all code used to query the model for it to be OSS. I don't need Meta's training data or their actual model at all, I can train my own with code that I can fully audit and modify if I choose to.


But surely you wouldn't call it open source if sentry just gave you a binary - and the source code wasn't available.

I suspect that even if you allowed people to take the data, nobody but a FAANG like organisation could even store it?

My impression is the training data for foundation models isn't that large. It won't fit on your laptop drive, but it will fit comfortably in a few racks of high-density SSDs.

yeah, according to the article [0] about the release of Llama 3.1 405B, it was trained on 15 trillion tokens using 16000 Nvidia H100's to do it. Even if they did release the training data, I don't think many people would have the number of gpus required to actually do any real training to create the model....

[0] https://ai.meta.com/blog/meta-llama-3-1/


And a token is the sequence number of a sequence of input in a restricted dictionary. GPT-2 was said to have 50k distinct tokens, so I think it's safe to assume even the latest ones are well below 4M tokens, so max 4 bytes per token. 15 trillion tokens -> 4 bytes/token * 15 T tokens -> training input<=60 TB doesn't sound that large.

It's the computation that is costly.


LLAMA is an open-weights model. I like this term, let's use that instead of open source.

Can a human programmer edit the weights according to some semantics?

It is possible to merge two fine-tunes of models from the same family by... wait for it... averaging or combining their weights[0].

I am still amazed that we can do that.

[0]: https://arxiv.org/abs/2212.09849


This is absolutely wild.

Yes. Using fine tuning.

Yes, there is the concept of a "frakenmerge" and folks have also bolted on vision and audio models to LLMs.

If you can’t share the dataset, under what twisted reality are you fine to share the derivative models based on those unsharable datasets?

In a better world, there would be no “I ran some algos on it and now it’s mine” defense.


Yeah was gonna say exactly the same thing. Weird how the legislation allows releasing LLMs trained on data that is not allowed to be shared otherwise.

Meta might possibly have a license to use (some of) that data, but not a license to distribute it. Legislation has little to do with it, I imagine.

latest llama 3.1 is in a different repo, https://github.com/meta-llama/llama-models/blob/main/models/... , but yes, the code is shared. It astonishing that in software 2.0 era, powerful applications like llama has only hundreds of lines of code, and most work hidden in training data. Source code alone is no longer that informative as Software 1.0

For models of this size, the code used to train them is going to be very custom to the architecture/cluster they are built on. It would be almost useless to anybody outside of Meta. The dataset would be more a lot more interesting, as it would at the very least show everybody how they got it to behave in certain ways.

Open training data would be great too.

If you have open data and open source code you can reproduce the weights


Not easily for these large scale models, but theoretically maybe

Really? I have to check out the training code again. Last time I looked the training and inference code were just example toys that were barely usable.

Has that changed?


Open Source Initiative (kind of a de-facto authority on what's open source and what not) is spending a whole lot of time figuring out what it means for an AI system to be open source. In other words, they're basically trying to come up with a new license because the existing ones can't easily apply.

I believe this is the current draft: https://opensource.org/deepdive/drafts/the-open-source-ai-de...


OSI made themselves the authority because they hated Richard Stallman and his Free Software movement. It's just marketing.

RMS has no interest in governing Open Source, so your comment bears no particular relevance.

RMS is an advocate for Free Software. Free Software generally implies Open Source, but not the converse.

RMS considers openness of source to be a separate category from the freeness of software. "Free software is a political movement; open source is a development model."

https://www.gnu.org/licenses/license-list.en.html


Are you really pretending that OSI and the open source label itself wasn’t a reactionary movement that vilified free software principles in hopes of gaining corporate traction?

Most of us who were there remember it differently. True open source advocates will find little to refute in what I’ve said.


> True open source advocates will find little to refute in what I’ve said.

No true Scotsman https://en.wikipedia.org/wiki/No_true_Scotsman

OSI helped popularize the open source movement. They not only make it palatable to businesses, but got them excited about it. I think that FSF/Stallman alone would not have been very successful on this front with GPL/AGPL.


Like I said, honest open source advocates won’t take issue to how I framed their position.

Here’s a more important point: how far would the open source people have gotten without GCC and glibc?

Much less far than they will ever admit, in my experience.


> Most of us who were there remember it differently. True open source advocates will find little to refute in what I’ve said.

> Like I said, honest open source advocates won’t take issue to how I framed their position.

Yet you've failed to provide even a single point of evidence to back up your claim.

> "honest open source advocates"

You've literally just made this term up. It's meaningless.


It’s not a term, it’s a phrase. It means “open source advocates who are being honest about their advocacy”, in case you really need such a degree of clarification.

I’ve met honest open source advocates before and, once again, they would be unlikely to refute the fact that “open source” was invented in explicit contrast to “free software” to achieve corporate palatability.

The comment you are responding to was literally responding to a comment which validated this exact sentiment.

As to providing evidence, those of us who were there at the time don’t need any and those of you who weren’t ought to seek some. It’s not my job to link to the nearly infinite number of conversations where this obvious dynamic played out.


For some advocates, sure. I was there, too — although at the beginning of my career and not deeply involved in most licensing discussions until the founding of Mozilla (where I argued against the GNU GPL and was generally pleased with the result of the MPL). However, from ~1990, I remember sharing some code where I "more or less" made my code public domain but recommended people consider the GNU GPL as part of the README (I don't have the source code available, so I don't recall).

Your characterization is quit easily refutable, because at the time that OSI was founded, there was already an explosion of possible licenses and RMS and other GNUnatics were making lots of noise about GNU/Linux and trying to be as maximalist as possible while presenting any choice other than the GNU GPL as "against freedom".

This certainly would not have held well with people who were using the MIT Licence or BSD licences (created around the same time as the GNU GPL v1), who believed (and continue to believe) that there were options other than a restrictive viral licence‡. Yes, some of the people involved vilified the "free software principles", but there were also GNU "advocates" who were making RMS look tame with their wording (I recall someone telling me to enjoy "software slavery" because I preferred licences other than the GNU GPL).

The "Free Software" advocates were pretending that the goals of their licence were the only goals that should matter for all authors and consumers of software. That is not and never has been the case, so it is unsurprising that there was a bit of reaction to such extremism.

OSI and the open source label were a move to make things easier for corporations to accept and understand by providing (a) a clear unifying definition, and (b) a set of licences and guidelines for knowing what licenses did what and the risks and obligations they presented to people who used software under those licences.

‡ Don't @ me on this, because both the virality and restrictiveness are features of the GNU GPL. If it weren't for the nonsense in the preamble, it would be a good licence. As it is, it is an effective if rampantly misrepresented licence.


Didn't the Open Source Definition start as the DFSG? You telling me Debian hates the Free Software movement? Unless you define "hating Free Software" as "not banning the BSD license", then I'll have to disagree.

Training code is only useful to people in academia, and the closest thing to "code you can modify" are open weights.

People are framing this as if it was an open-source hierarchy, with "actual" open-source requiring all training code to be shared. This is not obvious to me, as I'm not asking people that share open-source libraries to also share the tools they used to develop them. I'm also not asking them to share all the design documents/architecture discussion behind this software. It's sufficient that I can take the end result and reshape it in any way I desire.

This is coming from an LLM practitioner that finetunes models for a living; and this constant debate about open-source vs open-weights seems like a huge distraction vs the impact open-sourcing something like Llama has... this is truly a Linux-like moment. (at a much smaller scale of course, for now at least)


I dunno — if an open source project required, say, a proprietary compiler, that would diminish its open source-ness. But I agree it's not totally comparable, since the weights are not particularly analogous to machine code. We probably need a new term. Open Weights.

There are many "compilers", you can download The Pile yourself.

> If so, then how can current ML models be open source?

The source of a language model is the text it was trained on. Llama models are not open source (contrary to their claims), they are open weight.


You can find the entire Llama 3.0 pretraining set here: https://huggingface.co/datasets/HuggingFaceFW/fineweb

15T tokens, 45 terrabytes. Seems fairly open source to me.


Where has Facebook linked that? I can't find anywhere that they actually published that.

Many companies stopped publishing their data sets after people published evidence they were mass, copyright infringement. They dropped the specifics of pretraining data from the model cards.

Aside from licensing content, that content creators don’t like redistribution means a lawful model would probably only use Gutenberg’s collection and permissive code. Anything else, including Wikipedia, usually has licensing requirements they might violate.


Yeah I don't think I've seen it linked officially, but Meta does this sort of semi-official stuff all the time, leaking models ahead of time for PR, they even have a dedicated Reddit account for releasing unofficial info.

Regardless, it fits the compute used and the claim that they trained from public web data, and was suspiciously published by HF staff shortly after L3 released. It's about as official as the Mistral 7B v0.2 base model. I.e. mostly, but not entirely, probably for some weird legal reasons.


Says it is ~94TB, with >130k downloads, implying more than 12 exabytes of copying, seems a bit off, wonder how they are calculating downloads

No. The text is an asset used by the source to train the model. The source can process arbitrary text. Text is just text, it was written for communication purposes, software (defined by source code) processes that text in a particular way to train a model.

In programming, "source" and "asset" have specific meanings that conflict with how you used them.

Source is the input to some built artifact. It is the source of that artifact. As in: where the artifact comes from. Textual input is absolutely the source of the ML model. What you are using "source" as is analogous to the source of the compiler in traditional programming.

Asset is an artifact used as input, that is revered verbatim by the output. For example, a logo baked into an application to be rendered in the UI. The compilation of the program doesn't make a new logo, it just moves the asset into the built artifact.


I hadn't had my morning coffee yet when I wrote this and I have no idea what I meant instead of "revered", but you get the idea :D

I think it would also include the code used to train it

That would be more analogous to the build toolchain than the source code, but yes

Surely traditional “open source” also needs some notion of a reproducible build toolchain, otherwise the source code itself is approximately useless.

Imagine if the source code was in a programming language of which the basic syntax and semantics were known to no one but the original developers.

Or more realistically, I think it’s a major problem if an open source project can only be built by an esoteric process that only the original developers have access to.


Source code in a vacuum is still valuable as a way to deal with missing/inaccurate documentation and diagnose faults and their causes.

Raw training datasets similarly has some value as you can analyze it for different characteristics to understand why the trained model is under/over-representing different concepts.

But yes real FOSS should be "open-build" and allow anyone to build a test-passing artifact from raw source material.


Of course you are right, I'd put it less carefully: The quoted Linux line is deceptive marketing.

- If we start with the closed training set, that is closed and stolen, so call it Stolen Source.

- What is distributed is a bunch of float arrays. The Llama architecture is published, but not the training or inference code. Without code there is no open source. You can as well call a compiler book open source, because it tells you how to build a compiler.

Pure marketing, but predictably many people follow their corporate overlords and eagerly adopt the co-opted terms.

Reminder again that FB is not releasing this out of altruism, but because they have an existing profitable business model that does not depend on generated chats. They probably do use it internally for tracking and building profiles, but that is the same as using Linux internally, so they release the weights to destroy the competition.

Isn't price dumping an anti trust issue?


I like the term "open weights". Open source would be the dataset and code that generates these weights.

There is still a lot you can do with weights, like fine tuning, and it is arguably more useful as retraining the entire model would cost millions in compute.



No, it's not. The Llama 3 Community License Agreement is not an open source license. Open source licenses need to meet the criteria of the only widely accepted definition of "open source", and that's the one formulated by the OSI [0]. This license has multiple restrictions on use and distribution which make it not open source. I know Facebook keeps calling this stuff open source, maybe in order to get all the good will that open source branding gets you, but that doesn't make it true. It's like a company calling their candy vegan while listing one its ingredients as pork-based gelatin. No matter how many times the company advertises that their product is vegan, it's not, because it doesn't meet the definition of vegan.

[0] - https://opensource.org/osd


Isn't the MIT license the generally accepted "open source" license? It's a community owned term, not OSI owned

MIT is a permissive open source license, not the open source license.

There are more licenses than just MIT that are "open source". GPL, BSD, MIT, Apache, some of the Creative Commons licenses, etc. MIT has become the defacto default though

https://opensource.org/license (linking to OSI for the list because it's convenient, not because they get to decide)


These discussions (ie, everything that follows here) would be much easier if the crowd insisting on the OSI definition of open source would capitalize Open Source.

In English, proper nouns are capitalized.

"Open" and "source" are both very normal English words. English speakers have "the right" to use them according to their own perspective and with personal context. It's the difference between referring to a blue tooth, and Bluetooth, or to an apple store or an Apple store.


Open source licenses need to meet the criteria of the only widely accepted definition of "open source", and that's the one formulated by the OSI [0]

Who died and made OSI God?


This isn't helpful. The community defers to the OSI's definition because it captures what they care about.

We've seen people try to deceptively describe non-OSS projects as open source, and no doubt we will continue to see it. Thankfully the community (including Hacker News) is quick to call it out, and to insist on not cheapening the term.

This is one the topics that just keeps turning up:

* https://news.ycombinator.com/item?id=24483168

* https://news.ycombinator.com/item?id=31203209

* https://news.ycombinator.com/item?id=36591820


This isn't helpful. The community...

Speak for yourself, please. The term is much older than 1998, with one easily-Googled example being https://www.cia.gov/readingroom/docs/DOC_0000639879.pdf , and an explicit case of IT-related usage being https://i.imgur.com/Nw4is6s.png from https://www.google.com/books/edition/InfoWarCon/09X3Ove9uKgC... .

Unless a registered trademark is involved (spoiler: it's not) no one, whether part of a so-called "community" or not, has any authority to gatekeep or dictate the terms under which a generic phrase like "open source" can be used.


Neither of those usages relate to IT, they both are about sources of intelligence (espionage). Even if they were, the OSI definition won, nobody is using the definitions from 1995 CIA or the 1996 InfoConWar book in the realm of IT, not even Facebook.

The community has the authority to complain about companies mis-labelling their pork products as vegan, even if nobody has a registered trademark on the term vegan. Would you tell people to shut up about that case because they don't have a registered trademark? Likewise, the community has authority to complain about Meta/Facebook mis-labelling code as open source even when they put restrictions on usage. It's not gate-keeping or dictatorship to complain about being misled or being lied to.


Would you tell people to shut up about that case because they don't have a registered trademark?

I especially like how I'm the one telling people to "shut up" all of a sudden.

As for the rest, see my other reply.


You're right, I and those who agree with me were the first to ask people to "shut up", in this case, to ask Meta to stop misusing the term open source. And I was the first to say "shut up", and I know that can be inflammatory and disrespectful, so I shouldn't have used it. I'm sorry. We're here in a discussion forum, I want you to express your opinion even it is to complain about my complaints. For what it's worth, your counter-arguments have been stronger and better referenced than any other I have read (for the case of accepting a looser definition of the term open source in the realm of IT).

All good, and I also apologize if my objection came across as disrespectful.

This whole 'Open Source' thing is a bigger pet peeve than it should be, because I've received criticism for using the term on a page where I literally just posted a .zip file full of source code. The smart thing to do would have been to ignore and forget the criticism, which I will now work harder at doing.

In the case of a pork producer who labels their products as 'vegan', that's different because there is some authority behind the usage of 'vegan'. It's a standard English-language word that according to Merriam-Webster goes back to 1944. So that would amount to an open-and-shut case of false advertising, which I don't think applies here at all.


> In the case of a pork producer who labels their products as 'vegan', that's different because there is some authority behind the usage of 'vegan'.

I don't see the difference. Open source software is a term of art with a specific meaning accepted by its community. When people misuse the term, invariably in such a way as to broaden it to include whatever it is they're pushing, it's right that the community responds harshly.


Terms of art do not require licenses. A given term is either an ordinary dictionary word that everyone including the courts will readily recognize ("Vegan"), a trademark ("Microsoft® Office 365™"), or a fragment of language that everyone can feel free to use for their own purposes without asking permission. "Open Source" falls into the latter category.

This kind of argument is literally why trademark law exists. OSI did not elect to go down that path. Maybe they should have, but I respect their decision not to, and perhaps you should, too.


> Terms of art do not require licenses.

Agreed. There is no trademark on aileron or carburetor or context-free grammar. A couple of years ago I made this same point myself. [0]

> A given term is either an ordinary dictionary word that everyone including the courts will readily recognize ("Vegan"), a trademark ("Microsoft® Office 365™"), or a fragment of language that everyone can feel free to use for their own purposes without asking permission. "Open Source" falls into the latter category.

This taxonomy doesn't hold up.

Again, it's a term of art with a clear meaning accepted by its community. We've seen numerous instances of cynical and deceptive misuse of the term, which the community rightly calls out because it's not fair play, it's deliberate deception.

> This kind of argument is literally why trademark law exists

It is not. Trademark law exists to protect brands, not to clarify terminology.

You seem to be contradicting your earlier point that terms of art do not require licenses.

> OSI did not elect to go down that path. Maybe they should have, but I respect their decision not to, and perhaps you should, too.

I haven't expressed any opinion on that topic, and I don't see a need to.

[0] https://news.ycombinator.com/item?id=31203209


If the OSI members wanted to "clarify the terminology" in a way that permitted them (and you) to exclude others, trademark law would have absolutely been the correct way to do that. It's too late, however. The ship has sailed.

Come up with a new term and trademark that, and heck, I'll help you out with a legal fund donation when Facebook and friends inevitably try to appropriate it. Apart from that, you've fought the good fight and done what you could. Let it go.


The OSI was created about 20 years ago and defined and popularized the term open source. Their definition has been widely accepted over that period.

Recently, companies are trying to market things as open source when in reality, they fail to adhere to the definition.

I think we should not let these companies change the meaning of the term, which means it's important to explain every time they try to seem more open than they are.

I'm afraid the battle is being lost though.


>The OSI was created about 20 years ago and defined and popularized the term open source. Their definition has been widely accepted over that period.

It was defined and accepted by the community well before OSI came around though.


Citation? Wikipedia would appreciate your contribution.

https://en.wikipedia.org/wiki/Open_source

> Linus Torvalds, Larry Wall, Brian Behlendorf, Eric Allman, Guido van Rossum, Michael Tiemann, Paul Vixie, Jamie Zawinski, and Eric Raymond [...] > At that meeting, alternatives to the term "free software" were discussed. [...] Raymond argued for "open source. The assembled developers took a vote, and the winner was announced at a press conference the same evening

The original "Open source Definition" was derived from Debian's Social Contract, which did not use the term "open source"

https://web.archive.org/web/20140328095107/http://www.debian...


Citation? Wikipedia would appreciate your contribution.

It's not hard to find earlier examples where the phrase is used to describe enabling and (yes) leveraging community contributions to accomplish things that otherwise wouldn't be practical; see my other post for a couple of those.

But then people will rightfully object that the term "Open Source", when used in a capacity related to journalistic or intelligence-gathering activities, doesn't have anything to do with software licensing. Even if OSI had trademarked the phrase, which they didn't, that shouldn't constrain its use in another context.

To which I'd counter that this statement is equally true when discussing AI models. We are going to have to completely rewire copyright law from the ground up to deal with this. Flame wars over what "Open Source" means or who has the right to use the phrase are going to look completely inconsequential by the time the dust settles.


I'll concede that "open source" may mean other things in other contexts. For example, an open source river may mean something in particular to those who study rivers. This thread was not talking about a new context, it was not even talking about the weights of a machine learning model or the licensing of training data, it was talking about the licensing of the code in a particular GitHub repository, llama3.

AI may make copyright obsolete, or it may make copyright more important than ever, but my prediction is that the IT community will lose something of great value if the term "open source" is diluted to include licenses that restrict usage, restrict distribution, and restrict modification. I can understand why people may want to choose somewhat restrictive licenses, just like I can understand why a product may contain gelatin, but I don't like it when the product is mis-labelled as vegan. There are plenty of other terms that could be used, for example, "open" by itself. I'm honestly curious if you would defend a pork product labelled as vegan, or do you just feel that the analogy doesn't apply?


This is like saying any python program is open source because the python runtime is open source.

Inference code is the runtime; the code that runs the model. Not the model itself.


I disagree. The file I linked to, model.py, contains the Llama 3 model itself.

You can use that model with open data to train it from scratch yourself. Or you can load Meta’s open weights and have a working LLM.


Yeah a lot of people here seem to not understand that PyTorch really does make model definitions that simple, and that has everything you need to resume back-propagation. Not to mention PyTorch itself being open-sourced by Meta.

That said the LLama-license doesn't meet strict definitions of OS, and I bet they have internal tooling for datacenter-scale training that's not represented here.


> The file I linked to, model.py, contains the Llama 3 model itself.

That makes it source available ( https://en.wikipedia.org/wiki/Source-available_software ), not open source


Source available means you can see the source, but not modify it. This is kinda the opposite, you can modify the model, but you don't see all the details of its creation.

> Source available means you can see the source, but not modify it.

No, it doesn't mean that. To quote the page I linked, emphasis mine,

> Source-available software is software released through a source code distribution model that includes arrangements where the source can be viewed, and in some cases modified, but without necessarily meeting the criteria to be called open-source. The licenses associated with the offerings range from allowing code to be viewed for reference to allowing code to be modified and redistributed for both commercial and non-commercial purposes.

> This is kinda the opposite, you can modify the model, but you don't see all the details of its creation.

Per https://github.com/meta-llama/llama3/blob/main/LICENSE there's also a laundry list of ways you're not allowed to use it, including restrictions on commercial use. So not Open Source.


That's not the training code, just the inference code. The training code, running on thousands of high-end H100 servers, is surely much more complex. They also don't open-source the dataset, or the code they used for data scraping/filtering/etc.

"just the inference code"

It's not the "inference code", its the code that specifies the architecture of the model and loads the model. The "inference code" is mostly the model, and the model is not legible to a human reader.

Maybe someday open source models will be possible, but we will need much better interpretability tools so we can generate the source code from the model. In most software projects you write the source as a specification that is then used by the computer to implement the software, but in this case the process is reversed.


That is just the inference code. Not training code or evaluation code or whatever pre/post processing they do.

Is there an LLM with actual open source training code and dataset? Besides BLOOM https://huggingface.co/bigscience/bloom


Yes, there are a few dozen full open source models (license, code, data, models)

What are some of the other ones? I am aware mainly of OLMo (https://blog.allenai.org/olmo-open-language-model-87ccfc95f5...)

The term “source code” can mean many things. In a legal context it’s often just defined as the preferred format for modification. It can be argued that for artificial neural networks that’s the weights (along with code and preferably training data).

Can’t you do fine tuning on those binaries? That’s a modification.

You can fine tune the models, and you can modify binaries. However, there is no human readable "source" to open in either case. The act of "fine tuning" is essentially brute forcing the system to gradually alter the weights such that loss is reduced against a new training set. This limits what you can actually do with the model vs an actual open source system where you can understand how the system is working and modify specific functionality.

Additionally, models can be (and are) fine tuned via APIs, so if that is the threshold required for a system to be "open source", then that would also make the GPT4 family and other such API only models which allow finetuning open source.


I don't find this argument super convincing.

There's a pretty clear difference between the 'finetuning' offered via API by GPT4 and the ability to do whatever sort of finetuning you want and get the weights at the end that you can do with open weights models.

"Brute forcing" is not the correct language to use for describing fine-tuning. It is not as if you are trying weights randomly and seeing which ones work on your dataset - you are following a gradient.


"There's a pretty clear difference between the 'finetuning' offered via API by GPT4 and the ability to do whatever sort of finetuning you want and get the weights at the end that you can do with open weights models."

Yes, the difference is that one is provided over a remote API, and the provider of the API can restrict how you interact with it, while the other is performed directly by the user. One is a SaaS solution, the other is a compiled solution, and neither are open source.

""Brute forcing" is not the correct language to use for describing fine-tuning. It is not as if you are trying weights randomly and seeing which ones work on your dataset - you are following a gradient."

Whatever you want to call it, this doesn't sound like modifying functionality in source code. When I modify source code, I might make a change, check what that does, change the same functionality again, check the new change, etc... up to maybe a couple dozen times. What I don't do is have a very simple routine make very small modifications to all of the system's functionality, then check the result of that small change across the broad spectrum of functionality, and repeat millions of times.


The gap between fine-tuning API and weights-available is much more significant than you give it credit for.

You can take the weights and train LoRAs (which is close to fine-tuning), but you can also build custom adapters on top (classification heads). You can mix models from different fine-tunes or perform model surgery (adding additional layers, attention heads, MoE).

You can perform model decomposition and amplify some of its characteristics. You can also train multi-modal adapters for the model. Prompt tuning requires weights as well.

I would even say that having the model is more potent in the hands of individual users than having the dataset.


That still doesn't make it open source.

There is a massive difference between a compiled binary that you are allowed to do anything you want with, including modifying it, building something else on top or even pulling parts of it out and using in something else, and a SaaS offering where you can't modify the software at all. But that doesn't make the compiled binary open source.


> When I modify source code, I might make a change, check what that does, change the same functionality again, check the new change, etc... up to maybe a couple dozen times.

You can modify individual neurons if you are so inclined. That's what Anthropic have done with the Claude family of models [1]. You cannot do that using any closed model. So "Open Weights" looks very much like "Open Source".

Techniques for introspection of weights are very primitive, but i do think new techniques will be developed, or even new architectures which will make it much easier.

[1] https://www.anthropic.com/news/mapping-mind-language-model


"You can modify individual neurons if you are so inclined."

You can also modify a binary, but that doesn't mean that binaries are open source.

"That's what Anthropic have done with the Claude family of models [1]. ... Techniques for introspection of weights are very primitive, but i do think new techniques will be developed"

Yeah, I don't think what we have now is robust enough interpretability to be capable of generating something comparable to "source code", but I would like to see us get there at some point. It might sound crazy, but a few years ago the degree of interpretability we have today (thanks in no small part to Anthropic's work) would have sounded crazy.

I think getting to open sourcable models is probably pretty important for producing models that actually do what we want them to do, and as these models become more powerful and integrated into our lives and production processes the inability to make them do what we actually want them to do may become increasingly dangerous. Muddling the meaning of open source today to market your product, then, can have troubling downstream effects as focus in the open source community may be taken away from interpretability and on distributing and tuning public weights.


> a few years ago the degree of interpretability we have today (thanks in no small part to Anthropic's work) would have sounded crazy

My understanding is that a few years ago, if we knew the degree of interpretability we have today (compared to capability) it would have been devastatingly disappointing.

We are climbing out of the trough of disillusionment maybe, but to say that we have reached mind-blowing heights with interpretability seems a bit of an hyperbole, unless I've missed some enormous breakthrough.


"My understanding is that a few years ago, if we knew the degree of interpretability we have today (compared to capability) it would have been devastatingly disappointing."

I think this is a situation where both things are true. Much more progress has been made in capabilities research than interpretability and the interpretability tools we have now (at least, in regards to specific models) would have been seen as impossible or at least infeasible a few years back.


You make a good point but those are also just limitations of the technology (or at least our current understanding of it)

Maybe an analogy would help. A family spent generations breeding the perfect apple tree and they decided to “open source” it. What would open sourcing look like?


Your hypothetical apple-grower family would simply share a handbook which meticulously shared the initial species of apple used, the breeding protocol, the hybridization method, and any other factors used to breed this perfect apple.

Having the handbook and materials available would make it possible for others to reproduce the resulting apple, or to obtain similar apples with different properties by modifying the protocols.

The handbook is the source code.

On the other hand, what we have here is Monsanto saying: "we've got those Terminator-lineage apples, and we're open-sourcing them by giving you the actual apples as an end product for free. Feel free to breed them into new varieties at will as long as you're not a Big Farm company."

Not open source.


What would enable someone to reproduce the tree from scratch, and continue developing that line of trees, using tools common to apple tree breeders? I’m not an apple tree breeder, but I suspect that’s the seeds. Maybe the genetic sequence is like source code in some analogical sense, but unless you can use that information to produce an actual seed, it doesn’t qualify in a practical sense. Trees don’t have a “compilation phase” to my knowledge, so any use of “open source” would be a stretch.

"You make a good point but those are also just limitations of the technology (or at least our current understanding of it)"

Yeah, that is my point. Things that don't have source code can't be open source.

"Maybe an analogy would help. A family spent generations breeding the perfect apple tree and they decided to “open source” it. What would open sourcing look like?"

I think we need to be weary of dilemmas without solutions here. For example, let's think about another analogy: I was in a car accident last week. How can I open source my car accident?

I don't think all, or even most things, are actually "open sourcable". ML models could be open sourced, but it would require a lot of work to interpret the models and generate the source code from them.


Be charitable and intellectually curious. What would "open" look like?

GNU says "The GNU GPL can be used for general data which is not software, as long as one can determine what the definition of “source code” refers to in the particular case. As it turns out, the DSL (see below) also requires that you determine what the “source code” is, using approximately the same definition that the GPL uses."

and offers these categories, for example:

https://www.gnu.org/licenses/license-list.en.html#NonFreeSof...

* Software Licenses

* * GPL-Compatible Free Software Licenses

\

* * GPL-Incompatible Free Software Licenses

\

* Licenses For Documentation

* * Free Documentation Licenses

\

* Licenses for Other Works

* * Licenses for Works of Practical Use besides Software and Documentation

* * Licenses for Fonts

* * Licenses for Works stating a Viewpoint (e.g., Opinion or Testimony)

* * Licenses for Designs for Physical Objects


"Be charitable and intellectually curious. What would "open" look like?"

To really be intellectually curious we need to be open to the idea that there is not (yet) a solution to this problem. Or in the analogy you laid out, that it is simply not possible for the system to be "open source".

Note that most of the licenses listed under the "Licenses for Other Works" section say "It is incompatible with the GNU GPL. Please don't use it for software or documentation, since it is incompatible with the GNU GPL and with the GNU FDL." This is because these are not free software/open source licenses. They are licenses that the FSF endorses because they encourage openness and copyleft in non-software mediums, and play nicely with the GPL when used appropriately (i.e. not for software).

The GPL is appropriate for many works that we wouldn't conventionally view as software, but in those contexts the analogy is usually so close to the literal nature of software that it stops being an analogy. The major difference is public perception. For example, we don't generally view jpegs as software. However, jpegs, at their heart, are executable binaries with very domain specific instructions that are executed in a very much non-Turing complete context. The source code for the jpeg is the XCF or similar (if it exists) which contains a specification (code) for building the binary. The code becomes human readable once loaded into an IDE, such as GIMP, designed to display and interact with the specification. This is code that is most easily interacted with using a visual IDE, but that doesn't change the fact that it is code.

There are some scenarios where you could identify a "source code" but not a "software". For example, a cake can be open sourced by releasing the recipe. In such a context, though, there is literally source code. It's just that the code never produces a binary, and is compiled by a human and kitchen instead of a computer. There is open source hardware, where the source code is a human readable hardware specification which can be easily modified, and the hardware is compiled by a human or machine using that specification.

The scenario where someone has bred a specific plant, however, can not be open source, unless they have also deobfuscated the genome, released the genome publicly, and there is also some feasible way to convert the deobfuscated genome, or a modification of it, into a seed.


> vs an actual open source system where you can understand how the system is working and modify specific functionality.

No one on the planet understands how the model weights work exactly, nor can they modify them specifically (i.e. hand modifying the weights to get the result they want). This is an impossible standard.

The source code is open (sorta, it does have some restrictions). The weights are open. The training data is closed.


> No one on the planet understands how the model weights work exactly

Which is my point. These models aren't open source because there is no source code to open. Maybe one day we will have strong enough interpretability to generate source from these models, and then we could have open source models. But today its not possible, and changing the meaning of open source such that it is possible probably isn't a great idea.


It's no secret that implementing AI usually involves far more investment into training and teaching than actual code. You can know how a neural net or other ML model works. You can have all the code before you. It's still a huge job (and investment) to do anything practical with that. If Meta shares the code their AI runs on with you, you're not going to be able to do much with it unless you make the same investment in gathering data and teaching to train that AI. That would probably require data Meta won't share. You'd effectively need your own Facebook.

If everyone open sources their AI code, Meta can snatch the bits that help them without much fear of helping their direct competitors.


I think you're misunderstanding what I'm saying. I don't think its technically feasible for current models to be open source, because there is no source code to open. Yes, there is a harness that runs the model, but the vast, vast amount of instructions are contained in the model weights, which are akin to a compiled binary.

If we make large strides in interpretability we may have something resembling source code, but we're certainly not there yet. I don't think the solution to that problem should be to change the definition of open source and pretend the problem has been solved.


You release all the technology and the training data. Everything that was used to create the model, including instructions.

I'm not sure if facebook has done that


I agree; there's a lot of muddiness in the term "open source AI". Earlier this year there was a talk[1] at FOSDEM, titled "Moving a step closer to defining Open Source AI". It is from someone at the Open Source Initiative. The video and slides are available in the link below[1]. From the abstract:

"Finding an agreement on what constitutes Open Source AI is the most important challenge facing the free software (also known as open source) movement. European regulation already started referring to "free and open source AI", large economic actors like Meta are calling their systems "open source" despite the fact that their license contain restrictions on fields-of-use (among other things) and the landscape is evolving so quickly that if we don't keep up, we'll be irrelevant."

[1] https://fosdem.org/2024/schedule/event/fosdem-2024-2805-movi... defining-open-source-ai/


Open source = reproducible binaries (weights) by you on your computer, IMO.

Strategy of FB is that they are good to be a user only and fine ruining competitor’s business with good enough free alternatives while collecting awards as saviors of whatever.


If that were the definition then any software you can install on your computer would be open source. It makes open source lose nearly all meaning.

Just say "open weights", not "open source".


Not sure what you mean by "they are good to be a user only." Whatever their strategy is, this is great for the community.

Coming up with the words and concepts to describe the models is a challenge.

Does the training data require permission from the copyright holder to use? Are the weights really open source or more like compiled assembly?


Open training dataset + open steps sufficient to train exactly the same model.

This isn't what Meta releases with their models, though I would like to see more public training data. However, I still don't think that would qualify as "open source". Something isn't open source just because its reproducible out of composable parts. If one, very critical and system defining part is a binary (or similar) without publicly available source code, then I don't think it can be said to be "open source". That would be like saying that Windows 11 is open source because Windows Calculator is open source, and its a component of Windows.

Here’s one list of what is needed to be actually open source:

https://blog.allenai.org/hello-olmo-a-truly-open-llm-43f7e73...


That's what I meant by "open steps", I guess I wasn't clear enough.

Is that what you meant? I don't think releasing the sequence of steps required to produce the model satisfies "open source", which is how I interpreted you, because there is still no source code for the model.

They can't release training dataset if it was illegally scrapped all over the web without permission :) (taps head)

I also think that something like Chromium is a better analogy for corporate open source models than a grassroots project like Linux is. Chromium is technically open source, but Google has absolute control over the direction of it's development and realistically it's far too complex to maintain a fork without Googles resources, just like Meta has complete control over what goes into their open models, and even if they did release all the training data and code (which they don't) us mere plebs could never afford to train a fork from scratch anyway.

I think you’re right from the perspective of an individual developer. You and I are not about to fork Chromium any time soon. If you presume that forking is impractical then sure, the right to fork isn’t worth much.

But just because a single developer couldn’t do it doesn’t mean it couldn’t be done. It means nobody has organized a large enough effort yet.

For something like a browser, which is critical for security, you need both the organization and the trust. Despite frequent criticism, Mozilla (for example) is still considered pretty trustworthy in a way that an unknown developer can’t be.


If Microsoft can't do it, then we can reasonably conclude that it can't be done for any practical purpose. Discussing infinitesimal possibilities is better left to philosophers.

Doesn’t Microsoft maintain its own fork of Chromimum?

yes - their browser is chromium-based

No, open source means that sources are open, typically for inspection, modification etc. Also here it can be considered the case. Likely in order to claim "true open source", they would have to share dataset? But even this might not be enough for truely open source model? This dataset is nothing but another artifact. So how did they arrive at this dataset, now they have to share pipelines and infra...

.. the thing is, we have not dealt with llm much, it's hard to say what can be considered open source llm just yet, so we use that as metaphore for now


If you think about LLMs as a new kind of programming runtime, the matrices are the source.

Ok call it Open Weights then if the dictionary definitions matter so much to you.

The actual point that matters is that these models are available for most people to use for a lot of stuff, and this is way way better than what competitors like OpenAI offer.


They don't "[allow] developers to modify its code however they want", which is a critical component of "open source", and one that Meta is clearly trying to leverage in branding around its products. I would like them to start calling these "public weight models", because what they're doing now is muddying the waters so much that "open source" now just means providing an enormous binary and an open source harness to run it in, rather than serving access to the same binary via an API.

Feels a bit like you are splitting hair for the pleasure of semantic arguments to be honest. Yes there are no source in ML, so if we want to be pedantic it shouldn't be called open source. But what really matters in the open source movement is that we are able to take a program built by someone and modify it to do whatever we want with it, without having to ask someone for permission or get scrutinized or have to pay someone.

The same applies here, you can take those models and modify them to do whatever you want (provided you know how to train ML models), without having to ask for permission, get scrutinized or pay someone.

I personally think using the term open source is fine, as it conveys the intent correctly, even if, yes, weights are not sources you can read with your eyes.


Calling that “open source” renders the word “source” meaningless. By your definition, I can release a binary executable freely and call it “open source” because you can modify it to do whatever you want.

Model weights are like a binary that nobody has the source for. We need another term.


No it’s not the same as releasing a binary, feels like we can’t get out of the pedantics. I can in theory modify a binary to do whatever I want. In practice it is intractably hard to make any significant modification to a binary, and even if you could, you would then not be legally allowed to e.g. redistribute.

Here, modifying that model is not harder that doing regular ML, and I can redistribute.

Meta doesn’t have access to some magic higher level abstraction for that model that would make working with it easier that they did not release.

The sources in ML are the architecture the training and inference code and a paper describing the training procedure. It’s all there.


"In practice it is intractably hard to make any significant modification to a binary, and even if you could, you would then not be legally allowed to e.g. redistribute."

It depends on the binary and the license the binary is released under. If the binary is released to the public domain, for example, you are free to make whatever modifications you wish. And there are plenty of licenses like this, that allow closed source software to be used as the user wishes. That doesn't make it open source.

Likewise, there are plenty of closed source projects who's binaries we can poke and prod with much higher understanding of what our changes are actually doing than we're able to get when we poke and prod LLMs. If you want to make a Pokemon Red/Blue or Minecraft mod you have a lot of tools at your disposal.

A project that only exists as a binary which the copyright holder has relinquished rights to, or has released under some similar permissive closed source license, but people have poked around enough to figure out how to modify certain parts of the binary with some degree of predictability is a more apt analogy. Especially if the original author has lost the source code, as there is no source code the speak of when discussing these models.

I would not call that binary "open source", because the source would, in fact, not be open.


Can you change the tokenizer? No, because all you have is the weights trained with the current tokenizer. Therefore, by any normal definition, you don’t have the source. You have a giant black box of numbers with no ability to reproduce it.

> Can you change the tokenizer?

Yes.

You can change it however you like, then look at the paper [1] under section 3.2. to know which hyperparameters were used during training and finetune the model to work with your new tokenizer using e.g. FineWeb [2] dataset.

You'll need to do only a fraction of the training you would have needed to do if you were to start a training from scratch for your tokenizer of choice. The weights released by Meta give you a massive head start and cost saving.

The fact that it's not trivial to do and out of reach of most consumer is not a matter of openness. That's just how ML is today.

[1]: https://scontent-sjc3-1.xx.fbcdn.net/v/t39.2365-6/452387774_...

[2]: https://huggingface.co/datasets/HuggingFaceFW/fineweb


You can change the tokenizer and build another model, if you can come up with your own version of the rest of the source (e.g., the training set, RLHF, etc.). You can’t change the tokenizer for this model, because you don’t have all of its source.

There is nothing that requires you to train with the same training set, or to re-do RLHF. You can train on fineweb, and llama 3.1 will learn to use your new tokenizer just fine.

There is 0 doubt that you are better of finetuning that model to use your tokenizer than training from scratch. So what Meta gives you for free massively helps you building your model, that's OSS to me.


You have to write all the code needed to do the modifications you are interested in. That is, there is no source code provided that can be used to make the modifications of interest. One also has to come up with suite le datasets, from scratch. Training setup and data is completely non trivial for a large language model. To replicate Llama would take hundreds of hours of engineering, at least.

> You have to write all the code needed to do the modifications you are interested in. That is, there is no source code provided that can be used to make the modifications of interest.

Just like open source?

> Training setup and data is completely non trivial for a large language model. To replicate Llama would take hundreds of hours of engineering, at least.

The entire point of having the pre-trained weight released is to *not* have to do this. You just need to finetune, which can be done with very little data, depending on the task, and many open source toolkits, that work with those weights, exist to make this trivial.


I think maybe we’re talking past each other because it seems obvious to me and others that the weights are the output of the compilation process, whereas you seem to think they’re the input. Whether you can fine tune the weights is irrelevant to whether you got all the materials needed to make them in the first place (i.e., the source).

I can do all sorts of things by “fine tuning” Excel with formulas, but I certainly don’t have the source for Excel.


> The same applies here, you can take those models and modify them to do whatever you want without having to ask for permission, get scrutinized or pay someone.

The "Additional Commercial Terms" section of the license includes restrictions that would not meet the OSI definition of open source. You must ask for permission if you have too many users.


"Public weight models" sounds about right, thanks for coming up with a good term! Hope it catches.

My central point is this:

"are available for most people to use for a lot of stuff, and this is way way better than what competitors like OpenAI offer."

I presume you agree with it.

> rather than serving access

Its not the same access though.

I am sure that you are creative enough to think of many questions that you could ask llama3, that would instead get you kicked off of OpenAI.

> They don't "[allow] developers to modify its code however they want"

Actually, the fact that the model weights are available means that you can even ignore any limitations that you think are on it, and you'll probably just get away with it. You are also ignoring the fact that the limitations are minimal to most people.

Thats a huge deal!

And it is dishonest to compare a situation where limitations are both minimal and almost unenforceable (Except against maybe Google) to a situation where its physically not possible to get access to the model weights to do what you want with them.


> Actually, the fact that the model weights are available means that you can even ignore any limitations that you think are on it, and you'll probably just get away with it. You are also ignoring the fact that the limitations are minimal to most people.

The limitations here are technical, not legal. (Though I am aware of the legal restrictions as well, and I think its worth noting that no other project would get by calling themselves open source while imposing a restriction which prevents competitors from using the system to build their competing systems.) There isn't any source code to read and modify. Yes, you can fine tune a model just like you can modify a binary but this isn't source code. Source code is a human readable specification that a computer can use to transform into executable code. This allows the human to directly modify functionality in the specification. We simply don't have that, and it will not be possible unless we make a lot of strides in interpretability research.

> Its not the same access though.

> I am sure that you are creative enough to think of many questions that you could ask llama3, that would instead get you kicked off of OpenAI.

I'm not saying that systems that are provided as SaaS don't tend to be more restrictive in terms of what they let you do through the API they expose vs what is possible if you run the same system locally. That may not always be true, but sure, as a general rule it is. I mean, it can't be less restrictive. However, that doesn't mean that being able to run code on your own machine makes the code open source. I wouldn't consider Windows open source, for example. Why? Because they haven't released the source code for Windows. Likewise, I wouldn't consider these models open source because their creators haven't released source code for them. Being technically infeasible to do doesn't mean that the definition changes such that its no longer technically infeasible. It is simply infeasible, and if we want to change that, we need to do work in interpretability, not pretend like the problem is already solved.


So then yes you agree with this:

"are available for most people to use for a lot of stuff, and this is way way better than what competitors like OpenAI offer." And that this is very significant.


One counterpoint is that major publications (eg New York Times) would have you believe that AI is a mildly lossy compression algorithm capable of reconstructing the original source material.

I believe it is able to reconstruct parts of the original source material—if the interrogator already knows the original source material to prompt the model appropriately.

It's not?

Unfortunately open source really just means an open API these days. The API is heavily intertwined with closed source.

Weight is the new code.

I think saying it's the new binary is closer to the truth. You can't reproduce it, but you can use it. In this new version, you can even nudge it a bit to do something a little different.

New stuff, so probably not good to force old words, with known meanings, onto new stuff.


The model is more akin to a python script than a compiled C binary. This is how I see it:

Training Code and dataset are analogous to the developer who wrote the script

Model and weights are end product that is then released

Inference Code is the runtime that could execute the code. That would be e.g. PyTorch, which can import the weights and run inference.


> The model is more akin to a python script than a compiled C binary.

No, I completely disagree. Python is near pseudo-text source. Source exists for the specific purpose of being easily and completely understood, by humans, because it's for and from humans. You can turn a python calculator into a web server, because it can be split and separated at any point, because it can be completely understood at any point, and it's deterministic at every point.

A model cannot be understood by a human. It isn't meant to be. It's meant to be used, very close to as is. You can't fundamentally change the model, or dissect it, you can only nudge it in a direction, with the force of that nudge being proportional to the money you can burn, along with hope that it turns out how you want.

That's why I say it's closer to a binary: more of a black box you can use. You can't easily make a binary do something fundamentally different without changing the source. You can't easily see into that black box, or even know what it will do without trying. You can only nudge it to act a little differently, or use it as part of a workflow. (decompilation tools aside ;))


None of Meta's models are "open source" in the FOSS sense, even the latest Llama 3.1. The license is restrictive. And no one has bothered to release their training data either.

This post is an ad and trying to paint these things as something they aren't.


> no one has bothered to release their training data

If the FOSS community sets this as the benchmark for open source in respect of AI, they're going to lose control of the term. In most jurisdictions it would be illegal for the likes of Meta to release training data.


Regardless of the training data, the license even heavily restricts how you can use the model.

Please read through their "acceptable use" policy before you decide whether this is really in line with open source.


> Please read through their "acceptable use" policy before you decide whether this is really in line with open source

I'm not taking a specific posiion on this license. I haven't read it closely. My broad point is simply that open source AI, as a term, cannot practically require the training data be made available.


> In most jurisdictions it would be illegal for the likes of Meta to release training data.

How come releasing an LLM trained on that data is not illegal then? I think it should be.


the training data is the source.

I don’t think it’s that simple. The source is “the preferred form of the work for making modifications to it” (to use the GPL’s wording).

For an LLM, that’s not the training data. That’s the model itself. You don’t make changes to an LLM by going back to the training data and making changes to it, then re-running the training. You update the model itself with more training data.

You can’t even use the training code and original training data to reproduce the existing model. A lot of it is non-deterministic, so you’ll get different results each time anyway.

Another complication is that the object code for normal software is a clear derivative work of the source code. It’s a direct translation from one form to another. This isn’t the case with LLMs and their training data. The models learn from it, but they aren’t simply an alternative form of it. I don’t think you can describe an LLM as a derivative work of its training data. It learns from it, it isn’t a copy of it. This is mostly the reason why distributing training data is infeasible – the model’s creator may not have the license to do so.

Would it be extremely useful to have the original training data? Definitely. Is distributing it the same as distributing source code for normal software? I don’t think so.

I think new terminology is needed for open AI models. We can’t simply re-use what works for human-editable code because it’s a fundamentally different type of thing with different technical and legal constraints.


No the preferred way to make modifications is using the the training code. One may also input a snapshot weighs to start from, but the training code is definitely what you would modify to make a change.

how do you train it in a different language by changing the training code?

By selecting different dataset. Of course this dataset does need to exist. In practice building and curating datasets also involves a lot of code.

sounds like you need the data to train the model.

Given a well behaved training setup, you will give an equivalently powerful model, given the same dataset and training scripts, and training settings. At least if you are willing to run it several times, an pick the best one - a process that is commonly used for large models.

> the training data is the source

Sure. But that's not going to be released. The term open source AI cannot be expected to cover it because it's not practical.


Meta can call it something else other than open source.

Synthetic part of the training data could be released.


Of course it could be practical - provide the data. The fact of that society is a dystopian nightmare controlled by a few megacorporations that don't want free information does not justify outright changing the meaning of the language.

> provide the data

Who? It's not their data.


why are they using it?

And why legislation allows them to use the data to train their LLM and release that, but not release the data?

So because it's really hard to do proper Open Source with these LLMs, means we need to change the meaning of Open Source so it fits with these PR releases?

> because it's really hard to do proper Open Source with these LLMs, means we need to change the meaning of Open Source so it fits with these PR releases?

Open training data is hard to the point of impracticality. It requires excluding private and proprietary data.

Meanwhile, the term "open source" is massively popular. So it will get used. The question is how.

Meta et al would love for the choice to be between, on one hand, open weights only, and, on the other hand, open training data, because the latter is impractical. That dichotomy guarantees that when someone says open source AI they'll mean open weights. (The way open source software, today, generally means source available, not FOSS.)


>Meanwhile, the term "open source" is massively popular. So it will get used. The question is how.

Here's the source of the disagreement. You're justifying the use of the term "open source" by saying it's logical for Meta to want to use it for its popularity and layman (incorrect) understanding.

Other person is saying it doesn't matter how convenient it is or how much Meta wants to use it, that the term "open source" is misleading for a product where the "source" is the training data, and the final product has onerous restrictions on use.

This would be like Adobe giving Photoshop away for free, but for personal use only and not for making ads for Adobe's competitors. Sure, Adobe likes it and most users may be fine with it, but it isn't open source.

>The way open source software, today, generally means source available, not FOSS.

I don't agree with that. When a company says "open source" but it's not free, the tech community is quick to call it "source available" or "open core".


> You're justifying the use of the term "open source" by saying it's logical for Meta to want to use it for its popularity and layman (incorrect) understanding

I'm actually not a fan of Meta's definition. I'm arguing specifically against an unrealistic definition, because for practical purposes that cedes the term to Meta.

> the term "open source" is misleading for a product where the "source" is the training data, and the final product has onerous restrictions on use

Agree. I think the focus should be on the use restrictions.

> When a company says "open source" but it's not free, the tech community is quick to call it "source available" or "open core"

This isn't consistently applied. It's why we have the free vs open vs FOSS fracture.


> Open training data is hard to the point of impracticality. It requires excluding private and proprietary data.

Right, so the onus is on Facebook/Meta to get that right, then they could call something Open Source, until then, find another name that already doesn't have a specific meaning.

> (The way open source software, today, generally means source available, not FOSS.)

No, but it's going in that way. Open Source, today, still means that the things you need to build a project, is publicly available for you to download and run on your own machine, granted you have the means to do so. What you're thinking of is literally called "Source Available" which is very different from "Open Source".

The intent of Open Source is for people to be able to reproduce the work themselves, with modifications if they want to. Is that something you can do today with the various Llama models? No, because one core part of the projects "source code" (what you need to reproduce it from scratch), the training data, is being held back and kept private.


source available is absolutely not the same as open source

you are playing very loosely with terms that have specific, widely accepted definitions (e.g. https://opensource.org/osd )

I don't get why you think it would be useful to call LLMs with published weights "open source"


> terms that have specific, widely accepted definitions

OSF's definition is far from the only one [1]. Switzerland is currently implementing CH Open's definition, the EU another one, et cetera.

> I don't get why you think it would be useful to call LLMs with published weights "open source"

I don't. I'm saying that if the choice is between open weights or open weights + open training data, open weights will win because the useful definition will outcompete the pristine one in a public context.

[1] https://en.wikipedia.org/wiki/Open-source_software#Definitio...


For the EU, I'm guessing you're talking about the EUPL, which is FSF/OSI approved and GPL compatible, generally considered copyleft.

For the CH Open, I'm not finding anything specific, even from Swiss websites, could you help me understand what you're referring to here?

I'm guessing that all these definitions have at least some points in common, which involves (another guess) at least being able to produce the output artifacts/binaries by yourself, something that you cannot do with Llama, just as an example.


> For the CH Open, I'm not finding anything specific, even from Swiss websites, could you help me understand what you're referring to here

Was on the HN front page earlier [1][2]. The definition comes strikingly close to source on request with no use restrictions.

> all these definitions have at least some points in common

Agreed. But they're all different. There isn't an accepted defintiion of open source even when it comes to software; there is an accepted set of broad principles.

[1] https://news.ycombinator.com/item?id=41047172

[2] https://joinup.ec.europa.eu/collection/open-source-observato...


> Agreed. But they're all different. There isn't an accepted defintiion of open source even when it comes to software; there is an accepted set of broad principles.

Agreed, but are we splitting hairs here and is it relevant to the claim made earlier?

> (The way open source software, today, generally means source available, not FOSS.)

Do any of these principles or definitions from these orgs agree/disagree with that?

My hypothesis is that they generally would go against that belief and instead argue that open source is different from source available. But I haven't looked specifically to confirm if that's true or not, just a guess.


> are we splitting hairs here and is it relevant to the claim made earlier?

I don't think so. Take the Swiss definition. Source on request, not even available. Yet being branded and accepted as open source.

(To be clear, the Swiss example favours FOSS. But it also permits source on request and bundles them together under the same label.)


diluting open source into a marketing term meaning "you can download something" would be a sad result

> specific, widely accepted definitions

Realistically, nobody outside of Hacker News commenters have ever cared about the OSD. It's just not how the term is used colloquially.


who says open source colloquially? ime anyone who doesn't care about software licenses will just say free (per free beer)

and (strong personal opinion) any software developer should have a firm grip on the terminology and details for legal reasons


> who says open source colloquially?

There is a large span of people between gray beard programmer and lay person, and many in that span have some concept of open-source. It's often used synonymously with visible source, free software, or in this case, open weights.

It seems unfortunate - though expected - that over half of the comments in this thread are debating the OSD for the umpeenth time instead of discussing the actual model release or accompanying news posts. Meanwhile communities like /r/LocalLlama are going hog wild with this release and already seeing what it can do.

> any software developer should have a firm grip on the terminology and details for legal reasons

They'd simply need to review the terms of the license to see if it fits their usage. It doesn't really matter if the license satisfies the OSD or not.


No, we need to adapt an existing term into the new context that it is being deployed in.

We've had a similar debate before, but the last time it about whether Linux device drivers based on non-public datasheets under NDA were actually open source. This debate occurred again over drivers that interact with binary blobs.

I disagree with the purists - if you can legally change the source or weights - even without having access to the data used by the upstream authors - it's open enough for me. YMMV.


No. It's an asset used in the training process, the source code can process arbitrary training data.

I don’t think even that is true. I conjecture that Facebook couldn’t reproduce the model weights if they started over with the same training data, because I doubt such a huge training run is a reproducible deterministic process. I don’t think anyone has “the” source.

numpy.random.seed(1234)

AI2 has released training data in their OLMo model: https://blog.allenai.org/hello-olmo-a-truly-open-llm-43f7e73...

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: