At the same time of TensorFlow’s rise, foreshadowing what was yet to come in open source AI, enterprise software went through an open source licensing crisis. Mostly thanks to AWS, which had mastered the craft of taking open source infrastructure projects and building commercial services around them, many open source projects exchanged their permissible licenses for “Copyleft” or “ShareAlike” (SA) alternatives.
Not all open source is created equal. Permissible licenses (like Apache 2.0 or MIT) allow anyone to take an open source project and build a commercial service around it. “Copyleft” licenses (like GPL), similar to Creative Common’s “ShareAlike” terms, are one way to protect against this. They are sometimes referred to as a “poison pill”, because they require any derivative product to be licensed the same way. If AWS launched a service based on an open source project with a “Copyleft” license, the AWS service itself must be open sourced under the same license.
So, partially in response to competitive cloud services, the corporate creators and maintainers of open source projects like MongoDB and Redis switched up their licenses to less permissible alternatives. This led to a painful but entertaining back-and-forth between AWS and those companies on the principles and merits of open source, which has since calmed down a bit.
Note that this change in licensing had a deceptive impact on the open source ecosystem: There are still a lot of new open source projects being announced, but the licensing implications on what can and cannot be done with those projects are more complicated than most people realize.
At this point you should be asking yourself: If the corporate maintainers of open source infrastructure projects realized that others were reaping more of the commercial benefits than themselves, shouldn’t the same be happening with AI? Isn’t this an even bigger deal for open source AI models, which hold the aggregate value of compute and data that went into creating them? The answers are: Yes and yes.
Although there seems to be a Robin Hood-esque movement around open source AI, the data is pointing in a different direction. Large corporations like Microsoft are changing licensing of some of their most popular models from permissible to non-commercial (NC) licenses, and Meta has started to use non-commercial licenses for all of their recent open source projects (MMS, ImageBind, DINOv2 are all CC-BY-NC 4.0 and LLAMA is GPL 3.0). Even popular projects from universities like Stanford’s Alpaca are only licensed for non-commercial use (inherited by the non-permissible attributes of the dataset they used). Entire companies change their business models in order to protect their IP and rid themselves of the obligation to open source as part of their mission — remember when a small non-profit called OpenAI transformed itself into a capped-profit? Notice that GPT2 was open sourced, but GPT3.5 or GPT4 were not?
More generally speaking, the trend towards less permissible licenses in AI, although opaque, is noticeable. Below is an analysis of model licenses on Hugging Face. The share of permissible licenses (like Apache, MIT, or BSD) has been on a persistent decline since mid 2022, while non-permissible licenses (like GPL) or restrictive licenses (like OpenRAIL) are becoming more common.
To make things worse, the recent frenzy around large language models (LLMs) has further muddied the waters. Hugging Face maintains an “Open LLM Leaderboard” which aims to highlight “the genuine progress that is being made by the open-source community”. To be fair, all of the models on the board are indeed open source. However, a closer look reveals that almost none are licensed for commercial use*.
*Between the writing of this post and its publication, the license for Falcon models changed to the permissible Apache 2.0 license. The overall observation is still valid.
If anything, the Open LLM Leaderboard highlights that innovation from big tech (LLaMA was open sourced by Meta with a non-commercial license) dominates all other open source efforts. The bigger problem is that these derivative models are not as forthcoming about their licenses. Almost none declare their license explicitly, and you have to do your own research to find out that the models and data they are based on don’t allow for commercial use.
There is a lot of virtue-signaling in the community, mostly by well-meaning entrepreneurs and VCs who hope that there is a future that is not dominated by OpenAI, Google, and a handful of others. It is not obvious why AI models should be open sourced — they represent hard-earned intellectual property that companies develop over years, spending billions on compute, data acquisition, and talent. Companies would be defrauding their shareholders if they just gave everything away for free.
“If I could invest in an ETF for IP lawyers I would.”
The trend towards non-permissible licenses in open source AI seems clear. Yet, the overwhelming volume of news fails to point out that the cumulative benefit of this work accrues almost entirely to academics and hobbyists. Investors and executives alike should be more aware of the implications and practice more care. I have a strong feeling that most startups in the emerging LLM cotton industry are building on top of non-commercially licensed technology. If I could invest in an ETF for IP lawyers I would.
My prediction is that the value capture for AI (specifically for the latest generation of large generative models) will look similar to other innovations that require significant capital investment and accumulation of specialized talent, like cloud computing platforms or operating systems. A few major players will emerge that provide the AI foundation to the rest of the ecosystem. There will still be ample room for a layer of startups on top of that foundation, but just as there are no open source projects dethroning AWS, I consider it very unlikely that the open source community will produce a serious competitor to OpenAI’s GPT and whatever comes next.