The Emerging Empires of AI & Their Hunger for Data
More is more
Give us your datasets. If you give us your datasets, we'll be very happy.
-Sam Altman, answering a question at a live event on what Indonesians can do for OpenAI, Jakarta 2023
The race to build the largest and most powerful model is already underway, with tech giants competing aggressively against each other to seize monopoly power.
Unfortunately for us, the AI sector is one in which bigger tends to be better, and in which giants tend to have an advantage. Bigger companies can afford more compute power and create larger models. However, the more parameters a large language model has, the more data it requires for effective training.
Over the past months, model developers have reached the limits not just of publicly available web scraper data (such as Common Crawl) but also of datasets known to be constituted from pirated material (such as Books3, which is a component of The Pile - the key dataset used in the training of models such as Llama).
Efforts to train models on their own outputs have encountered only mixed success: research indicates that this leads, over time, to “model collapse” as the original data distribution degrades with each iteration. We have reached a point at which the next generation of models are simply too large to be trained efficiently upon the data to which producers currently have legal access.
In order to obtain more data, these companies are enouraging users to upload their own and others’ private information – with some corporates going so far as to announce that not only do they intend to continue stealing copyrighted data to train their models, they will encourage their users to do so, and even cover the legal fees resulting from any dispute.
This escalation from 'mere' solo misbehaviour, to the aiding and abetting of others to do the same, is clearly bullying in nature, and unprecedented even by the standards of bigtech misbehaviour in the past. But it is clear proof of the all-consuming hunger for data and knowledge faced by big tech to feed their models, and the disadvantage at which small players sit when transacting on the AI market.
When Elephants Fight...
Many have long foreseen a new arms-race between the largest companies in getting control over proprietary data. Indeed, it has already begun, but it is in their heavy reliance on knowledge & data described above that we find the greatest opportunity to start turning the situation around.
IP-owners, whose data is generally too small to be valuable in isolation and spread across multiple platforms, lack the bargaining power to protect their rights and are at the mercy of data security policies set by secondary owners (the publishing platforms). With the public internet having already largely been scraped, AI companies desperate for more data have begun doing deals with centralised content platforms, once again cutting out the data producers themselves. However, individual data-owners are not the only ones being hit. AI as it is experienced by the majority of users is composed of three elements: the model that carries out the calculations, the data that feeds it, and the app via which the outputs are presented to users. In each case, the smaller providers struggle in the face of the oligopoly power held by the large tech companies.
With the gains from scaling stalling in the absence of new data sources, the chance for data owners, innovative niche model developers and prompt-tuners to leverage their collective power is now or never.
A New Arms Race over Domination of our AI future
The battle is to bring all levers of AI value creation under the control of single corporate entities.
It is not just data producers who are losing out at the hands of the large AI companies. As these few giants raise ever larger sums to pay for the inference compute required to continue pushing out improved versions without scaling up training, smaller and potentially more innovative companies are being starved of the funds that they need to grow. When tech giants are raising rounds in the billions to expand the capacities of their foundation models, it becomes extremely difficult for mid-size companies, startups and lone developers to find investment to pursue frontier or niche products and scale naturally. While the current wave of AI innovation was largely driven by maverick independents, the ladder is currently being pulled up, with tech giants working to starve potential future competitors of funds.
KIP Protocol was designed not just to let small data-producers monetise, but to provide equal facilities to model- and app-makers. While someone wishing to train a new model will need hundreds of thousands of dollars just to begin, KIP allows them to work up to this point rather than beginning with a single massive fund-raise. With KIP it is possible to monetise a small scale fine-tuned model or a LoRA and start earning immediately, turning an innovation into a business whose profits can be reinvested. Simultaneously, it allows developers to crowdfund their work. A model- or an app-maker can use the protocol to pre-sell tokens that will allow buyers a share in their future revenue, letting them crowdfund development and share the benefits with their earliest backers.
Last updated