AI Needs Us More Than We Need It
As magical apps keep appearing that can do everything from compose music to write news stories and legal briefs, many human “content creators” sense the end of a way of life and looming financial doom. How will artists, journalists, or, for that matter, lawyers and analysts of all kinds be able to make a living in the future if a chatbot can easily replicate their creativity and expertise?
But here’s some inside information about artificial intelligence that is at once scary and promising. It turns out that bots need people more than people need bots.
Simply put, that’s because to make bots smart you need to feed them high-quality data created by humans. Indeed, for bots to approach anything like human intelligence, they need both massive quantities of data and quality data produced by actual humans. And as it happens, we are running low on such data and will run out all the faster if AI puts more human content creators out of business.
This means that the more than $930 billion investors have so far poured into AI companies could ultimately turn out to be just inflating another bubble. But there is still a chance to get AI right. It involves using government policy to make sure that humans receive the compensation they deserve for creating the content that makes continued advancements in AI financially and intellectually sustainable.
Large-scale AI models that are capable of a wide array of uses or tasks are known as foundation models. They are trained on vast troves of data scraped off the internet by web crawlers. But to make chatbots, like ChatGPT, or Google’s generative search summaries work, these static models need to be updated with more relevant, timely, and accurate data. The best data comes from online publications, databases, and other digital content repositories that reflect reality and authentic human knowledge and expertise. Without using such sources, the AI applications that average users are familiar with can get trained on information about, say, restaurants that closed years ago or wild conspiracy theories that fact-checkers and scientists have debunked. If you tried to train a bot just using the misinformation and digital pollution that infects so much of the internet, you’d create a different kind of AI: artificial idiocy.
But here’s the rub. The AI industry is running short of the kind of data it needs to make bots smart. It’s estimated that within the next couple of years the demand for human-generated data could outstrip its supply.
Even in the early days, before quality training data became so scarce, AI models were beset by inherent challenges. Since AI outputs are created based on statistical correlations of previously created content and data, they tend toward the generic, emblematic, and stereotypical. They reflect what has done well commercially or gone viral in the past; they appeal to universalist values and tastes (for example, symmetry in art or facial replication and standard chord progressions in music); they bolster the middle while marginalizing extremes and outliers.
For the same reason, even the AI models trained on the best data tend to overestimate the probable, favor the average, and underestimate the improbable or rare, making them both less congruent with reality and more likely to introduce errors and amplify bias. Similarly, even the best AI models end up forgetting information that is mentioned less frequently in their data sets, and outputs become more homogeneous. This is why, for example, OpenAI says its image generator, DALL-E 3, displays “a tendency toward a Western point-of-view,” with images that “disproportionately represent individuals who appear White, female, and youthful.” Or why the Stable Diffusion tool generated pictures of toys in Iraq that look like American troops, reflecting the American, English-language association of Iraq with war.
Like a giant autocomplete, generative AI regurgitates the most likely response based on the data it has been trained on or reinforced with and the values it has been told to align with. As a result, the systems and their outputs embed, reinforce, and regurgitate dominant values and ideas and replicate and reinforce biases, some obvious and others not.
But now these inherent problems with AI are being made much worse by an acute shortage of quality training data—particularly of the kind that AI companies have been routinely appropriating for free.
One reason for the shortfall is that more and more of the best and most accurate information on the internet is now behind paywalls or fenced off from web crawlers. The Data Provenance Initiative recently audited 14,000 domains commonly used for AI training and found “a clear and systematic rise in restrictions to crawl and train on data” over the past year, as most top web domains—like news, social media, and encyclopedias—have restricted AI crawlers. In another running tally, more than half of 1,165 news publishers surveyed had instructed at least one of the three leading crawlers to exclude their sites, while 76 percent of U.S. publishers have implemented paywalls, another signal that their content is not intended to be available for free. Yet unless AI training includes access to quality news outlets and periodicals, including local newspapers, it is likely to be based on out-of-date information or on data that’s false or distorted, such as inaccurate voting information or false reports of illegal immigrants devouring pets.
Some AI companies, to be sure, are finding ways to scrape or steal data from news and other quality publications despite the technical and legal obstacles. Over the summer, journalists at Wired documented how Perplexity.ai, a free AI-powered search engine, surreptitiously scraped their content and other publishers like Forbes despite explicit instructions in both code and their terms of service prohibiting its crawlers and unauthorized uses by third parties.
“Perplexity had taken our work, without our permission, and republished it across multiple platforms—web, video, mobile—as though it were itself a media outlet,” lamented Forbes’s chief content officer and editor, Randall Lane. The search engine had apparently plagiarized a major scoop by the company, not just spinning up an article that regurgitated much of the same prose as the Forbes article but also generating an accompanying podcast and YouTube video that outperformed the original on search.
Yet AI companies still find it harder and harder to get quality training, especially for free. And that leaves them more and more dependent on data scraped from the open internet, where mighty rivers of propaganda and misinformation flow. A NewsGuard investigation recently found, for example, that the top 10 chatbots have a propensity to repeat false narratives on topics in the news and to mimic Russian propaganda, reflecting the scale and scope of Russia’s historic and ongoing state-sponsored information operations.
Then there’s the data problem caused by AI itself. Every day, people and governments around the world are adding a Niagara Falls of nonsense to the internet using increasingly accessible AI apps. Amazon is inundated with AI-generated books and product reviews. Facebook is overrun by zombie accounts. Google Search and Google News are infested with AI-generated fake news and junk content.
As the spread of AI makes it harder and harder to find quality data for training AI bots, the industry has responded by increasingly relying on what on some researchers call “synthetic data.” This refers to content created by AI bots for the purpose of training other AI systems. But there are real limits to this approach. It’s like trying to advance human knowledge using photocopies of photocopies ad infinitum. Even if the original data has some truth quotient, the resulting models become distorted and less and less faithful to reality. Eventually they tend to malfunction, degrade, and potentially even collapse, rendering AI useless, if not downright harmful. When such degraded content spreads, the resulting “enshittification” of the internet poses an existential threat to the very foundation of the AI paradigm.
Recently, a team of AI researchers used the term model collapse to describe what happens when you use AI-generated data to train AI models. Model collapse, the authors explain, “refers to a degenerative learning process where models start forgetting improbable events over time, as the model becomes poisoned with its own projection of reality.” To avoid model collapse, they warn, “access to genuine human-generated content is essential.”
The AI industry knows this. And they know that when it comes to chatbots, information retrieval, and fine-tuning, data quality counts more than quantity. “In a nutshell, what has been learned over the last few years is that working with a smaller amount of high quality data with a larger model, often expressed in parameters, is a better way to go,” writes the data architect Dennis Layton, who also notes that the human capacity to learn depends on more than the raw number of examples we are given. “This seems to hold true for machine learning as well.”
The AI industry has taken up several different strategies for trying to overcome its increasing difficulties in appropriating the human-generated content it needs to survive.
One tack has simply been to use its lobbying power to try to erode copyright laws and other protections for authors, publishers, musicians, and other content creators. AI companies have repeatedly claimed that requiring licensing would stall “progress” or make it impossible to develop some AI systems. “If licenses were required to train [large language models] on copyrighted content, today’s general-purpose AI tools simply could not exist,” according to Anthropic, the Amazon- and Google-backed generative AI firm.
Similarly, a representative of the Silicon Valley venture capital firm Andreessen Horowitz told the U.S. Copyright Office that the “only practical way for these tools to exist is if they can be trained on massive amounts of data without having to license that data.” According to OpenAI, limiting model training to content in the public domain would not meet the needs of their models.
Having flaunted copyright, digital rights management, and contractual constraints with impunity, Big Tech is working as hard as possible to convince lawmakers to limit the privacy and contract rights for content creators while expanding the legal zone of “fair use.” As Bill Gross, who is best known for coming up with the pay-per-click model for digital advertising, recently told Wired, “It’s stealing. They’re shoplifting and laundering the world’s knowledge to their benefit.”
Authors, publishers, photo and music agencies, entertainers, and ordinary users have tried to fight back, filing more than two dozen lawsuits against the AI companies at the forefront of what one plaintiff characterized as “systematic theft on a mass scale.” But these will take years to resolve, if ever. Meanwhile, entire professions that have evolved in part due to the protections and revenue provided by copyright and the enforcement of contracts become more precarious—think journalism, publishing, and entertainment, to name just three.
Another tack being tried by the biggest players in AI has been to strike deals with those most likely to sue or to pay off the most vocal opponents. This is why the biggest players are signing deals and “partnerships” with publishers, record labels, social media platforms, and other sources of content that can be “datafied” and used to train their models.
Microsoft-backed OpenAI has made more than a dozen licensing deals and is in discussion with another dozen of the most prominent publishers in the U.S. and Europe, including a $250 million deal with NewsCorp that dwarfs the estimated $1 million to $5 million that several other publishers reportedly received. OpenAI’s deals with AP and Time include access to their archives as well as newsroom integrations likely to provide useful training and alignment, while a slew of other deals include newsroom integration and API credits, ensuring a supply of human-centered data.
“We are starting to get really concerned that the platforms are going to talk to 15 or 20 of the big media owners around the world, make a lot of nice deals, then they’ll tell the governments and regulators that they’re playing nice with the publishers,” says Alastair Lewis, head of the industry association FIPP, which represents magazine publishers around the world. “But the vast majority of us are going to be left behind again, as we felt we were 15 years ago.”
Both OpenAI and Google made $60 million deals with Reddit that will provide access to a regular supply of real-time, fresh data created by the social media platform’s 73 million daily active users. But those users won’t see a dime. Meanwhile, Google, which was recently found guilty in federal court of using business practices to maintain an illegal monopoly in search engines, is the only search engine able to return results from Reddit because it’s currently the only AI company allowed to scrape its site for training data. Google’s YouTube is also in talks with the biggest labels in the record industry about licensing their music to train its AI, reportedly offering a lump sum, though in this case the musicians appear to have a say, whereas journalists and Reddit users do not.
Yet these deals don’t really solve AI’s long-term sustainability problem, while also creating many other deep threats to the quality of the information environment. For one, if AI models are trained only on data that has been licensed through deals with a handful of the largest, most dominant media and entertainment conglomerates, this in and of itself will exacerbate the distortions to which AI is already inherently prone, such as undercounting outliers and magnifying conventional thoughts and opinions, while reinforcing existing dominance. For another, such deals help to hasten the decline of smaller publishers, artists, and independent content producers, while also leading to increasing monopolization of AI itself. As the AI Now Institute observed, those with the “widest and deepest” data advantages will be able to “embed themselves as core infrastructure.” Everyone else just winds up as vassals or frozen out.
The threat to smaller content creators goes beyond simple theft of their intellectual property. Not only have AI companies grown large and powerful by purloining other people’s work and data, they are now creating products that directly cost content creators their customers as well. For example, many news publications depend on traffic referred to them by Google searches. But now the search monopolist is using AI to create summaries of the news rather than providing links to original reporting. “Google’s new product will further diminish the limited traffic publishers rely on to invest in journalists, uncover and report on critical issues, and to fuel these AI summaries in the first place. It is offensive and potentially unlawful to accept this fate from a dominant monopoly that makes up the rules as they go,” says Danielle Coffey, CEO of the News Media Alliance, which represents more than 2,000 predominantly U.S. publishers.
And while some U.S. writers and actors got some limited protections about how their work can be used by the publishers and studios they work for following union strikes last year, deep concerns remain about how and if they will be compensated for the use of their content by AI companies and whether those AI products will eclipse the need for human authors and actors entirely.
So what’s to be done? Clearly, advances in AI depend critically on humans continuing to create a high volume of new fact-based and creative knowledge work that is not the product of AI. And humans will not create that volume if monopolistic corporations are allowed to use AI technology in ways that strip out all the value created by artists, writers, journalists, and human experts of all kinds, many of whom are already struggling to make a living due to the monopoly power of Big Tech firms. This relationship suggests that a grand bargain is needed by both sides that redresses the imbalance of power between human creators and the corporations exploiting work.
Some of the inequities can be settled through civil litigation, but that will take years and pit deep-pocketed monopolies against struggling artists, writers, musicians, and small publications. Better enforcement of existing law can also be part of the answer. That means prosecuting AI firms when they violate licensing requirements or violate privacy law by instructing their crawlers to ingest people’s personal information and private data. In France, competition authorities recently fined Google for using news publisher content without permission and for not providing them with sufficient opt-out options.
Competition authorities should also investigate whether data partnerships violate antitrust law. Deals struck by dominant AI firms often contain provisions that could be illegal under long-standing antitrust statutes because they magnify monopoly power. This includes tying the use of one product, like access to content, to exclusive use of another product. An example is Microsoft’s deal with the publisher Axel Springer, which gives Microsoft access to the global publisher’s content but also requires that Axel Springer use Microsoft’s cloud services. Authorities in the United States and Europe are already scrutinizing the anti-competitive effects of this kind of deal when it involves cloud computing; the same scrutiny should be extended to data partnerships.
Some litigation currently being considered could also help. For example, a raft of bills are trying to impose measures that will make it easier to identify content created by AI and trace the provenance of data used in generative AI systems. Often the motive behind such bills is to attack the problem of deepfakes and other AI safety concerns. Some of these same measures could also make it easier to track the use of copyright-protected work, thus making it easier for a small newspaper, for example, to spot, and prove, when its content is being ripped off. The European Union’s AI Act already requires that AI companies provide information about where they get the content to train their models, allowing “publishers and creators worldwide to peer under the hood of the foundation models.”
But more fundamentally, lawmakers need to look for ways to compel tech companies to pay for the externalities involved in the production of AI. These include the enormous environmental costs involved in producing the huge amounts of electricity and water used by AI companies to crunch other people’s data. And they include the huge and growing societal cost of letting AI companies steal that content from its rightful owners and strip-mine society’s creative ecosphere.
There are many mechanisms by which government policy could achieve that end as part of grand bargains. Taxes that target AI production could make a lot of sense, especially if the resulting revenue went to shore up the economic foundations of journalism and to support the creative output of humans and institutions that are essential to the long-term viability of AI. Get this one right, and we could be on the cusp of a golden age in which knowledge and creativity flourish amid broad prosperity. But it will only work if we use smart policies to ensure an equitable partnership of human and artificial intelligence.
The post AI Needs Us More Than We Need It appeared first on Washington Monthly.