Even the most advanced automated systems can’t catch every bit of extreme content.
Updated at 12:18 p.m. ET on November 7, 2022
There’s been a lot of chatter, in recent days, about the fate of a certain platform that deals mostly in text posts no longer than 280 characters. With a chaos agent now at the helm of Twitter, many people are understandably fretting about whether it could possibly control a rising tide of abuse, hate speech, pornography, spam, and other junk. But in a sense, these worries miss the point: In 2022, Twitter is small fry.
A far grander and more terrifying saga is unfolding on the endless video feeds that have become the dominant mode of social media today, drawing not millions but billions of monthly users on TikTok, Instagram, Facebook, and YouTube. The Era of Video has definitively and irreversibly arrived.
Doubters might look to a recent clash between social-media royalty and Instagram leadership. Back in July, two Kardashians and a Jenner shared an Instagram image from the creator Tati Bruening calling on the platform to “stop trying to be tiktok.” In the history of social media, this was a bit like the 21st-century equivalent of nailing a pamphlet to a cathedral door. The protest referred in no uncertain terms to the company’s turn toward video, and away from its origins as a vehicle for still images. TikTok, of course, is almost entirely video—a relentlessly addictive scroll of auto-playing content, much of which comes from accounts you do not follow.
But not even some of the biggest influencers of all time could turn the tides. The day after their posts went up, Instagram’s CEO, Adam Mosseri, doubled down. “I need to be honest,” he said (in a video). “I do believe that more and more of Instagram is going to become video over time.”
No matter where you swipe or tap, video is there—a torrent of pixels, fury, and sound that is, if not literally infinite, effectively endless. The quality of our online lives now hinges on how these feeds are ordered and mediated, powers that are largely automated. In line with its push to embrace video, Meta has said that by the end of 2023, it will more than double the proportion of material on Instagram and Facebook users’ feeds that is “recommended by our AI.” The likelihood that we’ll find ourselves hurtling down one of those black holes of content—where time seems to dilate and lose all meaning—will depend less on whom we follow and more on what the machine decides to serve us next.
The shape of our politics, our ideology, and even our fundamental grasp of how the world works is, in some substantial way, up to the algorithms. According to a recent survey from the Pew Research Center, a quarter of people under 30 in the U.S. regularly get their news from TikTok clips. That number is growing. People are even turning to social-media video as a replacement for Google search.
Whether the results of such swipes and searches lead us to enlightenment or drag our worldviews further down toward their least reconciliatory, most conspiratorial depths depend in part on AI. In an experiment from September, the fact-checking company NewsGuard found that the top results on TikTok for a range of terms often included misleading, hateful, and in some cases extremely dangerous videos. Thirteen of the top 20 results for does mugwort induce abortion, for example, advocated unproven herbal abortifacients such as papaya seeds. In the same experiment, a search for hydroxychloroquine yielded a tutorial on how to fabricate the malaria drug—and bogus COVID cure—at home using grapefruit. Needless to say, this is not how to make hydroxychloroquine.
All of this should lead social-media companies to redouble their efforts to keep the muck—violence, illegal pornographic material, disinformation—off their platforms. As it is, only about 40 percent of the videos that TikTok pulls down are culled with automated systems, leaving millions of videos to be reviewed each month by workers who will have to spend the rest of their lives under the pall of what they’ve seen.
Even the most cutting-edge AI is not always as smart or all-seeing as it’s chalked up to be. These shortcomings could become more painfully evident in the years ahead. And if there were a way for AI to execute moderation tasks faithfully and accurately on the endless feed, it could come at a heavy price, drawn directly from our scant remaining balance of privacy and autonomy.
How does a machine even “understand” a video in the first place? Thanks to advances in computer vision over recent years, it’s become routine for AI to examine still images and, for example, connect the attributes of a face to someone’s identity, or to conclude that a gun is indeed a gun—or that a grapefruit is indeed a grapefruit.
Every minute of video is really just thousands of static pictures arranged in succession. And AI-based computer vision can certainly find still frames containing signals of problematic content with impressive tenacity. But that’s only going to get you so far. YouTube arguably has the most extensive experience with automated video moderation, but violative videos are viewed millions of times every day.
Part of the issue is that video contains a lot of data. In sequence, the thousands of still frames that make up a video create narrative. The accompanying audio adds further layers of meaning. This morass can be difficult to sort through—the density of data in any given video is a “two-sided coin,” as Dhanaraj Thakur, a research director at the Washington, D.C.–based Center for Democracy and Technology, told me. You can’t have the system analyze just one layer, because it would pick up a bycatch of content that, at a glance, would appear to break terms of service, despite being innocuous—while still letting other illicit material through the gaps.
For example, a system that flags any video with a gun would flag a clip from Pawn Stars with two people discussing the value of an antique rifle. Meanwhile, it would miss a clip of a person being shot with a firearm that’s out of frame. Visually and even perhaps auditorily, a clip of someone attempting to make hydroxychloroquine might be hard to discern from that of someone making grapefruit juice, especially if that person takes care not to say anything about COVID.
Read: Of God and machines
Social-media firms couldn’t abide such overzealous AI-based policing, because it would get in the way of the kind of content that’s often best for engagement and ad impressions—the shocking, the outrageous, and, of course, the risqué. AI-based skin detection, which continues to be the basis for nudity filtering, might block any clip of swimwear-clad beachgoers, spelling doom for an entire industry of fitness influencers. This has pushed the research community toward some unorthodox solutions over the years. Researchers have, for example, sought to improve the accuracy of these kinds of classifiers with AI that recognizes porn’s unique patterns of motion, as well as its singular soundtrack—in the words of one team: “moans, deep breathing, panting, groans, screams, and whimpers,” as well as “bed creaking, sheets rustling.” But although such tools may work fine in the lab, they can struggle in the infinitude of the real world. A similar effort from 2006 that detected porn on the basis of the audio’s “periodicity”—that is, its repetitiveness—ended up catching lots of footage of tennis matches.
Ultimately, a video’s true meaning may be gleaned only at the “confluence” of its various layers of data, as Becca Ricks, a senior researcher at the Mozilla Foundation, put it to me.
Footage of a desolate landscape with a tumbleweed blowing across it would raise no red flags for either a human or a machine, nor would an audio clip of someone saying, “Look how many people love you.” But if you combine the two and send the result to someone you’re cyberbullying—No one loves you!—the computer would be none the wiser.
Much hate speech, it turns out, fits snugly in this gap of computer comprehension. A “Hateful Meme Challenge” organized by Facebook in 2020 created a set of more “subtle” combinations of images and text that AI struggles to understand; an image of a tumbleweed with the exact phrasing contained above, for example, or a picture of an alligator with the text “Your wrinkle cream is working great.” Different AI models had varying levels of success, but none of them came close to human accuracy. Effective moderation can require “real-life context and common sense,” researchers wrote—in other words, a human’s sensibilities. Achieving better results in a single TikTok, which the researcher Abbie Richards describes as a “three dimensional meme composed of video, audio, and text,” could take a quantum leap in technology.
AI also doesn’t do well on things that it hasn’t encountered before—“edge cases,” in the poetics of the academy. This lends those posting harmful content a perpetual upper hand. “Humans are good at adaptation,” Hany Farid, a professor at UC Berkeley who was one of the creators of PhotoDNA, a widely used system for detecting child porn, told me. “AI is slower.”
YouTube has said that prior to 2020, its AI had gotten reasonably good at detecting “the few main narratives” that dominated “the misinformation landscape online,” (“9/11 truthers, moon landing conspiracy theorists, and flat earthers”). But when a wave of new pandemic mis- and disinformation began to flood the web, it took time for the company to retrain its algorithms and catch up.
Maybe someday AI will reliably grasp the difference between an exercise video and a porno, or between banter and hate speech. But an AI that can preempt the next trick up the misinformation community’s sleeve? That could be beyond science fiction.
Even if AI could filter out illicit video with superhuman acuity, there is still the task of ordering the endless feed. With extreme political content on the rise across the internet, the manner in which the algorithm chooses what to surface and recommend is a fraught technological question with very high stakes. Here, too, video is different.
When you engage with a video online, you generate far more data than you give up by just staring at a photo, according to Spandana Singh, a former policy analyst at New America’s Open Technology Institute. Companies can track things like how many times you rewatched a video or how far through you made it before skipping to something else. “All that interaction data, I have to assume, goes into determining how videos are ranked,” Ricks told me. (TikTok and Instagram did not respond to a request for comment about how they use interaction data to serve content; both have information pages explaining that they use interactions to sort through and serve content.)
The AI that churns these data has proved astonishingly adept at keeping us on our screens as long as possible—sometimes with perilous results.
An internal Facebook document from 2020, which was released as part of the Facebook Papers and recently analyzed by a team at Amnesty International, describes how a video of the leader of the anti-Rohingya extremist group 969 circulated on Facebook two full years after the company faced widespread condemnation for its role in the Rohingya genocide. As it turned out, the clip’s numbers had not been driven by some coordinated campaign. Seventy percent of the video’s views had come through Facebook’s algorithmically fueled “Up Next” feature. The same team also noted that algorithmically recommended content already accounted for no less than half of the total time that Burmese users spent on the platform.
The issue is that an AI optimized for engagement can’t tell the difference between a clip that you enjoyed watching and one that you hate-watched, or watched passively. If you watched a clip multiple times, the AI won’t be able to discern whether it was because it gave you joy or because it boiled your blood. (Even if it could, a company might end up promoting infuriating content anyway because it’s so compelling—Facebook supposedly did exactly that after introducing emoji-based reactions a few years ago.) As Singh put it, “How do you train a recommender system to optimize for happiness? Like, what does that mean?”
Read: Artificial intelligence is misreading human emotion
Some people are trying to find out. Jochen Hartmann, an assistant professor at the University of Groningen, in the Netherlands, is part of an enthusiastic technical community building AI that hunts for the “unstructured” data within social-media video. Last year, Hartmann and three other researchers built a tool that analyzed video for a wide range of different qualities, including whether any given face in the clip expresses anger, happiness, disgust, surprise, or neutrality. If such a system could, say, rank videos by how happy they are, it might help elevate more positive content and subdue the darker stuff: less vitriol, more virtue. Researchers are also exploring using such techniques for detecting hate speech more effectively.
Of course, there’s a caveat. “All of this information,” Hartmann told me, “can be used to recommend new products.” It might be used, in connection with your personal data, for marketing.
And another: “Emotion recognition” is widely seen to be based on physiognomic theories that were revealed, decades ago, to be both thoroughly unscientific and startlingly racist.
Every expert I spoke with said that it’s hard to measure what we know to be true of AI in general against what’s actually going on inside the locked labs and servers of Silicon Valley. The algorithms behind products like Facebook and TikTok are “frustratingly and impossibly opaque,” as the Microsoft researcher Tarleton Gillespie has written.
Technology is also evolving rapidly. The forces governing video feeds today may not have much in common with whatever users will be subjected to a few years from now. “Here’s the one thing I’ve learned in my 30-year career,” Farid told me. “Don’t make predictions about the future when it comes to technology.”
If there is just one certainty in all of this, it is that video holds an embarrassment of data—and although these data might prove to be a stubborn obstacle for AI moderation and an impish enabler for recommendation engines, Farid is sure of this much: “Companies will figure out how to mine this massive amount of data to monetize it.”
In particular, video could be unimaginably useful for building the next generation of artificial intelligence. In 2016, the internet was briefly gripped by the Mannequin Challenge, in which people held poses as a camera moved around the scene. Two years later, Google revealed that it had used 2,000 mannequin-challenge videos to develop an AI capable of depth perception, a skill that will be essential for the kind of embodied robots that Silicon Valley hopes to bring to market in the coming years.
Other, similar experiments are surely in the works. Both Meta and Google recently unveiled prototype AI systems capable of turning any text prompt into video. It’s like DALL-E, but for the moving image—a cataclysmic prospect for keeping disinformation off the internet, according to Farid and Thakur, the Center for Democracy and Technology research director.
Maybe elsewhere in Silicon Valley, a team of engineers is using one of the mostly very boring videos I post about food to train up another AI. What for? Who knows.
I asked Farid if he thought those as-yet-shrouded future AIs will be built from the ground up with an eye to ethics. On this, he was also willing to break his rule against predictions. “No,” he said. “I’m not that naive.”
This article originally identified the researcher Abbie Richards by the wrong last name.