Post #3872 by Goofyhoofy on the Macroeconomic Trends & Risks board

Personal Finance / Macroeconomic Trends & Risks

Unthreaded | Threaded | Whole Thread (4)

Post New | Post Reply | Report Post | Recommend It!

No. of Recommendations: 34

I have never done this before, but I am going to post a long screed I lifted from Facebook about AI tarpits, something I did not know existed, and (obviously) about which I know nothing. It’s interesting, at least. Comments, confirmations, or refutations are welcome.

They Built a Roach Motel for AI Scrapers. The Industry Has No One to Blame but Itself. Somewhere on the internet right now, a bot is lost. It entered a website looking for training data. It followed a link. Then another. Then another. It has been following links for weeks. Every page it finds leads to more pages. None of them lead out. The bot works for one of the largest AI companies in the world. The website owner is one person with a server bill. This is what winning looks like when you have nothing left to lose. The tools doing this are called tarpits, and they are exactly what they sound like. A developer who goes by Aaron built the first major one and named it Nepenthes, after a carnivorous plant that drowns whatever crawls inside it. He published it in January 2025, explicitly labeling it as malware, explicitly warning that it is "not for the faint of heart." Nepenthes traps AI crawlers in what Aaron calls an "infinite maze of static files with no exit links," then feeds them Markov babble. Random gibberish. Designed to poison the training data that becomes the AI models that then go looking for more data to train on. The cycle is almost poetic. A second tool called Iocaine appeared within days. Its creator, a software developer named Gergely Nagy, reported that it immediately killed 94 percent of bot traffic to his site, which had been running so hot from AI scraping that it was consuming nearly all his available bandwidth. He named the tool after "one of the deadliest poisons known to man," a reference that anyone who sat through the 1980s will recognize. He was not joking. Here is what drove these people to this. Last summer, Anthropic's ClaudeBot crawler hit the repair database iFixit nearly a million times in a single day, despite iFixit's robots.txt file explicitly asking it not to. Freelancer.com got 3.5 million hits from ClaudeBot in four hours. Their sysadmins were getting woken up at 3 a.m. by alarms. Reddit CEO Steve Huffman called out Microsoft, Anthropic, and Perplexity by name, saying it had been "a real pain in the ass" to block crawlers that kept coming back regardless of what the robots.txt file said. He wasn't wrong. He was just late to a problem that every independent web publisher had been living with for months. Robots.txt is a plain text file. It is how the web has said "please don't scrape this" since 1994. It is not a law. It is not encrypted. It is a convention, a handshake between site owners and crawlers, built on the assumption that the other party is operating in good faith. Good faith has been vaporized. According to TollBit's State of the Bots report for Q2 2025, 13.26 percent of AI bot requests ignored robots.txt directives outright. That's up from 3.3 percent in Q4 2024. The rate is quadrupling in under a year. This is not a rounding error in corporate behavior. It is a policy. The industry's response to being called out has been a masterpiece of motivated reasoning. Companies pointed to configuration bugs. They promised fixes. They hired spokespeople to say things like "we design our systems to be resilient while respecting robots.txt and standard web practices." And then the crawlers kept crawling. I have been inside rooms where decisions like this get made. The calculation is simple and it is ruthless. If the data is available and the legal risk is manageable, you take it. You do not ask. You scrape first and negotiate under legal pressure later, if at all. You assume the small operators won't have the resources to fight you, and you are usually right. What changed is that a handful of people decided to fight anyway, without lawyers, without venture capital, and without asking permission. The thing that makes Nepenthes and Iocaine interesting is not whether they will "burn AI to the ground." They won't. The researchers at Carnegie Mellon who study this stuff are probably right that AI companies can detect and filter garbage data. OpenAI has already announced it's working on tarpit countermeasures. The thing that makes these tools interesting is that they exist at all. Someone with a Raspberry Pi, a server bill, and a deep and justified fury built a weapon that can trap the crawler of a multi-billion dollar company in an infinite loop for months. One person. And within days of it going public, dozens of site owners deployed it. Within weeks, new versions appeared, each one more sophisticated than the last. That is not a technical story. That is a political one. The developer who built Nepenthes put it plainly: the internet he grew up on is gone. Not because the technology changed. Because a small number of companies decided to consume it whole, turn it into product, and sell it back to the people who built it. That sounds familiar. I have watched this happen to local journalism. I watched classified ad revenue die, digital advertising get captured by two companies in California, and editorial desks fold into silence across the Front Range while the people in Denver and Boulder who used to be informed voters quietly stopped being. The pipe got captured. The content was extracted. The communities that needed it were told to adapt. The mechanism is the same. The web is infrastructure. For thirty years, it ran on an informal social contract: publish openly, get indexed, get traffic, get found. AI broke that contract in one direction. Companies scraped the entire open web to train models worth billions of dollars, then deployed those models to answer questions that used to bring traffic back to the publishers. The content went in. The traffic did not come out. Aaron, the guy who built Nepenthes, has a financial theory to go with his fury. Every compute cycle his tarpit wastes is cash an AI company spent without getting anything back. None of these companies are profitable. They are burning investor money to outrun each other to a business model that still doesn't fully exist. Raised costs. Slowed timelines. Strained the investor patience that funds the whole race. That is not nothing. The tarpits won't win this fight alone. The fight needs policy, needs legislation, needs courts to finally decide whether scraping the open web without permission is theft or just business as usual. The EU is making moves. The U.S. is not. But "be indigestible, grow spikes" is not a bad principle while we wait for the law to catch up. The carnivorous plant eats whatever falls in. That's what it was built to do. And right now, a lot of things are falling in. --------------------------------------------------------------------- CLAIMS AND SOURCES 1. Anthropic's ClaudeBot crawler hit iFixit's website nearly a million times in a single day in summer 2024. Source: https://pivot-to-ai.com/2024/07/29/anthropic-is-sc... 2. Freelancer.com received 3.5 million hits from ClaudeBot in four hours, described as "the most aggressive" bot they had seen, waking sysadmins at 3 a.m. Source: https://pivot-to-ai.com/2024/07/29/anthropic-is-sc... 3. Reddit CEO Steve Huffman called out Microsoft, Anthropic, and Perplexity by name for scraping data without permission, calling it "a real pain in the ass to block these crawlers." Sources: https://www.windowscentral.com/microsoft/reddit-ce..., https://www.businessinsider.com/reddit-ceo-microso... 4. Nepenthes was released in mid-January 2025, described by its developer as malicious software that traps AI crawlers in an "infinite maze of static files with no exit links" and feeds them Markov babble to poison training data. Source: https://arstechnica.com/tech-policy/2025/01/ai-hat... 5. Nepenthes is available as an open-source project. Source: https://github.com/NEPENTHESWEB/nepenthes-py 6. Developer Gergely Nagy (handle: algernon) created Iocaine as a second tarpit tool, reporting it immediately killed approximately 94 percent of bot traffic to his site. Source: https://arstechnica.com/tech-policy/2025/01/ai-hat... 7. According to TollBit's State of the Bots report, 13.26 percent of AI bot requests ignored robots.txt directives in Q2 2025, up from 3.3 percent in Q4 2024. Sources: https://almcorp.com/blog/google-ai-overviews-opt-o..., https://www.theregister.com/2025/12/08/publishers_... [Verified during research; Register source URL may require subscription at final check] 8. OpenAI confirmed it is developing countermeasures against tarpit attacks, stating it "designs systems to be resilient while respecting robots.txt and standard web practices." Source: https://arstechnica.com/tech-policy/2025/01/ai-hat... 9. Microsoft's director of partner technology published a report in May 2024 concluding that data poisoning is "a serious threat to machine learning models." Source: https://arstechnica.com/tech-policy/2025/01/ai-hat...

Post New | Post Reply | Report Post | Recommend It!

Print the post

Unthreaded | Threaded | Whole Thread (4)

Prev | Next

Announcements

Macroeconomic Trends & Risks FAQ

Contact Shrewd'm
Contact the developer of these message boards.