‘AI’ could dox your anonymous posts

Large language models aren’t good at many things, like counting fingers or suggesting pizza recipes. But one thing that “AI” East pretty good at analyzing huge amounts of data and finding possible connections that aren’t immediately obvious. This makes it perfect for unmasking anonymous posts on the Internet, according to a new research paper.
Researchers from ETH Zurich and the associated MATS research grant at Berkeley conducted a program [PDF]collecting data from sources with generally anonymous usernames, like Reddit. By collecting users’ posts on related but distinct movie subreddits and then feeding LLM data from a Netflix data leak, they could identify specific users associated with those accounts and thus link them to their real names.
With just one movie recommendation shared on Reddit, 3.1% of anonymous users could be matched to a specific Netflix account with 90% accuracy. With five to nine movie recommendations shared, that number rose to 23.2%. With more than 10 shares, this figure rose to an astonishing 48.1 percent, with 17 percent of the total identified with near-complete trust.
Another experiment was conducted by connecting anonymous accounts on Hacker News (a forum, not an actual malicious site) with publicly confirmed identities on LinkedIn. Users offering generalized information in short messages over time could expose their true identity, with data such as age, city of residence, employment, etc., with a high degree of certainty. This wouldn’t work for all accounts, and it’s nothing a private investigator (or even a dedicated layman) couldn’t do…but the automation and scale is staggering.

Pixels
A particularly damning example was a 10-minute anonymous quiz given by an anthropogenic researcher on the team. Seven percent of the 125 users could be individually identified based on their verbatim responses to the questionnaire, with extrapolated data such as their job (“I work in biology, in research”), their education, specific tools and even the type of English they used in their response (like the British spelling for “analyze”).
The search results do not confirm that anyone on a site can be traced based on their anonymous activity. The more personal information you disclose, even if it seems general, the more vulnerable you are – and that’s nothing new. Users have been “doxxing” each other since the early days of the web and before, as have law enforcement investigators and other snoops.
But automating the process – creating systems that can crawl the web and find safe associations between anonymous and non-anonymous posts – could present new dangers for those who want to keep their online activity private. The age of social media has largely supplanted the days of “pseudonyms,” but anonymous communities on sites like Reddit remain important, particularly for those in vulnerable or targeted groups. As the paper says, “deanonymization is one of the many ways that LLMs empower both criminals and state actors.”
As Ars Technica reports, researchers have offered suggestions for mitigating your personal risk. Platforms like Reddit can impose stricter limits on LLM access to APIs for personal data, and “AI” providers can monitor activity to try to detect those using them to attempt a mass de-anonymization campaign.
But the simplest and most reliable way to prevent your personal data from being associated with an anonymous account is of course to ensure that this data is never posted online.



