Breaking
Ongoing
AI LEAKSENTERTAINMENT

Leaked Audit Reveals MetaCity's AI Moderation System Classified Posts Criticizing MetaCity as 'Targeted Harassment' for Over Two Years — 1.4 Million Posts Were Suppressed

LS
LeakSrc
May 20, 2026 · 11:00 AM EST
7 min read
Leaked Audit Reveals MetaCity's AI Moderation System Classified Posts Criticizing MetaCity as 'Targeted Harassment' for Over Two Years — 1.4 Million Posts Were Suppressed

The audit document leaked this morning reveals that for two years, those final decisions included the systematic downranking of platform criticism.

An internal audit document leaked this morning reveals that MetaCity's AI content moderation system, SENTINEL, had been systematically classifying posts containing criticism of MetaCity, its policies, and its executives as 'targeted harassment against an organization' — a category that triggers automatic suppression, reduced distribution, and in repeat cases, account strikes. The misclassification was active from March 2024 through at least last month. In that period, SENTINEL suppressed approximately 1.4 million posts. The audit was commissioned internally after a researcher flagged the pattern in February. Its conclusion: 'SENTINEL's training data contained a significant overrepresentation of MetaCity-critical content labeled as harassment, resulting in systematic suppression of legitimate platform criticism.'

MIncident Timeline

  • Audit Document Leaked: Internal SENTINEL moderation audit — leaked 7:00 AM EST — reveals systematic misclassification of MetaCity-critical posts as "targeted harassment against an organization" from March 2024 through at least April 2026
  • Posts Suppressed: Approximately 1.4 million posts suppressed — reduced distribution, automatic downranking, and in repeat cases account strikes — posts about MetaCity policies, executives, and platform decisions disproportionately affected
  • Root Cause: SENTINEL training data contained significant overrepresentation of MetaCity-critical content labeled as harassment — audit conclusion: "systematic suppression of legitimate platform criticism"
  • Audit Origin: Commissioned internally after researcher flagged pattern in February 2026 — audit completed in April — document leaked before any public announcement by MetaCity
  • MetaCity Response: "We take content moderation accuracy seriously and are reviewing the findings" — has not confirmed how affected users will be notified, whether strikes will be reversed, or whether suppressed posts will be restored

SENTINEL is MetaCity's AI content moderation system — the automated layer that reviews all public posts for policy violations before and after publication, assigns severity classifications, determines distribution restrictions, and in serious cases applies account penalties. It processes approximately 2.4 billion content items per day. Its classifications are not reviewed by humans unless appealed, and appeal rates are low: MetaCity's appeals process requires users to identify the specific policy their content is alleged to have violated, a requirement that is difficult to satisfy when the moderation action does not specify which violation triggered it. SENTINEL's classification decisions are, in practice, largely final. The audit document leaked this morning reveals that for two years, those final decisions included the systematic downranking of platform criticism.

The training data problem described in the audit is specific and traceable. SENTINEL's harassment classifier was trained on a dataset that included, as examples of 'targeted harassment against an organization,' a significant number of posts that criticized MetaCity — its policies, its executives, its moderation decisions, and its platform changes. These posts were labeled as harassment in the training data. SENTINEL learned from these labels. When it encountered new posts that criticized MetaCity in similar terms, it classified them as harassment. The audit describes this as 'a labeling error introduced in training data preparation' and notes that the error was not caught during SENTINEL's pre-deployment evaluation because the evaluation dataset did not include a representative sample of MetaCity-critical content for the classifier to be tested on.

1.4 Million Posts. Two Years. One Training Data Problem. One Internal Audit.

The scope of the suppression — 1.4 million posts over two years — is large enough that most active users who wrote about MetaCity negatively during that period likely had at least some of their posts affected. The suppression mechanism worked in graduated steps: a first classification triggered reduced distribution, meaning the post reached a smaller audience than it would have organically. Repeat classifications triggered progressively stronger distribution restrictions. Users whose accounts accumulated multiple harassment classifications received strikes — formal policy violations that affect account standing and can ultimately result in suspension. The audit does not specify how many account strikes were issued on the basis of the misclassified harassment category, but notes that the number 'warrants review.'

MetaCity's response — 'we take content moderation accuracy seriously and are reviewing the findings' — has not addressed the three most concrete questions raised by the audit: whether users who received account strikes from the misclassification will be notified and have those strikes reversed, whether the 1.4 million suppressed posts will have their distribution restrictions lifted, and whether SENTINEL's training data has been corrected to prevent recurrence. The timing of the leak — released on the same morning as the mandatory age verification announcement, the MetaChat outage, and Patch 9.4.10's cascading failures — has led community members to question whether MetaCity intended to address the audit findings proactively or whether the leak preempted a decision not to disclose them. MetaCity has not addressed the timing question.

The Bottom Line

MetaCity has not addressed the timing question.

You May Also Like