MediaWiki Spam, Crawler, and Bot Prevention

Running a MediaWiki site? Whether you’re managing a public wiki or a collaborative knowledge base, spam and bots can quickly become a headache. This guide walks you through proven tools and techniques—from AbuseFilter and ConfirmEdit to Cloudflare and zip bombs—to help you secure your site. Learn how to block spam, manage crawlers, and keep your content clean and accessible.

...
WikiTeq
Share:

MediaWiki is a powerful platform for collaborative knowledge sharing, but its open and flexible nature also makes it a common target for spam, bots, and malicious crawlers. If you're running a public wiki, you've likely encountered some form of these intrusions—from link spam buried in obscure pages to aggressive web crawlers that overload your server.

In this article, we’ll walk you through how to effectively combat these nuisances using proven MediaWiki extensions, smart configuration strategies, and protective services like Cloudflare. Whether you're running a community wiki or an internal documentation hub, these methods will help you reduce unwanted activity and protect your content.

Fighting Spam in MediaWiki

Spam can come in many forms—links to shady websites, auto-generated nonsense, or even human-generated posts designed to manipulate search rankings. Thankfully, MediaWiki has several powerful extensions that can significantly reduce your exposure to spam.

HoneyPot

The HoneyPot extension introduces a clever trick to catch spambots during account registration. It adds a hidden field to the account creation form. Since humans won’t see or interact with this field, any submission that fills it in is likely from a bot—and the account creation is blocked.

This tool is quick to install, requires no third-party service, and works silently in the background. For installation details, visit the HoneyPot extension page.

SpamRegex

SpamRegex lets administrators define specific patterns—like certain words or links—that should be blocked from being added to pages. It's especially useful if your wiki is being targeted by spam that follows consistent patterns.

You can set these rules through regular expressions (regex), but even without technical expertise, simple lists of keywords or suspicious domains can make a big difference. Learn how to use it on the SpamRegex documentation page.

AbuseFilter

AbuseFilter is a more advanced tool that allows you to create custom rules for identifying and preventing unwanted edits. For example, you could block new users from adding too many links in one edit or prevent blanking of entire pages.

It includes logging and testing tools so you can experiment with filters before enforcing them. Visit the AbuseFilter extension page to learn more about writing effective filters.

TitleBlacklist

The TitleBlacklist extension prevents users from creating pages with suspicious or spammy titles. This is helpful for blocking repeated spam attempts like pages titled “Buy-Cheap-Drugs-Online” or similar.

Patterns can be based on keywords or phrases and can be applied specifically to new accounts. You can find configuration guidance on the TitleBlacklist documentation page.

SpamBlacklist (and a Simpler Alternative)

The SpamBlacklist extension focuses on blocking edits that include links to specific blacklisted domains. This is especially helpful for preventing link spam to common spam or phishing sites. It supports both local blacklists and shared, community-maintained ones—like those hosted on Meta-Wiki.

SpamBlacklist is powerful, but it relies on regular expressions (regex) to define which links should be blocked. That gives you a lot of flexibility, but let’s be honest: regex can be tricky to get right. As one programmer famously joked, the plural of regex is regrets.

A Simpler Alternative: BlockedExternalDomains

If you’re looking for a more straightforward way to block links without diving into regex, MediaWiki now offers a helpful alternative: BlockedExternalDomains, a feature of AbuseFilter introduced in recent versions.

Instead of writing complex patterns, you can simply list domains that should never appear in edits—no fancy syntax required. For example, you could block all edits that include links to:

  • spamdomain.com

  • clickbait.io

  • cheap-pills.biz

This feature covers a large part of the same use cases as SpamBlacklist, but in a much easier-to-manage format—perfect for admins who don’t want to deal with regex syntax errors or maintenance headaches.

When to Use Which:

  • Use BlockedExternalDomains if you want a quick, readable way to block specific sites.

  • Use SpamBlacklist if you need advanced pattern matching or want to tap into global community-maintained blacklists.

You can even use both together to create a layered defense: BlockedExternalDomains for easy wins, and SpamBlacklist for more nuanced or large-scale filtering.

MediaWiki Crawler and Bot Prevention


While spam fills your content with junk, aggressive bots and crawlers can overwhelm your server, slow down your site, and cause spikes in bandwidth usage. MediaWiki provides several ways to reduce the impact of bots—both through internal configuration and external services.

Understanding Crawlers

Crawlers (or “bots”) scan your site to index pages, scrape data, or—if misbehaving—probe for weaknesses. Not all bots are bad. For example, Googlebot is essential for SEO, but rogue or poorly designed bots can be problematic.

The key is to distinguish between helpful bots and harmful ones, then guide or restrict the latter.

Built-in MediaWiki Tools

The Handling web crawlers page offers a detailed breakdown of MediaWiki’s bot-handling features.

Robots.txt

A well-configured robots.txt file can guide crawlers to ignore specific paths. Place it in the root directory of your web server:

User-agent: *
Disallow: /wiki/Special:
Disallow: /w/
Disallow: /index.php

This prevents most bots from indexing special pages and internal infrastructure.

Meta Tags

MediaWiki supports the use of noindex meta tags to exclude individual pages or namespaces from indexing. You can use configuration variables like:

$wgNamespaceRobotPolicies[NS_USER] = 'noindex, nofollow';

This prevents search engines from indexing user profile pages.

Crawl-Delay

Some bots respect the Crawl-delay directive in robots.txt, which can reduce the rate at which they access your site:

User-agent: *
Crawl-delay: 10

Note: Googlebot does not respect Crawl-delay, so consider server-side rate limiting for aggressive bots.

Protecting Your Site with Cloudflare

Cloudflare is a reverse proxy service that sits between your MediaWiki server and the rest of the internet. It offers a suite of tools that help defend your site from malicious traffic and abusive bots, while also improving performance through caching.

Here’s how Cloudflare helps:

  • Rate Limiting: You can define thresholds to block or throttle IPs that make too many requests in a short time.

  • Bot Management: Cloudflare uses machine learning, traffic behavior, and IP reputation to identify bad bots. Depending on the severity, it can challenge, delay, or outright block them.

  • Challenge Pages (JS/CAPTCHA): Suspicious clients may be required to complete a JavaScript or CAPTCHA challenge before accessing your site.

  • IP Reputation & Firewall Rules: Cloudflare lets you create rules based on user agent, request patterns, geography, and more to filter traffic before it hits your server.

What About Serving Cached Pages to Bots?

By default, Cloudflare caches static content (like images and JavaScript), but it doesn’t cache full HTML pages unless you explicitly tell it to.

To serve cached versions of full pages (including to bots), you need to:

  1. Set up Cache Everything rules in Cloudflare's Page Rules or Transform Rules.

  2. Optionally use Edge Cache TTL to control how long pages are cached.

  3. Make sure that login/edit pages and dynamic content are excluded, to avoid serving stale or private content.

This approach can reduce load on your origin server, especially if anonymous visitors and known bots are all served pre-cached content. However, it should be tested carefully to avoid caching pages with user-specific content.

Cloudflare doesn’t treat bots differently out of the box when it comes to page caching—but because many bots (especially good ones like Googlebot) follow cache headers, caching full pages can indirectly improve your performance for them as well.

Learn more about this from Cloudflare’s official docs on:

  • Cache Everything Rules

  • Bot Management

  • Using Page Rules


Using a “Zip Bomb” to Confuse Bad Bots

Some bots are smarter than others—but many are just basic, brute-force crawlers that don't know when to stop. One unconventional yet clever trick to stop these unsophisticated bots is to serve them what's called a zip bomb.

A zip bomb is a small file that, when downloaded and opened, expands into a massive amount of data—sometimes gigabytes of useless content. The idea is to waste the bot's time and resources by feeding it a file that looks tiny but takes forever to process or even crashes it entirely.

This method works especially well on bots that:

  • Download everything on your site without discrimination

  • Don't respect robots.txt rules or rate limits

  • Aren't smart enough to detect that they’re being trapped

Here’s the general idea:

  • You detect when a suspicious bot visits your site—maybe it’s making requests too quickly or using a known fake user-agent.

  • Instead of serving it a real page, you serve it this compressed “zip bomb” file.

  • The bot, trying to be thorough, downloads and tries to decompress it.

  • Suddenly it’s dealing with a huge amount of junk data, slowing it down or overwhelming it completely.

From a human visitor’s perspective, nothing changes—because this file is never shown to regular users. It’s a trap specifically for bad actors.

Why it works:
Many bots don’t have the safeguards that web browsers or search engine crawlers have. They blindly process everything, and this makes them vulnerable to tricks like zip bombs. Once they hit your decoy file and choke on it, they often give up and move on.

⚠️ Important Warning

A zip bomb is not something you want real users—or even yourself—to encounter. If a regular visitor, admin, or search engine bot accidentally downloads and unpacks the file, it could overwhelm their device or browser, possibly leading to crashes or data loss. Even the original author of the method acknowledged the danger, noting that he accidentally crashed his own system while testing it.

Because of this, you should only deploy a zip bomb if:

  • Other bot mitigation strategies (Cloudflare, rate limits, AbuseFilter) have already failed

  • You are absolutely certain that the client you're targeting is a malicious bot

  • You configure your system to serve the file only to that bot and under very specific conditions

This technique is more of a digital "booby trap" than a defense mechanism. It’s not meant for regular traffic—and it should be handled with extreme caution.

When might this be used?

If your site is facing targeted abuse from a specific scraper or spam bot that isn’t deterred by other measures, a zip bomb might get them to back off. But due to the risks, it should be used sparingly, carefully, and with proper safeguards in place.

For more context on this technique and how to install it, see this article by Ibrahim Diallo.

Conclusion

Spam, bots, and crawlers are ongoing challenges for any public-facing wiki, but with the right combination of tools, extensions, and thoughtful configuration, you can significantly reduce unwanted activity.

Start with easy wins like AbuseFilter and HoneyPot, consider stronger account moderation tools like ConfirmAccount, and use Cloudflare for extra protection against bots. For link spam, explore both SpamBlacklist and the simpler BlockedExternalDomains feature.

And while CAPTCHA tools like ConfirmEdit offer value, remember to balance security with accessibility—especially for a global user base.

If you're running a private internal knowledge base—one that requires authentication and isn't publicly accessible—many of these measures may not apply or be necessary. In those cases, basic access controls and account approval workflows are usually sufficient to prevent abuse.

No single tool is perfect, but together, these methods form a layered defense that keeps your MediaWiki instance healthy, clean, and contributor-friendly.

Need help implementing any of these solutions or not sure where to start?

WikiTeq offers free, no-obligation consultations to help you assess your setup and recommend the right tools to secure and streamline your MediaWiki site.



Latest Stories

Here’s what we've been up to recently.


Stay informed!

Get our latest blogs and news delivered straight to your inbox.