Prohibit marketing spiders from crawling the website

Today, I checked the website's access logs and found that a User Agent Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html) has been crawling the site very frequently. Looking at the log of over twenty Mb, it is clear that the bot has been crawling the site for quite some time.

Website Logs

Based on the pages crawled by this bot, it seems to be randomly combining some parameters from previously crawled pages for access, resulting in a long string of URLs, most of which return 404. Additionally, the bot crawls every few seconds, causing the normal access records in the logs to be overwhelmed by tens of thousands of garbage bot records.

At first, I thought that since it is a bot, it should comply with robots.txt, so I considered adding rules to robots.txt:

User-agent: SemrushBot
Disallow: /

However, after checking online, I found that this bot does not seem to comply with robots.txt 😅 (the official page claims that the bot strictly adheres to robots.txt, but feedback online suggests otherwise), and it's not just this SemrushBot; many marketing bots do not comply with robots.txt. I had no choice but to block it using Nginx. In the Nginx Free Firewall in Baota, I clicked on User-Agent Filtering under Global Configuration and added the following regular expression (compiled from online sources; I didn't expect there to be so many, and I also added some useless UAs, so please check if there are any UAs you need before using):

(nmap|NMAP|HTTrack|sqlmap|Java|zgrab|Go-http-client|CensysInspect|leiki|webmeup|Python|python|curllCurl|wget|Wget|toutiao|Barkrowler|AhrefsBot|a Palo Alto|ltx71|censys|DotBot|MauiBot|MegaIndex.ru|BLEXBot|ZoominfoBot|ExtLinksBot|hubspot|FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|Bytespider|Ezooms|JikeSpider|SemrushBot)

After a few seconds, I could see the intercepted data.

Intercepted Data

Later, seeing the continuously rising number of interceptions, I thought that simply making requests, whether they return 404 or 444, would consume server resources. This constant request every few seconds is not a long-term solution. Upon checking, I found that these bot IPs are all foreign nodes, and my website's foreign route is also resolved to Cloudflare, so I could let Cloudflare block them from accessing the site.

Bot IP

After entering the corresponding domain console in Cloudflare, I went to the Security section and clicked on the WAF item, then clicked Add Rule. In the Expression Preview, I added the following expression (again, please check if any necessary UAs are blocked before use):

(http.user_agent contains "SemrushBot") or (http.user_agent contains "FeedDemon") or (http.user_agent contains "Indy Library") or (http.user_agent contains "Alexa Toolbar") or (http.user_agent contains "AskTbFXTV") or (http.user_agent contains "AhrefsBot") or (http.user_agent contains "CrawlDaddy") or (http.user_agent contains "CoolpadWebkit") or (http.user_agent contains "Java") or (http.user_agent contains "Feedly") or (http.user_agent contains "UniversalFeedParser") or (http.user_agent contains "ApacheBench") or (http.user_agent contains "Microsoft URL Control") or (http.user_agent contains "Swiftbot") or (http.user_agent contains "ZmEu") or (http.user_agent contains "jaunty") or (http.user_agent contains "Python-urllib") or (http.user_agent contains "lightDeckReports Bot") or (http.user_agent contains "YYSpider") or (http.user_agent contains "DigExt") or (http.user_agent contains "HttpClient") or (http.user_agent contains "MJ12bot") or (http.user_agent contains "heritrix") or (http.user_agent contains "Bytespider") or (http.user_agent contains "Ezooms") or (http.user_agent contains "JikeSpider") or (http.user_agent contains "HTTrack") or (http.user_agent contains "Apache-HttpClient") or (http.user_agent contains "harvest") or (http.user_agent contains "audit") or (http.user_agent contains "dirbuster") or (http.user_agent contains "pangolin") or (http.user_agent contains "nmap") or (http.user_agent contains "sqln") or (http.user_agent contains "hydra") or (http.user_agent contains "libwww") or (http.user_agent contains "BBBike") or (http.user_agent contains "sqlmap") or (http.user_agent contains "w3af") or (http.user_agent contains "owasp") or (http.user_agent contains "Nikto") or (http.user_agent contains "fimap") or (http.user_agent contains "havij") or (http.user_agent contains "BabyKrokodil") or (http.user_agent contains "netsparker") or (http.user_agent contains "httperf") or (http.user_agent contains " SF/") or (http.user_agent contains "zgrab") or (http.user_agent contains "NMAP") or (http.user_agent contains "Go-http-client") or (http.user_agent contains "CensysInspect") or (http.user_agent contains "leiki") or (http.user_agent contains "webmeup") or (http.user_agent contains "Python") or (http.user_agent contains "python") or (http.user_agent contains "wget") or (http.user_agent contains "Wget") or (http.user_agent contains "toutiao") or (http.user_agent contains "Barkrowler") or (http.user_agent contains "a Palo Alto") or (http.user_agent contains "ltx71") or (http.user_agent contains "censys") or (http.user_agent contains "DotBot") or (http.user_agent contains "MauiBot") or (http.user_agent contains "MegaIndex.ru") or (http.user_agent contains "BLEXBot") or (http.user_agent contains "ZoominfoBot") or (http.user_agent contains "ExtLinksBot") or (http.user_agent contains "hubspot")

Then, I selected the Action as Block and saved it.

Cloudflare Custom Rules

At this point, if the risk interception number in Baota's Nginx Free Firewall does not increase, and the interception number of Cloudflare's firewall rules skyrockets, it proves that Cloudflare has successfully intercepted these garbage bots' access.

Wake Up Update#

After waking up, I found that it had already blocked over two thousand times; this thing is really persistent 😅.

Cloudflare Custom Rules

Custom Interception Page#

I recently saw Cloudflare's default interception page and found it too ugly... I wanted to change it to a custom interception page.

Cloudflare Default Interception Page

I found that a custom page requires upgrading to the Pro plan to use. Well, as someone who benefits without paying, I can't upgrade just for a custom page, so I had to implement it in a more cumbersome way. The idea is to redirect all visitors with the above User-Agent to a custom page.

Note

🔔 Please note: Pages hosted on Vercel will be inaccessible domestically. Since the overseas access to this site will go through Cloudflare, there is no need to consider whether it can be accessed domestically. If you want it to be accessible domestically as well, please use another page hosting platform.

We need to write a static interception page ourselves and host it on Vercel.

Detailed Steps for Vercel Deployment#

First, go to Vercel and create a new project.

Create a New Project

Then you can choose to import your custom page from an existing repository on the left or clone a template on the right. Since I don't have a custom page yet, I can click the Browse All Templates link on the right to choose a template.

Clone Template

Since I am using React to write the page, I will use the Create React App template. Click the Deploy button to use the template.

Next, click on Github in Create Git Repository to connect your Github, enter the custom repository name, check the Create private Git Repository option to set it as a private repository, and click Create to automatically deploy.

Deployment

When you see the Congratulations! prompt, it means it has been successfully deployed.

Successfully Deployed

Finally, you need to commit the code to the newly created repository on Github, and after the submission is complete, Vercel will automatically update the page.

Below is the custom page I wrote using React, which you can use directly.

\src\App.js file:

import React, { Component } from 'react'
import './App.css';

export default class App extends Component {
  state = {
    tran: {
      lang: 'en',
      title: 'This request has been blocked',
      contents_1: 'Some of your characteristics exist in the blacklist and this request has been blocked.',
      contents_2: 'If you think this is a false alarm, please contact me promptly.',
      footer: '@Vinking Security Center',
      tips: 'Details have been saved to continuously optimize the Security Center'
    }
  }
  handleTranslations = () => {
    const { lang } = this.state.tran
    const newState = (lang === 'en') ? {
      lang: 'zh',
      title: '请求已被拦截',
      contents_1: '您的一些特征存在于黑名单中，此次请求已被拦截。',
      contents_2: '如果您觉得这是误报，请及时联系我。',
      symbols: '@ Vinking 安全中心',
      tips: '详细信息已保存以持续优化安全中心'
    } : {
      lang: 'en',
      title: 'This request has been blocked',
      contents_1: 'Some of your characteristics exist in the blacklist and this request has been blocked.',
      contents_2: 'If you think this is a false alarm, please contact me promptly.',
      symbols: '@Vinking Security Center',
      tips: 'Details have been saved to continuously optimize the Security Center'
    }
    document.title = newState.title
    this.setState({ tran: newState })
  }
  render() {
    const { title, contents_1, contents_2, symbols, tips } = this.state.tran
    return (
      <div className="content">
        <div className="card">
          <div className="cardHeader">
            <div>{title}</div>
            <div className='translation' onClick={this.handleTranslations}>
              <svg xmlns="http://www.w3.org/2000/svg" width="15" height="15" viewBox="0 0 1024 1024"><path fill="#f8f9fa" d="M608 416h288c35.36 0 64 28.48 64 64v416c0 35.36-28.48 64-64 64H480c-35.36 0-64-28.48-64-64V608H128c-35.36 0-64-28.48-64-64V128c0-35.36 28.48-64 64-64h416c35.36 0 64 28.48 64 64v288zm0 64v64c0 35.36-28.48 64-64 64h-64v256.032C480 881.696 494.304 896 511.968 896H864a31.968 31.968 0 0 0 31.968-31.968V512A31.968 31.968 0 0 0 864 480.032H608zM128 159.968V512c0 17.664 14.304 31.968 31.968 31.968H512A31.968 31.968 0 0 0 543.968 512V160a31.968 31.968 0 0 0-31.936-32H160a31.968 31.968 0 0 0-32 31.968zm64 244.288V243.36h112.736V176h46.752c6.4.928 9.632 1.824 9.632 2.752a10.56 10.56 0 0 1-1.376 4.128c-2.752 7.328-4.128 16.032-4.128 26.112v34.368h119.648v156.768h-50.88v-20.64h-68.768V497.76h-49.504V379.488h-67.36v24.768H192zm46.72-122.368v60.48h67.392V281.92h-67.36zm185.664 60.48V281.92h-68.768v60.48h68.768zm203.84 488H576L668.128 576h64.64l89.344 254.4h-54.976l-19.264-53.664H647.488l-19.232 53.632zm33.024-96.256h72.864l-34.368-108.608h-1.376l-37.12 108.608zM896 320h-64a128 128 0 0 0-128-128v-64a192 192 0 0 1 192 192zM128 704h64a128 128 0 0 0 128 128v64a192 192 0 0 1-192-192z" /></svg>
            </div>
          </div>
          <div className="cardDesc">
            <span className="red">{contents_1}</span>
            <br />
            {contents_2}
          </div>
          <div className="cardSymbols">
            <div>{symbols}</div>
          </div>
        </div>
        <div className="tips">{tips}</div>
      </div>
    )
  }
}

\public\index.html file:

<!doctype html>
<html>

<head>
  <meta charset="utf-8">
  <meta name="viewport"
    content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=no, shrink-to-fit=no">
  <meta name="theme-color" content="#07092f">
  <title>This request has been blocked</title>
  <meta name="description" content="This request has been blocked.">
</head>

<body>
  <div id="root"></div>
</body>

</html>

Back to Cloudflare, click on the redirect rules under the rules section to create a rule. Fill in the expression for blocking marketing bots in the expression editing box, select Static for Type, fill in the link to the Vercel-hosted page, set the Status Code to 301, and uncheck Preserve Query String to customize the interception page.

Cloudflare Redirect

This article is synchronized and updated to xLog by Mix Space. The original link is https://www.vinking.top/posts/codes/blocking-bots-with-cloudflare