About Crawler Activity
Introduction
The Crawler Activity web application provides a user interface for exploring the activity of web crawlers at my two websites, sphaerula.com and conradhalling.com. Since I suspected that the bulk of traffic at both of these sites was from web crawlers, I decided to analyze the access log files produced by the Apache httpd web server to identify the crawlers that were accessing my web pages.
As presented on the Summary page, more than 98% of the traffic on my websites comes from web crawlers.
Definition of a Request
Crawler Activity counts the number of requests issued from IP addresses and categorizes the requests where possible by what crawler made the request. So, what is a request?
First, a little background. My websites use my variant of the Jekyll Chirpy theme v7.2.4. This theme has many support files required to correctly render a page. For example, a single page request for https://conradhalling.com/blog/, actually causes 21 total HTTP requests. These are the files that are retrieved by a request to my blog's home page:
- /blog/index.html
- /blog/assets/lib/fonts/main.css
- /blog/assets/lib/fontawesome-free/css/all.min.css
- /blog/assets/css/jekyll-theme-chirpy.css
- /blog/assets/lib/loading-attribute-polyfill/loading-attribute-polyfill.min.css
- /blog/assets/js/dist/theme.min.js
- /blog/assets/img/avatar/avatar.jpg
- /blog/assets/lib/loading-attribute-polyfill/loading-attribute-polyfill.umd.min.js
- /blog/assets/lib/simple-jekyll-search/simple-jekyll-search.min.js
- /blog/assets/lib/dayjs/dayjs.min.js
- /blog/assets/lib/dayjs/locale/en.js
- /blog/assets/lib/dayjs/plugin/relativeTime.js
- /blog/assets/lib/dayjs/plugin/localizedFormat.js
- /blog/assets/js/dist/home.min.js
- /blog/assets/lib/fontawesome-free/webfonts/fa-brands-400.woff2
- /blog/assets/lib/fontawesome-free/webfonts/fa-solid-900.woff2
- /blog/assets/lib/fontawesome-free/webfonts/fa-regular-400.woff2
- /blog/assets/lib/fonts/Lato/Lato-Regular.ttf
- /blog/assets/js/data/search.json
- /blog/assets/img/favicons/favicon-32x32.png
- /blog/assets/img/favicons/apple-touch-icon.png
Consequently, my code ignores all of the accessory files, so that a request for a page is counted once and not 21 times.
Identified Crawlers
of different types, many of which identify themselves and many others that do not.
Many crawlers identify themselves in their user agent strings. For example, OpenAI's crawler, GPTBot/1.2, has the following user agent string:
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)
Crawler Activity has a database field that categorizes such crawlers as “identified”.
Unidentified Crawlers
Unidentified crawlers have user agent strings that contain no information about the crawler. Sometimes the user agent string is as short as “-”. This website identifies these crawlers based on patterns of behavior. manual entry of crawler identifier strings pattern matching of data from the access log files manual entry of new crawlers into the database as I monitor the data
Identification Status | Crawler Count | Request Count |
---|---|---|
Identified | 103 | 71,933 |
Unidentified | 11 | 51,197 |
Types of Unidentified Crawlers
The data are extracted from the Apache httpd access logs and filtered for page requests. The pages provide information about crawlers that have made page requests.
Code Details
Out of necessity, I wrote this application using Python CGI. This is because my shared hosting service supports PHP or Python CGI and not a newer protocol such as WSGI (e.g., Flask). Although CGI is dismissed these days as obsolete, CGI applications are fairly easy to write and can be responsive on good hardware.
For data containers, out of the many choices Python provides, I decided to use ordinary dictionaries. I experimented with the dataclasses module, but to me it didn't provide any advantages for my purposes here.
Source Code
The source code for this application will be made available soon at my GitHub repository.