- A summary of my bot defence systems
- Butlerian Jihad - Blog posts on the topic of fighting off spam bots, search engine spiders and other non-humans wasting the precious resources we have on Earth
- EmacsWiki's robots.txt
VirtualTam's bookmarks
-
-
2025-04-09 Web page archive formats:
Tools for crawling, scraping and archiving Web pages:
- internetarchive/heritrix3 - Extensible, web-scale, archival-quality web crawler project (Java)
- internetarchive/Zeno - State-of-the-art web crawler (Go)
- internetarchive/gowarc - Read and write WARC files in Go
- webrecorder/pywb - Web Archiving Toolkit for replay and recording of web archives (Python)
Self-hosted solutions:
- ArchiveBox - A self-hosted app that lets you preserve content from websites in a variety of formats
- Wallabag - Save and classify articles, read them later