What is robots.txt? It is a simple text file placed in the root directory of your website. It provides instructions to web crawlers about which parts of your site they're allowed to access and index.
How does it work?
1. User-agent: Identifies the crawler being addressed
2. Disallow: Specifies which parts of the site the crawler
shouldn't access
User-agent: *
Disallow: /
The downside of this approach is that Google’s will not Index your website and therefore it won’t be found in any Google searches. Probably not what we want. An alternative is to identify a folder on your website (if you have folders). In this case my images are all stored in the subdiretory /images.
Disallow: /images/
It is easy to make your robot text file much more specific. For example I don’t want chat gtp’s web scraper to go through my website you can stop so I need to know the agent name that OpenAi are using and I got this information from 20i article How to stop AI scraping your website (see see video below)
User-agent: GPTBot
Disallow: /
At this point in time I not sure this approach is reliable because I suspect that a lot of Bots that are scraping the Internet are just
ignoring robot text. it’s clearly a
voluntary exercise. However it is one step that demonstrates you want to be
opted out of the datasets collected by specific AI developer, which may come in
handy in future.
A more regulated way to opt-out os such indiscriminate content scrapping would be nice?