Tuesday, August 20, 2024

Using robot.txt

For decades, website owners could control how search engine crawlers and other software bots interact with their sites. This powerful tool, known as robots.txt, has been an integral part of HTML since its early days. Let's dive into what robots.txt is and how you can use it effectively.

What is robots.txt? It is a simple text file placed in the root directory of your website. It provides instructions to web crawlers about which parts of your site they're allowed to access and index.

How does it work?

 The file typically contains at least two types of instructions:

1. User-agent: Identifies the crawler being addressed
2. Disallow: Specifies which parts of the site the crawler shouldn't access

 A basic example might look like this:

User-agent: *                                    
Disallow: /                                        

The downside of this approach is that Google’s will not Index your website and therefore it won’t be found in any Google searches. Probably not what we want. An alternative is to identify a folder on your website (if you have folders). In this case my images are all stored in the subdiretory /images. 

Disallow: /images/                           

It is easy to make your robot text file much more specific. For example I don’t want chat gtp’s web scraper to go through my website you can stop so I need to know the agent name that OpenAi are using and I got this information from 20i article How to stop AI scraping your website (see see video below) 

User-agent: GPTBot                    
Disallow: /                                    

At this point in time I not sure this approach is reliable because I suspect that a lot of Bots that are scraping the Internet are just ignoring robot text.  it’s clearly a voluntary exercise. However it is one step that demonstrates you want to be opted out of the datasets collected by specific AI developer, which may come in handy in future.

A more regulated way to opt-out os such indiscriminate content scrapping would be nice?

No comments: