wandering in the light: August 2024

Tuesday, August 20, 2024

Using robot.txt

For decades, website owners could control how search engine crawlers and other software bots interact with their sites. This powerful tool, known as robots.txt, has been an integral part of HTML since its early days. Let's dive into what robots.txt is and how you can use it effectively.

What is robots.txt? It is a simple text file placed in the root directory of your website. It provides instructions to web crawlers about which parts of your site they're allowed to access and index.

How does it work?

The file typically contains at least two types of instructions:

1. User-agent: Identifies the crawler being addressed
2. Disallow: Specifies which parts of the site the crawler shouldn't access

A basic example might look like this:

User-agent: *
Disallow: /

The downside of this approach is that Google’s will not Index your website and therefore it won’t be found in any Google searches. Probably not what we want. An alternative is to identify a folder on your website (if you have folders). In this case my images are all stored in the subdiretory /images.

Disallow: /images/

It is easy to make your robot text file much more specific. For example I don’t want chat gtp’s web scraper to go through my website you can stop so I need to know the agent name that OpenAi are using and I got this information from 20i article How to stop AI scraping your website (see see video below)

User-agent: GPTBot
Disallow: /

At this point in time I not sure this approach is reliable because I suspect that a lot of Bots that are scraping the Internet are just ignoring robot text. it’s clearly a voluntary exercise. However it is one step that demonstrates you want to be opted out of the datasets collected by specific AI developer, which may come in handy in future.

A more regulated way to opt-out os such indiscriminate content scrapping would be nice?

Friday, August 16, 2024

Is AI getting “good enough”?

In two minds about using AI for art works

This is an AI image I created with the prompt above, using google Deep Mind’s latest incarnation of its generative app Imogen-3. it really looks better than most other generative AI. Less overcrowding with intricate but less relevant detail, with consistent lighting and better capturing the emotional direction of my prompt. It was also my only creation with this prompt.

Google claims

"We’ve significantly improved Imagen 3’s ability to understand prompts, which helps the models generate a wide range of visual styles and capture small details from longer prompts.
To be even more useful, Imagen 3 will be available in multiple versions, each optimized for different types of tasks, from generating quick sketches to high-resolution images."

Ok that’s nice wording google but what does it mean and why do I see such a difference to other prompt generated images

Well perhaps there is a hint right at the end of their hype.

"Imagen 3 was built with our latest safety and responsibility innovations, from data and model development to production.
We used extensive filtering and data labeling to minimize harmful content in datasets and reduced the likelihood of harmful outputs. We also conducted red teaming and evaluations on topics including fairness, bias and content safety.”

Recent developments in AI technology have raised some intriguing questions about data handling, promoting fake news and bias. It appears that at least google is implementing input checking mechanisms on the information they use to train their models. This likely extends to assessing image quality as well, ensuring that the data fed into these systems meets certain standards. Looks “good enough”.

However, this observation leads to a more pressing personal concern:

Does responsible AI development truly encompass ethical practices across the board? The issue of data collection methods remains a significant point of contention. Are these companies indiscriminately scraping data from various sources without proper consent or consideration?

This brings me to a personal worry many of us share: the privacy of our own data, particularly our photos. With the prevalence of cloud-based photo storage services like Google Photos, and the vast number of images captured and uploaded from mobile devices daily, it's natural to wonder about the security and usage of this data. I worry some companies are "hoovering up" these personal images en masse? If so, what are the implications for our privacy and the control we have over our own digital footprint?

As consumers and digital citizens, it's important that we stay informed about these practices and advocate for ethical standards in AI development. The balance between technological advancement and personal privacy is delicate, and it's a conversation we need to keep having as AI continues to evolve and integrate into our daily lives.