When Copies Aren't Perfect: A Visual Warning About AI LLM Training
We tend to trust that digital copies are exact replicas of the original. But what happens when we copy a copy, then copy that copy again?
Using my cartoon alter ego Alvin, I wish to demonstrate how the compressed JPEG image format, using its "lossy" compression method, degrades with each generation of copying. While JPEG saves disk space and speeds up downloads, it achieves this by discarding data each time you save.
Starting with a clear image of Alvin shouting "gibberish" at a screen, created in Coreldraw. I repeatedly saved and resaved it as a JPEG:
At
80% quality (the common standard), the first 5 generations showed minor
deterioration at high-contrast edges.
Switching
to 70% quality (used by many social media platforms), problems became
obvious by generation 10
•At 50% quality, by generation 20, the image was almost unrecognisable. Even the mispelt word "gibberish" Alvin was shouting became illegible
Why This Matters
This isn't just about image quality. It's a powerful analogy for what's happening with large language models trained on "synthetic data", a dodgy term used by the LLM enthusiasts for AI-generated content fed back into AI systems.
Just as each JPEG generation compounds tiny adjustments until the image becomes gibberish, AI systems trained on their own output accumulate biases and inaccuracies. The feedback loop doesn't make things more accurate, it amplifies what's wrong.
When we assume digital processes are perfectly reliable, we miss how errors compound through iteration. Each cycle reinterprets the previous one, carrying forward and magnifying small mistakes, biases and fake stuff. Eventually, we're left with output that bears little resemblance to the original truth.
The lesson? Whether it's image compression or AI training, recursive copying without fresh input leads to degradation. Garbage in, garbage out, feeding this back in and the garbage out just gets worse.
Whilst I’m definitely noticing significant progress towards
better vision, after corneal graft issues. I can’t help but notice the progress in what was a rapid advance of LLM, or Large Language Models. Development has
definitely decelerated. Perhaps I’ve been busy elsewhere, looking at
other interesting things to follow up. Like human vision, how our eyes work, particularly in terms of colour perception.
That’s a different story.
When appropriate, I am still experimenting with a few different things using the common Generative
AI systems around today, both for evaluating my writing and producing images or
creative ideas.
My approach of always checking when my text is sent to an AI
large language model is sent for summary and fixing spelling or grammar. I
printout the AI’s output and using highlight pens mark each sentence into 5 groups (see Why You Can't Trust AI Writing Without Human
Oversight). This indicates to me that the quality of output has become
significantly less usable under my original observational classification.
Two things in particular stand out. The system now has
decided to be much more friendly and complimentary to me. This sycophantic
encouragement makes me very suspicious of what it’s going to tell me. Is it
going to be something I am expected to lap it up and not question. I’d be happy if the chatbot conversational
could be a bit more adversarial. I like a conversation with a similarly
experienced colleague, but perhaps with different views or a new idea. The other
area that worries me is how much confidence made up rubbish is presented with
as if it was well sourced information.
The idea these things can be improved by making large
language models even larger is not a sound one. Particularly when I hear
mention of using synthetic data, output from itself or other LLMs and then be
reload into an even bigger training run. We have already seen that the AI will often
make up stuff (halucinations). Isn’t
this all a bit dangerous. I’m not the only one that sees this as a risk. Above is a
recent interview with Michael Wooldridge, a prominent UK AI researcher, where
he explores what it will take for machines to achieve higher-order reasoning.
This is a simple recycling tip, to help keep your brushes happy. Stay watching for the Extra Bonus Tip about cleaning watercolour brushes at the end.
Still experimenting with new techniques and tools to create these videos. I just updated my very old version of VideoStudio, to the last every (2023) version. Despite the "end of the road" vibes. It does what I need. Faster rendering and higher resolution. Not to mention several new features to learn. This video for YouTube was filmed on my phone and edited with Corel VideoStudio.
So I still have a lot to learn about filming, presenting and more professional editing, BUT I am learning and enjoying myself. Maybe these tips might help or interest you.