The collective store of knowledge has been generated by human in forums, reddit, stackoverflow, websites, Wiki, and search engines. We have used this to train ChatGPT. Since GPT4, the generation of knowledge is moving from the public-sphere into a direct private chat. This is a problem.
So the generation and capture of knowledge is now privately sourced and pooled into a Machine by-passing the human public domain.
Where will the next AI model get its training data from? GPT-Next will be trained on legacy data.
This raises multiple questions related to Ethics, the sanctity of Data Access, and an increasing importance to legislate for public data.
AI will become a dominant source of knowledge simply by virtue-of-growth. It depends on training data which could become unavailable, or monopolised, or licensed by a chatbot.
Trusting the output from AI is just as alarming. If we lose access to the training data, and propagate AI outputs we will lose the collective memory of human-generated thinking and lose the ability to validate AI. Just like search, humanity will be told what to think, how to think, and programmed by AI.
So to avoid a dangerous feedback loop, we need to individually, and collectively support (and extend) the public domain.