3

Considering the popularity of chatGPT, we can imagine that in the near future, many people will use it to produce lots of text contents on the internet, like blogs, forums. The productivity will be significantly improved.

However, I have a worry that the produced text content will be consumed someday in the future by chatGPT as part of the corpus. So it raises a problem, the AI will study tons of self produced text as new material.

So, is there a mechanism for chatGPT or any other chat bot to avoid that technical vicious circle?

zzzgoo
  • 141
  • 4
  • 2
    If it reads what it writes and so on at least doesn't burn our time. What worries me more is humans having to produce what it consumes or us having to read what it writes. – Jaume Oliver Lafont Mar 22 '23 at 17:40

1 Answers1

5

An effort to distinguish content from good content.

One of the major complaints about ChatGPT (certainly you'll see this all over Stack Exchange sites) is that it tends to deliver low quality answers. Which makes sense. It's just a language model. Its answers are, therefore, definitely language. Just not necessarily accurate and detailed responses to a given query. As such, I think you would very rarely see ChatGPT generated content being very popular or upvoted (even if it wasn't outright banned).

I think that's the sort of thing that a training session should try to take into account: not "what is the source of this" but "is the source popular, well liked, or notable".

If you were training ChatGPT from Stack Exchange input, then an obvious first step would be to have it skip anything with low numbers of views or low number of upvotes. SE is already telling you which questions and answers are low quality, so avoiding those is a good idea, and this would also, I believe, incidentally keep a chatbot from ingesting other chatbot output (because the typical rambling, vaguely off-topic chatbot outputs would rarely make it into the "high quality" category).

Similarly, you don't know if a website is created by a chatbot, but if you can gauge its notability or appeal, you can at least try to make a judgement about if it's worth using as a training source. Not notable? Not by a known author? Low page view count? Unknown website? Skip it.

Basically, just don't train a chatbot by having it willy-nilly go through the entire internet A-Z. The inputs need some curation. The more you can steer it away from low quality inputs, the better its training will be and, incidentally, the more likely it is to avoid inputting previous rounds of low quality chatbot output.

JamieB
  • 166
  • 2