1

Detecting ChatGPT-generated content is a contentious topic at the moment. E.g., I've been on Reddit r/ChatGPT, and there's a constant stream of users claiming they've been unfairly accused of plagiarism.

One thing I'm curious about is character frequency, i.e., how frequently each English character (a, b, ..., z) occurs in ChatGPT-generated text vs. human-generated text. I'm not sure if a statistically significant difference is present, and maybe there's research into this. My guess would be that it's almost impossible to detect a difference.

Question: Does (English) ChatGPT-generated content have statistically significantly different character frequency than human-generated content?

(For this question, I want to ask about English only.)

I asked ChatGPT this question and it said:

... Overall, it is possible that ChatGPT-generated content has statistically significant differences in character frequency compared to human-generated content, but this would depend on the specific training data and settings used to train the model. Further research would be needed to determine the extent of these differences.

So maybe there's some subtle difference. I didn't find a related post by Bing-searching the question.


I stumbled upon this paper which points out how scientists use more capital letters than ChatGPT:

Scientists also use more proper nouns and/or acronyms, both of which are captured in the frequency of capital letters, and scientists use more numbers.
Desaire et al., Distinguishing academic science writing from humans or ChatGPT with over 99% accuracy using off-the-shelf machine learning tools, Cell Reports Physical Science 4, 101426, 2023.

  • See this related post on Linguistics: https://linguistics.stackexchange.com/questions/46703/linguistic-analysis-of-chatgpts-default-style-of-writing – hmltn Jun 27 '23 at 13:49
  • 1
    And fingerprinting https://www.beren.io/2023-02-26-Fingerprinting-LLMs-with-unconditioned-distribution/ – hmltn Jun 27 '23 at 13:52

0 Answers0