0

In his recent short pamphlet on GPT, Stephan Wolfram says

... the 'size of the network' that seems to work well is ... comparable to the 'size of the training data'. ... in this representation it seems there’s in the end rather little 'compression' of the training data; it seems on average to basically take only a bit less than one neural net weight to carry the 'information content' of a word of training data.

But this isn't true is it, even by his own account? He says that the size of the network is 175 billion parameters, and the size of the training data is at least "a trillion words of text" (possibly 100 times that). That's a lot less than "one neural net weight to carry the 'information content' of a word".

Does GPT not need to compress its training data? Is it true that the number of parameters (Wolfram omits biases) is roughly the same as the size of the training data in words?

orome
  • 103
  • 5
  • I thought that compression was the whole reason networks wore forced to generalize at all, broadly speaking? – orome Apr 13 '23 at 17:52

0 Answers0