How can CAPTCHAs be used for both user verification and ML training?

Question

CAPTCHAs (e.g. requiring a site visitor to click all the images of traffic lights in a grid of images) are often used throughout to Internet to verify that a site visitor is a human rather than a bot.

Many people claim that the providers of the CAPTCHA service (often Google) also use these CAPTCHAs to train their machine-learning algorithms by getting the site visitors to (unknowingly) label training data for supervised learning of image-recognition algorithms.

But I don't see how the (alleged) overt and covert functions of CAPTCHAs - verifying the user's humanity and labeling future training data - can be done at the same time. It seems to me in order to distinguish humans from bots, the images need to already be reliably labeled by the CAPTCHA provider so that the CAPTCHA knows whom to let through. But in order for users to provide input to future ML training runs, the images can't already be reliably labeled.

So what exactly is the claim here? Are the images in online CAPTCHAs already labeled or not? Is the claim that some CAPTCHAs are being used to block bots, and others are being used to label future training data, but no individual CAPTCHA is performing both tasks? If so, will CAPTCHAs in the second category let you through no matter which images you click? Is there any public information or estimates about the relative proportion of these two categories of CAPTCHA?

A quick search will turn up dozens of sites that claim that Google uses its CAPTCHAs to label future training data, but I haven't found any that address this point.

score 1 · Answer 1 · answered Aug 01 '23 at 10:14

What about Google giving you some room for errors every now and then? Say that out of 9 picture, 4 are trucks and Google know that those are trucks, then out of the remaining 5, it knows that 4 are not trucks.

Now, you have one remaining test image that Google doesn't know if it is a truck or not, so Google won't penalize you if you click it or not... however, it is going to show such image to thousands of users, and if (example) more than 80% of them also clicked on that unlabeled image, then it's reasonable that human sees a truck in that image

How can CAPTCHAs be used for both user verification and ML training?

1 Answers1