CAPTCHAs (e.g. requiring a site visitor to click all the images of traffic lights in a grid of images) are often used throughout to Internet to verify that a site visitor is a human rather than a bot.
Many people claim that the providers of the CAPTCHA service (often Google) also use these CAPTCHAs to train their machine-learning algorithms by getting the site visitors to (unknowingly) label training data for supervised learning of image-recognition algorithms.
But I don't see how the (alleged) overt and covert functions of CAPTCHAs - verifying the user's humanity and labeling future training data - can be done at the same time. It seems to me in order to distinguish humans from bots, the images need to already be reliably labeled by the CAPTCHA provider so that the CAPTCHA knows whom to let through. But in order for users to provide input to future ML training runs, the images can't already be reliably labeled.
So what exactly is the claim here? Are the images in online CAPTCHAs already labeled or not? Is the claim that some CAPTCHAs are being used to block bots, and others are being used to label future training data, but no individual CAPTCHA is performing both tasks? If so, will CAPTCHAs in the second category let you through no matter which images you click? Is there any public information or estimates about the relative proportion of these two categories of CAPTCHA?
A quick search will turn up dozens of sites that claim that Google uses its CAPTCHAs to label future training data, but I haven't found any that address this point.