Unfortunately, the answer here is that "it depends". People have taken different approaches to this problem and I'll describe a few here. None of which however is the "right" answer.
Labeling
When generating benchmark datasets, we actually do have this problem. To be honest, most of the time the labeling is done to the best ability of the human. Sometimes ambiguous or difficult cases are separated out and cleaned, but, usually, labelers are given a set of concrete guidelines to say whether or not something is a cat. In the case where a human is unsure, usually, that data is thrown out or moved over to the "difficult" pile. Unfortunately, this difficult pile isn't publicly available for the most part in many public datasets. However, if you look at most public datasets, even with significant cleaning these cases exist.
Bayesian Deep Learning
One common theme that I've seen is that people add proper probabilistic uncertainties into their models. This is different than the output of a softmax at the end of an object detection network. The output of a softmax is just some number $\in [0, 1]$ that represents the regressed output of a classification within the model. For example, in SSD the softmax is just the classification of a specific anchor box. There is no real "certainty" information associated with it. In most standalone models (and without some pretty hard assumptions) it doesn't have any rigorous probabilistic meaning.
So how would we add a probability? How can we say that "There is a 80% chance that this image is a cat?. The paper, "What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? does a pretty good job at looking at this problem.
Basically what they do is build Bayesian models to break have their models explicitly output two types of uncertainties.
Aleatoric uncertainty
captures noise inherent in the observations. On the other hand, epistemic uncertainty accounts for uncertainty in the model – uncertainty which can be explained
away given enough data.
You can go through the paper to get a better understanding of what's going on, but, basically, they fit models to regress both the uncertainty associated with the model, and the uncertainty associated with the data.
Zero-Shot Learning
What about the case where you have a very well defined object, but you've never seen it before? Let's say you've seen a bunch of horses but you've never seen a zebra. As a human, you would look at it for a while and basically just assume that it's a horse with black and white stripes. There is an entire field of machine learning dedicated to this topic. I'm not an expert in it personally, but there are plenty of resources online if you're interested.
A Practical Note
In industry, we usually try to scope the problem as best as we can such that we don't have to deal with this as much. Now, there are times when this is clearly inevitable. What I've seen if an object isn't clearly detectable, then the algorithm might fallback to just saying that there's "something" there. Consider the case for self-driving cars. It's good to detect if there are pedestrians in the road, but if you don't know if something is a pedestrian it's still useful to know that there's something in the road. For this, you can fallback to unsupervised methods to help distinguish objects. From a labeling perspective, you could imagine an ontology of objects for this purpose. At the root node of this ontology would be just, "something in the road" and branches off to "car", "pedestrian" or, "bike" for example. If a labeler is not sure if something is a pedestrian, but it's def something in the road that shouldn't be hit then it would be labeled as "something in the road". Again though, this is highly dependent upon the application.