What should we do when we have equal observations with different labels?

Question

Suppose we have a labeled data set with columns $A$, $B$, and $C$ and a binary outcome variable $X$. Suppose we have rows as follows:

 col  A B C X
  1   1 2 3 1
  2   4 2 3 0
  3   6 5 1 1
  4   1 2 3 0

Should we throw away either row 1 or row 4 because they have different values of the outcome variable X? Or keep both of them?

I think your question is quite naive. If you can share your motivation for the question, then the question would attract more apt answers. — naive, Aug 30 '19 at 12:37

score 4 · Answer 1 · answered Aug 24 '19 at 00:13

4

The problem you are portraying looks like a modified XOR problem. You can't throw away the lines with a label of 1 because a the model won't be able to learn this class.

answered Aug 24 '19 at 00:13

CaucM

141
2

score 1 · Answer 2 · answered Aug 24 '19 at 15:37

1

This is perfectly acceptable in a stochastic environment. Generally your loss is to minimize $-log\ p(Y|X)$ or equivalently $-\sum_i log\ p(y_i|x_i)$. This optimization is equivalent to $-\mathbb{E}\log\ p(y_i|x_i)$. In other words you are minimizing in this case:

$$ \begin{align*} L &= -log\ p(1|x_0) - log\ p(0|x_0) \\ &= -log [p(1|x_0) * p(0|x_0)] \\ &= -log [p(1|x_0) * (1 - p(1|x_0))] \\ \end{align*} $$
or since log is concave equivalently minimizing
$$ \hat L = -p(1|x_0) * (1 - p(1|x_0)) $$ After some basic calc 1, we see the optimal result we want the system to learn is that
$$ p(1|x_0) = .5$$

Note that if you had more evidence, the result would just be that you want it to learn that it is $1$ with probability $\mathbb{E}_i\ y_i | x$

answered Aug 24 '19 at 15:37

mshlis

2,349
7
23

So throwaway or keep both columns? – NebulousReveal Aug 26 '19 at 03:03
keep both. Do **not** throw away data unless you have good reason to. In this case you want your model to output .5 (not 0 or 1) – mshlis Aug 26 '19 at 03:04
What happens if the outcome variable for both rows (or $n$ rows) are the same (i.e we have duplicate rows)? Should we throw one of them out? Or keep them both? Does it really matter? – NebulousReveal Aug 26 '19 at 03:09
@Prime In that case it depends, if this is due to a sampling scheme then do not, because usually thatll mean that daat point is twice as important on the other hand if someone accidentally copy and pasted a row, then yes delete it, because itll be giving additionall importance to a point that doesnt deserve it – mshlis Aug 26 '19 at 11:08

score 0 · Answer 3 · answered Sep 04 '19 at 01:25

0

I might consider 2 models (throw away col 1 and throw away col 4), and one more that keeps both, and see which generalises better to test set.

answered Sep 04 '19 at 01:25

joek47

11
1

What should we do when we have equal observations with different labels?

3 Answers3