5

Suppose we have a labeled data set with columns $A$, $B$, and $C$ and a binary outcome variable $X$. Suppose we have rows as follows:

 col  A B C X
  1   1 2 3 1
  2   4 2 3 0
  3   6 5 1 1
  4   1 2 3 0

Should we throw away either row 1 or row 4 because they have different values of the outcome variable X? Or keep both of them?

naive
  • 699
  • 6
  • 13
guest_guy
  • 51
  • 1
  • I think your question is quite naive. If you can share your motivation for the question, then the question would attract more apt answers. – naive Aug 30 '19 at 12:37

3 Answers3

4

The problem you are portraying looks like a modified XOR problem. You can't throw away the lines with a label of 1 because a the model won't be able to learn this class.

CaucM
  • 141
  • 2
1

This is perfectly acceptable in a stochastic environment. Generally your loss is to minimize $-log\ p(Y|X)$ or equivalently $-\sum_i log\ p(y_i|x_i)$. This optimization is equivalent to $-\mathbb{E}\log\ p(y_i|x_i)$. In other words you are minimizing in this case:

$$ \begin{align*} L &= -log\ p(1|x_0) - log\ p(0|x_0) \\ &= -log [p(1|x_0) * p(0|x_0)] \\ &= -log [p(1|x_0) * (1 - p(1|x_0))] \\ \end{align*} $$
or since log is concave equivalently minimizing
$$ \hat L = -p(1|x_0) * (1 - p(1|x_0)) $$ After some basic calc 1, we see the optimal result we want the system to learn is that
$$ p(1|x_0) = .5$$

Note that if you had more evidence, the result would just be that you want it to learn that it is $1$ with probability $\mathbb{E}_i\ y_i | x$

mshlis
  • 2,349
  • 7
  • 23
  • So throwaway or keep both columns? – NebulousReveal Aug 26 '19 at 03:03
  • keep both. Do **not** throw away data unless you have good reason to. In this case you want your model to output .5 (not 0 or 1) – mshlis Aug 26 '19 at 03:04
  • What happens if the outcome variable for both rows (or $n$ rows) are the same (i.e we have duplicate rows)? Should we throw one of them out? Or keep them both? Does it really matter? – NebulousReveal Aug 26 '19 at 03:09
  • @Prime In that case it depends, if this is due to a sampling scheme then do not, because usually thatll mean that daat point is twice as important on the other hand if someone accidentally copy and pasted a row, then yes delete it, because itll be giving additionall importance to a point that doesnt deserve it – mshlis Aug 26 '19 at 11:08
0

I might consider 2 models (throw away col 1 and throw away col 4), and one more that keeps both, and see which generalises better to test set.

joek47
  • 11
  • 1