2

Assuming we have big $m \times n$ input dataset, with $m \times 1$ output vector. It's a classification problem with only two possible values: either $1$ or $0$.

Now, the problem is that almost all elements of the output vector are $0$s with a very few $1$s (i.e. it's a sparse vector), such that if the neural network would "learn" to give always 0 as output, this would produce high accuracy, while I'm also interested in learning when the 1s occurs.

I thought one possible approach could be to write a custom loss function giving more weight to the 1s, but I'm not completely sure if this would be a good solution.

What kind of strategy can be applied to detect such outliers?

nbro
  • 39,006
  • 12
  • 98
  • 176
Marvin
  • 21
  • 1
  • 1
    I like this discussion about accuracy as a scoring rule: https://stats.stackexchange.com/q/312780/247274. – Dave Oct 28 '20 at 17:08

2 Answers2

1

From your case, it seems like you want your algorithm to classify both 1s and 0s with high accuracy. To increase the number of 1s and get it to a comparable level as 0s, you could generate new examples of 0s by tweaking some features or adding random noise.

If you don't care about classifying 0s and only care about classifying 1s (which doesn't seem like what you want to do but putting this out there), you can create a surrogate loss function which assigns more loss weight to 1s than 0s (eg - 1000 weight for 1s and 1 weight for 0s.)

0

As described in this post, this problem is known as "unbalanced dataset" problem, which can have different solution approaches. If you use supervised learning, augmentation approaches could help. Otherwise, unsupervised approaches need some proper distance measure for outliers detection.

Marvin
  • 21
  • 1