3

Sorry if this is too noob question, I'm just a beginner.

I have a data set with companies' info. There are 2 kinds of features: financial (revenue and so on) and general info (like the number of employees and date of registration)

I have to predict the probability of default. And the data has gaps: about the half of the companies have no financial data at all. But general features are 100% filled.

What is the best practice for such a situation?

Will be great if you can give some example links to read.

nbro
  • 39,006
  • 12
  • 98
  • 176
Denis Ka
  • 31
  • 1
  • 1
    May help https://stats.stackexchange.com/questions/98953/why-doesnt-random-forest-handle-missing-values-in-predictors – caveman Sep 13 '20 at 02:31

1 Answers1

3

You should look into "missing values". This is an entire research field in itself.

First, you need to identify the type of missing values:

  1. They can be missing purely at random.
  2. Whether they are missing or not is itself a useful feature, and should be treated as a class of its own.

(Those two are the best case scenarios.)

  1. Whether they are missing or not depends on the underlying (unknown) value. For example, a thermometer might fail occasionally if the temperatures get too high. In your case, certain types of companies might be more likely to not share their information.
  2. Information might be missing specifically to mislead you, the data analyst. This is the worst possible scenario, and there is not much you can do.

So, what do you do about it? A few typical options:

  1. Throw out all the rows with missing data: we do not have enough information about these companies.
  2. Throw out all the columns with missing data: this field is not reliably measurable and we shouldn't use it.
  3. Try to guess the missing values. This can be done if the amount of missing data is small. Either you train a predictive model based on the non-missing data, or you fill in the median for that type of row, or you fill in the value of the "closest" matching row. This can be dangerous.
  4. Some algorithms are OK with missing data. Check the documentation for your models and algorithms to see how they deal with missing values.
  • Thanks for an answer! I've got this data in some competition, so it is possible that missing values were selected randomly (but on purpose) by orgs. It looks like a real situation: some companies - are already clients, and some are new so we don't know much. I think that in my case throw all columns is bad - we'll miss too much data. And throw out all this rows also not an option, because test data has the same structure - half of record without financial info. – Denis Ka Sep 12 '20 at 23:20
  • I think it will be good to find out whether the fact of missing finance is a feature. But can i make (and how) more than 2 classes on presented data? How can i categorise several features into a new feature ? %) – Denis Ka Sep 12 '20 at 23:20
  • 1
    I'd introduce a new boolean feature: "is_missing". Then you can see if it correlates with any of the known features. To see if it can be predicted, take the part of the data set where it is available, and train a predictor on that particular feature. Depending on how well it does (validation score) you can try to predict the missing ones. – Robby Goetschalckx Sep 13 '20 at 01:15
  • Essentially you're treating the possibly missing value as the target of a prediction problem. – Robby Goetschalckx Sep 13 '20 at 01:16
  • Oh! I've got it! Thanks again – Denis Ka Sep 13 '20 at 19:32