3

Are there (complex) tabular datasets where deep neural networks (e.g. more than 3 layers) outperform traditional methods such as XGBoost by a large margin?

I'd prefer tabular datasets rather than image datasets, since most image dataset are either too simple that even XGBoost can perform well (e.g. MNIST), or too difficult for XGBoost that its performance is too low (e.g. almost any dataset that is more complex than CIFAR10; please correct me if I'm wrong).

nbro
  • 39,006
  • 12
  • 98
  • 176
Clara
  • 31
  • 1
  • I just remembered a good example... http://blog.kaggle.com/2012/11/01/deep-learning-how-i-did-it-merck-1st-place-interview/ but that was in 2012 and didn't outperform "by a large margin" and was before the xgboost mania. – user1269942 Dec 12 '19 at 20:27
  • I'll try out the dataset used in this post: it's great that the columns are uncleaned, since I'd expect the advantage of DNN to be better pronounced when the features are less structured. Thank you @user1269942! – Clara Dec 14 '19 at 20:54

1 Answers1

0

In my opinion, no. Also images could be interpreted as tabular dataset as well, where certain columns represent different rgb codes of pixels. If you seek to use neural nets opt for image datasets, with large sample size. Neural networks generally require large sample sizes to perform, and huge dimension inputs to not be outperformed by boosting.

  • That makes a lot of sense, thank you Matthew! May I ask if you know any complex / large tabular dataset (e.g. perhaps with low-level features as columns) that could be difficult for boosting? Thank you! – Clara Dec 11 '19 at 15:59
  • Unfortunately, I don't. But you can always make one for yourself, out of image data sets, – MatthewTuby Dec 12 '19 at 07:25
  • I can confirm this anecdotally. I have tried many times to apply NNs to tabular and/or non-complex data and I cannot recall a NN ever doing better than more "traditional" models. Ha ha, of course, we all know that the plural of anecdote is *not* data! So exceptions may exist. I do, however, disagree with trying to represent an image as column data. – user1269942 Dec 12 '19 at 20:18
  • Disagree on what ground? Possibility or usefulness? – MatthewTuby Dec 13 '19 at 07:38
  • @MatthewTuby on the grounds that it will *likely* make the typical convolution operators ineffective as "columnified" image data will likely lose the benefit that pixels near-by in the actual image are near-by in the new representation(column data) like they are in the standard RGB layered matrix representation. A pedant could assert that the normal matrix representation is constructed from columns...but that is not what is being implied here. One example of image data in column format, used in GIS, is the .csv format with encodes a single channel image into "x,y,value" rows. – user1269942 Dec 15 '19 at 21:34
  • @MatthewTuby a quick note to say that current image modeling success is highly dependent on convolution operations. – user1269942 Dec 15 '19 at 21:36
  • @user1269942 Do we require convolution operator on tabular data set? Since the question originally focused on tabular data, I somehow implied that the "deep neural network" term refers to multilayer perceptrons and not convolutional networks. – MatthewTuby Dec 16 '19 at 14:25
  • @MatthewTuby we're going to get shut down for too many comments! "are convolution operators required?" --no. But in some cases they work(some tabular time series data for example). – user1269942 Dec 17 '19 at 16:12