I have structured data and image data to solve a regression problem. One sample of structured data can be related to N images.
If I use only structured data, I get decent performance, but not enough to properly solve the problem. I want to use related images to the structured data to improve performance.
My approach was to create 3 neural networks. The first one for the image input, the second one for structured input, and the third one to combine both image and structured networks and output the final result.
The main problem is how to properly combine one sample of structured data with N images. All the images already saved as bottleneck features from one of Keras applications. I combined the structured data with each corresponding image and got a very good result. (Duplicating structured sample for each corresponding image) But investigation showed that the validation dataset had training structured samples, but only combined with different images. So the network just memorized the dataset very well (on 110k samples) giving great synthetic results and bad generalization on real-world data. After I fixed validation and training datasets (each dataset doesn't have the same sample of structured data), the neural net showed real performance, which is bad.
So my question is: What is the state-of-the-art to combine one sample of structured data with N images? Of course, structured data and images are logically connected. Train 2 neural networks alone and then combine their outputs in third network? Or train all three networks at once? Or maybe train images with CNN and then combine CNN output with structured data using some gradient boosting algorithm?