A more efficient way would be creating a multi input model, with something like this:
___________ _____________
|__Image__| |Other input|
_____|_____ _____|_____
|___CNN___| |__Dense___|
_____|______ _____|______
|_Features1_| |_Features2_|
__|_____|__
|__Merge___|
_____|______
|___Dense__|
_____|_____
|__Output__|
However, you could also combine the unstructured data to the image, as stated in the quora answer:
The out-of-the-box method:
If you want to just take your CNN library and use it without much
thought, there's an easy way to do it.
Your image has “channels”: red blue and green channels, for example.
Just add another channel for each unstructured feature. Those channels
will just be 2D-arrays whose entries are all the same value: the value
of your outside feature.
It means more memory and more parameters though. If you have a lot of
unstructured data, this can become prohibitively expensive.
The more efficient method (and still not hard):
You use one or more deconvolutional filters to bring the unstructured
data up to the size of the structured data, concatenate them along the
channel dimension, and keep going as if nothing happened.
Source: How can l train a CNN with extra features other than the pixels? (Quora)