Huge dimensionality of input and output — any recommendations?

Question

At work there is an idea of solving a problem with machine learning. I was assigned the task to have a look at this, since I'm quite good at both mathematics and programming. But I'm new to machine learning.

In the problem a box would be discretized into smaller boxes (e.g. $100 \times 100 \times 100$ or even more), which I will call 'cells'. Input data would then be a boolean for each cell, and output data would be a float for each cell. Thus both input and output have dimensions of order $10^6$ to $10^9$.

Do you have any recommendations about how to do this? I guess that it should be done with a ConvNet since the output depends on relations between close cells.

I have concerns about the huge dimensions, especially as our training data is not at all that large, but at most contains a few thousands of samples.

Motivation

It can be a bit sensitive to reveal information from a company, but since this is a common problem in computational fluid dynamics (CFD) and we already have a good solution, it might not be that sensitive.

The big boxes are virtual wind tunnels, the small boxes ('cells' or voxels) are a discretization of the tunnel. The input tells where a model is located and the output would give information about where the cells of a volume mesh need to be smaller.

Hello. Welcome to Artificial Intelligence Stack Exchange. It might be a good idea to describe your problem at a higher level. What are these boxes? What do the cells represent? What do the floating-point values associated with each cell (I suppose these are your "labels") represent? — nbro, Mar 25 '21 at 10:49
Thanks, @nbro. It can be a bit sensitive to reveal information from a company, but since this is a common problem in [computational fluid dynamics](https://en.wikipedia.org/wiki/Computational_fluid_dynamics) and we already have a good solution, it might not that sensitive. The big boxes are virtual wind tunnels, the small boxes ('cells' or [voxels](https://en.wikipedia.org/wiki/Voxel)) are a discretization of the tunnel. The input tells where a model is located and the output would give information about where the cells of a [volume mesh](https://imgur.com/a/iB6ypZj) need to be smaller. — md2perpe, Mar 25 '21 at 12:05
@md2perpe if the model exists, is it confined exclusively to one cell? If not maybe you could do some generalisation of the inputs/outputs based on whether the model exists within an area of space. Its approximately what the CNN attempts to do (smoothing + sub-sampling) but by doing it manually instead imposing your external knowledge of the system it could help reduce dimensionality exponentially whilst also speeding up learning — quest ions, Mar 28 '21 at 12:49
@questions. The model will exist; without it the task is meaningless. The model will span a lot of cells, but we're looking for numbers in the cells outside of the model. The model will probably be placed in the middle of the space, with some margin around (like [this](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQU3_58Ov3qzPO0C3HILD0voVGp4dS3cRfQOQ&usqp=CAU) but in 3 dimensions and with higher resolution). — md2perpe, Mar 28 '21 at 15:42
@md2perpe by "numbers in the cells outside of the model" i assume you mean outputs where the input is 0. If the number of cells the model occupies is deterministic (w/ the margin) maybe there exists some parameterisation that can approximately describe the number of cells which are 1. A motivating idea is how one describes 1D gaussians by their mean and standard deviation instead of trying to find how each individual input in $\mathbb{R}$ is mapped. — quest ions, Mar 28 '21 at 22:14
@questions. Correct; interesting output is where input equals 0. The number of cells the model occupies is quite deterministic for a given model. It should be approximately proportional to the volume of the model. How do you mean that this could be used? The output is known to depend on the form of the model, but just scale with the size of the model. — md2perpe, Mar 29 '21 at 08:53
@md2perpe How you use that information to define input features is the crux of the problem. For your case one could first try implementing a 3D coordinate system associating each cell with a coordinate and then, for instance, passing the coordinates of the cells that are on the perimeter of the model as the input (this wont work if the perimeter is constantly changing as that causes the size of input changes). Without much further context it's hard to suggest a truly viable input but this hopefully provides you with an idea of how one could attempt this. — quest ions, Mar 29 '21 at 11:28
The purpose of these features is to, in a succinct way, be able to uniquely define the "true inputs". There may be various ways to pick feature sets and the results during learning can vary a lot based on this. So you should try come up with different input features and compare the performances — quest ions, Mar 29 '21 at 11:39
@questions. So your idea is to reduce the number of dimensions of the input by using some representation of the boundary of the model? — md2perpe, Mar 30 '21 at 09:05
yes that was my suggestion, although it may not work in practice because NNs require a fixed size input. But hopefully that provides you with a new perspective on how to reduce dimensionality of your problem - If you feel satisfied with my response I can write this up as an answer — quest ions, Apr 01 '21 at 09:28
@questions. The number of dimensions of the output is still the same, though. And because of that, I'm not sure if reducing the dimensions of input helps much. But perhaps there is some way to also reduce the number of dimensions of the output. — md2perpe, Apr 01 '21 at 15:01

Huge dimensionality of input and output — any recommendations?

Motivation

0 Answers0