How do the relative number of cells between neighboring stacked LSTM layers affect the network's behavior?

Question

It seems that stacking LSTM layers can be beneficial for some problem settings in order to learn higher levels of abstraction of temporal relationships in the data. There is already some discussion on selecting the number of hidden layers and number of cells per layer.

My question: Is there any guidance for the relative number of cells from one LSTM layer to a subsequent LSTM layer in the stack? I am specifically interested in problems involving timeseries forecasting (given a stretch of temporal data, predict the trend of that data over some time window into the future), but I'd also be curious to know for other problem settings.

For example, say I am stacking 3 LSTM layers on top of each other: LSTM1, LSTM2, LSTM3, where LSTM1 is closer to the input and LSTM3 is closer to the output. Are any of the following relationships expected to improve performance?

num_cells(LSTM1) > num_cells(LSTM2) > num_cells(LSTM3) [Sizes decrease input to output]
num_cells(LSTM1) < num_cells(LSTM2) < num_cells(LSTM3) [Sizes increase input to output]
num_cells(LSTM1) < num_cells(LSTM2) > num_cells(LSTM3) [Middle layer is largest]

Obviously there are other combinations, but those seem to me salient patterns. I know the answer is probably "it depends on your problem, there is no general guidance", but I'm looking for some indication of what kind of behavior I could expect from these different configurations.

How do the relative number of cells between neighboring stacked LSTM layers affect the network's behavior?

0 Answers0