LSTM exploding? - multiple parallel time series with multiple variables

Question

I have the following situation:

Stock	Time_Stamps	Feature_1	Feature_2	Feature_n	Price
Stock_1	2019	0.5	1.0	1.0	100
Stock_1	2020	0.7	1.3	0.9	90
Stock_2	2019	0.3	0.9	1.1	110
Stock_2	2020	0.2	0.8	1.1	120
Stock_n	year_n	value_n	value_n	value_n	price_n

So this is how my data table is structured. My original df has 100+ features and 70000k observations resp. 2000+ stocks - so this is only a simplification.

I want to train a LSTM on this data table and look for features correlation with the price. Common idea, nothing new, so pls save your time giving me "this will not work" bla bla.

I am generally interested in how you would approach this problem. We have multiple inputs (features) for our time series forecast with 8 time stamps (8 years) per stock. However, in my understanding, I'd have to train my model for every stock seperately which is inconvenient.

How would you pre-process my data, so that I can train a decent model?

score 0 · Answer 1 · answered Jul 15 '22 at 15:18

Maybe you can try to encode "stock" column via one hot encoding https://stackoverflow.com/questions/37292872/how-can-i-one-hot-encode-in-python

so for example, if you have "AAPL" and "MSFT" stocks, than AAPL would be encoded as [0, 1] and MSFT would be encoded as [1, 0].

And maybe it worth to do the same for "Year", like 2022 -> [0, 0, 1], 2021 -> [0, 1, 0], 2020 -> [1, 0, 0]

And also, I think price should also be normalized, so instead of absolute values you can use minmax normalization, to convert price to number from 0-1 range https://stackoverflow.com/questions/48178884/min-max-normalisation-of-a-numpy-array

thanks for the ideas. In a prior approach, I have categorised the price (-1,0,1), however, if I would encode my 2000+ stocks, wouldn't that create a horribly huge dataframe for my model? — Sphenoidale, Jul 17 '22 at 14:06

score 0 · Answer 2 · answered Dec 12 '22 at 20:15

No idea if this would result in a "decent" model, but can you not take the disjoint union of all features?

Roughly that means you concatenate the $n$ features for all 8 stocks into a single "megastock" that has $8n$ features and $8$ outputs ($1$ price per stock), as opposed to 8 models with $n$ features and $1$ output each.

In other words, for each stock you have $n$ input features and 1 output (price), so taking the disjoint union would result in $8n$ input features and 8 outputs (price for stock $1$, price for stock $2$, etc.).

LSTM exploding? - multiple parallel time series with multiple variables

2 Answers2