2

I have the following situation:

Stock Time_Stamps Feature_1 Feature_2 Feature_n Price
Stock_1 2019 0.5 1.0 1.0 100
Stock_1 2020 0.7 1.3 0.9 90
Stock_2 2019 0.3 0.9 1.1 110
Stock_2 2020 0.2 0.8 1.1 120
Stock_n year_n value_n value_n value_n price_n

So this is how my data table is structured. My original df has 100+ features and 70000k observations resp. 2000+ stocks - so this is only a simplification.

I want to train a LSTM on this data table and look for features correlation with the price. Common idea, nothing new, so pls save your time giving me "this will not work" bla bla.

I am generally interested in how you would approach this problem. We have multiple inputs (features) for our time series forecast with 8 time stamps (8 years) per stock. However, in my understanding, I'd have to train my model for every stock seperately which is inconvenient.

How would you pre-process my data, so that I can train a decent model?

2 Answers2

0

Maybe you can try to encode "stock" column via one hot encoding https://stackoverflow.com/questions/37292872/how-can-i-one-hot-encode-in-python

so for example, if you have "AAPL" and "MSFT" stocks, than AAPL would be encoded as [0, 1] and MSFT would be encoded as [1, 0].

And maybe it worth to do the same for "Year", like 2022 -> [0, 0, 1], 2021 -> [0, 1, 0], 2020 -> [1, 0, 0]

And also, I think price should also be normalized, so instead of absolute values you can use minmax normalization, to convert price to number from 0-1 range https://stackoverflow.com/questions/48178884/min-max-normalisation-of-a-numpy-array

  • thanks for the ideas. In a prior approach, I have categorised the price (-1,0,1), however, if I would encode my 2000+ stocks, wouldn't that create a horribly huge dataframe for my model? – Sphenoidale Jul 17 '22 at 14:06
0

No idea if this would result in a "decent" model, but can you not take the disjoint union of all features?

Roughly that means you concatenate the $n$ features for all 8 stocks into a single "megastock" that has $8n$ features and $8$ outputs ($1$ price per stock), as opposed to 8 models with $n$ features and $1$ output each.

In other words, for each stock you have $n$ input features and 1 output (price), so taking the disjoint union would result in $8n$ input features and 8 outputs (price for stock $1$, price for stock $2$, etc.).