As I understand from literature, normally, the last activation in an actor (policy) network in TD3 and SAC algorithms is a Tanh function, which is scaled by a certain limit.
My action vector is perfectly described as a vector, where all values are between 0 and 1, and which should sum up to 1. This is perfect for a Softmax function. But these values are not probabilities of discrete actions. Each value in action vector should be a percentage from the whole portfolio to be invested in a certain stock.
But I cannot figure out, if it would be mathematically fine to use Softmax as an activation layer in TD3 or SAC?