If you use RNNs, then I think the solution is to use padding (zero padding) with max sequence length (that is the max number of words in a text) in order to tell your model to skip the zeros when possible. In that way, your model will try to learn a good representation of your input with fixed size. If you do not know this dimension, a solution may be to grid search this hyperparameter.
If you still want to exploit the dimensionality difference, maybe you can train different models with fixed size dimension of the representation dependently of the dimension of the input. That is, for example, use one for small, one for medium and one for large dimensions, but this should surely require to have a large and quite balanced initial dataset.
Another idea could be to use the autoencoder with a fixed latent dimension. Then, do effective clustering on your samples using their latent representation, considering that similar representations should have similar dimensionality requirements (?). After that, you could train your initial dataset on k models, the same number as the clusters. That is, there should be k different latent spaces. The goal is to match each instance to the correct model. At first, you should train them all with each instance, but as the training progresses, you should maybe go with binary search for each instance in order it to find the correct model, assuming that there is total order in the measuring of the dimensionality requirements. Of course, this is just an idea, I don't know if it is going to be really helpful at all.