Multiple GRU layers to improve a text generation

Question

I am using the model in this colab https://colab.research.google.com/github/tensorflow/text/blob/master/docs/tutorials/text_generation.ipynb#scrollTo=AM2Uma_-yVIq for Shakespeare like text generation.

It looks like this

class MyModel(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, rnn_units):
    super().__init__(self)
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(rnn_units,
                                   return_sequences=True,
                                   return_state=True)
    self.dense = tf.keras.layers.Dense(vocab_size)

  def call(self, inputs, states=None, return_state=False, training=False):
    x = inputs
    x = self.embedding(x, training=training)
    if states is None:
      states = self.gru.get_initial_state(x)
    x, states = self.gru(x, initial_state=states, training=training)
    x = self.dense(x, training=training)

    if return_state:
      return x, states
    else:
      return x

I look for strategies to improve the model, under the hypothesis that there is a reliant why to assess the goodness of my model; how could adding one or more GRU layers improve the model, e.g., number of rnn_units, number of layers, for such a stacked model ?

For instance, this gives extremely bad results

class MyModel(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, rnn_units):
    super().__init__(self)
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(rnn_units,
                                   return_sequences=True,
                                   return_state=True)
    self.gru_2 = tf.keras.layers.GRU(rnn_units,
                                   return_sequences=True,
                                   return_state=True)
    self.dense = tf.keras.layers.Dense(vocab_size)

  def call(self, inputs, states=None, return_state=False, training=False):
    x = inputs
    x = self.embedding(x, training=training)
    if states is None:
      states = self.gru.get_initial_state(x)
    x, states = self.gru(x, initial_state=states, training=training)
    if states is None:
      states = self.gru_2.get_initial_state(x)
    x, states = self.gru_2(x, initial_state=states, training=training)
    x = self.dense(x, training=training)

    if return_state:
      return x, states
    else:
        return x

Please limit posts to one question, as it stands there are three here: "How can GRU improve my approach?", "What guidance exists for RNN hyperparameter tuning", and "How do I assess the performance of my model?" — Andy, Sep 21 '21 at 19:48
Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. — Community, Sep 21 '21 at 19:48

kkm inactive - support strike · Answer 1 · 2022-09-05T22:20:02.607

A single recurrent network layer $l=1$ in the original example carries over its own state $S^{\mathrm{out}(l=1)}_{t-1} \to S^{\mathrm{in}(l=1)}_{t}$ from one minichain/minibatch to the next along the increasing sequence's time[like] (w.l.o.g.) discrete coordinate¹ $(t-1) \gets t$ equidistantly² increasing from one invocation to the next, via the states argument to the MyModel.call() function.

THE INVARIANT $(\mathcal{I})$: Each recurrent layer $l$ has its own carry-over state $$S^{\mathrm{out}(l)}_{t-1} \to S^{\mathrm{in}(l)}_{t} \mathcal{\tag{I}}$$

This is not what's going on in the second code example, and is indeed the root cause of the very expected, miserably invalid backpropagation gradient approximations. The code takes in $S^{\mathrm{out}(1)}_{t-1}$ and correctly passes it as the input state to the GRU layer $l=1$, receiving a new state $S^{\mathrm{out}(1)}_{t}$ as part of the layer's return tuple. But then the horrible happens: the code passes this state to another layer $l'=2$ as if it were $S^{\mathrm{in}(2)}_{t}$, which is obviously a no-no-no-no-no! These states are unrelated, because they belong to different internal spaces: each one is that of its own respective layer! The tensor[-likeness] $S^{*(l)}$ lives on an entirely different manifold than $S^{*(l')}$ does for $l \ne l'$!

What you should do instead is collect every $S^{\mathrm{out}(l)}_{t}$ for each layer $l$ separately, and pass them to their respective layers as $S^{\mathrm{in}(l)}_{t'}$ on the next iteration and in the same states variable, after the discrete co-ordinate value $t' \gets t$ is advanced by the framework, to ensure that the invariant $(\mathcal{I})$ is conserved.

The code for a model with two recurrent layers, having being corrected for the inconsistency noted above, and including the initialization of the states $S^{\mathrm{in}(l)}_{0}$ for the initial time[likeness] $t=0$ and both layers $l=\{1,2\}$, may look like the following:

class MyModel(tf.keras.Model):
    . . .
    self.gru1 = tf.keras.layers.GRU(rnn_units_1,
    . . .
    self.gru2 = tf.keras.layers.GRU(rnn_units_2,
    . . .

  def call(self, inputs, states=None, return_state=False, training=False):
    . . .
    (s1, s2) = states or (self.gru1.get_initial_state(x),
                          self.gru2.get_initial_state(x))
    x, s1 = self.gru1(x, initial_state=s1, training=training)
    x, s2 = self.gru2(x, initial_state=s2, training=training)
    . . .
    return (x, (s1, s2)) if return_state else x

The False-ity of the None literal value together with the or construct may be conveniently used in this manner to make the code arguably³ more readable.

In the case of packing more recurrent layers into the model, the tuple may be an awkward structure to work with; whether storing the references to all RNN layers in, and passing their respective internal states via a list, tuple or another data structure, is your code readability judgment call.

____________________________
¹ Indices, as do all contravariant tensor bases, transform covariantly. We indicate this fact by the using of a reversed arrow.
² We're using the increment of $1$ w.l.o.g.
³ But please don't: this is far beyond the scope of the question. There's CodeReview.SE for that! :)

Multiple GRU layers to improve a text generation

1 Answers1