1

Inverse Reinforcement Learning based on GAIL and GAN-Guided Cost Learning(GAN-GCL), uses a discriminator to classify between expert demos and policy generated samples. Adversarial iRL, build upon GAN-GCL, has its discriminator $D_{\theta, \phi}$ as a function of a state-only reward approximator $f_{\theta, \phi}$.

$$ D_{\theta, \phi}\left(s, a, s^{\prime}\right)=\frac{\exp \left\{f_{\theta, \phi}\left(s, a, s^{\prime}\right)\right\}}{\exp \left\{f_{\theta, \phi}\left(s, a, s^{\prime}\right)\right\}+\pi(a \mid s)}, $$

where $f_{\theta,\phi}$ is expressed as:

$$f_{\theta,\phi} = g_{\theta} (s) + γh_φ (s\prime ) − h_φ (s).$$

The optimal $*g(s)$ tries to recover the optimal reward function $r^*(s)$. While the $h(s)$ tries to recover the optimal value funtion $V^*(s)$, which makes $f_{\theta,\phi}$ interpretable as the advantage.

My question comes from the network architecture used for $h(s)$ in the original paper.

... we use a 2-layer ReLU network for the shaping term h. For the policy, we use a two-layer (32 units) ReLU gaussian policy.

What is meant by the quoted text in bold, because my interpretation of that text, (shown below) doesn't seem viable

h = nn.Sequential([nn.ReLu(), nn.ReLu()])
nbro
  • 39,006
  • 12
  • 98
  • 176
mugoh
  • 531
  • 4
  • 20
  • I've seen this [short example from pytorch docs](https://pytorch.org/tutorials/beginner/examples_nn/two_layer_net_nn.html). They define a fully-connected ReLU network with one hidden layer as ```nn.Sequential( nn.Linear(in, H), .nn.ReLU(), nn.Linear(H, out))``` – mugoh Sep 09 '20 at 14:44

1 Answers1

1

The Pytorch docs define a fully connected ReLU network as:

torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)

Neural networks are called are made of neurons. Activation functions only help determine which of these neurons to fire up, meaning they have no learnable nodes themselves through which we can back-propagate gradients. A network with no learnable parameters is therefore not a neural net. So neural nets can't be composed of activation functions only.

What is meant by the quoted text in bold, because my interpretation of that text, (shown below) doesn't seem viable

Yes, what's given here is not a network that can approximate the said function $h(s)$. A two layer ReLU network would resemble:

   x = nn.ReLU(nn.Linear(d_IN, H))
   x = nn.ReLU(nn.Linear(H, H))
   out = nn.Linear(x, d_OUT)

Another way to see it is that a network must have an input and ouput layer, and optional hidden layers. It's not possible to use an activation funtion as an input layer, because then you'll have no way of configuring the number of features to represent your input data. In this context, a ReLU can't represent the features of the observation input s.

To show activation functions have no learnable nodes in them and that this interpretation

h = nn.Sequential([nn.ReLu(), nn.ReLu()])

is not what the authors are driving across, here a script that counts the number of parameters in a network.


import torch.nn as nn
import numpy as np

activation = nn.ReLU


def count_params(module):
    return np.sum([np.prod(x.shape) for x in module.parameters()])


one_linear = nn.Sequential(nn.Linear(32, 10), nn.Linear(10, 1))
linear_act = nn.Sequential(nn.Linear(32, 10), activation(), nn.Linear(10, 1))
act_only = nn.Sequential(activation(), activation())

t_lin = count_params(linear_act)
lin = count_params(one_linear)
act = count_params(act_only)

print(f'Linear only: {lin}, Linear + Activation: {t_lin},' +
        f'Activation only: {act}')


[Out]: Linear only: 341, Linear + Activation: 341, Activation only: 0.0

The activation-function-only module has zero learnable parameters. Likewise, an activation function adds no parameters to the fully connected layers.

Update: Links to implementations

To confirm this answer's interpretation is correct, here are links to GAIL and GAN-GCL example implementations

  1. GAIL : discriminator prediction (Forward call), discriminator architecture (The ReLU net):
  2. GAN-GCL : discriminator prediction, discriminator architecture:
mugoh
  • 531
  • 4
  • 20