The Pytorch docs define a fully connected ReLU network as:
torch.nn.Sequential(
torch.nn.Linear(D_in, H),
torch.nn.ReLU(),
torch.nn.Linear(H, D_out),
)
Neural networks are called are made of neurons. Activation functions only help determine which of these neurons to fire up, meaning they have no learnable nodes themselves through which we can back-propagate gradients. A network with no learnable parameters is therefore not a neural net. So neural nets can't be composed of activation functions only.
What is meant by the quoted text in bold, because my interpretation of that text, (shown below) doesn't seem viable
Yes, what's given here is not a network that can approximate the said function $h(s)$. A two layer ReLU network would resemble:
x = nn.ReLU(nn.Linear(d_IN, H))
x = nn.ReLU(nn.Linear(H, H))
out = nn.Linear(x, d_OUT)
Another way to see it is that a network must have an input and ouput layer, and optional hidden layers. It's not possible to use an activation funtion as an input layer, because then you'll have no way of configuring the number of features to represent your input data. In this context, a ReLU
can't represent the features of the observation input s
.
To show activation functions have no learnable nodes in them and that this interpretation
h = nn.Sequential([nn.ReLu(), nn.ReLu()])
is not what the authors are driving across, here a script that counts the number of parameters in a network.
import torch.nn as nn
import numpy as np
activation = nn.ReLU
def count_params(module):
return np.sum([np.prod(x.shape) for x in module.parameters()])
one_linear = nn.Sequential(nn.Linear(32, 10), nn.Linear(10, 1))
linear_act = nn.Sequential(nn.Linear(32, 10), activation(), nn.Linear(10, 1))
act_only = nn.Sequential(activation(), activation())
t_lin = count_params(linear_act)
lin = count_params(one_linear)
act = count_params(act_only)
print(f'Linear only: {lin}, Linear + Activation: {t_lin},' +
f'Activation only: {act}')
[Out]: Linear only: 341, Linear + Activation: 341, Activation only: 0.0
The activation-function-only module has zero learnable parameters. Likewise, an activation function adds no parameters to the fully connected layers.
Update: Links to implementations
To confirm this answer's interpretation is correct, here are links to GAIL and GAN-GCL example implementations
- GAIL : discriminator prediction (Forward call), discriminator architecture (The ReLU net):
- GAN-GCL : discriminator prediction, discriminator architecture: