Consider the following problem:
Given a vector x
of size dim
with values between 0 and 1 (exclusive), determine if max(0.05 / x) > 1
.
Obviously, this is a trivial problem to solve with conventional means.
However, what strikes me as interesting is that I have not been able to get good results training a neural network to solve this problem. Even with essentially unlimited training data, I struggle to achieve an accuracy approaching 100%. I would have expected that I could get 100% accuracy (or very close to it) for a problem like this, and in fact for my real-world application, accuracy of close to 100% is very important.
Here's an example implementation of an MLP to solve the problem, with dim=20
. Note that the network has a very large number of parameters (~150k) but it would seem that even increasing the number of parameters to 1M or higher doesn't actually improve things very much. e.g. increasing the number of units per layer from 200 to 500 brings the accuracy up to ~98%, but it feels like incredible overkill to be using a 1M parameter network for this problem, not to mention that the performance is still far below what I would need it to be. I need to see accuracy in the range of 99.9999% or better, which I feel is not too much to ask for a simple problem like this.
import tensorflow as tf
# 1000000 data points is arbitrary, you could increase this to as large a number as you like,
# or imagine a generator that gives a new set of data points for every training batch.
n, dim, validation_split = 1000000, 20, 0.1
x = tf.random.uniform(shape=(n, dim), minval=0.00000001, maxval=1.0)
y = tf.reduce_max(0.05 / x, axis=-1) > 1.0
initial_learning_rate = 0.001
act, units, depth = 'relu', 200, 5
inp = tf.keras.layers.Input((dim,))
net = inp
for _ in range(depth):
net = tf.keras.layers.Dense(units, activation=act)(net)
net = tf.keras.layers.BatchNormalization()(net)
net = tf.keras.layers.Dense(1, activation="sigmoid")(net)
model = tf.keras.models.Model(inp, net)
model.summary()
model.compile(
optimizer=tf.keras.optimizers.Adam(initial_learning_rate),
loss=tf.keras.losses.BinaryCrossentropy(),
metrics=[tf.keras.metrics.BinaryAccuracy()]
)
stopping_callback = tf.keras.callbacks.EarlyStopping(monitor="val_loss", patience=10, restore_best_weights=True)
reduce_lr_callback = tf.keras.callbacks.ReduceLROnPlateau(monitor="loss", factor=0.1**0.5, patience=5,
verbose=1, min_lr=1e-4)
h = model.fit(x, y, verbose=1,
batch_size=1024, epochs=10000, validation_split=validation_split,
callbacks=[stopping_callback, reduce_lr_callback])
n_val = int(n * validation_split)
print([model.evaluate(x[:n_val], y[:n_val]),
model.evaluate(x[-n_val:], y[-n_val:])])
Important note: I have a real application in mind - this problem is a simple distillation of the kind of problem I need to solve with a neural network. So a bespoke solution which does well on this one problem isn't so interesting. What I'm interested in is insight into why this kind of problem is hard for this kind of NN, and some general strategies that I could use to drastically improve the performance.
It's worth mentioning that I have tried many permutations of learning rate, batch size, etc.