The vector functions for true positive, false positive etc all make use of the "magic" numbers $0$ and $1$ used to represent Boolean values. They are convenience methods that you can use in a numerical library, but you do need to be aware of the fundamental Boolean nature of the data. The $0$ and $1$ values allow the maths for calculating TP et al, but are not fundamental to it, they are a representation.
Your derivations of gradients for the functions you give seem correct, barring the odd typo. However, the gradient doesn't really apply to the value of $\mathbf{y}$, because all components of $\mathbf{y}$ are either $0$ or $1$. The idea that you could increase $\mathbf{y}$ slightly where $\mathbf{\hat{y}}$ is $1$ in order to increase the value of the TP metric slightly has no basis. Instead the only valid changes to make an improvement are to modify FN values from $0$ to $1$ exactly.
You could probably still use your derivations as a gradient for optimisation (it would not be the only time in machine learning that something does not quite apply theoretically but you could still use it in practice). However, you then immediately hit the problem of how the values of $\mathbf{y}$ have been discretised to $0$ or $1$ - that function will not be differentiable, and it will prevent you back propagating your gradients to the neural network weights that you want to change. If you fix that follow-on problem using a smoother function (e.g. a sigmoid) then you are likely to end up with something close to either cross-entropy loss or the perceptron update step.
In other words, although what you have been told is an over-simplification, you will not find a way to improve the performance of your classifier by adding cost functions based directly on TP, FP etc. That is what binary cross-entropy loss is already doing. There are other, perhaps more fruitful avenues of investigation - hyperparameter searches, regularisation, ensembling, and if you have an unbalanced data set then consider weighting the true/false costs.