Let's look at the definition of gradient:
In vector calculus, the gradient of a scalar-valued differentiable function $f$ of several variables is the vector field (or vector-valued function) $\nabla f$ whose value at a point $p$ is the vector $r^{[a]}$ whose components are the partial derivatives of $f$ at $p .^{[1][2][3][4][5][6][7][8][9]}$ That is, for $f: \mathbb{R}^{n} \rightarrow \mathbb{R}$, its gradient $\nabla f: \mathbb{R}^{n} \rightarrow \mathbb{R}^{n}$ is defined at the point $p=\left(x_{1}, \ldots, x_{n}\right)$ in $n$-dimensional space as the vector: $^{[b]}$
$$
\nabla f(p)=\left[\begin{array}{c}
\frac{\partial f}{\partial x_{1}}(p) \\
\vdots \\
\frac{\partial f}{\partial x_{n}}(p)
\end{array}\right]
$$
First of all, the gradient is not a single value or a vector, it's an operator that given a function returns another function (note that in the definition $$\nabla f$$ map from $\mathbb{R}^n$ to $\mathbb{R}^n$ again), which can be used to compute a vector for each point of a field. So, it doesn't really make sense to talk of a gradient direction per se, since the direction actually belongs to the single vectors associated with each point of the field. How many directions do these vectors have? Well, it depends on the field. A plane has 2 dimensions, hence 2 directions in which you can move, in the same way, $\mathbb{R}^n$ will have $n$ dimensions, hence $n$ directions in which you can move.
Note also that in gradient descent the gradient is computed with respect to the cost (loss) function:
$$W_{t+1} = w_t - \alpha(\partial C / \partial w)$$
Mathematically this means that:
- we have a set of weights $w$
- we use these weights to produce an output given a specific input
- we then compute an error (or cost, or loss) between the input and output, i.e. we compute at which point of the field of the cost function we end up due to our current weights
- we then compute the gradient of the cost function, i.e. we look at all points of the field of the cost function to understand which vector has the biggest magnitude (we care most about the magnitude rather than the direction) and finally
- we update the weight in such a way that the next value produced by the weights when computing the cost function will be in the same direction as the previously found vector.