Assume $\mathbf{X} \in R^{N, C}$ is the input of the softmax $\mathbf{P} \in R^{N, C}$, where $N$ is number of examples and $C$ is number of classes:
$$\mathbf{p}_i = \left[ \frac{e^{x_{ik}}}{\sum_{j=1}^C e^{x_{ij}}}\right]_{k=1,2,...C} \in R^{C} \mbox{ is a row vector of } \mathbf{P}$$
Consider example $i$-th, because softmax function $\mathbf{p}:R^C \mapsto R^C$ (eliminate subscript $i$ for ease notation), so the derivative of vector-vector mapping is Jacobian matrix $\mathbf{J}$:
$$\mathbf{J}_{\mathbf{p}}(\mathbf{x}) = \left[ \frac{\partial \mathbf{p}}{\partial x_1}, \frac{\partial \mathbf{p}}{\partial x_2}, ..., \frac{\partial \mathbf{p}}{\partial x_C} \right] = \begin{bmatrix} \frac{\partial p_1}{\partial x_1} & \frac{\partial p_1}{\partial x_2} & \dots & \frac{\partial p_1}{\partial x_C} \\ \frac{\partial p_2}{\partial x_1} & \frac{\partial p_2}{\partial x_2} & \dots & \frac{\partial p_2}{\partial x_C} \\ \dots & \dots & \dots & \dots \\ \frac{\partial p_C}{\partial x_1} & \frac{\partial p_C}{\partial x_2} & \dots & \frac{\partial p_C}{\partial x_C} \end{bmatrix} \in R^{C, C} $$
$\mathbf{J}_{\mathbf{p}}(\mathbf{x})$ is called the derivative of vector ${\mathbf{p}}$ with respect to vector $\mathbf{x}$
$$\mbox{1) Derivative in diagonal:}\frac{\partial p_{k}}{\partial x_{k}} = \frac{\partial}{\partial x_{k}}\left( \frac{e^{x_k}}{\sum_{j=1}^C e^{x_j}} \right)$$
$$ = \frac{\left( \frac{\partial e^{x_k}}{\partial x_k} \right)\sum_{j=1}^C e^{x_j} - e^{x_k}\left(\frac{\partial \sum_{j=1}^C e^{x_j}}{\partial x_k} \right)}{\left(\sum_{j=1}^C e^{x_j}\right)^2} \mbox{ (Quotient rule) }$$
$$ = \frac{e^{x_k}\sum_{j=1}^C e^{x_j} - e^{x_k} e^{x_k}}{\left(\sum_{j=1}^C e^{x_j}\right)^2} = \frac{e^{x_k}(\sum_{j=1}^C e^{x_j} - e^{x_k})}{\left(\sum_{j=1}^C e^{x_j}\right)^2}$$
$$ = \frac{e^{x_k}}{\sum_{j=1}^C e^{x_j}} \left(\frac{\sum_{j=1}^C e^{x_j} - e^{x_k}}{\sum_{j=1}^C e^{x_j}}\right) = \frac{e^{x_k}}{\sum_{j=1}^C e^{x_j}} \left(1 - \frac{e^{x_k}}{\sum_{j=1}^C e^{x_j}}\right)$$
$$\rightarrow \frac{\partial p_{k}}{\partial x_{k}} = p_{k}(1-p_{k})$$
$$\mbox{2) Derivative not in diagonal } k \neq c \mbox{ :} \frac{\partial p_{k}}{\partial x_{c}} = \frac{\partial}{\partial x_{c}}\left( \frac{e^{x_k}}{\sum_{j=1}^C e^{x_j}} \right)$$
$$ = \frac{\left( \frac{\partial e^{x_k}}{\partial x_c} \right)\sum_{j=1}^C e^{x_j} - e^{x_k}\left(\frac{\partial \sum_{j=1}^C e^{x_j}}{\partial x_c} \right)}{\left(\sum_{j=1}^C e^{x_j}\right)^2} \mbox{ (Quotient rule) }$$
$$ = \frac{0 - e^{x_k} e^{x_c}}{\left(\sum_{j=1}^C e^{x_j}\right)^2} = -\frac{e^{x_k}}{\sum_{j=1}^C e^{x_j}}\frac{e^{x_c}}{\sum_{j=1}^C e^{x_j}}$$
$$\rightarrow \frac{\partial p_{k}}{\partial x_{c}} = -p_{k}p_{c}$$
$$\rightarrow \mathbf{J}_{\mathbf{p}}(\mathbf{x}) = \begin{bmatrix} p_1(1-p_1) & -p_1p_2 & \dots & -p_1p_C \\ -p_2p_1 & p_2(1-p_2) & \dots & -p_2p_C \\ \vdots & \vdots & \ddots & \vdots \\ -p_Cp_1 & -p_Cp_2 & \dots & p_C(1-p_C) \end{bmatrix}$$
- Focal Loss: $\displaystyle FL = -\sum_{k=1}^C y_{k} \alpha_{k}(1-p_k)^\gamma \log (p_k)$
$$\nabla_{\mathbf{x}} FL = \nabla_{\mathbf{p}}FL (\mathbf{J}_{\mathbf{p}}(\mathbf{x}))^T $$
$$\nabla_{\mathbf{p}} FL = \begin{bmatrix} \frac{\partial FL}{\partial p_1}\\ \frac{\partial FL}{\partial p_2}\\ \vdots \\ \frac{\partial FL}{\partial p_C} \end{bmatrix} \mbox{ where } \frac{\partial FL}{\partial p_k} = - y_k\alpha_k \left(-\gamma(1-p_k)^{\gamma-1} \log(p_k) + \frac{(1-p_k)^\gamma}{p_k} \right) = y_k \alpha_k\gamma(1-p_k)^{\gamma-1}\log(p_k) - y_k\alpha_k\frac{(1-p_k)^\gamma}{p_k}$$
$$\nabla_{\mathbf{x}} FL = \begin{bmatrix} \frac{\partial FL}{\partial p_1}\\ \frac{\partial FL}{\partial p_2}\\ \vdots \\ \frac{\partial FL}{\partial p_C} \end{bmatrix}^T \begin{bmatrix} \frac{\partial p_1}{\partial x_1} & \frac{\partial p_1}{\partial x_2} & \dots & \frac{\partial p_1}{\partial x_C} \\ \frac{\partial p_2}{\partial x_1} & \frac{\partial p_2}{\partial x_2} & \dots & \frac{\partial p_2}{\partial x_C} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial p_C}{\partial x_1} & \frac{\partial p_C}{\partial x_2} & \dots & \frac{\partial p_C}{\partial x_C} \end{bmatrix} = \begin{bmatrix} \sum_{k=1}^C \left(\frac{\partial FL}{\partial p_k}\frac{\partial p_k}{\partial x_1}\right)\\ \sum_{k=1}^C \left(\frac{\partial FL}{\partial p_k}\frac{\partial p_k}{\partial x_2}\right)\\ \vdots \\ \sum_{k=1}^C \left(\frac{\partial FL}{\partial p_k}\frac{\partial p_k}{\partial x_C}\right) \end{bmatrix}^T \in R^C $$
$$\mbox{Case 1: }\displaystyle \frac{\partial FL}{\partial p_k}\frac{\partial p_k}{\partial x_k} \forall k=1,2,...,C$$
$$\frac{\partial FL}{\partial p_k} \frac{\partial p_k}{\partial x_k} = y_k \alpha_k\gamma(1-p_k)^{\gamma-1}\log(p_k)p_k(1-p_k) - y_k\alpha_k\frac{(1-p_k)^\gamma}{p_k}p_k(1-p_k)$$
$$ = y_k \alpha_k (1-p_k)^{\gamma}(\gamma p_k \log(p_k) - 1 + p_k) $$
$$\mbox{Case 2: } (k \neq c)\displaystyle \frac{\partial FL}{\partial p_k}\frac{\partial p_k}{\partial x_c}$$
$$\frac{\partial FL}{\partial p_k} \frac{\partial p_k}{\partial x_c} = - y_k\alpha_k\gamma(1-p_k)^{\gamma-1}\log(p_k)p_kp_c + y_k\alpha_k\frac{(1-p_k)^\gamma}{p_k}p_kp_c$$
$$ = - y_k\alpha_k (1-p_k)^{\gamma-1}p_c(\gamma p_k \log(p_k) - 1 + p_k) $$
$$\mbox{For each } d=1,2,...,C \mbox{ : }\sum_{k=1}^C \left(\frac{\partial FL}{\partial p_k} \frac{\partial p_k}{\partial x_d}\right) = y_d \alpha_d (1-p_d)^{\gamma}(\gamma p_d \log(p_d) - 1 + p_d) + \sum_{c \neq d}^C \left( - y_d\alpha_d (1-p_d)^{\gamma-1}p_c(\gamma p_d \log(p_d) - 1 + p_d) \right) = y_d\alpha_d(1-p_d)^{\gamma-1}(\gamma p_d \log(p_d) - 1 + p_d)\left(1-p_d -\sum_{c \neq d}^C(p_c)\right) $$
$$\rightarrow \nabla_{\mathbf{x}} FL = \left[ y_d\alpha_d(1-p_d)^{\gamma-1}(\gamma p_d \log(p_d) - 1 + p_d)\left(1-p_d -\sum_{c \neq d}^C(p_c)\right) \right]_{d=1,2,...,C}$$
However, the problem is $\left(1-p_d -\sum_{c \neq d}^C(p_c)\right) = 0$ (because sum of all probabilities is 1) then the whole expression collapses to $0$.
Is there any wrong in my focal loss derivation?
Reference: