I don't want to think about the correctness of your supposed ELBO equation now. Nevertheless, it's true that the ELBO can be rewritten in different ways (e.g. if you expand the KL divergence below, by applying its definition, you will end up with a different but equivalent version of the ELBO). I will use the most common (and definitely most intuitive, at least to me) one, which you can find in the paper that originally introduced the VAE, which you should use as a reference, when you are confused (although that paper may require at least 2 readings before you fully understand it).
Here's the most common form of the ELBO (for the VAE), which immediately explains the common VAE implementations (including yours):
$$\mathcal{L}\left(\boldsymbol{\theta}, \boldsymbol{\phi} ; \mathbf{x}^{(i)}\right)=
\underbrace{
\color{red}{
-D_{K L}\left(q_{\boldsymbol{\phi}}\left(\mathbf{z} \mid \mathbf{x}^{(i)}\right) \| p_{\boldsymbol{\theta}}(\mathbf{z})\right)
}
}_{\text{KL divergence}}
+
\underbrace{
\color{green}{
\mathbb{E}_{q_{\boldsymbol{\phi}}\left(\mathbf{z} \mid \mathbf{x}^{(i)}\right)}\left[\color{blue}{ \log p_{\boldsymbol{\theta}}\left(\mathbf{x}^{(i)} \mid \mathbf{z}\right)}\right]
}
}_{\text{Expected log-likelihood}},
$$
where
$p_{\boldsymbol{\theta}}(\mathbf{z})$ is the prior over $\mathbf{z}$ (i.e. the latent variable); in practice, the prior is not actually parametrized by $\boldsymbol{\theta}$ (i.e. it's just a Gaussian with mean zero and variance one, or whatever, depending on your assumptions about $\mathbf{z}$!), but in the paper they assume that $\mathbf{z}$ depends on $\boldsymbol{\theta}$ (see figure 1).
$q_{\boldsymbol{\phi}}\left(\mathbf{z} \mid \mathbf{x}^{(i)}\right)$ is the encoder parametrized by $\boldsymbol{\phi}$
$p_{\boldsymbol{\theta}}$ is the decoder, parametrized by $\boldsymbol{\theta}$
If you assume that $p_{\boldsymbol{\theta}}(\mathbf{z})$ and $q_{\boldsymbol{\phi}}\left(\mathbf{z} \mid \mathbf{x}^{(i)}\right) $ are Gaussian distributions, then it turns out that the KL divergence has an analytical form, which is also derived in appendix B of the VAE paper (here is a detailed derivation)
$$
\color{red}{
-D_{K L}\left(q_{\boldsymbol{\phi}}\left(\mathbf{z} \mid \mathbf{x}^{(i)}\right) \| p_{\boldsymbol{\theta}}(\mathbf{z})\right)}
=
\color{red}{
\frac{1}{2} \sum_{j=1}^{J}\left(1+\log \left(\left(\sigma_{j}\right)^{2}\right)-\left(\mu_{j}\right)^{2}-\left(\sigma_{j}\right)^{2}\right)
}
$$
Hence this implementation, which should be equivalent to your term $0.5*(1+Z_{\sigma}-Z_{\mu}^2-exp(Z_\sigma))$
The expected log-likelihood is actually an expectation, so you cannot compute it exactly, in general. Hence, you can approximate it with Monte Carlo sampling (aka sampling averages: remember the law of large numbers?). More concretely, if you assume that you have a Bernoulli likelihood, i.e. $p_{\boldsymbol{\theta}}\left(\mathbf{x}^{(i)} \mid \mathbf{z}\right)$ is a Bernoulli, then its definition is (again from the VAE paper, Appendix C.1)
$$
\color{blue}{ \log p(\mathbf{x} \mid \mathbf{z})}= \color{blue}{ \sum_{i=1}^{D} x_{i} \log y_{i}+\left(1-x_{i}\right) \cdot \log \left(1-y_{i}\right) }\tag{1}\label{1},
$$
where $\mathbf{y}$ is the output of the decoder (i.e. the reconstruction/generation of the original input).
This formula should be very familiar to you if you are familiar with the cross-entropy. In fact, minimizing the cross-entropy is equivalent to maximizing the log-likelihood (this may still be confusing because of the flipped signs in the ELBO above, but just remember that maximizing a function is equivalent to minimizing its negative!). Hence this loss.
To answer/address some of your questions/doubts directly
how $(0.5*(1+Z_{\sigma}-Z_{\mu}^2-exp(Z_\sigma)))$ equates to the entropy?
I answered this above. That's just the analytical expression for the KL divergence (by the way, the KL divergence is also known as relative entropy). See this answer for a derivation.
It still seems a bit off because you're getting the cross entropy of $Y$ with it's reconstruction, not the Gaussian distribution for learning latents $Z$.
As you can see from the definition of the ELBO above (from the VAE paper), the expected log-likelihood is, as the name suggests, an expectation, with respect to the encoder (i.e. the Gaussian, in case you choose a Gaussian). However, the equivalence is between the log-likelihood (which is the term inside the expectation) and the cross-entropy, i.e. once you have sampled from the encoder, you just need to compute the term inside the expectation (i.e. the cross-entropy). Your term $\text{CrossEntropy}(\hat{Y}, Y)$ represents the CE but after you have sampled a latent variable from the encoder (or Gaussian), otherwise, you could not have obtained the reconstruction $\hat{Y}$ (i.e. the reconstruction depends on this latent variable, see figure 1).
In equation \ref{1}, note that there is no expectation. In fact, in the implementations, you may just sample once from the encoder, and then immediately compute the ELBO. I have seen this also in the implementations of Bayesian neural networks (basically, normal neural networks with the same principles of the VAE). However, in principle, you could sample multiple times from the Gaussian encoder, compute \ref{1} multiple times, then average it, to compute a better approximation of the expected log-likelihood.
Hopefully, some of the information in this answer is clear. Honestly, I don't have much time to write a better answer now (maybe I will come back later). In any case, I think you can find all answers to your questions/doubts in the VAE paper (although, as I said, you may need to read it at least twice to understand it).
By the way, the simplest/cleanest implementation of the VAE that I have found so far is this one.