Is this the only big difference? Or in other words, can I do the same implementation as in a simple MCTS with the 4 stages: selection, expansion, simulation and backpropagation, where the result of the simulation is the accumulated reward instead of a value between 0 and 1? How would the UCT selection be adjusted in this case?
No, this is not the only difference. The important differences between the "original UCT" (as described in the paper you linked), and the "standard UCT" (as typically implemented in game AI research) are nicely summarised in Subsection 2.4 of "On Monte Carlo Tree Search and Reinforcement Learning". I'd recommend reading the entire paper, or large parts of it, if you have the time, but to summarise the key differences:
- Original stores as many nodes as allowed by memory limitations per iteration, standard only expands one node per iteration.
- Original uses transpositions, standard implementation does not
- Original uses discount factor $\gamma$, standard does not (equivalent to picking $\gamma = 1$)
- Original takes into account at which point in time rewards were observed, standard does not
That last point is the one your question mainly seems to be about. Your idea of using a standard implementation, but accumulating any intermediate rewards and treating them as one large terminal reward, would not be the same. Actually taking into account the times at which rewards were observed, as done in the original UCT, can lead to more accurate value estimates in nodes in the middle of the tree (or at least to finding them more quickly). Your idea would be equivalent to learning from Monte-Carlo backups in Reinforcement Learning literature (e.g. Sutton and Barto's book), whereas the original UCT implementation would be more similar to learning from TD($0$) backups.
Note that, when talking about the "standard" implementation of UCT, that's very often in papers about two-player zero-sum games in which all intermediate rewards have a value of $0$, discounting is considered to be unimportant (i.e. $\gamma = 1$), and only terminal game states have a non-zero reward. In this special case, the two ideas do become equivalent.
What does the UpdateValue function in line 12 does exactly? In the text it says it is used to adjust the state-action pair value at a given depth, will this be used to the selection? How is this calculated exactly?
This would be done in the "standard" way. The visit count is incremented, and $q$ could be added to a sum of all $q$ values ever observed during the search for that state-action pair, such that the average score can be computed as the sum of $q$ scores divided by visit count. This average score is the $\bar{X}$ used in Selection.
What is the depth parameter needed for? Is it related with the UpdateValue?
I think it's usually not really needed. In the paper they describe that it may be considered to have a depth cut-off. From paper:
"This can be the reach of a terminal state, or episodes can be cut at a certain depth (line 8)."