To model the problem through RL,
- it is possible to discretize time horizon into very short time interval (for example 5 minutes as a stage) such that in each time interval, just a single customer enter to our system.
- On the other hand, it is possible that stages are defined as the time when a customer enters our system.
1) Is the second approach an SMDP and if I want to solve it with RL, I should use Hierarchical RL?
The second approach as you described it does not sound like an SMDP to me... it sounds to me like a regular MDP where the discrete time steps in the MDP do not have any meaningful connection anymore to real-world time in the real-world problem being modelled. This is the approach I would recommend taking if long periods of "inactivity" in between customers are irrelevant and/or if you do not expect your actions to have any influence on the duration of those periods of inactivity.
If you do expect your actions to have meaningful influence on the duration of inactivity, I would not recommend this approach. For example, your actions may make a customer angry and reduce their likelihood of quickly returning. Or your actions may indirectly affect the likelihood of other customers visiting due to interaction between your population of customers, they may or may not recommend your bank to their friends based on your actions.
2) In the first approach, if a customer enters in a time interval, it is easy to update $Q$-values. However, what should we do, if we are in state $S$ and take action $A$ but, no customer enter our system so we do not receive any reward for the pair of $(S, A)$ there would be no difference if we would take action $A_1$, $A_2$ and so on. This can happen for several consecutive time interval. I think it is more challenging when we consider eligibility trace.
I would actually argue that your first approach can be treated as an SMDP, and that it may be a good idea to do so (personally I prefer the Options framework, it's a bit more general than SMDPs, a bit more flexible, and a bit closer to "standard" RL terminology; there are plenty of free versions of that paper available on goole too).
Whenever you encounter a customer, you could select an option (or "macro-action") in the SMDP, which executes the one "real" action for your current customer once, and is subsequently forced to automatically select no-op actions (which do nothing) until the next customer arrives; that is the point in time where your previous option ends, and a new option can be selected.
An RL algorithm that operates on the SMDP / hierarchical level, i.e. one that only learns to optimize the selection of options / macro-actions and completely skips over all the forced no-op actions, would suffice. This is basically the approach described in Concurrent Reinforcement Learning from Customer Interactions.