MDPs are strict generalisations of contextual bandits, adding time steps and state transitions, plus the concept of return as a measure of agent performance.
Therefore, methods used in RL to solve MDPs will work to solve contextual bandits. You can either treat a contextual bandit as a series of 1-step episodes (with start state chosen randomly), or as a continuous problem with discount factor zero.
Can I use the DQN algorithm with TD Target only the observed reward instead of the reward plus the predicted value of the next state?
Yes. That's identical mathematically to having a discount of zero, or having 1-step episodes.
Can I use a policy gradient algorithm, like REINFORCE or A2C? If yes, should I use a baseline and what this baseline should be?
Yes. Once converted to an MDP, you can use the same baselines in these algorithms as normal (A2C's use of advantage instead of action value is already a baseline). Generally adding a baseline can help reduce variance, so it may still help when applying RL to contextual bandit problems.
I have seen in the literature that there are some algorithms for contextual bandits such as LinUCB, LinRel, NeuralBandit, etc. And I am wondering why the DQN, A2C and REINFORCE algorithms, which seem to work well in MDP setting, are not used in contextual bandits
There are a couple of reasons that contextual bandit problems are not solved using RL techniques more often:
The goal in contextual bandits is commonly focused on creating a highly efficient online learner that minimises regret. Regret is the long term difference in total reward between always exploiting the best action choice compared to the exploration necessary to find it. Some RL solvers - e.g. DQN - are poor by this metric.
The lack of timesteps and state transitions can be used in algorithm design to be more efficient.
Improvements to RL methods designed to help with sparse rewards and the assignment problem in MDPs are pointless for contextual bandits and may be wasteful or even counter-productive.
Some RL algorithms do resolve to be nearly identical to their contextual bandit counterparts, and have the same performance characteristics e.g. REINFORCE with baseline for 1-step episodes is essentially the Contextual Gradient Bandit algorithm.
It is also worth noting than many problem domains where contextual bandit algorithms do well - e.g. website recommendations and advertising - have research showing that that a more sophisticated MDP model and RL-like approach can do even better. Although that is not quite the same as your question, it typically means extending the model so that timesteps and state transitions are meaningful.