Thompson sampling with Bernoulli prior and non-binary reward update

Question

I am solving a problem for which I have to select the best possible servers (level 1) to hit for a given data. These servers (level 1) in turn hit some other servers (level 2) to complete the request. The level 1 servers have the same set of level 2 servers integrated with them. For a particular request, I am getting success or failure as a response.

For this, I am using Thompson Sampling with Bernoulli prior. On success, I am considering reward as 1 and, for failure, it is 0. But in case of failure, I am receiving errors as well. In some error, it is evident that the error is due to some issue at the server (level 1) end, and hence reward 0 makes sense, but some error results from request data errors or issue at level 2 servers. For these kinds of errors, we can't penalize the level 1 servers with reward 0 nor can we reward them with value 1.

Currently, I am using 0.5 as a reward for such cases.

Exploring over the Internet, I couldn't find any method/algorithm to calculate the reward for such cases in a proper (informed) way.

What could be the possible way to calculate reward in such cases?

Let me try to clarify. Is this a bandit problem, i.e. do you have only 1 state or multiple ones? From your description, it seems that you have to select actions only once at the beginning, i.e. you have to select only 1 server (from a pool of servers), and this selection represents an action. However, it's not fully clear what you mean by "These servers (level 1) in turn hit some other servers (level 2) to complete the request.". How do they "hit" other servers? Do they do it automatically, stochastically or what? — nbro, Nov 17 '20 at 10:53
Eaxctly, I have a collection of servers at level 1 from which I have to select one. I will send a request to these servers. These servers at level 1 will do their processing and generate another request which will be sent to server at lev 2. So these servers at lev 2 are the ultimate destination. The problem is identify best servers at lev 1 which will optimize my success rate. And I want to reward these servers(lev 1) for the error at their end (0 for failure and 1 for success) but there are cases where lev 1 servers are not responsible for failure and I want to calculate reward in such case — PUNEET AGARWAL, Nov 18 '20 at 11:12

Thompson sampling with Bernoulli prior and non-binary reward update

0 Answers0