1

Lets take an ad recommendation problem for 1 slot. Feedback is click/no click. I can solve this by contextual bandits. But I can also introduce exploration in supervised learning, I learn my model from collected data every k hours.

What can contextual bandits give me in this example which supervised learning + exploration cannot do?

dksahuji
  • 111
  • 2
  • Hello. Can you please clarify what you mean by "I can also introduce exploration in supervised learning"? What's your idea more specifically? – nbro Jun 29 '21 at 12:37
  • I will learn a supervised model from the logging policy/model 's data and I will use it in a stochastic way which will promote exploration. If I do this iteratively(i.e. learn model with IPW supervised loss and collect data with exploration), i hope i can achieve what a batch learning with bandit feedback setup will also achieve. I will use inverse propensity weights in bandits and supervised learning loss as well. then my question is that for this specific problem is, will bandit give me anything i cannot achieve with a IPW supervised loss. – dksahuji Jun 30 '21 at 02:52
  • the whole point is working in a practical setup and trying to learn from partial observations i.e. in recommendation you dont have data for what you dont recommend. learned stuff from this ( http://www.cs.cornell.edu/courses/cs7792/2018fa/ ) . batch learning with bandit feedback seems exciting to me. But how do i convince myself that for simple problem as explained in the question exploration plus inverse propensity weighted supervised loss is equal to a batch learned bandit. IPW supervised loss will obviously require less iterations over data. – dksahuji Jun 30 '21 at 02:59
  • In the example in question description, i can have a softmax over candidates for that slot. exploration will let me try different candidates and hence collecting data by exploration. – dksahuji Jun 30 '21 at 03:05
  • I'm wondering the same question. Bandit is not "an algorithm", it is a modeling setup. Your bandit will follow a policy π. You say your policy π is based on a supervised model (disregard exploration for now). I believe that your supervision signal (click) would directly translate to a reward that you get at time t. In non-bandit agents, RL would optimize for long term future rewards. Bandit is a one-episode game, so I think supervised learning is equivalent to RL. The benefits you get by posing problem as bandit is you can reason about and control your exploration – islijepcevic Oct 20 '22 at 12:21
  • No. If I am using IPW weighted supervised loss then actually cross-entropy is my negative reward and click is consumed in the loss. – dksahuji Apr 07 '23 at 09:45

0 Answers0