I'm working with the Online Logistic Regression Algorithm (Algorithm 3) of Chapelle and Li in their paper, "An Empirical Evaluation of Thompson Sampling" (https://papers.nips.cc/paper/2011/file/e53a0a2978c28872a4505bdb51db06dc-Paper.pdf). It's a contextual bandit model using a Laplace approximation for regularized logistic regression with Thompson Sampling.
My question is about the effect of data imbalance. Say the log files of the online advertiser showed that after implementing the algorithm, Ad#1 accounted for 40% of the impressions and achieved 60% of the rewards (clicks). Ad#1 has the 3rd highest conversion rate compared to 30 other ads. (The two ads with higher conversion rate accounted for <1% of weekly impressions/rewards. The number of impressions is still high though, N>300K).
Question: What would be the effect of downsampling in the training set, the examples for which Ad#1 has positive reward? The paper doesn't mention anything about data imbalance. Is data balancing needed/advisable?
Option A The model will impress Ad#1 less and result in fewer weekly aggregate rewards (less exploitation of Ad#1 resulting in less rewards). The concern is that the model has chosen to exploit Ad#1 and if we downsample its rewards in the training set, we will exploit less and allocate more impressions to ads with lower conversion rates.
Option B The model will impress Ad#1 less, however will learn better when to serve the impressions in a targeted way so that the weekly aggregate rewards will not change and/or will increase.
The idea is that the imbalanced dataset is causing Ad#1 to be impressed in situations in which it shouldn't. By downsampling, and having a balanced dataset, we will better model when Ad#1 should actually be impressed and this will lead to higher rewards for Ad#1
Option C Something Else