In which community does using a Bayesian regression model as a reward function with exploration vs. exploitation challenges fall under?

Question

I am trying to find research papers addressing a problem that, in my opinion, deserves significant attention. However, I am having difficulty locating relevant information.

To illustrate the problem at hand, consider a multivariate Bayesian regression model represented as $y_t = b_1 \cdot x_{1,t}^{c_1} + b_2 \cdot x_{2,t}^{c_2}$. The purpose of this model is to estimate rewards $y_t$ based on inputs ($x_{1,t}$, $x_{2,t}$) at each time step denoted by $t$. The parameters $b_1$ and $b_2$ are assumed to follow a half-normal distribution, while $c_1$ and $c_2$ follow a gamma distribution. I aim to plan inputs over a given horizon $H$ given a specified budget $\left(\sum_t x_{1,t} + \sum_{t} x_{2,t} \le \mbox{budget}\right)$, with the main objective of maximizing cumulative rewards. To help with this goal, it is important to also gain knowledge about the model parameters to facilitate improved allocation of inputs for maximizing cumulative rewards in the future.

Every day we make a decision on how to allocate a portion of the total budget and every day we receive a response ($y=\mbox{sales}$). Typically, we have access to 30-300 daily data points to establish the initial posterior, and the planning horizon is typically 30 days.

This problem exhibits a trade-off between exploration and exploitation, which is commonly encountered in communities such as reinforcement learning and dual control. However, the constraint of only being able to run experiments based on the exact functional form of the model, rather than simulations in the real environment, makes it challenging to apply standard reinforcement learning techniques. While elements of this problem can be found in other communities, such as Bayesian experimental design and active learning, these approaches primarily address the exploration aspect.

An intuitive strategy could involve incorporating certain aspects of active learning, such as maximum information gain, along with an $\epsilon$-greedy approach. In this strategy, the mean of the posterior predictive distribution of the input can be used during exploitation, while maximum information gain can guide exploration by proposing new inputs. I can of course think of more approaches, Thompson sampling being one of them where we sample a single sample of parameters from our posterior and use it as ground truth in the downstream optimization task. Another one is to take multiple samples and use the mean of the posterior predictive distribution. One could be UCB with some tweaks etc. A third one could be to take a Thompson sample and compute the downstream optimization task and repeat it iteratively to get an comprehension of the variance of possible solutions and maybe in the end take the mean or median.

Despite the belief that this problem falls within the realm of sequential decision optimization and should have received extensive research attention, it is challenging to find specific resources that tackle this type of problem.

Question: In which communities can I find more information about my problem type?

Crossposted: https://or.stackexchange.com/questions/10611/bayesian-regression-model-as-reward-function-in-exploration-vs-exploitation-sett

EDIT

After reading the reports provided in the response below and examining various model-based Bayesian reinforcement learning techniques, several concerns arose in my mind, accompanied by potential solution strategies. To address these concerns and demonstrate some techniques applicable to resolving this matter, I initiated a new discussion thread. The problem formulation in that thread bears a striking resemblance to the one above, albeit with additional contextual information that closely aligns with the real problem I am currently engaged in. This thread can be read here: Methods for sequential decision optimization problem with nonlinear bayesian reward function In this current question, I am solely interested in hearing about which communities have been trying to attack this problem type.

Hi @paul and welcome to AI Stack Exchange! If possible, please outline your specific question in the body of the post. Right now, I'm unsure of what question to answer. I did my PhD in some very similar topic, and I'd like to help out if possible. Thanks, and we look forward to more of your questions in the future! — DeepQZero, Jun 28 '23 at 17:04
hi @DeepQZero , just edited: My main question is, where can i find more information about my problem type, under which communities, has this problem-type even been tackled before? — paul, Jun 29 '23 at 07:14
i would love to discuss this problem with someone who actually worked on something similar @DeepQZero because everywhere i look it is extremely difficult to find relevant info on this, bits and pieces can be taken from all the sequential decision communities but it seems so strange that noone had tackled this exact problem type before — paul, Jun 29 '23 at 07:16
also added a couple of more strategies(policies) i been thinking off, it seems so strange however that this problem has not been tackled before extensively in the literature. — paul, Jun 29 '23 at 08:24
Thank you for the update. Another question - is gaining info about the model parameters rewarded somehow, or is it only helpful in maximizing the total return? I'm asking this question since I usually have seen maximizing total rewards as the sole goal in similar problems, whereas gaining info about the model parameters is only a byproduct of trying to maximize the total rewards. If you need clarification, just let me know. — DeepQZero, Jun 29 '23 at 14:14
@DeepQZero it is solely helpful in maximizing the total return. No extra reward added for gaining info about them. — paul, Jun 29 '23 at 16:01
i would say that certain bandit-problems, active learning for regression, MPC is the closest i got to finding research that resembles this problem but they all have parts that are very different from this setting — paul, Jun 29 '23 at 17:59
I'll try to give an answer by the end of next week. If I don't give an answer or don't give an update here, please send me another message in this comment chain to get my attention again. — DeepQZero, Jun 30 '23 at 19:04
@DeepQZero Lovely, thank you very much. Are there any specific research papers or Google searches that you recommend right away? — paul, Jun 30 '23 at 19:09
Try this arxiv paper: Bayesian Sequential Optimal Experimental Design for Nonlinear Models Using Policy Gradient Reinforcement Learning — DeepQZero, Jun 30 '23 at 19:15
@DeepQZero wow, finally i found reports tackling something similar, awesome reports, not just that one, but all the other sOED reports. Will definitely reference them in my final report about my real problem(the one presented here is just a toy one showcasing main considerations). A wide range of questions popped up in my mind as i am digesting the reports and trying to bridge it to my problem. I aim to create maybe separate threads showcasing these questions by friday approx and then link those threads in this one. Hold back until i create those threads, i think you will ace all of them. — paul, Jul 05 '23 at 11:29
Glad to hear it! I'm busy until the weekend (at the very earliest) anyways, so I can't look at any of this until at least then. This sOED problem is very rich, so it doesn't surprise me that it's prompting more questions in your mind. — DeepQZero, Jul 06 '23 at 13:56
@DeepQZero no worries, i am just extremely happy someone can actually answer some of my questions, no matter how long it takes :), i created a separate thread/question(read the EDIT part in this question), that sums up my main concerns that popped up in my head when digesting the sOED reports, i would be so happy if you could shed some light on my thoughts in that thread(it is basically a bridge between an algorithm stemming from the problem above and sOED reports): https://ai.stackexchange.com/questions/41198/is-this-deterministic-lookahead-policy-greedy-myopic-and-suboptimal-if-so-can — paul, Jul 07 '23 at 18:22
@DeepQZero hi, hope you are doing well, just wanted to send a little reminder in case you had forgotten, i appreciate your help immensely. kind regards, Paul — paul, Jul 12 '23 at 15:40
I finally got some time to look at this question again. I'll try to give an answer soon, perhaps today. — DeepQZero, Jul 22 '23 at 17:27

score 1 · Accepted Answer · answered Jul 22 '23 at 19:08

One community that has very recently been attacking problems of the type posed by your question is the Bayesian sequential optimal experimental design (Bayesian sOED) community. The Bayesian sOED setting generally assumes that there is a current belief state $x_k \in \mathcal{X}$ over an underlying system with some unknown quantity (in your case, unknown parameters) and possible design choices $d_k \in \mathcal{D}$ that may be conditioned on the current belief state. Performing experiment $k \in \{0, 1, 2, \ldots, N-1\}$ is equivalent to choosing a design $d_k$, receiving an observation $y_k \in \mathcal{Y}$ from the underlying system, and earning a utility $g(x_k, d_k, y_k) \in \mathbb{R}$ based on the current belief state $x_k$, design choice $d_k$, and observation $y_k$. The goal is to choose a sequence of designs $(d_0, d_1, d_2, \ldots, d_{N-1})$ that maximizes the expected sum of utility across all experiments. Learning more information about the underlying system (e.g. the unknown parameters) is directly helpful in achieving that goal by producing more informative belief states $x_{k+1}$ after the conclusion of each experiment.

The above experimental design setting is:

Bayesian due to incorporating a belief state over the underlying system,
sequential because experiments are performed sequentially to use all information about past experiments as opposed to performing all experiments concurrently (batch design),
optimal due to maximizing the expected sum of utility across all experiments as opposed to only performing the maximization over the next experiment (greedy design).

Since your exact question is simply asking about relevant research communities, I won't make any specific remarks in this answer regarding how to solve your illustrative problem using Bayesian sOED (feel free to make a follow-up question on that topic, if desired). Instead, I will list some references that I personally have found useful for solving problems in this field, with those most directly related to the Bayesian sOED problem listed first:

Marked it as solved. Thanks. Very useful links, some of which i had not seen. I will formulate a subsequent question on how to formulate my problem as an sOED formulation, i make sure to tag you here when i have the question up and running. Thanks alot. Kind regards/paul — paul, Jul 23 '23 at 12:10
Thanks, sounds good! If you liked the answer, you can also consider giving it an upvote to help me rise up the leaderboards, but it isn't necessary. I'll keep viewing your other questions throughout the week. Feel free to keep sending me notifications if you'd like me to look at questions, but I can't always promise fast answers. — DeepQZero, Jul 23 '23 at 20:49
i just gave the answer an upvote! thanks a lot. I removed the previous secondary thread i created and instead made another thread which really sums up my problem and addresses my primary concerns when it comes to "solving" this problem. Find that thread in the edit part or: https://ai.stackexchange.com/questions/41460/methods-for-sequential-decision-optimization-problem-with-nonlinear-bayesian-rew I would really appreciate you giving it a try, it is incredibly helpful to me to have someone who knows these stuff looking at my problem. Thanks in advance, kind regards. — paul, Jul 24 '23 at 16:43

In which community does using a Bayesian regression model as a reward function with exploration vs. exploitation challenges fall under?

1 Answers1