Methods for sequential decision optimization problem with nonlinear bayesian reward function

Question

I am attempting to grasp if there are any other methods out there that i am not aware of that can be beneficial given my problem context. Being inspired from optimal experimental design communities and RL-communities i have a sense there is.

To illustrate the problem at hand, consider a multivariate Bayesian regression model represented as $\hat{y}_t = b_{1, t} \cdot x_{1,t}^{c_1} + b_{2, t} \cdot x_{2,t}^{c_2}$ where as $\hat{y}_t$ denotes predicted values and $y_t$ observations. The purpose of this model is to estimate rewards $\hat{y}_t$ based on inputs ($x_{1,t}$, $x_{2,t}$) at each time step denoted by $t$. The parameters $b_{1, t}$ and $b_{2, t}$ are time-variant implying that they may vary over the horizon and are not constant. For instance, we could consider a Gaussian random walk prior that associates with the day of the month, allowing it to capture and generalize the day-of-the-month effect observed across multiple months (assume we have daily historical data). $c_1$ and $c_2$ follow a gamma distribution. Lets assume a normally distributed likelihood $L(y_t|b_1, b_2, c_1, c_2, x_{1, t}, x_{2, t}) \sim \mathcal{N}(\hat{y_t}, \sigma^2)$. I aim to plan inputs over a given horizon $H$ given a specified budget $\left(\sum_{t \in H} x_{1,t} + x_{2,t} \le \mbox{budget}\right)$, with the main objective of maximizing cumulative rewards. To help with this goal, it is important to also gain knowledge about the model parameters to facilitate improved allocation of inputs for maximizing cumulative rewards in the future.

Every day we make a decision on how to allocate a portion of the total budget and every day we receive a response ($y_t=\mbox{sales} \; timestep \; t$).

Note that every day, we have the ability to adjust our planning in accordance with the newly acquired information. As a result, we can recompute our decisions daily and implement them, essentially enabling us to recompute our decisions "online".

Typically, we have access to 30-300 daily data points to establish the initial posterior, and the planning horizon is typically 30 days.

I am aware of essentially two, maybe three different strategies of "solving" this problem. Consider the following algorithm showcasing these different strategies:

Estimate the posterior with e.g MCMC factoring in our current dataset
Sample from the posterior
- Strategy 1: Sample a single set of parameters from the posterior(thompson sampling)
- Strategy 2: Sample many samples from the posterior(SAA)
- Strategy 3: Sample many samples from the posterior or with some probability $\epsilon$ maximize information gain(see bulletpoint 3 for clarification)
Construct a deterministic optimization problem over the whole horizon with the previously sampled parameters
- Strategy 1: Objective function being $\sum_{t \in H}\hat{y}_t$ with the single sample as parameters
- Strategy 2: Objective function being $\sum_{t \in H}\frac{1}{M}\sum_M \hat{y}_t$ where $M$ denotes the quantity of samples
- Strategy 3: Determine the timesteps that should maximize information gain by sampling from a binomial distributon governed by $\epsilon, H$ and maximize expected information gain at those timesteps and then insert those values as constants in objective function for Strategy 2
Solve the determinstic optimization problem subject to the budget constraint
Set allocation for today
Retrieve reward
Add allocation and reward to current dataset
Iterate from 1 the next day

Notice how the decisions we made for days in the future stemming from solving the optimization problem are merely generated such that we take a better decision today, given that the coefficients are time-varying we might want to ease in on money today to spend more tomorrow e.g. Also, it helps us adapt to the budget constraint over the whole horizon.

One might hypothesize that Strategy 1 would yield the most unstable controller/agent and consequently result in the highest level of exploration.

Likewise, one would suspect that Strategy 2 would lead to the most stable controller/agent, favoring a higher degree of exploitation. The balance between exploration and exploitation is determined by the number of samples we choose to take. Notably, in the extreme case of just one sample, we revert to Strategy 1.

Additionally, one would expect Strategy 3 to represent a blend of the two other strategies, with the parameter $\epsilon$ governing the level of exploration.

A concern regarding all of these approaches is the omission of consideration for how our decisions might influence our future belief states and, consequently, subsequent decisions. The proposed exploration techniques seems to me somewhat adhoc.

While researching optimal experimental design, a community that addresses similar problems to mine, and model-based Bayesian reinforcement learning, I sense that there may exist further strategies, some of which possess theoretical guarantees of optimality, even if their practical implementation is challenging/impossible.

Upon reading about optimal experimental design and comparing Batch OED and Sequential OED, I notice the underlying assumption that we cannot recompute our decisions or policies online(as is the case for me), check e.g this great thesis: Numerical approaches for sequential Bayesian optimal experimental design.

Batch OED computes all decisions over the entire horizon in a single shot, making it unable to adjust to changing states (belief states).

On the other hand, Sequential OED calculates policies for decisions that are dependent on the state, thus allowing it to adapt to changing states (belief states).

However, in my scenario, I have the freedom to recompute decisions or policies at each timestep(or experiment in the context of OED). Clearly, you could just recompute a batch OED or sequential OED strategy at each time step. Methods from these communities may potentially provide better solutions than the ones proposed. However, I find it challenging to see how some of the batch and sequential OED solutions would be superior to the ones proposed given the context and the fact that we can recompute every timestep. Nevertheless, it seems easier to include the budget constraint in batch OED strategies.

Furthermore, while studying Bayesian model-based reinforcement learning, I encountered the Bayes optimal policy theorem, which theoretically guarantees optimality in the exploration-exploitation tradeoff, see page 45: Bayesian Reinforcement Learning: A Survey. However, I find it challenging to see how I can adapt my problem to fit such a setting and solve it approximately.

Consequently, this leads to my question:

Question: Considering the context, are there any other methods beyond the ones I proposed in the algorithm that could be superior and that can be employed to address this problem? If so, how would I proceed with implementing them?

@EngrStudent should i shorten it down?, it is a complex problem and i think my proposals of strategies and breaking down some of my research could be really beneficial also for other practitioners. — paul, Jul 26 '23 at 06:59
@paul I don't recall us ever having a maximum length for questions (someone correct me if I'm wrong - I'm not going to check the rules on this right now). If everything in the question is necessary, I don't think it should be shortened. People who don't know how to answer the question will probably realize that after the first paragraph or two, so you won't be wasting their time. Some questions are inherently more detailed than others, and length simply can't be avoided in those cases. — DeepQZero, Jul 26 '23 at 14:58
Also @paul - very nice job with formatting this question correctly for AI Stack Exchange! It's clear that you looked at your previous question that I edited and you made the relevant changes here (e.g. explicitly stating the question, using horizontal lines to partition the question, using good notation). Nice work, and we hope to see more of your questions in the future! I will take a look at this question probably by next week and see if I can answer it. — DeepQZero, Jul 26 '23 at 15:00
as for the bamdp/pomdp formulation of this problem, i created a separate thread asking about that, thought i add it here such that it gets linked: https://ai.stackexchange.com/questions/41426/how-can-i-cast-this-problem-as-an-pomdp-or-an-bamdp — paul, Aug 14 '23 at 18:26
@DeepQZero thank you, let me know if i should change anything in this question. best regards — paul, Aug 14 '23 at 18:28
@paul I've still got this question in my email inbox - I just haven't had any time to look at it. I've got two publications due by the end of next week, but I promise I'll look at it very soon afterwards. Do you need an answer before then? — DeepQZero, Aug 15 '23 at 14:28
@DeepQZero no worries, im just happy that anyone actually can answer these quite complex questions, wish you the best for your publications. best regards — paul, Aug 16 '23 at 09:06

Methods for sequential decision optimization problem with nonlinear bayesian reward function

0 Answers0

Linked