I am attempting to grasp if there are any other methods out there that i am not aware of that can be beneficial given my problem context. Being inspired from optimal experimental design communities and RL-communities i have a sense there is.
To illustrate the problem at hand, consider a multivariate Bayesian regression model represented as $\hat{y}_t = b_{1, t} \cdot x_{1,t}^{c_1} + b_{2, t} \cdot x_{2,t}^{c_2}$ where as $\hat{y}_t$ denotes predicted values and $y_t$ observations. The purpose of this model is to estimate rewards $\hat{y}_t$ based on inputs ($x_{1,t}$, $x_{2,t}$) at each time step denoted by $t$. The parameters $b_{1, t}$ and $b_{2, t}$ are time-variant implying that they may vary over the horizon and are not constant. For instance, we could consider a Gaussian random walk prior that associates with the day of the month, allowing it to capture and generalize the day-of-the-month effect observed across multiple months (assume we have daily historical data). $c_1$ and $c_2$ follow a gamma distribution. Lets assume a normally distributed likelihood $L(y_t|b_1, b_2, c_1, c_2, x_{1, t}, x_{2, t}) \sim \mathcal{N}(\hat{y_t}, \sigma^2)$. I aim to plan inputs over a given horizon $H$ given a specified budget $\left(\sum_{t \in H} x_{1,t} + x_{2,t} \le \mbox{budget}\right)$, with the main objective of maximizing cumulative rewards. To help with this goal, it is important to also gain knowledge about the model parameters to facilitate improved allocation of inputs for maximizing cumulative rewards in the future.
Every day we make a decision on how to allocate a portion of the total budget and every day we receive a response ($y_t=\mbox{sales} \; timestep \; t$).
Note that every day, we have the ability to adjust our planning in accordance with the newly acquired information. As a result, we can recompute our decisions daily and implement them, essentially enabling us to recompute our decisions "online".
Typically, we have access to 30-300 daily data points to establish the initial posterior, and the planning horizon is typically 30 days.
I am aware of essentially two, maybe three different strategies of "solving" this problem. Consider the following algorithm showcasing these different strategies:
Estimate the posterior with e.g MCMC factoring in our current dataset
Sample from the posterior
- Strategy 1: Sample a single set of parameters from the posterior(thompson sampling)
- Strategy 2: Sample many samples from the posterior(SAA)
- Strategy 3: Sample many samples from the posterior or with some probability $\epsilon$ maximize information gain(see bulletpoint 3 for clarification)
Construct a deterministic optimization problem over the whole horizon with the previously sampled parameters
- Strategy 1: Objective function being $\sum_{t \in H}\hat{y}_t$ with the single sample as parameters
- Strategy 2: Objective function being $\sum_{t \in H}\frac{1}{M}\sum_M \hat{y}_t$ where $M$ denotes the quantity of samples
- Strategy 3: Determine the timesteps that should maximize information gain by sampling from a binomial distributon governed by $\epsilon, H$ and maximize expected information gain at those timesteps and then insert those values as constants in objective function for Strategy 2
Solve the determinstic optimization problem subject to the budget constraint
Set allocation for today
Retrieve reward
Add allocation and reward to current dataset
Iterate from 1 the next day
Notice how the decisions we made for days in the future stemming from solving the optimization problem are merely generated such that we take a better decision today, given that the coefficients are time-varying we might want to ease in on money today to spend more tomorrow e.g. Also, it helps us adapt to the budget constraint over the whole horizon.
One might hypothesize that Strategy 1 would yield the most unstable controller/agent and consequently result in the highest level of exploration.
Likewise, one would suspect that Strategy 2 would lead to the most stable controller/agent, favoring a higher degree of exploitation. The balance between exploration and exploitation is determined by the number of samples we choose to take. Notably, in the extreme case of just one sample, we revert to Strategy 1.
Additionally, one would expect Strategy 3 to represent a blend of the two other strategies, with the parameter $\epsilon$ governing the level of exploration.
A concern regarding all of these approaches is the omission of consideration for how our decisions might influence our future belief states and, consequently, subsequent decisions. The proposed exploration techniques seems to me somewhat adhoc.
While researching optimal experimental design, a community that addresses similar problems to mine, and model-based Bayesian reinforcement learning, I sense that there may exist further strategies, some of which possess theoretical guarantees of optimality, even if their practical implementation is challenging/impossible.
Upon reading about optimal experimental design and comparing Batch OED and Sequential OED, I notice the underlying assumption that we cannot recompute our decisions or policies online(as is the case for me), check e.g this great thesis: Numerical approaches for sequential Bayesian optimal experimental design.
Batch OED computes all decisions over the entire horizon in a single shot, making it unable to adjust to changing states (belief states).
On the other hand, Sequential OED calculates policies for decisions that are dependent on the state, thus allowing it to adapt to changing states (belief states).
However, in my scenario, I have the freedom to recompute decisions or policies at each timestep(or experiment in the context of OED). Clearly, you could just recompute a batch OED or sequential OED strategy at each time step. Methods from these communities may potentially provide better solutions than the ones proposed. However, I find it challenging to see how some of the batch and sequential OED solutions would be superior to the ones proposed given the context and the fact that we can recompute every timestep. Nevertheless, it seems easier to include the budget constraint in batch OED strategies.
Furthermore, while studying Bayesian model-based reinforcement learning, I encountered the Bayes optimal policy theorem, which theoretically guarantees optimality in the exploration-exploitation tradeoff, see page 45: Bayesian Reinforcement Learning: A Survey. However, I find it challenging to see how I can adapt my problem to fit such a setting and solve it approximately.
Consequently, this leads to my question:
Question: Considering the context, are there any other methods beyond the ones I proposed in the algorithm that could be superior and that can be employed to address this problem? If so, how would I proceed with implementing them?