I'm trying to optimize the expected return from a game of chance, but have quickly realized the problem outclasses the introductory AI course I took in college years ago. I would appreciate any guidance into what algorithms might be relevant to this type of game.
Game Description
The player has 31 tiles that they can flip over and reveal a hidden color. There are 6 blue tiles, 4 red tiles, and possibly one gold tile. (The rest are blank.) The player is only granted 11 attempts to turn over tiles. Turning over all tiles of a single color results in a reward (a different reward for each color). Uncovering no complete sets of tiles results in 0 reward. Due to illustrations on the sets of tiles, revealing a single red or blue tile will also inform the player of the locations of the other tiles of the same color. (i.e. if I flip over one red tile, I know which other three tiles would be red if I were to flip them over too)
The distribution of colors across these 31 tiles is not uniform (but can be assumed as known). Further, there are only ~100 possible tile color configurations, and these configurations have been discovered by the players of this game.
(Concretely, the game is Faux Hollows from the video game FFXIV; the above description merely removes the flavortext.)
My object is maximize the expected reward over multiple plays of the game; I want to have a program that, given a board state, tells me the optimal tile to flip to get that reward.
What I've considered
My initial thought was to treat this as a Markov Decision Process, since we are changing state, receiving some value/reward based on state change, and I want a policy to maximize that expected reward. However, the only way I could think to model the state space (i.e. state = which tiles have been flipped and what color they are) results in literal billions of states ($\sum_{1\leq k\leq11}\binom{31}{k}$), which is (AFIAK) intractable for classical solvers of MPDs.
I thought maybe Reinforcement Learning could help, but I'm not already deeply familiar with it. My cursory glance at q-learning suggested that that high-cardinality state spaces are handled by approximating the value function, but --- because the game has very few states that give rewards --- I worry about how well approximating the value function will work. (Also, I'm not sure how I would incorporate the 11 tile-flip limit.)
I also considered using a decision tree to determine the board configuration that currently is selected, but this doesn't optimize for reward at all; it merely finds the fewest moves that would indicate which color configuration I have.
Could I have some recommendations of algorithms that I should learn/consider for use in finding an optimal strategy to this game?