For some reasons, a reward becomes a penalty if

Question

I am working to build a reinforcement agent with DQN. The agent would be able to place buy and sell orders for a day trading purpose. I am facing a little problem with that project. The question is "how to tell the agent to maximize the profit and avoid the transaction where the profit is less than 100$".

I want to maximize the profit inside a trading day and avoid to place the pair (limit buy order, limit sell order) if the profit on that transaction is less than 100$. The idea here is to avoid the little noisy movements. Instead, I prefer long beautiful profitable movements. Be aware that I thought using the "Profit & Loss" as the reward.

"I want the minimal profit per transaction to be 100$" ==> It seems this is not something that is enforceable. I can train the agent to maximize profit per transaction, but how that profit is cannot be ensured.

At the beginning, I wanted to tell the agent, if the profit of a transaction is 50 dollars, I will remove 100 dollars, then It becomes a penalty of 50 dollars for the agent. I thought it was a great way to tell the agent to not place a limit buy order if you are not sure it will give us a minimal profit of 100$. It seems that all I would be doing there is simply shifting the value of the reward. The agent only cares about maximizing the sum of rewards and not taking care of individual transactions.

How to tell the agent to maximize the profit and avoid the transaction where the profit is less than 100$? With that strategy, what guarantee that the agent will never make a buy/sell decision that results in less than 100 dollars profit? Does the sum of reward - # transaction * 100 can be a solution?

Does every transaction correspond to one action, or are multiple actions required for a single transaction, or can a single action lead to multiple transactions? If you have that every action results in exactly one transaction with exactly one reward (that transaction's profit)... then the most straightforward solution would be to simply replace any reward $R_t < 100$ with $R_t = 0$, but keep it unmodified if $R_t \geq 100$ — Dennis Soemers, Jan 18 '19 at 17:55
or shift all rewards by $-100$ I suppose, such that a hypothetical "do-nothing" action with a default reward of $0$ would be preferred over transactions with rewards $< 100$. It really depends on exactly how your problem is formulated though, what are states, what are actions, etc. — Dennis Soemers, Jan 18 '19 at 18:18

Neil Slater · Answer 1 · 2019-01-20T15:46:29.887

I want to maximize the profit inside a trading day and avoid to place the pair (limit buy order, limit sell order) if the profit on that transaction is less than 100$. Be aware that I thought using the "Profit & Loss" as the reward.

To me this implies that your profit per transaction is not the true reward function that you should be using. You don't say directly in the question, but presumably there is some per-transaction cost, tax or other issue which means that these low gain transactions are not desirable.

The answer is to find a more accurate reward function. You have suggested a quick fix of subtracting a fixed offset from the reward. This should have a desirable effect of limiting borderline buy/sell arrangements and meeting your constraint, so I would suggest trying it. The main issue I see with it is that you may have changed the reward function so that it doesn't truly reflect your goals.

A better approach is to look more carefully at the problem and your goals at a higher level. There must be a reason for wanting to avoid these smaller profit transactions. What is it, and can it be expressed itself as a reward? For instance, if there will be a transaction fee, include that. If the fee structure or reasoning is complicated or appears delayed or aggregated over many actions, then this makes optimisation harder, but RL is actually designed to cope with this. You could for instance only reward the agent with profit/loss after a group of actions covering a whole day's trade. You would then rely on the learning algorithm to figure out which states and actions combined to generate the observed reward.

If there is an unknown or random difference between predicted and actual financial gain then no amount of juggling the reward function gets around that. For RL, learning about and predicting expected gains is "built in", but that does not mean that this task becomes easier - in fact if prediction in the problem is hard, you may be better off focusing on that and forgetting RL at least initially. Your question is not clear on that, but if you want to avoid risk whilst learning, you should bear in mind that a simple hard rule would likely interfere with the ability of RL to work. At its heart RL is a trial-and-error learning system. Errors should be expected, and are required for the system to learn where best the balance point is between risk and reward. Of course that doesn't help you if the learning system would deliberately make you bankrupt exploring what happens when you sell things at a loss - there are likely to be ways to avoid that and still achieve your goals, but you will need to explain more about your system in a different question.

The purpose of wanting minimum $100 profit per transaction is simply to avoid the small movements profits and it gives me a sort of protection.Yeah, implicitly I want to cover a least the commission. — fgauth, Jan 19 '19 at 20:56
@fgauth: What prevents you including the commission in the reward calculation? Just base the reward on the actual money in your pocket after all handling costs . . . add in a cost for wasting your time too if you like — Neil Slater, Jan 19 '19 at 21:06
The commission is neglected for the moment as it is a very little amount. The idea here is to avoid the little noisy movements. Instead, I prefer long beautiful profitable movements. That's why I want a $100 minimum profit. — fgauth, Jan 19 '19 at 21:33
@fgauth: Then you need to put a price/reward on your preference, and you should note that forcing the agent to follow your preference could lead it to perform less well by other measurements. — Neil Slater, Jan 19 '19 at 22:00
You seem to know what you're talking about. Do you have time to discuss in a private room, i.e. Hangouts or whatever? — fgauth, Jan 19 '19 at 22:04
@fgauth: No sorry I don't want to schedule a meeting or chat. However, if you can phrase your problem as a question on this site, I might be able to answer. It takes some effort to ask a good question here, but it is good practice if some of your thinking is still unclear - sometimes merely thinking through and phrasing your question properly will help you find your own asnwer. — Neil Slater, Jan 20 '19 at 08:51
@fgauth: I have seen your second question, and it seems clearer. Although I won't be able to answer it as you have left the financial side unexplained (you will need a trading expert to understand what it is you actually want), I think it is a good fit for the site and upvoted it. — Neil Slater, Jan 20 '19 at 08:56
@fgauth "commission is neglected for the moment as it is a very little amount", I do not agree with you. Studies have shown that when transaction fees are included machine learning fails to out perform a buy and hold strategy. Trading fees eat away profit, especially the compounded negative effect of trading fees. — Jason, Mar 03 '19 at 07:45

score 1 · Answer 2 · answered Feb 27 '19 at 12:02

I would use the proposed fitness function defined in your other StackExchange question. I would then inflate the buying price used in the equation by \$100, and leave the selling price as the price the shares were sold for. This would only reward a trading rule when profit is greater than \$100.

If you want to minimise spike it might be best to normalise the data going into your learning algorithm, for example use an x day moving average instead of the actual share price.

For some reasons, a reward becomes a penalty if

2 Answers2