It seems that you assume that the state space has 8 * 8 * 2 states. We have a 2 * 4 = 8 grid, so one might think that there are 8 ways to place A, 8 ways to place B, and then there are 2 ways to place the ball. However, if you use this approach, you assume that A, B and the ball can be placed in the same cell. In the original Littman's paper (minimax Q-learning), this is not the case. A, B and the ball must always be in different cells - so, in that case, the correct number of states is 8 * 7 * 2 = 112.
I'm trying to understand why you defined the state space like self.state_space = (8, 8, 2), which seems to suggest that
- you calculated the state space wrongly, or
- maybe I am misinterpreting what the variable
self.state_space is supposed to represent.
- you allow players to be in the same cell
You write in the comments like this self.state_space: <num of variable1, num of variable2, num of variable3>, but this is unclear to me. You use state_space to define the Q-functions in the agent. Clearly, these should be represented as multi-dimensional arrays, such that each entry in the array corresponds to a tuple (a1, a2, state), so I think that makes sense.
Could you please clarify what is your approach to define the state space, and how does that affect e.g. the definition of the Q-function and its shape?
It seems that you assume that the state space has
8 * 8 * 2states. We have a2 * 4 = 8grid, so one might think that there are 8 ways to place A, 8 ways to place B, and then there are 2 ways to place the ball. However, if you use this approach, you assume that A, B and the ball can be placed in the same cell. In the original Littman's paper (minimax Q-learning), this is not the case. A, B and the ball must always be in different cells - so, in that case, the correct number of states is8 * 7 * 2 = 112.I'm trying to understand why you defined the state space like
self.state_space = (8, 8, 2), which seems to suggest thatself.state_spaceis supposed to represent.You write in the comments like this
self.state_space: <num of variable1, num of variable2, num of variable3>, but this is unclear to me. You usestate_spaceto define the Q-functions in the agent. Clearly, these should be represented as multi-dimensional arrays, such that each entry in the array corresponds to a tuple(a1, a2, state), so I think that makes sense.Could you please clarify what is your approach to define the state space, and how does that affect e.g. the definition of the Q-function and its shape?