Problem originally introduced in Solving POMDPs by Searching the Space of Finite Policies

Quoting from the original paper on problem description:

The load/unload problem with 8 locations: the agent starts in the “Unload” location (U) and receives a reward each time it returns to this place after passing through the “Load” location (L). The problem is partially observable because the agent cannot distinguish the different locations in between Load and Unload, and because it cannot perceive if it is loaded or not ($$|S| = 14$$, $$|O| = 3$$ and $$|A| = 2$$).

Figure from the paper:

python -m pomdp_py -r load_unload


## Submodules¶

States are defined by the location of the agent and whether or not it is loaded Actions: “move-left”, “move-right” Rewards:

Bases: State

Bases: Action

Bases: Observation

This problem is small enough for the probabilities to be directly given externally

probability(self, observation, next_state, action)[source]

Returns the probability of $$\Pr(o|s',a)$$.

Parameters:
• observation (Observation) – the observation $$o$$

• next_state (State) – the next state $$s'$$

• action (Action) – the action $$a$$

Returns:

the probability $$\Pr(o|s',a)$$

Return type:

float

sample(self, next_state, action)[source]

Returns observation randomly sampled according to the distribution of this observation model.

Parameters:
• next_state (State) – the next state $$s'$$

• action (Action) – the action $$a$$

Returns:

the observation $$o$$

Return type:

Observation

argmax(next_state, action, normalized=False, **kwargs)[source]

Returns the most likely observation

This problem is small enough for the probabilities to be directly given externally

probability(self, next_state, state, action)[source]

Returns the probability of $$\Pr(s'|s,a)$$.

Parameters:
• state (State) – the state $$s$$

• next_state (State) – the next state $$s'$$

• action (Action) – the action $$a$$

Returns:

the probability $$\Pr(s'|s,a)$$

Return type:

float

sample(self, state, action)[source]

Returns next state randomly sampled according to the distribution of this transition model.

Parameters:
• state (State) – the next state $$s$$

• action (Action) – the action $$a$$

Returns:

the next state $$s'$$

Return type:

State

argmax(state, action, normalized=False, **kwargs)[source]

Returns the most likely next state

Bases: RewardModel

probability(self, reward, state, action, next_state)[source]

Returns the probability of $$\Pr(r|s,a,s')$$.

Parameters:
• reward (float) – the reward $$r$$

• state (State) – the state $$s$$

• action (Action) – the action $$a$$

• next_state (State) – the next state $$s'$$

Returns:

the probability $$\Pr(r|s,a,s')$$

Return type:

float

sample(self, state, action, next_state)[source]

Returns reward randomly sampled according to the distribution of this reward model. This is required, i.e. assumed to be implemented for a reward model.

Parameters:
• state (State) – the next state $$s$$

• action (Action) – the action $$a$$

• next_state (State) – the next state $$s'$$

Returns:

the reward $$r$$

Return type:

float

argmax(state, action, next_state, normalized=False, **kwargs)[source]

Returns the most likely reward

Bases: RandomRollout

This is an extremely dumb policy model; To keep consistent with the framework.

probability(self, action, state)[source]

Returns the probability of $$\pi(a|s)$$.

Parameters:
• action (Action) – the action $$a$$

• state (State) – the state $$s$$

Returns:

the probability $$\pi(a|s)$$

Return type:

float

sample(self, state)[source]

Returns action randomly sampled according to the distribution of this policy model.

Parameters:

state (State) – the next state $$s$$

Returns:

the action $$a$$

Return type:

Action

argmax(state, normalized=False, **kwargs)[source]

Returns the most likely reward

get_all_actions(self, *args)[source]

Returns a set of all possible actions, if feasible.

Bases: POMDP