Tiger Problem

This is a classic POMDP problem, introduced in [7]. The description of the tiger problem is as follows: (Quote from POMDP: Introduction to Partially Observable Markov Decision Processes by Kamalzadeh and Hahsler ):

A tiger is put with equal probability behind one of two doors, while treasure is put behind the other one. You are standing in front of the two closed doors and need to decide which one to open. If you open the door with the tiger, you will get hurt (negative reward). But if you open the door with treasure, you receive a positive reward. Instead of opening a door right away, you also have the option to wait and listen for tiger noises. But listening is neither free nor entirely accurate. You might hear the tiger behind the left door while it is actually behind the right door and vice versa.

Tiger is a simple POMDP with only 2 states, 2 actions, and 2 observations. The transition and observation probabilities can be easily specified by a table (or a dictionary in Python). To define this POMDP:

  1. Define the domain

  2. Define the models

  3. Instantiating the POMDP

  4. Solving the POMDP instance

Note

For a simple POMDP like Tiger, it is encouraged to place the code for all components (e.g. state, action, observation and models) under the same Python module (i.e. the same .py file).

Define the domain

We start by defining the domain (\(S, A, O\)). In pomdp_py, this is equivalent as defining three classes that inherit State, Action, Observation (see basics).

class State(pomdp_py.State):
    def __init__(self, name):
        if name != "tiger-left" and name != "tiger-right":
            raise ValueError("Invalid state: %s" % name)
        self.name = name
    # ... __hash__, __eq__ should be implemented
class Action(pomdp_py.Action):
    def __init__(self, name):
        if name != "open-left" and name != "open-right"\
           and name != "listen":
            raise ValueError("Invalid action: %s" % name)
        self.name = name
    # ... __hash__, __eq__ should be implemented
class Observation(pomdp_py.Observation):
    def __init__(self, name):
        if name != "tiger-left" and name != "tiger-right":
            raise ValueError("Invalid action: %s" % name)
        self.name = name
    # ... __hash__, __eq__ should be implemented

[source]

Define the models

Next, we define the models (\(T, O, R, \pi\)). In pomdp_py, this is equivalent as defining classes that inherit ObservationModel, TransitionModel, RewardModel, PolicyModel (see basics).

Note

pomdp_py also provides an interface for BlackboxModel.

As mentioned before, the uncertainty of the models can be specified by a Python dictionary for Tiger problem. Let obs_probs and trans_probs be this dictionary for \(O\) and \(T\) respectively. For example, we can set the probabilities according to the paper [7]:

obs_probs = {  # next_state -> action -> observation
     "tiger-left":{
         "open-left": {"tiger-left": 0.5, "tiger-right": 0.5},
         "open-right": {"tiger-left": 0.5, "tiger-right": 0.5},
         "listen": {"tiger-left": 0.85, "tiger-right": 0.15}
     },
     "tiger-right":{
         "open-left": {"tiger-left": 0.5, "tiger-right": 0.5},
         "open-right": {"tiger-left": 0.5, "tiger-right": 0.5},
         "listen": {"tiger-left": 0.15, "tiger-right": 0.85}
     }
 }
trans_probs: {  # state -> action -> next_state
    "tiger-left":{
        "open-left": {"tiger-left": 0.5, "tiger-right": 0.5},
        "open-right": {"tiger-left": 0.5, "tiger-right": 0.5},
        "listen": {"tiger-left": 1.0, "tiger-right": 0.0}
    },
    "tiger-right":{
        "open-left": {"tiger-left": 0.5, "tiger-right": 0.5},
        "open-right": {"tiger-left": 0.5, "tiger-right": 0.5},
        "listen": {"tiger-left": 0.0, "tiger-right": 1.0}
    }
}

This dictionary can be processed so that each string is replaced with the corresponding State, Action or Observation object.

Then, we define classes that inherit ObservationModel, TransitionModel.

class TransitionModel(pomdp_py.TransitionModel):
    """This problem is small enough for the probabilities to be directly given
    externally"""
    def __init__(self, probs):
        self._probs = probs

    def probability(self, next_state, state, action, normalized=False, **kwargs):
        return self._probs[state][action][next_state]

    def sample(self, state, action, normalized=False, **kwargs):
        return self.get_distribution(state, action).random()

    def argmax(self, state, action, normalized=False, **kwargs):
        """Returns the most likely next state"""
        return max(self._probs[state][action], key=self._probs[state][action].get)

    def get_distribution(self, state, action, **kwargs):
        """Returns the underlying distribution of the model"""
        return pomdp_py.Histogram(self._probs[state][action])

    def get_all_states(self):
        return TigerProblem.STATES
class ObservationModel(pomdp_py.ObservationModel):
    """This problem is small enough for the probabilities to be directly given
    externally"""
    def __init__(self, probs):
        self._probs = probs

    def probability(self, observation, next_state, action, normalized=False, **kwargs):
        return self._probs[next_state][action][observation]

    def sample(self, next_state, action, normalized=False, **kwargs):
        return self.get_distribution(next_state, action).random()

    def argmax(self, next_state, action, normalized=False, **kwargs):
        """Returns the most likely observation"""
        return max(self._probs[next_state][action], key=self._probs[next_state][action].get)

    def get_distribution(self, next_state, action, **kwargs):
        """Returns the underlying distribution of the model; In this case, it's just a histogram"""
        return pomdp_py.Histogram(self._probs[next_state][action])

    def get_all_observations(self):
        return TigerProblem.OBSERVATIONS

[source]

Next, we define the PolicyModel. The job of a PolicyModel is to (1) determine the set of actions that the robot can take at given state (and/or history); (2) sample an action from this set according to some probability distribution. This allows extensions to policy models that have a prior over actions. The idea of preference over actions have been used in several existing work [1] [3] [4]. Without prior knowledge of action preference, the PolicyModel can simply sample actions from the set uniformly. Typically, we would like to start without (usually human-engineered) prior knowledge over actions, because it sort of guides the planner and we are not sure if this guidance based on heuristics is actually optimal. So caution must be used.

In the Tiger problem, we just define a simple PolicyModel as follows. We choose not to implement the probability and argmax functions because we don’t really use them for planning; The PolicyModel in this case can do (1) and (2) without those two functions. But in general, the PolicyModel could be learned, or the action space is large so a probability distribution over it becomes important.

class PolicyModel(pomdp_py.RandomRollout):
 """This is an extremely dumb policy model; To keep consistent
 with the framework."""
 def probability(self, action, state, normalized=False, **kwargs):
     raise NotImplementedError  # Never used

 def sample(self, state, normalized=False, **kwargs):
     return self.get_all_actions().random()

 def argmax(self, state, normalized=False, **kwargs):
     """Returns the most likely action"""
     raise NotImplementedError

 def get_all_actions(self, **kwargs):
     return TigerProblem.ACTIONS

[source]

Finally, we define the RewardModel. It is straightforward according to the problem description. In this case, (and very commonly), the reward function is deterministic.

class RewardModel(pomdp_py.RewardModel):
    def __init__(self, scale=1):
        self._scale = scale
    def _reward_func(self, state, action):
        reward = 0
        if action == "open-left":
            if state== "tiger-right":
                reward += 10 * self._scale
            else:
                reward -= 100 * self._scale
        elif action == "open-right":
            if state== "tiger-left":
                reward += 10 * self._scale
            else:
                reward -= 100 * self._scale
        elif action == "listen":
            reward -= 1 * self._scale
        return reward

    def probability(self, reward, state, action, next_state, normalized=False, **kwargs):
        if reward == self._reward_func(state, action):
            return 1.0
        else:
            return 0.0

    def sample(self, state, action, next_state, normalized=False, **kwargs):
        # deterministic
        return self._reward_func(state, action)

    def argmax(self, state, action, next_state, normalized=False, **kwargs):
        """Returns the most likely reward"""
        return self._reward_func(state, action)

[source]

Define the POMDP

With the models that we have defined, it is simple to define a POMDP for the Tiger problem; To do this, we need to define Agent, and Environment.

class TigerProblem(pomdp_py.POMDP):

    STATES = build_states({"tiger-left", "tiger-right"})
    ACTIONS = build_actions({"open-left", "open-right", "listen"})
    OBSERVATIONS = build_observations({"tiger-left", "tiger-right"})

    def __init__(self, obs_probs, trans_probs, init_true_state, init_belief):
        """init_belief is a Distribution."""
        self._obs_probs = obs_probs
        self._trans_probs = trans_probs

        agent = pomdp_py.Agent(init_belief,
                               PolicyModel(),
                               TransitionModel(self._trans_probs),
                               ObservationModel(self._obs_probs),
                               RewardModel())
        env = pomdp_py.Environment(init_true_state,
                                   TransitionModel(self._trans_probs),
                                   RewardModel())
        super().__init__(agent, env, name="TigerProblem")

[source]

Notice that init_true_state and init_belief need to be provided. The process of creating them is described in more detail in the next section.

Note

It is entirely optional to define a Problem class (like TigerProblem) that extends the pomdp_py.framework.basics.POMDP class in order to use a pomdp_py.framework.planner.Planner to solve a POMDP; Only the Agent and the Environment are needed. The POMDP class sometimes can organize the parameters that need to be passed into the constructors of Agent and Environment. For complicated problems, specific Agent and Environment classes are written that inherit pomdp_py.framework.basics.Agent and pomdp_py.framework.basics.Environment.

Instantiating the POMDP

Now we have a definition of the Tiger problem. Now, we need to instantiate a problem by providing parameters for the models, the initial state of the environment, and the initial belief of the agent.

In Tiger, the model parameters are basically the probabilities for \(T\) and \(O\), which have been described above (see Define the models).

We can create a random initial state and a uniform belief as follows:

init_true_state = random.choice(list(TigerProblem.STATES))
init_belief = pomdp_py.Histogram({State("tiger-left"): 0.5,
                                  State("tiger-right"): 0.5})

Then, we can create an instance of the Tiger problem:

tiger_problem = TigerProblem(obs_probs,
                             trans_probs,
                             init_true_state, init_belief)

[source]

Solving the POMDP instance

To solve a POMDP with pomdp_py, here are the basic steps:

  1. Create a planner (Planner)

  2. Agent plans an action \(a_t\).

  3. Environment state transitions \(s_t \rightarrow s_{t+1}\) according to its transition model.

  4. Agent receives an observation \(o_t\) and reward \(r_t\) from the environment.

  5. Agent updates history and belief \(h_t,b_t \rightarrow h_{t+1},b_{t+1}\) where \(h_{t+1} = h_t \cup (a_t, o_t)\).

    • This could be done either by updating the belief of an agent directly, or through an update of the planner. More specifically, if the planner is POMCP, updating the planner will result in the agent belief update as well. But for POUCT or ValueIteration, the agent belief needs to be updated explicitly.

  6. Unless termination condition is reached, repeat steps 2-6.

For the Tiger problem, we implemented this procedure as follows:

# Step 1; in main()
# creating planners
vi = pomdp_py.ValueIteration(horizon=2, discount_factor=0.99)
pouct = pomdp_py.POUCT(max_depth=10, discount_factor=0.95,
                       planning_time=.5, exploration_const=110,
                       rollout_policy=tiger_problem.agent.policy_model)
pomcp = pomdp_py.POMCP(max_depth=10, discount_factor=0.95,
                       planning_time=.5, exploration_const=110,
                       rollout_policy=tiger_problem.agent.policy_model)
...  # call test_planner() for steps 2-6.

# Steps 2-6; called in main()
def test_planner(tiger_problem, planner, nsteps=3):
   """Runs the action-feedback loop of Tiger problem POMDP"""
    for i in range(nsteps):  # Step 6
        # Step 2
        action = planner.plan(tiger_problem.agent)
        print("==== Step %d ====" % (i+1))
        print("True state: %s" % tiger_problem.env.state)
        print("Belief: %s" % str(tiger_problem.agent.cur_belief))
        print("Action: %s" % str(action))
        # Step 3; no transition since actions in Tiger problem
        # does not change environment state (i.e. tiger location).
        print("Reward: %s" % str(tiger_problem.env.reward_model.sample(tiger_problem.env.state, action, None)))

        # Step 4
        # Let's create some simulated real observation; Update the belief
        # Creating true observation for sanity checking solver behavior.
        # In general, this observation should be sampled from agent's observation model.
            real_observation = Observation(tiger_problem.env.state.name)
        print(">> Observation: %s" % real_observation)

        # Step 5
        tiger_problem.agent.update_history(action, real_observation)
        planner.update(tiger_problem.agent, action, real_observation)
        if isinstance(planner, pomdp_py.POUCT):
            print("Num sims: %d" % planner.last_num_sims)
        if isinstance(tiger_problem.agent.cur_belief, pomdp_py.Histogram):
            new_belief = pomdp_py.update_histogram_belief(tiger_problem.agent.cur_belief,
                                                          action, real_observation,
                                                          tiger_problem.agent.observation_model,
                                                          tiger_problem.agent.transition_model)
            tiger_problem.agent.set_belief(new_belief)

[source]

Summary

In short, to use pomdp_py to define a POMDP problem and solve an instance of the problem,

  1. Define the domain

  2. Define the models

  3. Instantiating the POMDP

  4. Solving the POMDP instance

Best of luck!

1

David Silver and Joel Veness. Monte-carlo planning in large pomdps. In Advances in neural information processing systems, 2164–2172. 2010.

7(1,2)

Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998.

3

David Abel, David Ellis Hershkowitz, Gabriel Barth-Maron, Stephen Brawner, Kevin O’Farrell, James MacGlashan, and Stefanie Tellex. Goal-based action priors. In Twenty-Fifth International Conference on Automated Planning and Scheduling. 2015.

4

Yuchen Xiao, Sammie Katt, Andreas ten Pas, Shengjian Chen, and Christopher Amato. Online planning for target object search in clutter under partial observability. In Proceedings of the International Conference on Robotics and Automation. 2019.