Using External Solvers

pomdp_py provides function calls to use external solvers, given a POMDP defined using pomdp_py interfaces. Currently, we interface with:

We hope to interface with:


Converting a pomdp_py Agent to a POMDP File

Many existing libraries take as input a POMDP model written in a text file. There are two file formats: .POMDP (link) and .POMDPX (link). A .POMDP file can be converted into a .POMDPX file using the pomdpconvert program that is part of the SARSOP toolkit.

If a pomdp_py Agent has enumerable state \(S\), action \(A\), and observation spaces \(\Omega\), with explicitly defined probability for its models (\(T,O,R\)), then it can be converted to either the POMDP file Format (to_pomdp_file) or the POMDPX file format (to_pomdpx_file).

pomdp_py.utils.interfaces.conversion.to_pomdp_file(agent, output_path=None, discount_factor=0.95, float_precision=9)[source]

Pass in an Agent, and use its components to generate a .pomdp file to output_path.

The .pomdp file format is specified at: http://www.pomdp.org/code/pomdp-file-spec.html

Note:

  • It is assumed that the reward is independent of the observation.

  • The state, action, and observations of the agent must be explicitly enumerable.

  • The state, action and observations of the agent must be convertable to a string that does not contain any blank space.

Parameters:
  • agent (Agent) – The agent

  • output_path (str) – The path of the output file to write in. Optional. Default None.

  • discount_factor (float) – The discount factor

  • float_precision (int) – Number of decimals for float to str conversion. Default 6.

Returns:

The list of states, actions, observations that

are ordered in the same way as they are in the .pomdp file.

Return type:

(list, list, list)

pomdp_py.utils.interfaces.conversion.to_pomdpx_file(agent, pomdpconvert_path, output_path=None, discount_factor=0.95)[source]

Converts an agent to a pomdpx file. This works by first converting the agent into a .pomdp file and then using the pomdpconvert utility program to convert that file to a .pomdpx file. Check out pomdpconvert at github://AdaCompNUS/sarsop

Follow the instructions at https://github.com/AdaCompNUS/sarsop to download and build sarsop (I tested on Ubuntu 18.04, gcc version 7.5.0)

See documentation for pomdpx at: https://bigbird.comp.nus.edu.sg/pmwiki/farm/appl/index.php?n=Main.PomdpXDocumentation

First converts the agent into .pomdp, then convert it to pomdpx.

Parameters:
  • agent (Agent) – The agent

  • pomdpconvert_path (str) – Path to the pomdpconvert binary

  • output_path (str) – The path of the output file to write in. Optional. Default None.

  • discount_factor (float) – The discount factor

Example

Let’s use the existing Tiger problem as an example. First we create an instance of the Tiger problem:

from pomdp_py.problems.tiger.tiger_problem import TigerProblem, TigerState
init_state = "tiger-left"
tiger = TigerProblem(0.15, TigerState(init_state),
             pomdp_py.Histogram({TigerState("tiger-left"): 0.5,
                                 TigerState("tiger-right"): 0.5}))

convert to .POMDP file

from pomdp_py import to_pomdp_file
filename = "./test_tiger.POMDP"
to_pomdp_file(tiger.agent, filename, discount_factor=0.95)

convert to .POMDPX file

from pomdp_py import to_pomdpx_file
filename = "./test_tiger.POMDPX"
pomdpconvert_path = "~/software/sarsop/src/pomdpconvert"
to_pomdpx_file(tiger.agent, pomdpconvert_path,
               output_path=filename,
               discount_factor=0.95)

Using pomdp-solve

Pass in the agent to the vi_pruning function, and it will run the pomdp-solve binary (using specified path)

pomdp_py.utils.interfaces.solvers.vi_pruning(agent, pomdp_solve_path, discount_factor=0.95, options=[], pomdp_name='temp-pomdp', remove_generated_files=False, return_policy_graph=False)[source]

Value Iteration with pruning, using the software pomdp-solve https://www.pomdp.org/code/ developed by Anthony R. Cassandra.

Parameters:
  • agent (pomdp_py.Agent) – The agent that contains the POMDP definition

  • pomdp_solve_path (str) – Path to the pomdp_solve binary generated after compiling the pomdp-solve library.

  • options (list) –

    Additional options to pass in to the command line interface. The options should be a list of strings, such as [“-stop_criteria”, “weak”, …] Some useful options are:

    -horizon <int> -time_limit <int>

  • pomdp_name (str) – The name used to create the .pomdp file.

  • remove_generated_files (bool) – True if after policy is computed, the .pomdp, .alpha, .pg files are removed. Default is False.

  • return_policy_graph (bool) – True if return the policy as a PolicyGraph. By default is False, in which case an AlphaVectorPolicy is returned.

Returns:

The policy returned by the solver.

Return type:

PolicyGraph or AlphaVectorPolicy

Example

Setting the path. After downloading and installing pomdp-solve, a binary executable called pomdp-solve should appear under path/to/pomdp-solve-<version>/src/. We create a variable in Python to point to this path:

pomdp_solve_path = "path/to/pomdp-solve-<version>/src/pomdp-solve"

Computing a policy. We recommend using the AlphaVectorPolicy; That means setting return_policy_graph to False (optional).

from pomdp_py import vi_pruning
policy = vi_pruning(tiger.agent, pomdp_solve_path, discount_factor=0.95,
                    options=["-horizon", "100"],
                    remove_generated_files=False,
                    return_policy_graph=False)

Using the policy. Here the code checks whether the policy is a AlphaVectorPolicy or a PolicyGraph

for step in range(10):
     action = policy.plan(tiger.agent)
     reward = tiger.env.state_transition(action, execute=True)
     observation = tiger.agent.observation_model.sample(tiger.env.state, action)

     if isinstance(policy, PolicyGraph):
         policy.update(tiger.agent, action, observation)
     else:
         # AlphaVectorPOlicy
         # ... perform belief update on agent

Complete example code:
import pomdp_py
from pomdp_py import vi_pruning
from pomdp_py.problems.tiger.tiger_problem import TigerProblem, TigerState

# Initialize problem
init_state = "tiger-left"
tiger = TigerProblem(0.15, TigerState(init_state),
             pomdp_py.Histogram({TigerState("tiger-left"): 0.5,
                                 TigerState("tiger-right"): 0.5}))

# Compute policy
pomdp_solve_path = "path/to/pomdp-solve-<version>/src/pomdp-solve"
policy = vi_pruning(tiger.agent, pomdp_solve_path, discount_factor=0.95,
                    options=["-horizon", "100"],
                    remove_generated_files=False,
                    return_policy_graph=False)

# Simulate the POMDP using the policy
for step in range(10):
     action = policy.plan(tiger.agent)
     reward = tiger.env.state_transition(action, execute=True)
     observation = tiger.agent.observation_model.sample(tiger.env.state, action)
     print(tiger.agent.cur_belief, action, observation, reward)

     if isinstance(policy, pomdp_py.PolicyGraph):
         # No belief update needed. Just update the policy graph
         policy.update(tiger.agent, action, observation)
     else:
         # belief update is needed for AlphaVectorPolicy
         new_belief = pomdp_py.update_histogram_belief(tiger.agent.cur_belief,
                                                       action, observation,
                                                       tiger.agent.observation_model,
                                                       tiger.agent.transition_model)
         tiger.agent.set_belief(new_belief)
Expected output (or similar):
 //****************\\
||   pomdp-solve    ||
||     v. 5.4       ||
 \\****************//
      PID=8239
- - - - - - - - - - - - - - - - - - - -
time_limit = 0
mcgs_prune_freq = 100
verbose = context
...
horizon = 100
...
- - - - - - - - - - - - - - - - - - - -
[Initializing POMDP ... done.]
[Initial policy has 1 vectors.]
++++++++++++++++++++++++++++++++++++++++
Epoch: 1...3 vectors in 0.00 secs. (0.00 total) (err=inf)
Epoch: 2...5 vectors in 0.00 secs. (0.00 total) (err=inf)
Epoch: 3...9 vectors in 0.00 secs. (0.00 total) (err=inf)
...
Epoch: 95...9 vectors in 0.00 secs. (2.39 total) (err=inf)
Epoch: 96...9 vectors in 0.00 secs. (2.39 total) (err=inf)
Epoch: 97...9 vectors in 0.00 secs. (2.39 total) (err=inf)
Epoch: 98...9 vectors in 0.00 secs. (2.39 total) (err=inf)
Epoch: 99...9 vectors in 0.00 secs. (2.39 total) (err=inf)
Epoch: 100...9 vectors in 0.01 secs. (2.40 total) (err=inf)
++++++++++++++++++++++++++++++++++++++++
Solution found.  See file:
        temp-pomdp.alpha
        temp-pomdp.pg
++++++++++++++++++++++++++++++++++++++++
User time = 0 hrs., 0 mins, 2.40 secs. (= 2.40 secs)
System time = 0 hrs., 0 mins, 0.00 secs. (= 0.00 secs)
Total execution time = 0 hrs., 0 mins, 2.40 secs. (= 2.40 secs)

** Warning **
        lp_solve reported 2 LPs with numerical instability.
[(TigerState(tiger-right), 0.5), (TigerState(tiger-left), 0.5)] listen tiger-left -1.0
[(TigerState(tiger-left), 0.85), (TigerState(tiger-right), 0.15)] listen tiger-left -1.0
[(TigerState(tiger-left), 0.9697986575573173), (TigerState(tiger-right), 0.03020134244268276)] open-right tiger-left 10.0
...

Using sarsop

pomdp_py.utils.interfaces.solvers.sarsop(agent, pomdpsol_path, discount_factor=0.95, timeout=30, memory=100, precision=0.5, pomdp_name='temp-pomdp', remove_generated_files=False, logfile=None)[source]

SARSOP, using the binary from https://github.com/AdaCompNUS/sarsop This is an anytime POMDP planning algorithm

Parameters:
  • agent (pomdp_py.Agent) – The agent that defines the POMDP models

  • pomdpsol_path (str) – Path to the pomdpsol binary

  • timeout (int) – The time limit (seconds) to run the algorithm until termination

  • memory (int) – The memory size (mb) to run the algorithm until termination

  • precision (float) – solver runs until regret is less than precision

  • pomdp_name (str) – Name of the .pomdp file that will be created when solving

  • remove_generated_files (bool) – Remove created files during solving after finish.

  • logfile (str) – Path to file to write the log of both stdout and stderr

Returns:

The policy returned by the solver.

Return type:

AlphaVectorPolicy

Example

Setting the path. After building SARSOP <https://github.com/AdaCompNUS/sarsop>_, there is a binary file pomdpsol under path/to/sarsop/src. We create a variable in Python to point to this path:

pomdpsol_path = "path/to/sarsop/src/pomdpsol"

Computing a policy.

from pomdp_py import sarsop
policy = sarsop(tiger.agent, pomdpsol_path, discount_factor=0.95,
                timeout=10, memory=20, precision=0.000001,
                remove_generated_files=True)

Using the policy. (Same as above, for the AlphaVectorPolicy case)

for step in range(10):
     action = policy.plan(tiger.agent)
     reward = tiger.env.state_transition(action, execute=True)
     observation = tiger.agent.observation_model.sample(tiger.env.state, action)
     # ... perform belief update on agent

Complete example code:
import pomdp_py
from pomdp_py import sarsop
from pomdp_py.problems.tiger.tiger_problem import TigerProblem, TigerState

# Initialize problem
init_state = "tiger-left"
tiger = TigerProblem(0.15, TigerState(init_state),
             pomdp_py.Histogram({TigerState("tiger-left"): 0.5,
                                 TigerState("tiger-right"): 0.5}))

# Compute policy
pomdpsol_path = "path/to/sarsop/src/pomdpsol"
policy = sarsop(tiger.agent, pomdpsol_path, discount_factor=0.95,
                timeout=10, memory=20, precision=0.000001,
                remove_generated_files=True)

# Simulate the POMDP using the policy
for step in range(10):
     action = policy.plan(tiger.agent)
     reward = tiger.env.state_transition(action, execute=True)
     observation = tiger.agent.observation_model.sample(tiger.env.state, action)
     print(tiger.agent.cur_belief, action, observation, reward)

     # belief update is needed for AlphaVectorPolicy
     new_belief = pomdp_py.update_histogram_belief(tiger.agent.cur_belief,
                                                   action, observation,
                                                   tiger.agent.observation_model,
                                                   tiger.agent.transition_model)
     tiger.agent.set_belief(new_belief)
Loading the model ...
  input file   : ./temp-pomdp.pomdp
  loading time : 0.00s

SARSOP initializing ...
  initialization time : 0.00s

-------------------------------------------------------------------------------
 Time   |#Trial |#Backup |LBound    |UBound    |Precision  |#Alphas |#Beliefs
-------------------------------------------------------------------------------
 0       0       0        -20        92.8205    112.821     4        1
 0       2       51       -6.2981    63.7547    70.0528     7        15
 0       4       103      2.35722    52.3746    50.0174     5        19
 0       6       155      6.44093    45.1431    38.7021     5        20
 0       8       205      12.1184    36.4409    24.3225     5        20
 ...
 0       40      1255     19.3714    19.3714    7.13808e-06 5        21
 0       41      1300     19.3714    19.3714    3.76277e-06 6        21
 0       42      1350     19.3714    19.3714    1.75044e-06 12       21
 0       43      1393     19.3714    19.3714    9.22729e-07 11       21
-------------------------------------------------------------------------------

SARSOP finishing ...
  target precision reached
  target precision  : 0.000001
  precision reached : 0.000001

-------------------------------------------------------------------------------
 Time   |#Trial |#Backup |LBound    |UBound    |Precision  |#Alphas |#Beliefs
-------------------------------------------------------------------------------
 0       43      1393     19.3714    19.3714    9.22729e-07 5        21
-------------------------------------------------------------------------------

Writing out policy ...
  output file : temp-pomdp.policy

[(TigerState(tiger-right), 0.5), (TigerState(tiger-left), 0.5)] listen tiger-left -1.0
[(TigerState(tiger-left), 0.85), (TigerState(tiger-right), 0.15)] listen tiger-left -1.0
[(TigerState(tiger-left), 0.9697986575573173), (TigerState(tiger-right), 0.03020134244268276)] open-right tiger-right 10.0
...

PolicyGraph and AlphaVectorPolicy

PolicyGraph and AlphaVectorPolicy extend the Planner interface which means they have a plan function that can be used to output an action given an agent (using the agent’s belief).

class pomdp_py.utils.interfaces.conversion.PolicyGraph(nodes, edges, states)[source]

A PolicyGraph encodes a POMDP plan. It can be constructed from the alphas and policy graph format output by Cassandra’s pomdp-solver.

classmethod construct(alpha_path, pg_path, states, actions, observations)[source]

See parse_pomdp_solve_output for detailed definitions of alphas and pg.

Parameters:
  • alpha_path (str) – Path to .alpha file

  • pg_path (str) – Path to .pg file

  • states (list) – List of states, ordered as in .pomdp file

  • actions (list) – List of actions, ordered as in .pomdp file

  • observations (list) – List of observations, ordered as in .pomdp file

plan(agent)[source]

Returns an action that is mapped by the agent belief, under this policy

update(agent, action, observation)[source]

Updates the planner based on real action and observation. Basically sets the current node pointer based on the incoming observation.

class pomdp_py.utils.interfaces.conversion.AlphaVectorPolicy(alphas, states)[source]

An offline POMDP policy is specified by a collection of alpha vectors, each associated with an action. When planning is needed, the dot product of these alpha vectors and the agent’s belief vector is computed and the alpha vector leading to the maximum is the ‘dominating’ alpha vector and we return its corresponding action.

An offline policy can be optionally represented as a policy graph. In this case, the agent can plan without actively maintaining a belief because the policy graph is a finite state machine that transitions by observations.

This can be constructed using .policy file created by sarsop.

plan(agent)[source]

Returns an action that is mapped by the agent belief, under this policy

value(belief)[source]

Returns the value V(b) under this alpha vector policy.

\(V(b) = max_{a\in\Gamma} {a} \cdot b\)

classmethod construct(policy_path, states, actions, solver='pomdpsol')[source]

Returns an AlphaVectorPolicy, given alphas, which are the output of parse_appl_policy_file.

Parameters:
  • policy_path (str) – Path to the generated .policy file (for sarsop) or .alpha file (for pomdp-solve)

  • states (list) – A list of States, in the same order as in the .pomdp file

  • actions (list) – A list of Actions, in the same order as in the .pomdp file

Returns:

The policy stored in the given policy file.

Return type:

AlphaVectorPolicy