Using External Solvers¶
pomdp_py provides function calls to use external solvers, given a POMDP defined using pomdp_py interfaces. Currently, we interface with:
pomdp-solve by Anthony R. Cassandra
SARSOP by NUS
We hope to interface with:
more? Help us if you can!
Converting a pomdp_py Agent
to a POMDP File¶
Many existing libraries take as input a POMDP model written in a text file.
There are two file formats: .POMDP
(link) and .POMDPX
(link). A .POMDP
file can be converted
into a .POMDPX
file using the pomdpconvert
program that is part of the SARSOP toolkit.
If a pomdp_py Agent
has enumerable state \(S\), action \(A\), and observation spaces \(\Omega\), with explicitly defined probability for its models (\(T,O,R\)), then it can be converted to either the POMDP file Format (to_pomdp_file
) or the POMDPX file format (to_pomdpx_file
).
- pomdp_py.utils.interfaces.conversion.to_pomdp_file(agent, output_path=None, discount_factor=0.95, float_precision=9)[source]¶
Pass in an Agent, and use its components to generate a .pomdp file to output_path.
The .pomdp file format is specified at: http://www.pomdp.org/code/pomdp-file-spec.html
Note:
It is assumed that the reward is independent of the observation.
The state, action, and observations of the agent must be explicitly enumerable.
The state, action and observations of the agent must be convertable to a string that does not contain any blank space.
- Parameters:
agent (Agent) – The agent
output_path (str) – The path of the output file to write in. Optional. Default None.
discount_factor (float) – The discount factor
float_precision (int) – Number of decimals for float to str conversion. Default 6.
- Returns:
- The list of states, actions, observations that
are ordered in the same way as they are in the .pomdp file.
- Return type:
(list, list, list)
- pomdp_py.utils.interfaces.conversion.to_pomdpx_file(agent, pomdpconvert_path, output_path=None, discount_factor=0.95)[source]¶
Converts an agent to a pomdpx file. This works by first converting the agent into a .pomdp file and then using the
pomdpconvert
utility program to convert that file to a .pomdpx file. Check outpomdpconvert
at github://AdaCompNUS/sarsopFollow the instructions at https://github.com/AdaCompNUS/sarsop to download and build sarsop (I tested on Ubuntu 18.04, gcc version 7.5.0)
See documentation for pomdpx at: https://bigbird.comp.nus.edu.sg/pmwiki/farm/appl/index.php?n=Main.PomdpXDocumentation
First converts the agent into .pomdp, then convert it to pomdpx.
- Parameters:
agent (Agent) – The agent
pomdpconvert_path (str) – Path to the
pomdpconvert
binaryoutput_path (str) – The path of the output file to write in. Optional. Default None.
discount_factor (float) – The discount factor
Example¶
Let’s use the existing Tiger
problem as an example.
First we create an instance of the Tiger problem:
from pomdp_py.problems.tiger.tiger_problem import TigerProblem, TigerState
init_state = "tiger-left"
tiger = TigerProblem(0.15, TigerState(init_state),
pomdp_py.Histogram({TigerState("tiger-left"): 0.5,
TigerState("tiger-right"): 0.5}))
convert to .POMDP file
from pomdp_py import to_pomdp_file
filename = "./test_tiger.POMDP"
to_pomdp_file(tiger.agent, filename, discount_factor=0.95)
convert to .POMDPX file
from pomdp_py import to_pomdpx_file
filename = "./test_tiger.POMDPX"
pomdpconvert_path = "~/software/sarsop/src/pomdpconvert"
to_pomdpx_file(tiger.agent, pomdpconvert_path,
output_path=filename,
discount_factor=0.95)
Using pomdp-solve¶
Pass in the agent to the vi_pruning
function,
and it will run the pomdp-solve binary (using specified path)
- pomdp_py.utils.interfaces.solvers.vi_pruning(agent, pomdp_solve_path, discount_factor=0.95, options=[], pomdp_name='temp-pomdp', remove_generated_files=False, return_policy_graph=False)[source]¶
Value Iteration with pruning, using the software pomdp-solve https://www.pomdp.org/code/ developed by Anthony R. Cassandra.
- Parameters:
agent (pomdp_py.Agent) – The agent that contains the POMDP definition
pomdp_solve_path (str) – Path to the pomdp_solve binary generated after compiling the pomdp-solve library.
options (list) –
Additional options to pass in to the command line interface. The options should be a list of strings, such as [“-stop_criteria”, “weak”, …] Some useful options are:
-horizon <int> -time_limit <int>
pomdp_name (str) – The name used to create the .pomdp file.
remove_generated_files (bool) – True if after policy is computed, the .pomdp, .alpha, .pg files are removed. Default is False.
return_policy_graph (bool) – True if return the policy as a PolicyGraph. By default is False, in which case an AlphaVectorPolicy is returned.
- Returns:
The policy returned by the solver.
- Return type:
Example¶
Setting the path. After downloading and installing pomdp-solve,
a binary executable called pomdp-solve
should appear under path/to/pomdp-solve-<version>/src/
. We create a variable in Python
to point to this path:
pomdp_solve_path = "path/to/pomdp-solve-<version>/src/pomdp-solve"
Computing a policy. We recommend using the AlphaVectorPolicy
;
That means setting return_policy_graph
to False (optional).
from pomdp_py import vi_pruning
policy = vi_pruning(tiger.agent, pomdp_solve_path, discount_factor=0.95,
options=["-horizon", "100"],
remove_generated_files=False,
return_policy_graph=False)
Using the policy. Here the code checks whether the policy is a AlphaVectorPolicy
or a PolicyGraph
for step in range(10):
action = policy.plan(tiger.agent)
reward = tiger.env.state_transition(action, execute=True)
observation = tiger.agent.observation_model.sample(tiger.env.state, action)
if isinstance(policy, PolicyGraph):
policy.update(tiger.agent, action, observation)
else:
# AlphaVectorPOlicy
# ... perform belief update on agent
import pomdp_py
from pomdp_py import vi_pruning
from pomdp_py.problems.tiger.tiger_problem import TigerProblem, TigerState
# Initialize problem
init_state = "tiger-left"
tiger = TigerProblem(0.15, TigerState(init_state),
pomdp_py.Histogram({TigerState("tiger-left"): 0.5,
TigerState("tiger-right"): 0.5}))
# Compute policy
pomdp_solve_path = "path/to/pomdp-solve-<version>/src/pomdp-solve"
policy = vi_pruning(tiger.agent, pomdp_solve_path, discount_factor=0.95,
options=["-horizon", "100"],
remove_generated_files=False,
return_policy_graph=False)
# Simulate the POMDP using the policy
for step in range(10):
action = policy.plan(tiger.agent)
reward = tiger.env.state_transition(action, execute=True)
observation = tiger.agent.observation_model.sample(tiger.env.state, action)
print(tiger.agent.cur_belief, action, observation, reward)
if isinstance(policy, pomdp_py.PolicyGraph):
# No belief update needed. Just update the policy graph
policy.update(tiger.agent, action, observation)
else:
# belief update is needed for AlphaVectorPolicy
new_belief = pomdp_py.update_histogram_belief(tiger.agent.cur_belief,
action, observation,
tiger.agent.observation_model,
tiger.agent.transition_model)
tiger.agent.set_belief(new_belief)
//****************\\
|| pomdp-solve ||
|| v. 5.4 ||
\\****************//
PID=8239
- - - - - - - - - - - - - - - - - - - -
time_limit = 0
mcgs_prune_freq = 100
verbose = context
...
horizon = 100
...
- - - - - - - - - - - - - - - - - - - -
[Initializing POMDP ... done.]
[Initial policy has 1 vectors.]
++++++++++++++++++++++++++++++++++++++++
Epoch: 1...3 vectors in 0.00 secs. (0.00 total) (err=inf)
Epoch: 2...5 vectors in 0.00 secs. (0.00 total) (err=inf)
Epoch: 3...9 vectors in 0.00 secs. (0.00 total) (err=inf)
...
Epoch: 95...9 vectors in 0.00 secs. (2.39 total) (err=inf)
Epoch: 96...9 vectors in 0.00 secs. (2.39 total) (err=inf)
Epoch: 97...9 vectors in 0.00 secs. (2.39 total) (err=inf)
Epoch: 98...9 vectors in 0.00 secs. (2.39 total) (err=inf)
Epoch: 99...9 vectors in 0.00 secs. (2.39 total) (err=inf)
Epoch: 100...9 vectors in 0.01 secs. (2.40 total) (err=inf)
++++++++++++++++++++++++++++++++++++++++
Solution found. See file:
temp-pomdp.alpha
temp-pomdp.pg
++++++++++++++++++++++++++++++++++++++++
User time = 0 hrs., 0 mins, 2.40 secs. (= 2.40 secs)
System time = 0 hrs., 0 mins, 0.00 secs. (= 0.00 secs)
Total execution time = 0 hrs., 0 mins, 2.40 secs. (= 2.40 secs)
** Warning **
lp_solve reported 2 LPs with numerical instability.
[(TigerState(tiger-right), 0.5), (TigerState(tiger-left), 0.5)] listen tiger-left -1.0
[(TigerState(tiger-left), 0.85), (TigerState(tiger-right), 0.15)] listen tiger-left -1.0
[(TigerState(tiger-left), 0.9697986575573173), (TigerState(tiger-right), 0.03020134244268276)] open-right tiger-left 10.0
...
Using sarsop¶
- pomdp_py.utils.interfaces.solvers.sarsop(agent, pomdpsol_path, discount_factor=0.95, timeout=30, memory=100, precision=0.5, pomdp_name='temp-pomdp', remove_generated_files=False, logfile=None)[source]¶
SARSOP, using the binary from https://github.com/AdaCompNUS/sarsop This is an anytime POMDP planning algorithm
- Parameters:
agent (pomdp_py.Agent) – The agent that defines the POMDP models
pomdpsol_path (str) – Path to the pomdpsol binary
timeout (int) – The time limit (seconds) to run the algorithm until termination
memory (int) – The memory size (mb) to run the algorithm until termination
precision (float) – solver runs until regret is less than precision
pomdp_name (str) – Name of the .pomdp file that will be created when solving
remove_generated_files (bool) – Remove created files during solving after finish.
logfile (str) – Path to file to write the log of both stdout and stderr
- Returns:
The policy returned by the solver.
- Return type:
Example¶
Setting the path. After building SARSOP <https://github.com/AdaCompNUS/sarsop>_, there is a
binary file pomdpsol
under path/to/sarsop/src
.
We create a variable in Python
to point to this path:
pomdpsol_path = "path/to/sarsop/src/pomdpsol"
Computing a policy.
from pomdp_py import sarsop
policy = sarsop(tiger.agent, pomdpsol_path, discount_factor=0.95,
timeout=10, memory=20, precision=0.000001,
remove_generated_files=True)
Using the policy. (Same as above, for the AlphaVectorPolicy
case)
for step in range(10):
action = policy.plan(tiger.agent)
reward = tiger.env.state_transition(action, execute=True)
observation = tiger.agent.observation_model.sample(tiger.env.state, action)
# ... perform belief update on agent
import pomdp_py
from pomdp_py import sarsop
from pomdp_py.problems.tiger.tiger_problem import TigerProblem, TigerState
# Initialize problem
init_state = "tiger-left"
tiger = TigerProblem(0.15, TigerState(init_state),
pomdp_py.Histogram({TigerState("tiger-left"): 0.5,
TigerState("tiger-right"): 0.5}))
# Compute policy
pomdpsol_path = "path/to/sarsop/src/pomdpsol"
policy = sarsop(tiger.agent, pomdpsol_path, discount_factor=0.95,
timeout=10, memory=20, precision=0.000001,
remove_generated_files=True)
# Simulate the POMDP using the policy
for step in range(10):
action = policy.plan(tiger.agent)
reward = tiger.env.state_transition(action, execute=True)
observation = tiger.agent.observation_model.sample(tiger.env.state, action)
print(tiger.agent.cur_belief, action, observation, reward)
# belief update is needed for AlphaVectorPolicy
new_belief = pomdp_py.update_histogram_belief(tiger.agent.cur_belief,
action, observation,
tiger.agent.observation_model,
tiger.agent.transition_model)
tiger.agent.set_belief(new_belief)
Loading the model ...
input file : ./temp-pomdp.pomdp
loading time : 0.00s
SARSOP initializing ...
initialization time : 0.00s
-------------------------------------------------------------------------------
Time |#Trial |#Backup |LBound |UBound |Precision |#Alphas |#Beliefs
-------------------------------------------------------------------------------
0 0 0 -20 92.8205 112.821 4 1
0 2 51 -6.2981 63.7547 70.0528 7 15
0 4 103 2.35722 52.3746 50.0174 5 19
0 6 155 6.44093 45.1431 38.7021 5 20
0 8 205 12.1184 36.4409 24.3225 5 20
...
0 40 1255 19.3714 19.3714 7.13808e-06 5 21
0 41 1300 19.3714 19.3714 3.76277e-06 6 21
0 42 1350 19.3714 19.3714 1.75044e-06 12 21
0 43 1393 19.3714 19.3714 9.22729e-07 11 21
-------------------------------------------------------------------------------
SARSOP finishing ...
target precision reached
target precision : 0.000001
precision reached : 0.000001
-------------------------------------------------------------------------------
Time |#Trial |#Backup |LBound |UBound |Precision |#Alphas |#Beliefs
-------------------------------------------------------------------------------
0 43 1393 19.3714 19.3714 9.22729e-07 5 21
-------------------------------------------------------------------------------
Writing out policy ...
output file : temp-pomdp.policy
[(TigerState(tiger-right), 0.5), (TigerState(tiger-left), 0.5)] listen tiger-left -1.0
[(TigerState(tiger-left), 0.85), (TigerState(tiger-right), 0.15)] listen tiger-left -1.0
[(TigerState(tiger-left), 0.9697986575573173), (TigerState(tiger-right), 0.03020134244268276)] open-right tiger-right 10.0
...
PolicyGraph and AlphaVectorPolicy¶
PolicyGraph and AlphaVectorPolicy extend the Planner
interface which means they have a plan
function that can be used to
output an action given an agent (using the agent’s belief).
- class pomdp_py.utils.interfaces.conversion.PolicyGraph(nodes, edges, states)[source]¶
A PolicyGraph encodes a POMDP plan. It can be constructed from the alphas and policy graph format output by Cassandra’s pomdp-solver.
- classmethod construct(alpha_path, pg_path, states, actions, observations)[source]¶
See parse_pomdp_solve_output for detailed definitions of alphas and pg.
- Parameters:
alpha_path (str) – Path to .alpha file
pg_path (str) – Path to .pg file
states (list) – List of states, ordered as in .pomdp file
actions (list) – List of actions, ordered as in .pomdp file
observations (list) – List of observations, ordered as in .pomdp file
- class pomdp_py.utils.interfaces.conversion.AlphaVectorPolicy(alphas, states)[source]¶
An offline POMDP policy is specified by a collection of alpha vectors, each associated with an action. When planning is needed, the dot product of these alpha vectors and the agent’s belief vector is computed and the alpha vector leading to the maximum is the ‘dominating’ alpha vector and we return its corresponding action.
An offline policy can be optionally represented as a policy graph. In this case, the agent can plan without actively maintaining a belief because the policy graph is a finite state machine that transitions by observations.
This can be constructed using .policy file created by sarsop.
- value(belief)[source]¶
Returns the value V(b) under this alpha vector policy.
\(V(b) = max_{a\in\Gamma} {a} \cdot b\)
- classmethod construct(policy_path, states, actions, solver='pomdpsol')[source]¶
Returns an AlphaVectorPolicy, given alphas, which are the output of parse_appl_policy_file.
- Parameters:
policy_path (str) – Path to the generated .policy file (for sarsop) or .alpha file (for pomdp-solve)
states (list) – A list of States, in the same order as in the .pomdp file
actions (list) – A list of Actions, in the same order as in the .pomdp file
- Returns:
The policy stored in the given policy file.
- Return type: