pomdp_py.algorithms package¶
Existing POMDP Solvers:
POrollout: Baseline algorithm in the POMCP paper 

POUCT (Partially Observable UCT) [1] is presented in the POMCP paper as an extension of the UCT algorithm to partiallyobservable domains that combines MCTS and UCB1 for action selection. 

This POMCP version only works for problems with action space that can be enumerated. 

This algorithm is only feasible for small problems where states, actions, and observations can be explicitly enumerated. 
Solvers under development (contribution wanted):
Implementation of BLQR algorithm described in “Belief space planning assuming maximum likelihood observations” [2] 
pomdp_py.algorithms.po_rollout module¶
POrollout: Baseline algorithm in the POMCP paper [1].
Quote from the POMCP paper:
To provide a performance benchmark in these cases, we evaluated the performance of simple MonteCarlo simulation without any tree. The POrollout algorithm used MonteCarlo belief state updates, as described in section 3.2. It then simulated \(n/A\) rollouts for each legal action, and selected the action with highest average return.
We don’t require MonteCarlo belief update (it’s an option). But it will do the rollouts and action selection as described.

class
pomdp_py.algorithms.po_rollout.
PORollout
¶ Bases:
pomdp_py.framework.planner.Planner
POrollout: Baseline algorithm in the POMCP paper

clear_agent
(self)¶

last_best_reward
¶

plan
(self, Agent agent)¶ The agent carries the information: Bt, ht, O,T,R/G, pi, necessary for planning

update
(self, Agent agent, Action real_action, Observation real_observation)¶ Updates the planner based on real action and observation. Updates the agent accordingly if necessary. If the agent’s belief is also updated here, the update_agent_belief attribute should be set to True. By default, does nothing.

update_agent_belief
¶ True if planner’s update function also updates agent’s belief.

pomdp_py.algorithms.po_uct module¶
This algorithm is POUCT (Partially Observable UCT). It is presented in the POMCP paper [1] as an extension to the UCT algorithm [3] that combines MCTS and UCB1 for action selection.
In other words, this is just POMCP without particle belief, but arbitrary belief representation.
This planning strategy, based on MCTS with belief sampling may be referred to as “belief sparse sampling”; Partially Observable Sparse Sampling (POSS) is introduced in a recent paper [4] as an extension of sparse sampling for MDP by [5]; It proposes an extension of POSS called POWSS (partially observable weighted sparse sampling). However, this line of work (POSS, POWSS) is based solely on particle belief representation. POSS still requires comparing observations exactly for particle belief update, while POWSS uses weighted particles depending on the observation distribution.
A note on the exploration constant. Quote from [6]:
“This constant should reflect the agent’s prior knowledge regarding the amount of exploration required.”
In the POMCP paper, they set this constant following:
“The exploration constant for POMCP was set to c = Rhi  Rlo, where Rhi was the highest return achieved during sample runs of POMCP with c = 0, and Rlo was the lowest return achieved during sample rollouts”
It is then clear that the POMCP paper is indeed setting this constant based on prior knowledge. Note the difference between sample runs and sample rollouts. But, this is certainly not the only way to obtainx1 the prior knowledge.

class
pomdp_py.algorithms.po_uct.
ActionPrior
¶ Bases:
object
A problemspecific object

get_preferred_actions
(self, vnode, state=None, history=None, belief=None, **kwargs)¶ This is to mimic the behavior of Simulator.Prior and GenerateLegal/GeneratePreferred in David Silver’s POMCP code.
Returns a set of preferred actions and associated num_visits_init and value_init, given arguments. In POMCP, this acts as a historybased prior policy, and in DESPOT, this acts as a beliefbased prior policy. For example, given certain state or history, only it is possible that only a subset of all actions is legal; This is particularly true in the RockSample problem.


class
pomdp_py.algorithms.po_uct.
POUCT
¶ Bases:
pomdp_py.framework.planner.Planner
POUCT (Partially Observable UCT) [1] is presented in the POMCP paper as an extension of the UCT algorithm to partiallyobservable domains that combines MCTS and UCB1 for action selection.
POUCT only works for problems with action space that can be enumerated.

clear_agent
()¶

last_num_sims
¶ Returns the number of simulations ran for the last plan call.

plan
(self, Agent agent)¶ The agent carries the information: Bt, ht, O,T,R/G, pi, necessary for planning

update
(self, Agent agent, Action real_action, Observation real_observation)¶ Assume that the agent’s history has been updated after taking real_action and receiving real_observation.

updates_agent_belief
¶ True if planner’s update function also updates agent’s belief.


class
pomdp_py.algorithms.po_uct.
QNode
¶

class
pomdp_py.algorithms.po_uct.
RandomRollout
¶ Bases:
pomdp_py.algorithms.po_uct.RolloutPolicy

rollout
(self, State state, tuple history=None)¶


class
pomdp_py.algorithms.po_uct.
RolloutPolicy
¶ Bases:
pomdp_py.framework.basics.PolicyModel

rollout
(self, State state, tuple history=None)¶


class
pomdp_py.algorithms.po_uct.
RootVNode
¶ Bases:
pomdp_py.algorithms.po_uct.VNode

from_vnode
(cls, vnode, history)¶

history
¶


class
pomdp_py.algorithms.po_uct.
VNode
¶ Bases:
pomdp_py.algorithms.po_uct.TreeNode

print_children_value
()¶


pomdp_py.algorithms.po_uct.
print_preferred_actions
()¶

pomdp_py.algorithms.po_uct.
print_preferred_actions_helper
()¶

pomdp_py.algorithms.po_uct.
print_tree
()¶

pomdp_py.algorithms.po_uct.
print_tree_helper
()¶

pomdp_py.algorithms.po_uct.
tree_stats
()¶

pomdp_py.algorithms.po_uct.
tree_stats_helper
()¶
pomdp_py.algorithms.pomcp module¶

class
pomdp_py.algorithms.pomcp.
POMCP
¶ Bases:
pomdp_py.algorithms.po_uct.POUCT
This POMCP version only works for problems with action space that can be enumerated.

plan
(self, Agent agent)¶ The agent carries the information: Bt, ht, O,T,R/G, pi, necessary for planning

update
()¶ Assume that the agent’s history has been updated after taking real_action and receiving real_observation.
 state_transform_func: Used to add artificial transform to states during
particle reinvigoration. Signature: s > s_transformed

update_agent_belief
¶ True if planner’s update function also updates agent’s belief.


class
pomdp_py.algorithms.pomcp.
RootVNodeParticles
¶ Bases:
pomdp_py.algorithms.po_uct.RootVNode

belief
¶

from_vnode
(cls, vnode, history)¶


class
pomdp_py.algorithms.pomcp.
VNodeParticles
¶ Bases:
pomdp_py.algorithms.po_uct.VNode
POMCP’s VNode maintains particle belief

belief
¶

pomdp_py.algorithms.value_iteration module¶
Implementation of the basic policy tree based value iteration as explained in section 4.1 of Planning and acting in partially observable stochastic domains [7]
Warning: No pruning  the number of policy trees explodes very fast.

class
pomdp_py.algorithms.value_iteration.
ValueIteration
¶ Bases:
pomdp_py.framework.planner.Planner
This algorithm is only feasible for small problems where states, actions, and observations can be explicitly enumerated.
__init__(self, horizon=float(‘inf’), discount_factor=0.9, epsilon=1e6)

plan
(self, agent)¶ Plans an action.

pomdp_py.algorithms.bsp.blqr module¶
Implementation of BLQR algorithm described in “Belief space planning assuming maximum likelihood observations” [2]

class
pomdp_py.algorithms.bsp.blqr.
BLQR
(func_sysd, func_obs, jac_sysd, jac_obs, jac_sysd_u, noise_obs, noise_sysd, Qlarge, L, Q, R, planning_horizon=15)[source]¶ Bases:
pomdp_py.framework.planner.Planner

ekf_update_mlo
(bt, ut)[source]¶ Performs the ekf belief update assuming maximum likelihood observation. This follows equations 12, 13 in the paper. It’s the function \(F\).
 Parameters
bt (tuple) – a belief point bt = (mt, Cov_t) representing the belief state. where mt is np.array of shape (d,) and Cov_t is np.array of shape (d,d). The cost function equation needs to turn Cov_t into a long vector consist of the columns of Cov_t stiched together.
ut (np.array) – control vector
noise_t (pomdp_py.Gaussian) – The Gaussian noise with “possibly statedependent” covariance matrix Wt.
noise_sysd (np.array) – A noise term (e.g. Gaussian noise) added to the system dynamics (\(v\) in Eq.12). This array should have the sam dimensionality as mt.
noise_obs_cov (np.array) – The covariance matrix of the Gaussian noise of the observation function. This corresponds to Wt in equation 13.

integrate_belief_segment
(b_i, u_i, num_segments)[source]¶ This is to represent equation 18.
 phi(b_i’, u_i’) = F(b_i’, u_i’) + sum F(b_{t+1}, u_i)  F(b_t, u_i)
t in seg
This essentially returns b_{i+1}’

segmented_cost_function
(bu_traj, b_des, u_des, num_segments)[source]¶ The cost function in eq 17.
 Parameters
b_des (tuple) – The desired belief (mT, CovT). This is needed to compute the cost function.
u_des (list) – A list of desired controls at the beginning of each segment. If this information is not available, pass in an empty list.

create_plan
(b_0, b_des, u_init, num_segments=5, control_bounds=None, opt_options={}, opt_callback=None)[source]¶ Solves the SQP problem using direct transcription, and produce belief points and controls at segments. Reference: https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html

 1(1,2,3,4)
David Silver and Joel Veness. Montecarlo planning in large pomdps. In Advances in neural information processing systems, 2164–2172. 2010.
 2(1,2)
R. Platt, R. Tedrake, L. Kaelbling, and T. LozanoPerez. Belief space planning assuming maximum likelihood observations. In Proceedings of Robotics: Science and Systems. Zaragoza, Spain, June 2010. doi:10.15607/RSS.2010.VI.037.
 3
Levente Kocsis and Csaba Szepesvári. Bandit based montecarlo planning. In European conference on machine learning, 282–293. Springer, 2006.
 4
Michael H Lim, Claire J Tomlin, and Zachary N Sunberg. Sparse tree search optimality guarantees in pomdps with continuous observation spaces. arXiv preprint arXiv:1910.04332, 2019.
 5
Michael Kearns, Yishay Mansour, and Andrew Y Ng. A sparse sampling algorithm for nearoptimal planning in large markov decision processes. Machine learning, 49(23):193–208, 2002.
 6
António Gusmao and Tapani Raiko. Towards generalizing the success of montecarlo tree search beyond the game of go. In ECAI, 384–389. 2012.
 7
Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(12):99–134, 1998.