# pomdp_py.algorithms package¶

Existing POMDP Solvers:

 PORollout PO-rollout: Baseline algorithm in the POMCP paper POUCT POUCT (Partially Observable UCT) [1] is presented in the POMCP paper as an extension of the UCT algorithm to partially-observable domains that combines MCTS and UCB1 for action selection. POMCP This POMCP version only works for problems with action space that can be enumerated. ValueIteration This algorithm is only feasible for small problems where states, actions, and observations can be explicitly enumerated.

Solvers under development (contribution wanted):

 blqr Implementation of B-LQR algorithm described in “Belief space planning assuming maximum likelihood observations” [2]

## pomdp_py.algorithms.po_rollout module¶

PO-rollout: Baseline algorithm in the POMCP paper [1].

Quote from the POMCP paper:

To provide a performance benchmark in these cases, we evaluated the performance of simple Monte-Carlo simulation without any tree. The PO-rollout algorithm used Monte-Carlo belief state updates, as described in section 3.2. It then simulated $$n/|A|$$ rollouts for each legal action, and selected the action with highest average return.

We don’t require Monte-Carlo belief update (it’s an option). But it will do the rollouts and action selection as described.

class pomdp_py.algorithms.po_rollout.PORollout

PO-rollout: Baseline algorithm in the POMCP paper

clear_agent(self)
last_best_reward
plan(self, Agent agent)

The agent carries the information: Bt, ht, O,T,R/G, pi, necessary for planning

update(self, Agent agent, Action real_action, Observation real_observation)

Updates the planner based on real action and observation. Updates the agent accordingly if necessary. If the agent’s belief is also updated here, the update_agent_belief attribute should be set to True. By default, does nothing.

update_agent_belief

True if planner’s update function also updates agent’s belief.

## pomdp_py.algorithms.po_uct module¶

This algorithm is PO-UCT (Partially Observable UCT). It is presented in the POMCP paper [1] as an extension to the UCT algorithm [3] that combines MCTS and UCB1 for action selection.

In other words, this is just POMCP without particle belief, but arbitrary belief representation.

This planning strategy, based on MCTS with belief sampling may be referred to as “belief sparse sampling”; Partially Observable Sparse Sampling (POSS) is introduced in a recent paper [4] as an extension of sparse sampling for MDP by [5]; It proposes an extension of POSS called POWSS (partially observable weighted sparse sampling). However, this line of work (POSS, POWSS) is based solely on particle belief representation. POSS still requires comparing observations exactly for particle belief update, while POWSS uses weighted particles depending on the observation distribution.

A note on the exploration constant. Quote from [6]:

“This constant should reflect the agent’s prior knowledge regarding the amount of exploration required.”

In the POMCP paper, they set this constant following:

“The exploration constant for POMCP was set to c = Rhi - Rlo, where Rhi was the highest return achieved during sample runs of POMCP with c = 0, and Rlo was the lowest return achieved during sample rollouts”

It is then clear that the POMCP paper is indeed setting this constant based on prior knowledge. Note the difference between sample runs and sample rollouts. But, this is certainly not the only way to obtainx1 the prior knowledge.

class pomdp_py.algorithms.po_uct.ActionPrior

Bases: object

A problem-specific object

get_preferred_actions(self, vnode, state=None, history=None, belief=None, **kwargs)

This is to mimic the behavior of Simulator.Prior and GenerateLegal/GeneratePreferred in David Silver’s POMCP code.

Returns a set of preferred actions and associated num_visits_init and value_init, given arguments. In POMCP, this acts as a history-based prior policy, and in DESPOT, this acts as a belief-based prior policy. For example, given certain state or history, only it is possible that only a subset of all actions is legal; This is particularly true in the RockSample problem.

class pomdp_py.algorithms.po_uct.POUCT

POUCT (Partially Observable UCT) [1] is presented in the POMCP paper as an extension of the UCT algorithm to partially-observable domains that combines MCTS and UCB1 for action selection.

POUCT only works for problems with action space that can be enumerated.

clear_agent()
last_num_sims

Returns the number of simulations ran for the last plan call.

plan(self, Agent agent)

The agent carries the information: Bt, ht, O,T,R/G, pi, necessary for planning

update(self, Agent agent, Action real_action, Observation real_observation)

Assume that the agent’s history has been updated after taking real_action and receiving real_observation.

updates_agent_belief

True if planner’s update function also updates agent’s belief.

class pomdp_py.algorithms.po_uct.QNode
class pomdp_py.algorithms.po_uct.RandomRollout
rollout(self, State state, tuple history=None)
class pomdp_py.algorithms.po_uct.RolloutPolicy
rollout(self, State state, tuple history=None)
class pomdp_py.algorithms.po_uct.RootVNode
from_vnode(cls, vnode, history)
history
class pomdp_py.algorithms.po_uct.TreeNode

Bases: object

children
num_visits
value
class pomdp_py.algorithms.po_uct.VNode
print_children_value()
pomdp_py.algorithms.po_uct.print_preferred_actions()
pomdp_py.algorithms.po_uct.print_preferred_actions_helper()
pomdp_py.algorithms.po_uct.print_tree()
pomdp_py.algorithms.po_uct.print_tree_helper()
pomdp_py.algorithms.po_uct.tree_stats()
pomdp_py.algorithms.po_uct.tree_stats_helper()

## pomdp_py.algorithms.pomcp module¶

class pomdp_py.algorithms.pomcp.POMCP

This POMCP version only works for problems with action space that can be enumerated.

plan(self, Agent agent)

The agent carries the information: Bt, ht, O,T,R/G, pi, necessary for planning

update()

Assume that the agent’s history has been updated after taking real_action and receiving real_observation.

state_transform_func: Used to add artificial transform to states during

particle reinvigoration. Signature: s -> s_transformed

update_agent_belief

True if planner’s update function also updates agent’s belief.

class pomdp_py.algorithms.pomcp.RootVNodeParticles
belief
from_vnode(cls, vnode, history)
class pomdp_py.algorithms.pomcp.VNodeParticles

POMCP’s VNode maintains particle belief

belief

## pomdp_py.algorithms.value_iteration module¶

Implementation of the basic policy tree based value iteration as explained in section 4.1 of Planning and acting in partially observable stochastic domains [7]

Warning: No pruning - the number of policy trees explodes very fast.

class pomdp_py.algorithms.value_iteration.ValueIteration

This algorithm is only feasible for small problems where states, actions, and observations can be explicitly enumerated.

__init__(self, horizon=float(‘inf’), discount_factor=0.9, epsilon=1e-6)

plan(self, agent)

Plans an action.

## pomdp_py.algorithms.bsp.blqr module¶

Implementation of B-LQR algorithm described in “Belief space planning assuming maximum likelihood observations” [2]

class pomdp_py.algorithms.bsp.blqr.BLQR(func_sysd, func_obs, jac_sysd, jac_obs, jac_sysd_u, noise_obs, noise_sysd, Qlarge, L, Q, R, planning_horizon=15)[source]
ekf_update_mlo(bt, ut)[source]

Performs the ekf belief update assuming maximum likelihood observation. This follows equations 12, 13 in the paper. It’s the function $$F$$.

Parameters
• bt (tuple) – a belief point bt = (mt, Cov_t) representing the belief state. where mt is np.array of shape (d,) and Cov_t is np.array of shape (d,d). The cost function equation needs to turn Cov_t into a long vector consist of the columns of Cov_t stiched together.

• ut (np.array) – control vector

• noise_t (pomdp_py.Gaussian) – The Gaussian noise with “possibly state-dependent” covariance matrix Wt.

• noise_sysd (np.array) – A noise term (e.g. Gaussian noise) added to the system dynamics ($$v$$ in Eq.12). This array should have the sam dimensionality as mt.

• noise_obs_cov (np.array) – The covariance matrix of the Gaussian noise of the observation function. This corresponds to Wt in equation 13.

integrate_belief_segment(b_i, u_i, num_segments)[source]

This is to represent equation 18.

phi(b_i’, u_i’) = F(b_i’, u_i’) + sum F(b_{t+1}, u_i) - F(b_t, u_i)

t in seg

This essentially returns b_{i+1}’

segmented_cost_function(bu_traj, b_des, u_des, num_segments)[source]

The cost function in eq 17.

Parameters
• b_des (tuple) – The desired belief (mT, CovT). This is needed to compute the cost function.

• u_des (list) – A list of desired controls at the beginning of each segment. If this information is not available, pass in an empty list.

create_plan(b_0, b_des, u_init, num_segments=5, control_bounds=None, opt_options={}, opt_callback=None)[source]

Solves the SQP problem using direct transcription, and produce belief points and controls at segments. Reference: https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html

interpret_sqp_plan(opt_res, num_segments)[source]

1(1,2,3,4)

David Silver and Joel Veness. Monte-carlo planning in large pomdps. In Advances in neural information processing systems, 2164–2172. 2010.

2(1,2)

R. Platt, R. Tedrake, L. Kaelbling, and T. Lozano-Perez. Belief space planning assuming maximum likelihood observations. In Proceedings of Robotics: Science and Systems. Zaragoza, Spain, June 2010. doi:10.15607/RSS.2010.VI.037.

3

Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In European conference on machine learning, 282–293. Springer, 2006.

4

Michael H Lim, Claire J Tomlin, and Zachary N Sunberg. Sparse tree search optimality guarantees in pomdps with continuous observation spaces. arXiv preprint arXiv:1910.04332, 2019.

5

Michael Kearns, Yishay Mansour, and Andrew Y Ng. A sparse sampling algorithm for near-optimal planning in large markov decision processes. Machine learning, 49(2-3):193–208, 2002.

6

António Gusmao and Tapani Raiko. Towards generalizing the success of monte-carlo tree search beyond the game of go. In ECAI, 384–389. 2012.

7

Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998.