pomdp_py.algorithms package

Existing POMDP Solvers:

PORollout

PO-rollout: Baseline algorithm in the POMCP paper

POUCT

POUCT (Partially Observable UCT) [2] is presented in the POMCP paper as an extension of the UCT algorithm to partially-observable domains that combines MCTS and UCB1 for action selection.

POMCP

POMCP is POUCT + particle belief representation.

ValueIteration

This algorithm is only feasible for small problems where states, actions, and observations can be explicitly enumerated.

Solvers under development (contribution wanted):

blqr

Implementation of B-LQR algorithm described in "Belief space planning assuming maximum likelihood observations" [4]

pomdp_py.algorithms.po_rollout module

PO-rollout: Baseline algorithm in the POMCP paper [2].

Quote from the POMCP paper:

To provide a performance benchmark in these cases, we evaluated the performance of simple Monte-Carlo simulation without any tree. The PO-rollout algorithm used Monte-Carlo belief state updates, as described in section 3.2. It then simulated \(n/|A|\) rollouts for each legal action, and selected the action with highest average return.

We don’t require Monte-Carlo belief update (it’s an option). But it will do the rollouts and action selection as described.

class pomdp_py.algorithms.po_rollout.PORollout

Bases: Planner

PO-rollout: Baseline algorithm in the POMCP paper

clear_agent(self)
last_best_reward
plan(self, Agent agent)

The agent carries the information: Bt, ht, O,T,R/G, pi, necessary for planning

set_rollout_policy(self, RolloutPolicy rollout_policy)

Updates the rollout policy to the given one

update(self, Agent agent, Action real_action, Observation real_observation)

Updates the planner based on real action and observation. Updates the agent accordingly if necessary. If the agent’s belief is also updated here, the update_agent_belief attribute should be set to True. By default, does nothing.

update_agent_belief

True if planner’s update function also updates agent’s belief.

pomdp_py.algorithms.po_uct module

This algorithm is PO-UCT (Partially Observable UCT). It is presented in the POMCP paper [2] as an extension to the UCT algorithm [3] that combines MCTS and UCB1 for action selection.

In other words, this is just POMCP without particle belief, but arbitrary belief representation.

This planning strategy, based on MCTS with belief sampling may be referred to as “belief sparse sampling”; Partially Observable Sparse Sampling (POSS) is introduced in a recent paper [5] as an extension of sparse sampling for MDP by [6]; It proposes an extension of POSS called POWSS (partially observable weighted sparse sampling). However, this line of work (POSS, POWSS) is based solely on particle belief representation. POSS still requires comparing observations exactly for particle belief update, while POWSS uses weighted particles depending on the observation distribution.

A note on the exploration constant. Quote from [7]:

“This constant should reflect the agent’s prior knowledge regarding the amount of exploration required.”

In the POMCP paper, they set this constant following:

“The exploration constant for POMCP was set to \(c = R_{hi} - R_{lo}\), where Rhi was the highest return achieved during sample runs of POMCP with \(c = 0\), and Rlo was the lowest return achieved during sample rollouts”

It is then clear that the POMCP paper is indeed setting this constant based on prior knowledge. Note the difference between sample runs and sample rollouts. But, this is certainly not the only way to obtainx1 the prior knowledge.

class pomdp_py.algorithms.po_uct.ActionPrior

Bases: object

A problem-specific object

get_preferred_actions(cls, state, history, kwargs)

Intended as a classmethod. This is to mimic the behavior of Simulator.Prior and GenerateLegal/GeneratePreferred in David Silver’s POMCP code.

Returns a set of tuples, in the form of (action, num_visits_init, value_init) that represent the preferred actions. In POMCP, this acts as a history-based prior policy, and in DESPOT, this acts as a belief-based prior policy. For example, given certain state or history, only it is possible that only a subset of all actions is legal; This is useful when there is domain knowledge that can be used as a heuristic for planning.

class pomdp_py.algorithms.po_uct.POUCT

Bases: Planner

POUCT (Partially Observable UCT) [2] is presented in the POMCP paper as an extension of the UCT algorithm to partially-observable domains that combines MCTS and UCB1 for action selection.

POUCT only works for problems with action space that can be enumerated.

__init__(self,

max_depth=5, planning_time=1., num_sims=-1, discount_factor=0.9, exploration_const=math.sqrt(2), num_visits_init=1, value_init=0, rollout_policy=RandomRollout(), action_prior=None, show_progress=False, pbar_update_interval=5)

Parameters:
  • max_depth (int) – Depth of the MCTS tree. Default: 5.

  • planning_time (float), amount of time given to each planning step (seconds) – -1. if negative, then planning terminates when number of simulations num_sims reached. If both num_sims and planning_time are negative, then the planner will run for 1 second.

  • num_sims (int) – Number of simulations for each planning step. If negative, then will terminate when planning_time is reached. If both num_sims and planning_time are negative, then the planner will run for 1 second.

  • rollout_policy (RolloutPolicy) – rollout policy. Default: RandomRollout.

  • action_prior (ActionPrior) – a prior over preferred actions given state and history.

  • show_progress (bool) – True if print a progress bar for simulations.

  • pbar_update_interval (int) – The number of simulations to run after each update of the progress bar, Only useful if show_progress is True; You can set this parameter even if your stopping criteria is time.

action_prior
clear_agent()
discount_factor
last_num_sims

Returns the number of simulations ran for the last plan call.

last_planning_time

Returns the amount of time (seconds) ran for the last plan call.

max_depth
num_visits_init
plan(self, Agent agent)

The agent carries the information: Bt, ht, O,T,R/G, pi, necessary for planning

rollout_policy
set_rollout_policy(self, RolloutPolicy rollout_policy)

Updates the rollout policy to the given one

update(self, Agent agent, Action real_action, Observation real_observation)

Assume that the agent’s history has been updated after taking real_action and receiving real_observation.

updates_agent_belief
value_init
class pomdp_py.algorithms.po_uct.QNode

Bases: TreeNode

class pomdp_py.algorithms.po_uct.RandomRollout

Bases: RolloutPolicy

A rollout policy that chooses actions uniformly at random from the set of possible actions.

rollout(self, State state, tuple history=None)
class pomdp_py.algorithms.po_uct.RolloutPolicy

Bases: PolicyModel

rollout(self, State state, tuple history=None)
class pomdp_py.algorithms.po_uct.RootVNode

Bases: VNode

classmethod from_vnode(cls, vnode, history)
history
class pomdp_py.algorithms.po_uct.TreeNode

Bases: object

children
num_visits
value
class pomdp_py.algorithms.po_uct.VNode

Bases: TreeNode

argmax(VNode self)

Returns the action of the child with highest value

print_children_value()
value

pomdp_py.algorithms.pomcp module

We implement POMCP as described in the original paper Monte-Carlo Planning in Large POMDPs https://papers.nips.cc/paper/4031-monte-carlo-planning-in-large-pomdps

One thing to note is that, in this algorithm, belief update happens as the simulation progresses. The new belief is stored in the vnodes at the level after executing the next action. These particles will be reinvigorated if they are not enough.

However, it is possible to separate MCTS completely from the belief update. This means the belief nodes no longer keep track of particles, and belief update and particle reinvogration happen for once after MCTS is completed. I have previously implemented this version. This version is also implemented in BasicPOMCP.jl (https://github.com/JuliaPOMDP/BasicPOMCP.jl) The two should be EQUIVALENT. In general, it doesn’t hurt to do the belief update during MCTS, a feature of using particle representation.

class pomdp_py.algorithms.pomcp.POMCP

Bases: POUCT

POMCP is POUCT + particle belief representation. This POMCP version only works for problems with action space that can be enumerated.

plan(self, Agent agent)

The agent carries the information: Bt, ht, O,T,R/G, pi, necessary for planning

update(agent, real_action, real_observation, state_transform_func=None)

Assume that the agent’s history has been updated after taking real_action and receiving real_observation.

state_transform_func: Used to add artificial transform to states during

particle reinvigoration. Signature: s -> s_transformed

update_agent_belief

True if planner’s update function also updates agent’s belief.

class pomdp_py.algorithms.pomcp.RootVNodeParticles

Bases: RootVNode

belief
classmethod from_vnode(cls, vnode, history)
class pomdp_py.algorithms.pomcp.VNodeParticles

Bases: VNode

POMCP’s VNode maintains particle belief

belief

pomdp_py.algorithms.value_iteration module

Implementation of the basic policy tree based value iteration as explained in section 4.1 of Planning and acting in partially observable stochastic domains [1]

Warning: No pruning - the number of policy trees explodes very fast.

class pomdp_py.algorithms.value_iteration.ValueIteration

Bases: Planner

This algorithm is only feasible for small problems where states, actions, and observations can be explicitly enumerated.

__init__(self, horizon=float(‘inf’), discount_factor=0.9, epsilon=1e-6)

plan(self, agent)

Plans an action.

pomdp_py.algorithms.bsp.blqr module

Implementation of B-LQR algorithm described in “Belief space planning assuming maximum likelihood observations” [4]

class pomdp_py.algorithms.bsp.blqr.BLQR(func_sysd, func_obs, jac_sysd, jac_obs, jac_sysd_u, noise_obs, noise_sysd, Qlarge, L, Q, R, planning_horizon=15)[source]

Bases: Planner

ekf_update_mlo(bt, ut)[source]

Performs the ekf belief update assuming maximum likelihood observation. This follows equations 12, 13 in the paper. It’s the function \(F\).

Parameters:
  • bt (tuple) – a belief point bt = (mt, Cov_t) representing the belief state. where mt is np.array of shape (d,) and Cov_t is np.array of shape (d,d). The cost function equation needs to turn Cov_t into a long vector consist of the columns of Cov_t stiched together.

  • ut (np.array) – control vector

  • noise_t (pomdp_py.Gaussian) – The Gaussian noise with “possibly state-dependent” covariance matrix Wt.

  • noise_sysd (np.array) – A noise term (e.g. Gaussian noise) added to the system dynamics (\(v\) in Eq.12). This array should have the sam dimensionality as mt.

  • noise_obs_cov (np.array) – The covariance matrix of the Gaussian noise of the observation function. This corresponds to Wt in equation 13.

integrate_belief_segment(b_i, u_i, num_segments)[source]

This is to represent equation 18.

phi(b_i’, u_i’) = F(b_i’, u_i’) + sum F(b_{t+1}, u_i) - F(b_t, u_i)

t in seg

This essentially returns b_{i+1}’

segmented_cost_function(bu_traj, b_des, u_des, num_segments)[source]

The cost function in eq 17.

Parameters:
  • b_des (tuple) – The desired belief (mT, CovT). This is needed to compute the cost function.

  • u_des (list) – A list of desired controls at the beginning of each segment. If this information is not available, pass in an empty list.

create_plan(b_0, b_des, u_init, num_segments=5, control_bounds=None, opt_options={}, opt_callback=None)[source]

Solves the SQP problem using direct transcription, and produce belief points and controls at segments. Reference: https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html

interpret_sqp_plan(opt_res, num_segments)[source]

pomdp_py.algorithms.visual.visual module

[1]

Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998.

[2] (1,2,3,4,5)

David Silver and Joel Veness. Monte-carlo planning in large pomdps. In Advances in neural information processing systems, 2164–2172. 2010.

[3]

Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In European conference on machine learning, 282–293. Springer, 2006.

[4] (1,2,3)

R. Platt, R. Tedrake, L. Kaelbling, and T. Lozano-Perez. Belief space planning assuming maximum likelihood observations. In Proceedings of Robotics: Science and Systems. Zaragoza, Spain, June 2010. doi:10.15607/RSS.2010.VI.037.

[5]

Michael H Lim, Claire J Tomlin, and Zachary N Sunberg. Sparse tree search optimality guarantees in pomdps with continuous observation spaces. arXiv preprint arXiv:1910.04332, 2019.

[6]

Michael Kearns, Yishay Mansour, and Andrew Y Ng. A sparse sampling algorithm for near-optimal planning in large markov decision processes. Machine learning, 49(2-3):193–208, 2002.

[7]

António Gusmao and Tapani Raiko. Towards generalizing the success of monte-carlo tree search beyond the game of go. In ECAI, 384–389. 2012.