pomdp_py.algorithms package¶
Existing POMDP Solvers:
PO-rollout: Baseline algorithm in the POMCP paper |
|
POUCT (Partially Observable UCT) [2] is presented in the POMCP paper as an extension of the UCT algorithm to partially-observable domains that combines MCTS and UCB1 for action selection. |
|
POMCP is POUCT + particle belief representation. |
|
This algorithm is only feasible for small problems where states, actions, and observations can be explicitly enumerated. |
Solvers under development (contribution wanted):
Implementation of B-LQR algorithm described in "Belief space planning assuming maximum likelihood observations" [4] |
pomdp_py.algorithms.po_rollout module¶
PO-rollout: Baseline algorithm in the POMCP paper [2].
Quote from the POMCP paper:
To provide a performance benchmark in these cases, we evaluated the performance of simple Monte-Carlo simulation without any tree. The PO-rollout algorithm used Monte-Carlo belief state updates, as described in section 3.2. It then simulated \(n/|A|\) rollouts for each legal action, and selected the action with highest average return.
We don’t require Monte-Carlo belief update (it’s an option). But it will do the rollouts and action selection as described.
- class pomdp_py.algorithms.po_rollout.PORollout¶
Bases:
Planner
PO-rollout: Baseline algorithm in the POMCP paper
- clear_agent(self)¶
- last_best_reward¶
- plan(self, Agent agent)¶
The agent carries the information: Bt, ht, O,T,R/G, pi, necessary for planning
- set_rollout_policy(self, RolloutPolicy rollout_policy)¶
Updates the rollout policy to the given one
- update(self, Agent agent, Action real_action, Observation real_observation)¶
Updates the planner based on real action and observation. Updates the agent accordingly if necessary. If the agent’s belief is also updated here, the update_agent_belief attribute should be set to True. By default, does nothing.
- update_agent_belief¶
True if planner’s update function also updates agent’s belief.
pomdp_py.algorithms.po_uct module¶
This algorithm is PO-UCT (Partially Observable UCT). It is presented in the POMCP paper [2] as an extension to the UCT algorithm [3] that combines MCTS and UCB1 for action selection.
In other words, this is just POMCP without particle belief, but arbitrary belief representation.
This planning strategy, based on MCTS with belief sampling may be referred to as “belief sparse sampling”; Partially Observable Sparse Sampling (POSS) is introduced in a recent paper [5] as an extension of sparse sampling for MDP by [6]; It proposes an extension of POSS called POWSS (partially observable weighted sparse sampling). However, this line of work (POSS, POWSS) is based solely on particle belief representation. POSS still requires comparing observations exactly for particle belief update, while POWSS uses weighted particles depending on the observation distribution.
A note on the exploration constant. Quote from [7]:
“This constant should reflect the agent’s prior knowledge regarding the amount of exploration required.”
In the POMCP paper, they set this constant following:
“The exploration constant for POMCP was set to \(c = R_{hi} - R_{lo}\), where Rhi was the highest return achieved during sample runs of POMCP with \(c = 0\), and Rlo was the lowest return achieved during sample rollouts”
It is then clear that the POMCP paper is indeed setting this constant based on prior knowledge. Note the difference between sample runs and sample rollouts. But, this is certainly not the only way to obtainx1 the prior knowledge.
- class pomdp_py.algorithms.po_uct.ActionPrior¶
Bases:
object
A problem-specific object
- get_preferred_actions(cls, state, history, kwargs)¶
Intended as a classmethod. This is to mimic the behavior of Simulator.Prior and GenerateLegal/GeneratePreferred in David Silver’s POMCP code.
Returns a set of tuples, in the form of (action, num_visits_init, value_init) that represent the preferred actions. In POMCP, this acts as a history-based prior policy, and in DESPOT, this acts as a belief-based prior policy. For example, given certain state or history, only it is possible that only a subset of all actions is legal; This is useful when there is domain knowledge that can be used as a heuristic for planning.
- class pomdp_py.algorithms.po_uct.POUCT¶
Bases:
Planner
POUCT (Partially Observable UCT) [2] is presented in the POMCP paper as an extension of the UCT algorithm to partially-observable domains that combines MCTS and UCB1 for action selection.
POUCT only works for problems with action space that can be enumerated.
- __init__(self,
max_depth=5, planning_time=1., num_sims=-1, discount_factor=0.9, exploration_const=math.sqrt(2), num_visits_init=1, value_init=0, rollout_policy=RandomRollout(), action_prior=None, show_progress=False, pbar_update_interval=5)
- Parameters:
max_depth (int) – Depth of the MCTS tree. Default: 5.
planning_time (float), amount of time given to each planning step (seconds) – -1. if negative, then planning terminates when number of simulations num_sims reached. If both num_sims and planning_time are negative, then the planner will run for 1 second.
num_sims (int) – Number of simulations for each planning step. If negative, then will terminate when planning_time is reached. If both num_sims and planning_time are negative, then the planner will run for 1 second.
rollout_policy (RolloutPolicy) – rollout policy. Default: RandomRollout.
action_prior (ActionPrior) – a prior over preferred actions given state and history.
show_progress (bool) – True if print a progress bar for simulations.
pbar_update_interval (int) – The number of simulations to run after each update of the progress bar, Only useful if show_progress is True; You can set this parameter even if your stopping criteria is time.
- action_prior¶
- clear_agent()¶
- discount_factor¶
- last_num_sims¶
Returns the number of simulations ran for the last plan call.
- last_planning_time¶
Returns the amount of time (seconds) ran for the last plan call.
- max_depth¶
- num_visits_init¶
- plan(self, Agent agent)¶
The agent carries the information: Bt, ht, O,T,R/G, pi, necessary for planning
- rollout_policy¶
- set_rollout_policy(self, RolloutPolicy rollout_policy)¶
Updates the rollout policy to the given one
- update(self, Agent agent, Action real_action, Observation real_observation)¶
Assume that the agent’s history has been updated after taking real_action and receiving real_observation.
- updates_agent_belief¶
- value_init¶
- class pomdp_py.algorithms.po_uct.RandomRollout¶
Bases:
RolloutPolicy
A rollout policy that chooses actions uniformly at random from the set of possible actions.
- rollout(self, State state, tuple history=None)¶
- class pomdp_py.algorithms.po_uct.RolloutPolicy¶
Bases:
PolicyModel
- rollout(self, State state, tuple history=None)¶
pomdp_py.algorithms.pomcp module¶
We implement POMCP as described in the original paper Monte-Carlo Planning in Large POMDPs https://papers.nips.cc/paper/4031-monte-carlo-planning-in-large-pomdps
One thing to note is that, in this algorithm, belief update happens as the simulation progresses. The new belief is stored in the vnodes at the level after executing the next action. These particles will be reinvigorated if they are not enough.
However, it is possible to separate MCTS completely from the belief update. This means the belief nodes no longer keep track of particles, and belief update and particle reinvogration happen for once after MCTS is completed. I have previously implemented this version. This version is also implemented in BasicPOMCP.jl (https://github.com/JuliaPOMDP/BasicPOMCP.jl) The two should be EQUIVALENT. In general, it doesn’t hurt to do the belief update during MCTS, a feature of using particle representation.
- class pomdp_py.algorithms.pomcp.POMCP¶
Bases:
POUCT
POMCP is POUCT + particle belief representation. This POMCP version only works for problems with action space that can be enumerated.
- plan(self, Agent agent)¶
The agent carries the information: Bt, ht, O,T,R/G, pi, necessary for planning
- update(agent, real_action, real_observation, state_transform_func=None)¶
Assume that the agent’s history has been updated after taking real_action and receiving real_observation.
- state_transform_func: Used to add artificial transform to states during
particle reinvigoration. Signature: s -> s_transformed
- update_agent_belief¶
True if planner’s update function also updates agent’s belief.
pomdp_py.algorithms.value_iteration module¶
Implementation of the basic policy tree based value iteration as explained in section 4.1 of Planning and acting in partially observable stochastic domains [1]
Warning: No pruning - the number of policy trees explodes very fast.
- class pomdp_py.algorithms.value_iteration.ValueIteration¶
Bases:
Planner
This algorithm is only feasible for small problems where states, actions, and observations can be explicitly enumerated.
__init__(self, horizon=float(‘inf’), discount_factor=0.9, epsilon=1e-6)
- plan(self, agent)¶
Plans an action.
pomdp_py.algorithms.bsp.blqr module¶
Implementation of B-LQR algorithm described in “Belief space planning assuming maximum likelihood observations” [4]
- class pomdp_py.algorithms.bsp.blqr.BLQR(func_sysd, func_obs, jac_sysd, jac_obs, jac_sysd_u, noise_obs, noise_sysd, Qlarge, L, Q, R, planning_horizon=15)[source]¶
Bases:
Planner
- ekf_update_mlo(bt, ut)[source]¶
Performs the ekf belief update assuming maximum likelihood observation. This follows equations 12, 13 in the paper. It’s the function \(F\).
- Parameters:
bt (tuple) – a belief point bt = (mt, Cov_t) representing the belief state. where mt is np.array of shape (d,) and Cov_t is np.array of shape (d,d). The cost function equation needs to turn Cov_t into a long vector consist of the columns of Cov_t stiched together.
ut (np.array) – control vector
noise_t (pomdp_py.Gaussian) – The Gaussian noise with “possibly state-dependent” covariance matrix Wt.
noise_sysd (np.array) – A noise term (e.g. Gaussian noise) added to the system dynamics (\(v\) in Eq.12). This array should have the sam dimensionality as mt.
noise_obs_cov (np.array) – The covariance matrix of the Gaussian noise of the observation function. This corresponds to Wt in equation 13.
- integrate_belief_segment(b_i, u_i, num_segments)[source]¶
This is to represent equation 18.
- phi(b_i’, u_i’) = F(b_i’, u_i’) + sum F(b_{t+1}, u_i) - F(b_t, u_i)
t in seg
This essentially returns b_{i+1}’
- segmented_cost_function(bu_traj, b_des, u_des, num_segments)[source]¶
The cost function in eq 17.
- Parameters:
b_des (tuple) – The desired belief (mT, CovT). This is needed to compute the cost function.
u_des (list) – A list of desired controls at the beginning of each segment. If this information is not available, pass in an empty list.
- create_plan(b_0, b_des, u_init, num_segments=5, control_bounds=None, opt_options={}, opt_callback=None)[source]¶
Solves the SQP problem using direct transcription, and produce belief points and controls at segments. Reference: https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html
pomdp_py.algorithms.visual.visual module¶
Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998.
David Silver and Joel Veness. Monte-carlo planning in large pomdps. In Advances in neural information processing systems, 2164–2172. 2010.
Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In European conference on machine learning, 282–293. Springer, 2006.
R. Platt, R. Tedrake, L. Kaelbling, and T. Lozano-Perez. Belief space planning assuming maximum likelihood observations. In Proceedings of Robotics: Science and Systems. Zaragoza, Spain, June 2010. doi:10.15607/RSS.2010.VI.037.
Michael H Lim, Claire J Tomlin, and Zachary N Sunberg. Sparse tree search optimality guarantees in pomdps with continuous observation spaces. arXiv preprint arXiv:1910.04332, 2019.
Michael Kearns, Yishay Mansour, and Andrew Y Ng. A sparse sampling algorithm for near-optimal planning in large markov decision processes. Machine learning, 49(2-3):193–208, 2002.
António Gusmao and Tapani Raiko. Towards generalizing the success of monte-carlo tree search beyond the game of go. In ECAI, 384–389. 2012.