pomdp_py.framework package

Important classes in pomdp_py.framework.basics:

Distribution

A Distribution is a probability function that maps from variable value to a real value, that is, \(Pr(X=x)\).

GenerativeDistribution

A GenerativeDistribution is a Distribution that additionally exhibits generative properties.

TransitionModel

A TransitionModel models the distribution \(T(s,a,s')=\Pr(s'|s,a)\).

ObservationModel

An ObservationModel models the distribution \(O(s',a,o)=\Pr(o|s',a)\).

BlackboxModel

A BlackboxModel is the generative distribution \(G(s,a)\) which can generate samples where each is a tuple \((s',o,r)\).

RewardModel

A RewardModel models the distribution \(\Pr(r|s,a,s')\) where \(r\in\mathbb{R}\) with argmax denoted as denoted as \(R(s,a,s')\).

PolicyModel

PolicyModel models the distribution \(\pi(a|s)\).

Agent

An Agent operates in an environment by taking actions, receiving observations, and updating its belief.

Environment

An Environment maintains the true state of the world.

POMDP

A POMDP instance = agent (Agent) + env (Environment).

State

The State class.

Action

The Action class.

Observation

The Observation class.

Option

An option is a temporally abstracted action that is defined by (I, pi, B), where I is a initiation set, pi is a policy, and B is a termination condition

pomdp_py.framework.basics module

class pomdp_py.framework.basics.Action

Bases: object

The Action class. Action must be hashable.

class pomdp_py.framework.basics.Agent

Bases: object

An Agent operates in an environment by taking actions, receiving observations, and updating its belief. Deciding what action to take is the job of a planner (Planner), and the belief update is usually done outside of the agent, taken care of e.g. by the belief representation, or by the planner. The Agent supplies its own version of the TransitionModel, ObservationModel, RewardModel, OR BlackboxModel to the planner or the belief update algorithm.

__init__(self, init_belief,

policy_model=None, transition_model=None, observation_model=None, reward_model=None, blackbox_model=None, name=None)

add_attr(self, attr_name, attr_value)

A function that allows adding attributes to the agent. Sometimes useful for planners to store agent-specific information.

all_actions

Only available if the policy model implements get_all_actions.

all_observations

Only available if the observation model implements get_all_observations.

all_states

Only available if the transition model implements get_all_states.

belief

Current belief distribution.

history

Current history.

init_belief

Initial belief distribution.

sample_belief(self)

Returns a state (State) sampled from the belief.

set_belief(self, belief, prior=False)
set_models(transition_model=None, observation_model=None, reward_model=None, blackbox_model=None, policy_model=None)

Re-assign the models to be the ones given.

set_name(self, str name)

gives this agent a name

update(self, real_action, real_observation)

updates the history and performs belief update

update_history(self, real_action, real_observation)
class pomdp_py.framework.basics.BlackboxModel

Bases: object

A BlackboxModel is the generative distribution \(G(s,a)\) which can generate samples where each is a tuple \((s',o,r)\).

argmax(self, state, action)

Returns the most likely (s’,o,r)

sample(self, state, action)

Sample (s’,o,r) ~ G(s,a)

class pomdp_py.framework.basics.Distribution

Bases: object

A Distribution is a probability function that maps from variable value to a real value, that is, \(Pr(X=x)\).

__getitem__(self, varval)

Probability evaulation. Returns the probability of \(Pr(X=varval)\).

__setitem__(self, varval, value)

Sets the probability of \(X=varval\) to be value.

class pomdp_py.framework.basics.Environment

Bases: object

An Environment maintains the true state of the world. For example, it is the 2D gridworld, rendered by pygame. Or it could be the 3D simulated world rendered by OpenGL. Therefore, when coding up an Environment, the developer should have in mind how to represent the state so that it can be used by a POMDP or OOPOMDP.

The Environment is passive. It never observes nor acts.

__init__(self, init_state,

transition_model=None, reward_model=None, blackbox_model=None)

apply_transition(self, next_state)

Apply the transition, that is, assign current state to be the next_state.

blackbox_model

The BlackboxModel underlying the environment

cur_state

Current state of the environment

provide_observation(self, observation_model, action)

Returns an observation sampled according to \(\Pr(o|s',a)\) where \(s'\) is current environment state(), \(a\) is the given action, and \(\Pr(o|s',a)\) is the observation_model.

Parameters:
Returns:

an observation sampled from \(\Pr(o|s',a)\).

Return type:

Observation

reward_model

The RewardModel underlying the environment

set_models(transition_model=None, reward_model=None, blackbox_model=None)

Re-assign the models to be the ones given.

state

Synonym for cur_state()

state_transition(self, action, execute=True)

Simulates a state transition given action. If execute is set to True, then the resulting state will be the new current state of the environment.

Parameters:
  • action (Action) – action that triggers the state transition

  • execute (bool) – If True, the resulting state of the transition will become the current state.

  • discount_factor (float) – Only necessary if action is an Option. It is the discount factor when executing actions following an option’s policy until reaching terminal condition.

Returns:

reward as a result of action and state transition, if execute is True (next_state, reward) if execute is False.

Return type:

float or tuple

transition_model

The TransitionModel underlying the environment

class pomdp_py.framework.basics.GenerativeDistribution

Bases: Distribution

A GenerativeDistribution is a Distribution that additionally exhibits generative properties. That is, it supports argmax() (or mpe()) and random() functions.

argmax(self)

Synonym for mpe().

get_histogram(self)

Returns a dictionary from state to probability

mpe(self)

Returns the value of the variable that has the highest probability.

class pomdp_py.framework.basics.Observation

Bases: object

The Observation class. Observation must be hashable.

class pomdp_py.framework.basics.ObservationModel

Bases: object

An ObservationModel models the distribution \(O(s',a,o)=\Pr(o|s',a)\).

argmax(self, next_state, action)

Returns the most likely observation

get_all_observations(self)

Returns a set of all possible observations, if feasible.

get_distribution(self, next_state, action)

Returns the underlying distribution of the model

probability(self, observation, next_state, action)

Returns the probability of \(\Pr(o|s',a)\).

Parameters:
  • observation (Observation) – the observation \(o\)

  • next_state (State) – the next state \(s'\)

  • action (Action) – the action \(a\)

Returns:

the probability \(\Pr(o|s',a)\)

Return type:

float

sample(self, next_state, action)

Returns observation randomly sampled according to the distribution of this observation model.

Parameters:
  • next_state (State) – the next state \(s'\)

  • action (Action) – the action \(a\)

Returns:

the observation \(o\)

Return type:

Observation

class pomdp_py.framework.basics.Option

Bases: Action

An option is a temporally abstracted action that is defined by (I, pi, B), where I is a initiation set, pi is a policy, and B is a termination condition

Described in Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning

initiate(state)

initiation(self, state) Returns True if the given parameters satisfy the initiation set

policy

Returns the policy model (PolicyModel) of this option.

sample(self, state)

Samples an action from this option’s policy. Convenience function; Can be overriden if don’t feel like writing a PolicyModel class

terminate(state)

termination(self, state) Returns a boolean of whether state satisfies the termination condition; Technically returning a float between 0 and 1 is also allowed.

class pomdp_py.framework.basics.POMDP

Bases: object

A POMDP instance = agent (Agent) + env (Environment).

__init__(self, agent, env, name=”POMDP”)

class pomdp_py.framework.basics.PolicyModel

Bases: object

PolicyModel models the distribution \(\pi(a|s)\). It can also be treated as modeling \(\pi(a|h_t)\) by regarding state parameters as history.

The reason to have a policy model is to accommodate problems with very large action spaces, and the available actions may vary depending on the state (that is, certain actions have probabilty=0)

argmax(self, state)

Returns the most likely reward

get_all_actions(self, *args)

Returns a set of all possible actions, if feasible.

get_distribution(self, state)

Returns the underlying distribution of the model

probability(self, action, state)

Returns the probability of \(\pi(a|s)\).

Parameters:
  • action (Action) – the action \(a\)

  • state (State) – the state \(s\)

Returns:

the probability \(\pi(a|s)\)

Return type:

float

sample(self, state)

Returns action randomly sampled according to the distribution of this policy model.

Parameters:

state (State) – the next state \(s\)

Returns:

the action \(a\)

Return type:

Action

update(self, state, next_state, action)

Policy model may be updated given a (s,a,s’) pair.

class pomdp_py.framework.basics.RewardModel

Bases: object

A RewardModel models the distribution \(\Pr(r|s,a,s')\) where \(r\in\mathbb{R}\) with argmax denoted as denoted as \(R(s,a,s')\).

argmax(self, state, action, next_state)

Returns the most likely reward. This is optional.

get_distribution(self, state, action, next_state)

Returns the underlying distribution of the model

probability(self, reward, state, action, next_state)

Returns the probability of \(\Pr(r|s,a,s')\).

Parameters:
  • reward (float) – the reward \(r\)

  • state (State) – the state \(s\)

  • action (Action) – the action \(a\)

  • next_state (State) – the next state \(s'\)

Returns:

the probability \(\Pr(r|s,a,s')\)

Return type:

float

sample(self, state, action, next_state)

Returns reward randomly sampled according to the distribution of this reward model. This is required, i.e. assumed to be implemented for a reward model.

Parameters:
  • state (State) – the next state \(s\)

  • action (Action) – the action \(a\)

  • next_state (State) – the next state \(s'\)

Returns:

the reward \(r\)

Return type:

float

class pomdp_py.framework.basics.State

Bases: object

The State class. State must be hashable.

class pomdp_py.framework.basics.TransitionModel

Bases: object

A TransitionModel models the distribution \(T(s,a,s')=\Pr(s'|s,a)\).

argmax(self, state, action)

Returns the most likely next state

get_all_states(self)

Returns a set of all possible states, if feasible.

get_distribution(self, state, action)

Returns the underlying distribution of the model

probability(self, next_state, state, action)

Returns the probability of \(\Pr(s'|s,a)\).

Parameters:
  • state (State) – the state \(s\)

  • next_state (State) – the next state \(s'\)

  • action (Action) – the action \(a\)

Returns:

the probability \(\Pr(s'|s,a)\)

Return type:

float

sample(self, state, action)

Returns next state randomly sampled according to the distribution of this transition model.

Parameters:
  • state (State) – the next state \(s\)

  • action (Action) – the action \(a\)

Returns:

the next state \(s'\)

Return type:

State

pomdp_py.framework.basics.sample_explict_models(TransitionModel T, ObservationModel O, RewardModel R, State state, Action action, float discount_factor=1.0)
pomdp_py.framework.basics.sample_generative_model(Agent agent, State state, Action action, float discount_factor=1.0)

\((s', o, r) \sim G(s, a)\)

If the agent has transition/observation models, a black box will be created based on these models (i.e. \(s'\) and \(o\) will be sampled according to these models).

Parameters:
  • agent (Agent) – agent that supplies all the models

  • state (State) –

  • action (Action) –

  • discount_factor (float) – Defaults to 1.0; Only used when action is an Option.

Returns:

\((s', o, r, n_steps)\)

Return type:

tuple

pomdp_py.framework.oopomdp module

This module describes components of the OO-POMDP interface in pomdp_py.

An OO-POMDP is a specific type of POMDP where the state and observation spaces are factored by objects. As a result, the transition, observation, and belief distributions are all factored by objects. A main benefit of using OO-POMDP is that the object factoring reduces the scaling of belief space from exponential to linear as the number of objects increases. See [1].

class pomdp_py.framework.oopomdp.DictState

Bases: ObjectState

This is synonymous as ObjectState, but does not convey ‘objectness’ of the information being described.

class pomdp_py.framework.oopomdp.OOBelief

Bases: GenerativeDistribution

Belief factored by objects.

__getitem__(self, state)

Returns belief probability of given state

__setitem__(self, oostate, value)

Sets the probability of a given oostate to value. Note always feasible.

b(objid)

convenient alias function call

mpe(self, return_oostate=False, **kwargs)

Returns most likely state.

object_belief(self, objid)

Returns the belief (GenerativeDistribution) for the given object.

object_beliefs
random(self, return_oostate=False, **kwargs)

Returns a random state

set_object_belief(self, objid, belief)

Sets the belief of object to be the given belief (GenerativeDistribution)

class pomdp_py.framework.oopomdp.OOObservation

Bases: Observation

factor(self, next_state, action, **kwargs)

Factors the observation by objects. That is, \(z\mapsto z_1,\cdots,z_n\)

Parameters:
  • next_state (OOState) – given state

  • action (Action) – given action

Returns:

map from object id to a pomdp_py.Observation.

Return type:

dict

classmethod merge(cls, object_observations, next_state, action, **kwargs)

Merges the factored object_observations into a single OOObservation.

Parameters:
  • object_observations (dict) – map from object id to a pomdp_py.Observation.

  • next_state (OOState) – given state

  • action (Action) – given action

Returns:

the merged observation.

Return type:

OOObservation

class pomdp_py.framework.oopomdp.OOObservationModel

Bases: ObservationModel

\(O(z | s', a) = \prod_i O(z_i' | s', a)\)

__init__(self, observation_models): :param observation_models: :type observation_models: dict

__getitem__(self, objid)

Returns observation model for given object

argmax(self, next_state, action, **kwargs)

Returns most likely observation

observation_models
probability(self, observation, next_state, action, **kwargs)

Returns \(O(z | s', a)\).

sample(self, next_state, action, argmax=False, **kwargs)

Returns random observation

class pomdp_py.framework.oopomdp.OOPOMDP

Bases: POMDP

An OO-POMDP is again defined by an agent and an environment.

__init__(self, agent, env, name=”OOPOMDP”)

class pomdp_py.framework.oopomdp.OOState

Bases: State

State that can be factored by objects, that is, to ObjectState(s).

Note: to change the state of an object, you can use set_object_state. Do not directly assign the object state by e.g. oostate.object_states[objid] = object_state, because it will cause the hashcode to be incorrect in the oostate after the change.

__init__(self, object_states)

__getitem__(key, /)

Return self[key].

copy(self)

Copies the state.

get_object_attribute(self, objid, attr)

Returns the attributes of requested object

get_object_class(self, objid)

Returns the class of requested object

get_object_state(self, objid)

Returns the ObjectState for given object.

s(objid)

convenient alias function

set_object_state(self, objid, object_state)

Sets the state of the given object to be the given object state (ObjectState)

situation

This is a frozenset which can be used to identify the situation of this state since it supports hashing.

class pomdp_py.framework.oopomdp.OOTransitionModel

Bases: TransitionModel

\(T(s' | s, a) = \prod_i T(s_i' | s, a)\)

__init__(self, transition_models): :param transition_models: :type transition_models: dict

__getitem__(self, objid)

Returns transition model for given object

argmax(self, state, action, **kwargs)

Returns the most likely next state

probability(self, next_state, state, action, **kwargs)

Returns :math:`T(s’ | s, a)

sample(self, state, action, argmax=False, **kwargs)

Returns random next_state

transition_models
class pomdp_py.framework.oopomdp.ObjectState

Bases: State

This is the result of OOState factoring; A state in an OO-POMDP is made up of ObjectState(s), each with an object class (str) and a set of attributes (dict).

__getitem__(self, attr)

Returns the attribute

__setitem__(self, attr, value)

Sets the attribute attr to the given value.

copy(self)

Copies this ObjectState.

pomdp_py.framework.planner module

class pomdp_py.framework.planner.Planner

Bases: object

A Planner can plan() the next action to take for a given agent (online planning). The planner can be updated (which may also update the agent belief) once an action is executed and observation received.

A Planner may be a purely planning algorithm, or it could be using a learned model for planning underneath the hood. Its job is to output an action to take for a given agent.

You can implement a Planner that is specific to an agent, or not. If specific, then when calling plan() the agent passed in is expected to always be the same one.

plan(self, Agent agent)

The agent carries the information: Bt, ht, O,T,R/G, pi, necessary for planning

update(self, Agent agent, Action real_action, Observation real_observation)

Updates the planner based on real action and observation. Updates the agent accordingly if necessary. If the agent’s belief is also updated here, the update_agent_belief attribute should be set to True. By default, does nothing.

updates_agent_belief()

True if planner’s update function also updates agent’s belief.