ramo.learner package

Submodules

ramo.learner.execute_learner module

ramo.learner.execute_learner.execute_learner(monfg, u_tpl, experiment='indep_ac', runs=100, episodes=5000, rollouts=100, alpha_q=0.01, alpha_theta=0.01, alpha_q_decay=1, alpha_theta_decay=1, epsilon=1, epsilon_decay=0.995, min_epsilon=0.1, seed=None)

Execute a commitment experiment.

Parameters:
  • monfg (MONFG) – An MONFG object.

  • u_tpl (Tuple[callable]) – A tuple of utility functions.

  • experiment (str, optional) – The type of commitment experiment to execute. (Default value = ‘coop_action’)

  • runs (int, optional) – The number of times to repeat the experiment. (Default value = 100)

  • episodes (int, optional) – The number of episodes in one run of the experiment. (Default value = 5000)

  • rollouts (int, optional) – The number of Monte-Carlo simulations at each episode. (Default value = 100)

  • alpha_q (float, optional) – The learning rate for Q-values. (Default value = 0.01)

  • alpha_theta (float, optional) – The learning rate for policy parameters. (Default value = 0.01)

  • alpha_q_decay (float, optional) – The decay of the learning rate for Q-values. (Default value = 1)

  • alpha_theta_decay (float, optional) – The decay for the learning rate of policy parameters. (Default value = 1)

  • epsilon (float, optional) – The exploration rate for a Q-learner agent. (Default value = 1)

  • epsilon_decay (float, optional) – The decay for the exploration rate. (Default value = 0.995)

  • min_epsilon (float, optional) – The minimum value for the exploration rate. (Default value = 0.1)

  • seed (int, optional) – The seed for random number generation. (Default value = None)

Returns:

A log of payoffs, a log of action probabilities for both agents, a log of the

state distribution and a log of the commitment probabilities.

Return type:

Tuple[Dict, Dict, ndarray, Dict]

Raises:

Exception – When the number of players does not equal two.

ramo.learner.execute_learner.select_actions(agents)

Select an action from each agent’s policy.

Parameters:

agents (List[Agent]) – A list of agents.

Returns:

A list of selected actions.

Return type:

List[int]

ramo.learner.execute_learner.update(agents, actions, payoffs, experiment)

Perform an update for a list of agents.

Parameters:
  • agents (List[Agent]) – A list of agents.

  • actions (List[int]) – A list of each action that was chosen, indexed by agent.

  • payoffs (List[ndarray]) – A list of each payoff that was received, indexed by agent.

Returns:

ramo.learner.indep_actor_critic module

class ramo.learner.indep_actor_critic.IndependentActorCriticAgent(u, num_actions, num_objectives, alpha_q=0.01, alpha_theta=0.01, alpha_q_decay=1, alpha_theta_decay=1, rng=None)

Bases: object

An independent learner using the multi-objective actor-critic algorithm for the SER criterion.

This implementation is based on the multi-objective actor-critic algorithm proposed in [1].

References

objective_function(theta, q_values)

The objective function for the agent. This is the SER criterion.

Parameters:
  • theta (ndarray) – The policy parameters.

  • q_values (ndarray) – The expected returns for the actions.

Returns:

The utility from the current policy and Q-values.

Return type:

float

select_action()

Select an action according to the agent’s policy.

Returns:

The selected action.

Return type:

int

update(action, reward)

Perform an update for the agent.

Parameters:
  • action (int) – The actions that was taken by the agent.

  • reward (float) – The reward that was obtained by the agent.

Returns:

update_parameters()

Update the hyperparameters. Decays the learning rate for the Q-values and policy parameters.

update_q_table(action, reward)

Update the Q-table based on the chosen actions and the obtained reward.

Parameters:
  • action (int) – The action chosen by this agent.

  • reward (float) – The reward obtained by this agent.

Returns:

ramo.learner.indep_q module

class ramo.learner.indep_q.IndependentQAgent(u, num_actions, num_objectives, alpha_q=0.01, alpha_q_decay=1, epsilon=0.01, epsilon_decay=1, min_epsilon=0.01, rng=None)

Bases: object

An independent learner using Q-learning for the SER criterion.

This implementation is based on the multi-objective Q-learning algorithm proposed in [1].

References

select_action()

Select an action according to the agent’s policy.

Returns:

The selected action.

Return type:

int

update(action, reward)

Perform an update for the agent.

Parameters:
  • action (int) – The actions that was taken by the agent.

  • reward (float) – The reward that was obtained by the agent.

Returns:

update_parameters()

Update the hyperparameters. This decays the learning rate for the Q-values and exploration parameter.

update_q_table(action, reward)

Update the Q-table based on the chosen actions and the obtained reward.

Parameters:
  • action (int) – The action chosen by this agent.

  • reward (float) – The reward obtained by this agent.

Returns:

ramo.learner.ja_actor_critic module

class ramo.learner.ja_actor_critic.JointActionActorCriticAgent(id, u, num_actions, num_objectives, player_actions, alpha_q=0.01, alpha_theta=0.01, alpha_q_decay=1, alpha_theta_decay=1, rng=None)

Bases: object

A joint-action learner using the multi-objective actor-critic algorithm for the SER criterion.

objective_function(theta, q_values)

The objective function for the agent. This is the SER criterion.

Parameters:
  • theta (ndarray) – The policy parameters.

  • q_values (ndarray) – The expected returns for the actions.

Returns:

The utility from the current policy and Q-values.

Return type:

float

select_action()

Select an action according to the agent’s policy.

Returns:

The selected action.

Return type:

int

update(actions, reward)

Perform an update for the agent.

Parameters:
  • actions (List[int]) – The actions taken by all players.

  • reward (float) – The reward that was obtained by the agent.

Returns:

update_counts(actions)

Update the joint-action counts.

Parameters:

actions (List[int]) – The actions chosen by the agents.

Returns:

update_parameters()

Update the hyperparameters.

update_policies()

Update the joint policies from the empirical action distribution.

update_q_table(actions, reward)

Update the joint-action Q-table.

Parameters:
  • actions (List[int]) – The actions chosen by the agents.

  • reward (float) – The reward obtained by this agent.

Returns:

ramo.learner.ja_q module

class ramo.learner.ja_q.JointActionQAgent(id, u, num_actions, num_objectives, player_actions, alpha_q=0.01, alpha_q_decay=1, epsilon=0.01, epsilon_decay=1, min_epsilon=0.01, rng=None)

Bases: object

An independent learner using Q-learning for the SER criterion.

select_action()

Select an action according to the agent’s policy.

Returns:

The selected action.

Return type:

int

update(actions, reward)

Perform an update for the agent.

Parameters:
  • actions (List[int]) – The actions that were taken by the agents.

  • reward (float) – The reward that was obtained by the agent.

Returns:

update_counts(actions)

Update the joint-action counts.

Parameters:

actions (List[int]) – The actions chosen by the agents.

Returns:

update_parameters()

Update the hyperparameters. This decays the learning rate for the Q-values and exploration parameter.

update_policies()

Update the joint policies from the empirical action distribution.

update_q_table(actions, reward)

Update the joint-action Q-table.

Parameters:
  • actions (List[int]) – The actions chosen by the agents.

  • reward (float) – The reward obtained by this agent.

Returns:

Module contents