ramo.commitment package

Submodules

ramo.commitment.best_response module

class ramo.commitment.best_response.BestResponseAgent(id, u, num_actions, num_objectives, alpha_q=0.01, alpha_theta=0.01, alpha_q_decay=1, alpha_theta_decay=1, epsilon=1, epsilon_decay=0.995, min_epsilon=0.1, rng=None)

Bases: object

A learner used in two-player Stackelberg games. The leader uses multi-objective actor-critic and the follower calculates a best-response using optimisation for the SER.

get_commitment()

Get the commitment from the leader.

Returns:: The current policy of the leader.
Return type:: ndarray

make_follower(): Make this agent the follower.

make_leader(): Make this agent the leader.

objective_function(theta, q_values)

The objective function for the leader. This is the SER criterion.

Parameters:

theta (ndarray) – The policy parameters.
q_values (ndarray) – The expected returns for the actions.

Returns:

The utility from the current policy and leader Q-values.

Return type:

float

select_action(commitment)

This method will select an action based on the commitment from the leader.

Parameters:: commitment (ndarray) – The commitment from the leader.
Returns:: The selected action.
Return type:: int

select_committed()

Play an action according to the committed policy.

Returns:: The selected action.
Return type:: int

select_counter_action(commitment, optimistic=False)

Calculate a best-response policy and sample an action from this policy as response to the commitment.

Parameters:

commitment (ndarray) – The commitment from the leader.
optimistic (bool, optional) – Whether the agent is optimistic or pessimistic. A pessimistic agent will minimise the leader’s utility. An optimistic agent will maximise their own utility. (Default value = False)

Returns:

The selected action.

Return type:

int

set_leader_utility(leader_u)

Set the leader’s utility function. This is used by a pessimistic follower.

Parameters:: leader_u (callable) – The utility function used by the leader.

Returns:

update(commitment, actions, reward)

Perform an update of the agent. Specifically updates the Q-tables, policies and hyperparameters.

Parameters:

commitment (ndarray) – The opponent’s committed policy. Unused at this point in time. Still provided to make it compatible with other commitment agents.
actions (List[int]) – The actions selected in an episode.
reward (float) – The reward that was obtained by the agent in that episode.

Returns:

update_leader_q_table(own_action, reward)

Update the leader’s Q-table based on their own action and the obtained reward.

Parameters:

own_action (int) – The action taken by the leader.
reward (float) – The reward obtained by this agent.

Returns:

update_parameters(): Update the internal parameters of the agent.

update_payoffs_table(actions, reward)

Update the joint-action payoffs table.

Parameters:

actions (List[int]) – The actions that were taken in an episode.
reward (float) – The reward obtained by this joint action.

Returns:

ramo.commitment.comp_action module

class ramo.commitment.comp_action.CompActionAgent(id, u, num_actions, num_objectives, alpha_lq=0.01, alpha_ltheta=0.01, alpha_fq=0.01, alpha_ftheta=0.01, alpha_q_decay=1, alpha_theta_decay=1, rng=None)

Bases: object

An agent that learns a best-response policy to each pure-strategy commitment from the leader [1].

References

get_commitment()

Get the commitment from the leader.

Returns:: A pure strategy commitment of the leader.
Return type:: int

make_follower(): Make this agent the follower.

make_leader(): Make this agent the leader.

objective_function(theta, q_values)

The objective function.

Parameters:

theta (ndarray) – The policy parameters.
q_values (ndarray) – Learned Q-values used to calculate the SER from these parameters.

Returns:

The utility from the current parameters theta and Q-values.

Return type:

float

select_action(commitment)

Select an action based on the commitment of the leader.

Parameters:: commitment (int) – The message that was sent.
Returns:: The selected action.
Return type:: int

select_committed(leader_action)

Play the pure strategy that was committed.

Parameters:: leader_action (int) – The pure strategy (action) the leader published.
Returns:: The committed action.
Return type:: int

select_counter_action(leader_action)

Select the correct counter policy and sample an action using this policy.

Parameters:: leader_action (int) – The committed pure strategy from the leader.
Returns:: The selected action.
Return type:: int

update(commitment, actions, reward)

Perform an update of the agent. Specifically updates the Q-tables, policies and hyperparameters.

Parameters:

commitment (int) – The leader’s committed action.
actions (List[int]) – The actions selected in an episode.
reward (float) – The reward that was obtained by the agent in that episode.

Returns:

update_leader_q_table(action, reward)

Update the leader’s Q-table based on their own action and the obtained reward.

Parameters:

action (int) – The action taken by the leader.
reward (float) – The reward obtained by this agent.

Returns:

update_parameters(): Update the internal parameters of the agent.

update_payoffs_table(actions, reward)

Update the joint-action payoffs table.

Parameters:

actions (List[int]) – The actions that were taken in an episode.
reward (float) – The reward obtained by this joint action.

Returns:

ramo.commitment.coop_action module

class ramo.commitment.coop_action.CoopActionAgent(id, u, num_actions, num_objectives, alpha_q=0.01, alpha_theta=0.01, alpha_q_decay=1, alpha_theta_decay=1, rng=None)

Bases: object

An agent that optimises a single optimal policy from pure strategy commitment [1].

This is mostly intended to be used in an alternating Stackelberg setting, such that players optimise a single optimal joint policy.

References

get_commitment()

Get the commitment from the leader.

Returns:: A pure strategy commitment of the leader.
Return type:: int

make_follower(): Make this agent the follower.

make_leader(): Make this agent the leader.

objective_function(theta, q_values)

The objective function.

Parameters:

theta (ndarray) – The policy parameters.
q_values (ndarray) – Learned Q-values used to calculate the SER from these parameters.

Returns:

The utility from the current parameters theta and Q-values.

Return type:

float

pre_update_policies(): Perform a pre update of all policies depending on what commitment is received.

select_action(commitment)

Select an action based on the commitment of the leader.

Parameters:: commitment (int) – The message that was sent.
Returns:: The selected action.
Return type:: int

select_committed(leader_action)

Play the pure strategy that was committed.

Parameters:: leader_action (int) – The pure strategy (action) the leader committed to.
Returns:: The committed action.
Return type:: int

select_counter_action(leader_action)

Select the correct counter policy and sample an action using this policy.

Parameters:: leader_action (int) – The committed pure strategy from the leader.
Returns:: The selected action.
Return type:: int

update(commitment, actions, reward)

Perform an update of the agent. Specifically updates the Q-tables, policies and hyperparameters.

Parameters:

commitment (int) – The opponent’s committed action.
actions (List[int]) – The actions selected in an episode.
reward (float) – The reward that was obtained by the agent in that episode.

Returns:

update_parameters(): Update the internal parameters of the agent.

update_q_table(actions, reward)

Update the joint-action Q-table.

Parameters:

actions (List[int]) – The actions taken by the agents.
reward (float) – The reward obtained by this agent.

Returns:

ramo.commitment.coop_policy module

class ramo.commitment.coop_policy.CoopPolicyAgent(id, u, num_actions, num_objectives, alpha_q=0.01, alpha_theta=0.01, alpha_q_decay=1, alpha_theta_decay=1, rng=None)

Bases: object

An agent that optimises a single optimal policy from mixed strategy commitment [1].

References

get_commitment()

Get the commitment from the leader.

Returns:: The current strategy of the leader.
Return type:: ndarray

make_follower(): Make this agent the follower.

make_leader(): Make this agent the leader.

objective_function(theta, q_values)

The objective function.

Parameters:

theta (ndarray) – The policy parameters.
q_values (ndarray) – Learned Q-values used to calculate the SER from these parameters.

Returns:

The utility from the current parameters theta and Q-values.

Return type:

float

select_action(commitment)

Select an action based on the commitment of the leader.

Parameters:: commitment (ndarray) – The message that was sent.
Returns:: The selected action.
Return type:: int

select_committed(leader_strategy)

Sample an action from the committed strategy.

Parameters:: leader_strategy (ndarray) – The mixed strategy the leader committed to.
Returns:: The committed action.
Return type:: int

select_counter_action(leader_strategy)

Perform an update to learn a counter policy and sample an action using this policy.

Parameters:: leader_strategy (ndarray) – The committed pure strategy from the leader.
Returns:: The selected action.
Return type:: int

update(commitment, actions, reward)

Perform an update of the agent. Specifically updates the Q-tables, policies and hyperparameters.

Parameters:

commitment (int) – The opponent’s committed action.
actions (List[int]) – The actions selected in an episode.
reward (float) – The reward that was obtained by the agent in that episode.

Returns:

update_parameters(): Update the internal parameters of the agent.

update_q_table(actions, reward)

Update the joint-action Q-table.

Parameters:

actions (List[int]) – The actions taken by the agents.
reward (float) – The reward obtained by this agent.

Returns:

ramo.commitment.execute_commitment module

ramo.commitment.execute_commitment.execute_commitment(monfg, u_tpl, experiment='coop_action', runs=100, episodes=5000, rollouts=100, alternate=False, alpha_lq=0.01, alpha_ltheta=0.01, alpha_fq=0.01, alpha_ftheta=0.01, alpha_cq=0.01, alpha_ctheta=0.01, alpha_q_decay=1, alpha_theta_decay=1, alpha_com_decay=1, seed=None)

Execute a commitment experiment.

Parameters:

payoff_matrices (MONFG) – An MONFG object.
u_tpl (Tuple[callable]) – A tuple of utility functions.
experiment (str, optional) – The type of commitment experiment to execute. (Default value = ‘coop_action’)
runs (int, optional) – The number of times to repeat the experiment. (Default value = 100)
episodes (int, optional) – The number of episodes in one run of the experiment. (Default value = 5000)
rollouts (int, optional) – The number of Monte-Carlo simulations at each episode. (Default value = 100)
alternate (bool, optional) – Whether to alternate the players between leader and follower. (Default value = False)
alpha_lq (float, optional) – The learning rate for leader Q-values. (Default value = 0.01)
alpha_ltheta (float, optional) – The learning rate for leader policy parameters. (Default value = 0.01)
alpha_fq (float, optional) – The learning rate for follower Q-values. (Default value = 0.01)
alpha_ftheta (float, optional) – The learning rate for follower policy parameters. (Default value = 0.01)
alpha_cq (float, optional) – The learning rate for optional commitment Q-values. (Default value = 0.01)
alpha_ctheta (float, optional) – The learning rate for optional commitment policy parameters. (Default value = 0.01)
alpha_q_decay (float, optional) – The decay for the Q-values learning rate. (Default value = 1)
alpha_theta_decay (float, optional) – The decay for the policy parameters learning rate. (Default value = 1)
alpha_com_decay (float, optional) – The decay for the commitment strategy learning rate when using optional commitment. (Default value = 1)
seed (int, optional) – The seed for random number generation. (Default value = None)

Returns:

A list of trained agents.

Return type:

List[Agent]

Raises:

Exception – When the number of players does not equal two.

ramo.commitment.execute_commitment.get_leader(agents, episode, alternate=False)

Select the leader in the current episode.

Parameters:

agents (List[Agent]) – A list of agents.
episode (int) – The current episode.
alternate (bool, optional) – Whether to alternate the leader between players of not. (Default value = False)

Returns:

The id of the leader and the leader agent itself.

Return type:

Tuple[int, Agent]

ramo.commitment.execute_commitment.select_actions(agents, commitment)

Select an action from each agent’s policy.

Parameters:

agents (List[Agent]) – A list of agents.
commitment (None | int | ndarray) – The commitment from the leader.

Returns:

A list of selected actions.

Return type:

List[int]

ramo.commitment.execute_commitment.update(agents, commitment, actions, payoffs)

Perform an update for a list of agents.

Parameters:

agents (List[Agent]) – A list of agents.
commitment (int | ndarray) – The commitment from the leader.
actions (List[int]) – A list of each action that was chosen, indexed by agent.
payoffs (List[ndarray]) – A list of each payoff that was received, indexed by agent.

Returns:

ramo.commitment.non_stationary module

class ramo.commitment.non_stationary.NonStationaryAgent(id, u, num_actions, num_objectives, alpha_q=0.01, alpha_theta=0.01, alpha_q_decay=1, alpha_theta_decay=1, buffer_size=20, rng=None)

Bases: object

An agent that learns a non-stationary policy to each pure-strategy commitment from the leader.

get_commitment()

Get the commitment from the leader.

Returns:: A pure strategy commitment of the leader.
Return type:: int

make_follower(): Make this agent the follower.

make_leader(): Make this agent the leader.

objective_function_follower(thetas, q_values, leader_policy)

The objective function for the follower.

Parameters:

thetas (ndarray) – A matrix of thetas.
q_values (ndarray) – Learned Q-values for the joint-actions.
leader_policy (ndarray) – The committed non-stationary strategy from the leader.

Returns:

The utility from these parameters.

Return type:

float

objective_function_leader(theta, q_values)

The objective function for the leader.

Parameters:

theta (ndarray) – The parameters for the commitment policy.
q_values (ndarray) – The Q-values for committing to actions.

Returns:

The utility from the commitment strategy.

Return type:

float

select_action(commitment)

Select an action based on the commitment of the leader.

Parameters:: commitment (int) – The message that was sent.
Returns:: The selected action.
Return type:: int

select_committed(leader_action)

Play the pure strategy that was committed.

Parameters:: leader_action (int) – The pure strategy (action) the leader published.
Returns:: The committed action.
Return type:: int

select_counter_action(leader_action)

Select the correct counter policy and sample an action using this policy.

Parameters:: leader_action (int) – The committed pure strategy from the leader.
Returns:: The selected action.
Return type:: int

set_opponent_actions(num_actions)

Set the number of actions that the opponent can play.

Parameters:: num_actions (int) – The number of actions for the opponent.

Returns:

update(commitment, actions, reward)

Perform an update of the agent. Specifically updates the Q-tables, policies and hyperparameters.

Parameters:

commitment (int) – The leader’s non-stationary commitment strategy.
actions (List[int]) – The actions selected in an episode.
reward (float) – The reward that was obtained by the agent in that episode.

Returns:

update_leader_q_table(action, reward)

Update the leader’s Q-table based on their own action and the obtained reward.

Parameters:

action (int) – The action taken by the leader.
reward (float) – The reward obtained by this agent.

Returns:

update_parameters(): Update the internal parameters of the agent.

update_payoffs_table(actions, reward)

Update the joint-action payoffs table.

Parameters:

actions (List[int]) – The actions that were taken in an episode.
reward (float) – The reward obtained by this joint action.

Returns:

ramo.commitment.optional_com module

class ramo.commitment.optional_com.OptionalComAgent(no_com_agent, com_agent, id, u, num_actions, num_objectives, alpha_q=0.01, alpha_theta=0.01, alpha_q_decay=1, alpha_theta_decay=1, rng=None)

Bases: object

An agent that learns when to commit through a two layer system [1].

This is implemented through learning two additional agents, one used when committing and another used when not committing.

References

get_commitment()

Get the commitment from the leader.

Returns:: A commitment from the leader.
Return type:: int | None

make_follower(): Make this agent the follower.

make_leader(): Make this agent the leader.

objective_function(theta, q_values)

The objective function.

Parameters:

theta (ndarray) – The policy parameters.
q_values (ndarray) – Learned Q-values used to calculate the SER from these parameters.

Returns:

The utility from the current parameters theta and Q-values.

Return type:

float

select_action(commitment)

Select an action based on the commitment of the leader. Pass the commitment to the correct layer.

Parameters:: commitment (int | ndarray) – The commitment from the leader.
Returns:: The selected action.
Return type:: int

update(commitment, actions, reward)

Perform an update of the agent and cascade an update to the lower layer agent.

Parameters:

commitment (int) – The leader’s non-stationary commitment strategy.
actions (List[int]) – The actions selected in an episode.
reward (float) – The reward that was obtained by the agent in that episode.

Returns:

update_parameters(): Update the internal parameters of the agent.

update_q_table(commitment, reward)

Update the vector-valued Q-table.

Parameters:

commitment (int | ndarray | None) – The commitment from the leader.
reward (float) – The reward obtained by this agent.

Returns:

ramo.commitment package

Submodules

ramo.commitment.best_response module

ramo.commitment.comp_action module

ramo.commitment.coop_action module

ramo.commitment.coop_policy module

ramo.commitment.execute_commitment module

ramo.commitment.non_stationary module

ramo.commitment.optional_com module

Module contents