ramo.commitment package
Submodules
ramo.commitment.best_response module
- class ramo.commitment.best_response.BestResponseAgent(id, u, num_actions, num_objectives, alpha_q=0.01, alpha_theta=0.01, alpha_q_decay=1, alpha_theta_decay=1, epsilon=1, epsilon_decay=0.995, min_epsilon=0.1, rng=None)
Bases:
objectA learner used in two-player Stackelberg games. The leader uses multi-objective actor-critic and the follower calculates a best-response using optimisation for the SER.
- get_commitment()
Get the commitment from the leader.
- Returns:
The current policy of the leader.
- Return type:
ndarray
- make_follower()
Make this agent the follower.
- make_leader()
Make this agent the leader.
- objective_function(theta, q_values)
The objective function for the leader. This is the SER criterion.
- Parameters:
theta (ndarray) – The policy parameters.
q_values (ndarray) – The expected returns for the actions.
- Returns:
The utility from the current policy and leader Q-values.
- Return type:
float
- select_action(commitment)
This method will select an action based on the commitment from the leader.
- Parameters:
commitment (ndarray) – The commitment from the leader.
- Returns:
The selected action.
- Return type:
int
- select_committed()
Play an action according to the committed policy.
- Returns:
The selected action.
- Return type:
int
- select_counter_action(commitment, optimistic=False)
Calculate a best-response policy and sample an action from this policy as response to the commitment.
- Parameters:
commitment (ndarray) – The commitment from the leader.
optimistic (bool, optional) – Whether the agent is optimistic or pessimistic. A pessimistic agent will minimise the leader’s utility. An optimistic agent will maximise their own utility. (Default value = False)
- Returns:
The selected action.
- Return type:
int
- set_leader_utility(leader_u)
Set the leader’s utility function. This is used by a pessimistic follower.
- Parameters:
leader_u (callable) – The utility function used by the leader.
Returns:
- update(commitment, actions, reward)
Perform an update of the agent. Specifically updates the Q-tables, policies and hyperparameters.
- Parameters:
commitment (ndarray) – The opponent’s committed policy. Unused at this point in time. Still provided to make it compatible with other commitment agents.
actions (List[int]) – The actions selected in an episode.
reward (float) – The reward that was obtained by the agent in that episode.
Returns:
- update_leader_q_table(own_action, reward)
Update the leader’s Q-table based on their own action and the obtained reward.
- Parameters:
own_action (int) – The action taken by the leader.
reward (float) – The reward obtained by this agent.
Returns:
- update_parameters()
Update the internal parameters of the agent.
- update_payoffs_table(actions, reward)
Update the joint-action payoffs table.
- Parameters:
actions (List[int]) – The actions that were taken in an episode.
reward (float) – The reward obtained by this joint action.
Returns:
ramo.commitment.comp_action module
- class ramo.commitment.comp_action.CompActionAgent(id, u, num_actions, num_objectives, alpha_lq=0.01, alpha_ltheta=0.01, alpha_fq=0.01, alpha_ftheta=0.01, alpha_q_decay=1, alpha_theta_decay=1, rng=None)
Bases:
objectAn agent that learns a best-response policy to each pure-strategy commitment from the leader [1].
References
- get_commitment()
Get the commitment from the leader.
- Returns:
A pure strategy commitment of the leader.
- Return type:
int
- make_follower()
Make this agent the follower.
- make_leader()
Make this agent the leader.
- objective_function(theta, q_values)
The objective function.
- Parameters:
theta (ndarray) – The policy parameters.
q_values (ndarray) – Learned Q-values used to calculate the SER from these parameters.
- Returns:
The utility from the current parameters theta and Q-values.
- Return type:
float
- select_action(commitment)
Select an action based on the commitment of the leader.
- Parameters:
commitment (int) – The message that was sent.
- Returns:
The selected action.
- Return type:
int
- select_committed(leader_action)
Play the pure strategy that was committed.
- Parameters:
leader_action (int) – The pure strategy (action) the leader published.
- Returns:
The committed action.
- Return type:
int
- select_counter_action(leader_action)
Select the correct counter policy and sample an action using this policy.
- Parameters:
leader_action (int) – The committed pure strategy from the leader.
- Returns:
The selected action.
- Return type:
int
- update(commitment, actions, reward)
Perform an update of the agent. Specifically updates the Q-tables, policies and hyperparameters.
- Parameters:
commitment (int) – The leader’s committed action.
actions (List[int]) – The actions selected in an episode.
reward (float) – The reward that was obtained by the agent in that episode.
Returns:
- update_leader_q_table(action, reward)
Update the leader’s Q-table based on their own action and the obtained reward.
- Parameters:
action (int) – The action taken by the leader.
reward (float) – The reward obtained by this agent.
Returns:
- update_parameters()
Update the internal parameters of the agent.
- update_payoffs_table(actions, reward)
Update the joint-action payoffs table.
- Parameters:
actions (List[int]) – The actions that were taken in an episode.
reward (float) – The reward obtained by this joint action.
Returns:
ramo.commitment.coop_action module
- class ramo.commitment.coop_action.CoopActionAgent(id, u, num_actions, num_objectives, alpha_q=0.01, alpha_theta=0.01, alpha_q_decay=1, alpha_theta_decay=1, rng=None)
Bases:
objectAn agent that optimises a single optimal policy from pure strategy commitment [1].
This is mostly intended to be used in an alternating Stackelberg setting, such that players optimise a single optimal joint policy.
References
[1] Willem Röpke, Diederik M. Roijers, Ann Nowé, & Roxana Rădulescu. (2021). Preference Communication in Multi-Objective Normal-Form Games.
- get_commitment()
Get the commitment from the leader.
- Returns:
A pure strategy commitment of the leader.
- Return type:
int
- make_follower()
Make this agent the follower.
- make_leader()
Make this agent the leader.
- objective_function(theta, q_values)
The objective function.
- Parameters:
theta (ndarray) – The policy parameters.
q_values (ndarray) – Learned Q-values used to calculate the SER from these parameters.
- Returns:
The utility from the current parameters theta and Q-values.
- Return type:
float
- pre_update_policies()
Perform a pre update of all policies depending on what commitment is received.
- select_action(commitment)
Select an action based on the commitment of the leader.
- Parameters:
commitment (int) – The message that was sent.
- Returns:
The selected action.
- Return type:
int
- select_committed(leader_action)
Play the pure strategy that was committed.
- Parameters:
leader_action (int) – The pure strategy (action) the leader committed to.
- Returns:
The committed action.
- Return type:
int
- select_counter_action(leader_action)
Select the correct counter policy and sample an action using this policy.
- Parameters:
leader_action (int) – The committed pure strategy from the leader.
- Returns:
The selected action.
- Return type:
int
- update(commitment, actions, reward)
Perform an update of the agent. Specifically updates the Q-tables, policies and hyperparameters.
- Parameters:
commitment (int) – The opponent’s committed action.
actions (List[int]) – The actions selected in an episode.
reward (float) – The reward that was obtained by the agent in that episode.
Returns:
- update_parameters()
Update the internal parameters of the agent.
- update_q_table(actions, reward)
Update the joint-action Q-table.
- Parameters:
actions (List[int]) – The actions taken by the agents.
reward (float) – The reward obtained by this agent.
Returns:
ramo.commitment.coop_policy module
- class ramo.commitment.coop_policy.CoopPolicyAgent(id, u, num_actions, num_objectives, alpha_q=0.01, alpha_theta=0.01, alpha_q_decay=1, alpha_theta_decay=1, rng=None)
Bases:
objectAn agent that optimises a single optimal policy from mixed strategy commitment [1].
References
[1] Willem Röpke, Diederik M. Roijers, Ann Nowé, & Roxana Rădulescu. (2021). Preference Communication in Multi-Objective Normal-Form Games.
- get_commitment()
Get the commitment from the leader.
- Returns:
The current strategy of the leader.
- Return type:
ndarray
- make_follower()
Make this agent the follower.
- make_leader()
Make this agent the leader.
- objective_function(theta, q_values)
The objective function.
- Parameters:
theta (ndarray) – The policy parameters.
q_values (ndarray) – Learned Q-values used to calculate the SER from these parameters.
- Returns:
The utility from the current parameters theta and Q-values.
- Return type:
float
- select_action(commitment)
Select an action based on the commitment of the leader.
- Parameters:
commitment (ndarray) – The message that was sent.
- Returns:
The selected action.
- Return type:
int
- select_committed(leader_strategy)
Sample an action from the committed strategy.
- Parameters:
leader_strategy (ndarray) – The mixed strategy the leader committed to.
- Returns:
The committed action.
- Return type:
int
- select_counter_action(leader_strategy)
Perform an update to learn a counter policy and sample an action using this policy.
- Parameters:
leader_strategy (ndarray) – The committed pure strategy from the leader.
- Returns:
The selected action.
- Return type:
int
- update(commitment, actions, reward)
Perform an update of the agent. Specifically updates the Q-tables, policies and hyperparameters.
- Parameters:
commitment (int) – The opponent’s committed action.
actions (List[int]) – The actions selected in an episode.
reward (float) – The reward that was obtained by the agent in that episode.
Returns:
- update_parameters()
Update the internal parameters of the agent.
- update_q_table(actions, reward)
Update the joint-action Q-table.
- Parameters:
actions (List[int]) – The actions taken by the agents.
reward (float) – The reward obtained by this agent.
Returns:
ramo.commitment.execute_commitment module
- ramo.commitment.execute_commitment.execute_commitment(monfg, u_tpl, experiment='coop_action', runs=100, episodes=5000, rollouts=100, alternate=False, alpha_lq=0.01, alpha_ltheta=0.01, alpha_fq=0.01, alpha_ftheta=0.01, alpha_cq=0.01, alpha_ctheta=0.01, alpha_q_decay=1, alpha_theta_decay=1, alpha_com_decay=1, seed=None)
Execute a commitment experiment.
- Parameters:
payoff_matrices (MONFG) – An MONFG object.
u_tpl (Tuple[callable]) – A tuple of utility functions.
experiment (str, optional) – The type of commitment experiment to execute. (Default value = ‘coop_action’)
runs (int, optional) – The number of times to repeat the experiment. (Default value = 100)
episodes (int, optional) – The number of episodes in one run of the experiment. (Default value = 5000)
rollouts (int, optional) – The number of Monte-Carlo simulations at each episode. (Default value = 100)
alternate (bool, optional) – Whether to alternate the players between leader and follower. (Default value = False)
alpha_lq (float, optional) – The learning rate for leader Q-values. (Default value = 0.01)
alpha_ltheta (float, optional) – The learning rate for leader policy parameters. (Default value = 0.01)
alpha_fq (float, optional) – The learning rate for follower Q-values. (Default value = 0.01)
alpha_ftheta (float, optional) – The learning rate for follower policy parameters. (Default value = 0.01)
alpha_cq (float, optional) – The learning rate for optional commitment Q-values. (Default value = 0.01)
alpha_ctheta (float, optional) – The learning rate for optional commitment policy parameters. (Default value = 0.01)
alpha_q_decay (float, optional) – The decay for the Q-values learning rate. (Default value = 1)
alpha_theta_decay (float, optional) – The decay for the policy parameters learning rate. (Default value = 1)
alpha_com_decay (float, optional) – The decay for the commitment strategy learning rate when using optional commitment. (Default value = 1)
seed (int, optional) – The seed for random number generation. (Default value = None)
- Returns:
A list of trained agents.
- Return type:
List[Agent]
- Raises:
Exception – When the number of players does not equal two.
- ramo.commitment.execute_commitment.get_leader(agents, episode, alternate=False)
Select the leader in the current episode.
- Parameters:
agents (List[Agent]) – A list of agents.
episode (int) – The current episode.
alternate (bool, optional) – Whether to alternate the leader between players of not. (Default value = False)
- Returns:
The id of the leader and the leader agent itself.
- Return type:
Tuple[int, Agent]
- ramo.commitment.execute_commitment.select_actions(agents, commitment)
Select an action from each agent’s policy.
- Parameters:
agents (List[Agent]) – A list of agents.
commitment (None | int | ndarray) – The commitment from the leader.
- Returns:
A list of selected actions.
- Return type:
List[int]
- ramo.commitment.execute_commitment.update(agents, commitment, actions, payoffs)
Perform an update for a list of agents.
- Parameters:
agents (List[Agent]) – A list of agents.
commitment (int | ndarray) – The commitment from the leader.
actions (List[int]) – A list of each action that was chosen, indexed by agent.
payoffs (List[ndarray]) – A list of each payoff that was received, indexed by agent.
Returns:
ramo.commitment.non_stationary module
- class ramo.commitment.non_stationary.NonStationaryAgent(id, u, num_actions, num_objectives, alpha_q=0.01, alpha_theta=0.01, alpha_q_decay=1, alpha_theta_decay=1, buffer_size=20, rng=None)
Bases:
objectAn agent that learns a non-stationary policy to each pure-strategy commitment from the leader.
- get_commitment()
Get the commitment from the leader.
- Returns:
A pure strategy commitment of the leader.
- Return type:
int
- make_follower()
Make this agent the follower.
- make_leader()
Make this agent the leader.
- objective_function_follower(thetas, q_values, leader_policy)
The objective function for the follower.
- Parameters:
thetas (ndarray) – A matrix of thetas.
q_values (ndarray) – Learned Q-values for the joint-actions.
leader_policy (ndarray) – The committed non-stationary strategy from the leader.
- Returns:
The utility from these parameters.
- Return type:
float
- objective_function_leader(theta, q_values)
The objective function for the leader.
- Parameters:
theta (ndarray) – The parameters for the commitment policy.
q_values (ndarray) – The Q-values for committing to actions.
- Returns:
The utility from the commitment strategy.
- Return type:
float
- select_action(commitment)
Select an action based on the commitment of the leader.
- Parameters:
commitment (int) – The message that was sent.
- Returns:
The selected action.
- Return type:
int
- select_committed(leader_action)
Play the pure strategy that was committed.
- Parameters:
leader_action (int) – The pure strategy (action) the leader published.
- Returns:
The committed action.
- Return type:
int
- select_counter_action(leader_action)
Select the correct counter policy and sample an action using this policy.
- Parameters:
leader_action (int) – The committed pure strategy from the leader.
- Returns:
The selected action.
- Return type:
int
- set_opponent_actions(num_actions)
Set the number of actions that the opponent can play.
- Parameters:
num_actions (int) – The number of actions for the opponent.
Returns:
- update(commitment, actions, reward)
Perform an update of the agent. Specifically updates the Q-tables, policies and hyperparameters.
- Parameters:
commitment (int) – The leader’s non-stationary commitment strategy.
actions (List[int]) – The actions selected in an episode.
reward (float) – The reward that was obtained by the agent in that episode.
Returns:
- update_leader_q_table(action, reward)
Update the leader’s Q-table based on their own action and the obtained reward.
- Parameters:
action (int) – The action taken by the leader.
reward (float) – The reward obtained by this agent.
Returns:
- update_parameters()
Update the internal parameters of the agent.
- update_payoffs_table(actions, reward)
Update the joint-action payoffs table.
- Parameters:
actions (List[int]) – The actions that were taken in an episode.
reward (float) – The reward obtained by this joint action.
Returns:
ramo.commitment.optional_com module
- class ramo.commitment.optional_com.OptionalComAgent(no_com_agent, com_agent, id, u, num_actions, num_objectives, alpha_q=0.01, alpha_theta=0.01, alpha_q_decay=1, alpha_theta_decay=1, rng=None)
Bases:
objectAn agent that learns when to commit through a two layer system [1].
This is implemented through learning two additional agents, one used when committing and another used when not committing.
References
[1] Willem Röpke, Diederik M. Roijers, Ann Nowé, & Roxana Rădulescu. (2021). Preference Communication in Multi-Objective Normal-Form Games.
- get_commitment()
Get the commitment from the leader.
- Returns:
A commitment from the leader.
- Return type:
int | None
- make_follower()
Make this agent the follower.
- make_leader()
Make this agent the leader.
- objective_function(theta, q_values)
The objective function.
- Parameters:
theta (ndarray) – The policy parameters.
q_values (ndarray) – Learned Q-values used to calculate the SER from these parameters.
- Returns:
The utility from the current parameters theta and Q-values.
- Return type:
float
- select_action(commitment)
Select an action based on the commitment of the leader. Pass the commitment to the correct layer.
- Parameters:
commitment (int | ndarray) – The commitment from the leader.
- Returns:
The selected action.
- Return type:
int
- update(commitment, actions, reward)
Perform an update of the agent and cascade an update to the lower layer agent.
- Parameters:
commitment (int) – The leader’s non-stationary commitment strategy.
actions (List[int]) – The actions selected in an episode.
reward (float) – The reward that was obtained by the agent in that episode.
Returns:
- update_parameters()
Update the internal parameters of the agent.
- update_q_table(commitment, reward)
Update the vector-valued Q-table.
- Parameters:
commitment (int | ndarray | None) – The commitment from the leader.
reward (float) – The reward obtained by this agent.
Returns: