ramo.learner package
Submodules
ramo.learner.execute_learner module
- ramo.learner.execute_learner.execute_learner(monfg, u_tpl, experiment='indep_ac', runs=100, episodes=5000, rollouts=100, alpha_q=0.01, alpha_theta=0.01, alpha_q_decay=1, alpha_theta_decay=1, epsilon=1, epsilon_decay=0.995, min_epsilon=0.1, seed=None)
Execute a commitment experiment.
- Parameters:
monfg (MONFG) – An MONFG object.
u_tpl (Tuple[callable]) – A tuple of utility functions.
experiment (str, optional) – The type of commitment experiment to execute. (Default value = ‘coop_action’)
runs (int, optional) – The number of times to repeat the experiment. (Default value = 100)
episodes (int, optional) – The number of episodes in one run of the experiment. (Default value = 5000)
rollouts (int, optional) – The number of Monte-Carlo simulations at each episode. (Default value = 100)
alpha_q (float, optional) – The learning rate for Q-values. (Default value = 0.01)
alpha_theta (float, optional) – The learning rate for policy parameters. (Default value = 0.01)
alpha_q_decay (float, optional) – The decay of the learning rate for Q-values. (Default value = 1)
alpha_theta_decay (float, optional) – The decay for the learning rate of policy parameters. (Default value = 1)
epsilon (float, optional) – The exploration rate for a Q-learner agent. (Default value = 1)
epsilon_decay (float, optional) – The decay for the exploration rate. (Default value = 0.995)
min_epsilon (float, optional) – The minimum value for the exploration rate. (Default value = 0.1)
seed (int, optional) – The seed for random number generation. (Default value = None)
- Returns:
- A log of payoffs, a log of action probabilities for both agents, a log of the
state distribution and a log of the commitment probabilities.
- Return type:
Tuple[Dict, Dict, ndarray, Dict]
- Raises:
Exception – When the number of players does not equal two.
- ramo.learner.execute_learner.select_actions(agents)
Select an action from each agent’s policy.
- Parameters:
agents (List[Agent]) – A list of agents.
- Returns:
A list of selected actions.
- Return type:
List[int]
- ramo.learner.execute_learner.update(agents, actions, payoffs, experiment)
Perform an update for a list of agents.
- Parameters:
agents (List[Agent]) – A list of agents.
actions (List[int]) – A list of each action that was chosen, indexed by agent.
payoffs (List[ndarray]) – A list of each payoff that was received, indexed by agent.
Returns:
ramo.learner.indep_actor_critic module
- class ramo.learner.indep_actor_critic.IndependentActorCriticAgent(u, num_actions, num_objectives, alpha_q=0.01, alpha_theta=0.01, alpha_q_decay=1, alpha_theta_decay=1, rng=None)
Bases:
objectAn independent learner using the multi-objective actor-critic algorithm for the SER criterion.
This implementation is based on the multi-objective actor-critic algorithm proposed in [1].
References
- objective_function(theta, q_values)
The objective function for the agent. This is the SER criterion.
- Parameters:
theta (ndarray) – The policy parameters.
q_values (ndarray) – The expected returns for the actions.
- Returns:
The utility from the current policy and Q-values.
- Return type:
float
- select_action()
Select an action according to the agent’s policy.
- Returns:
The selected action.
- Return type:
int
- update(action, reward)
Perform an update for the agent.
- Parameters:
action (int) – The actions that was taken by the agent.
reward (float) – The reward that was obtained by the agent.
Returns:
- update_parameters()
Update the hyperparameters. Decays the learning rate for the Q-values and policy parameters.
- update_q_table(action, reward)
Update the Q-table based on the chosen actions and the obtained reward.
- Parameters:
action (int) – The action chosen by this agent.
reward (float) – The reward obtained by this agent.
Returns:
ramo.learner.indep_q module
- class ramo.learner.indep_q.IndependentQAgent(u, num_actions, num_objectives, alpha_q=0.01, alpha_q_decay=1, epsilon=0.01, epsilon_decay=1, min_epsilon=0.01, rng=None)
Bases:
objectAn independent learner using Q-learning for the SER criterion.
This implementation is based on the multi-objective Q-learning algorithm proposed in [1].
References
[1] Rădulescu, R., Mannion, P., Zhang, Y., Roijers, D., & Nowé, A. (2020). A utility-based analysis of equilibria in multi-objective normal-form games. The Knowledge Engineering Review, 35, e32.
- select_action()
Select an action according to the agent’s policy.
- Returns:
The selected action.
- Return type:
int
- update(action, reward)
Perform an update for the agent.
- Parameters:
action (int) – The actions that was taken by the agent.
reward (float) – The reward that was obtained by the agent.
Returns:
- update_parameters()
Update the hyperparameters. This decays the learning rate for the Q-values and exploration parameter.
- update_q_table(action, reward)
Update the Q-table based on the chosen actions and the obtained reward.
- Parameters:
action (int) – The action chosen by this agent.
reward (float) – The reward obtained by this agent.
Returns:
ramo.learner.ja_actor_critic module
- class ramo.learner.ja_actor_critic.JointActionActorCriticAgent(id, u, num_actions, num_objectives, player_actions, alpha_q=0.01, alpha_theta=0.01, alpha_q_decay=1, alpha_theta_decay=1, rng=None)
Bases:
objectA joint-action learner using the multi-objective actor-critic algorithm for the SER criterion.
- objective_function(theta, q_values)
The objective function for the agent. This is the SER criterion.
- Parameters:
theta (ndarray) – The policy parameters.
q_values (ndarray) – The expected returns for the actions.
- Returns:
The utility from the current policy and Q-values.
- Return type:
float
- select_action()
Select an action according to the agent’s policy.
- Returns:
The selected action.
- Return type:
int
- update(actions, reward)
Perform an update for the agent.
- Parameters:
actions (List[int]) – The actions taken by all players.
reward (float) – The reward that was obtained by the agent.
Returns:
- update_counts(actions)
Update the joint-action counts.
- Parameters:
actions (List[int]) – The actions chosen by the agents.
Returns:
- update_parameters()
Update the hyperparameters.
- update_policies()
Update the joint policies from the empirical action distribution.
- update_q_table(actions, reward)
Update the joint-action Q-table.
- Parameters:
actions (List[int]) – The actions chosen by the agents.
reward (float) – The reward obtained by this agent.
Returns:
ramo.learner.ja_q module
- class ramo.learner.ja_q.JointActionQAgent(id, u, num_actions, num_objectives, player_actions, alpha_q=0.01, alpha_q_decay=1, epsilon=0.01, epsilon_decay=1, min_epsilon=0.01, rng=None)
Bases:
objectAn independent learner using Q-learning for the SER criterion.
- select_action()
Select an action according to the agent’s policy.
- Returns:
The selected action.
- Return type:
int
- update(actions, reward)
Perform an update for the agent.
- Parameters:
actions (List[int]) – The actions that were taken by the agents.
reward (float) – The reward that was obtained by the agent.
Returns:
- update_counts(actions)
Update the joint-action counts.
- Parameters:
actions (List[int]) – The actions chosen by the agents.
Returns:
- update_parameters()
Update the hyperparameters. This decays the learning rate for the Q-values and exploration parameter.
- update_policies()
Update the joint policies from the empirical action distribution.
- update_q_table(actions, reward)
Update the joint-action Q-table.
- Parameters:
actions (List[int]) – The actions chosen by the agents.
reward (float) – The reward obtained by this agent.
Returns: