I am trying to build a custom reinforcement learning environment with multiple agents having their own policy network for a project, and I have stuck in the training part (trying to follow a similar approach with this example)
My policy network accepts an array of size 21 as input and outputs a single element from [-1, 0, 1].
I have the following code (multiple-file code shortened into a single file; sorry for the mess):
clear close all %% Model parameters T_init = 0; T_final = 100; dt = 1; rng("shuffle") baseEnv = baseEnvironment(); p1_pos = randi(baseEnv.L,1); p2_pos = randi(baseEnv.L,1); while p1_pos == p2_pos p2_pos = randi(baseEnv.L,1); end rng("shuffle") baseEnv = baseEnvironment(); % validateEnvironment(baseEnv)
p1_pos = randi(baseEnv.L,1); p2_pos = randi(baseEnv.L,1); while p1_pos == p2_pos p2_pos = randi(baseEnv.L,1); end agent1 = IMAgent(baseEnv, p1_pos, 1, 'o'); agent2 = IMAgent(baseEnv, p2_pos, 2, 'x'); listOfAgents = [agent1; agent2]; multiAgentEnv = multiAgentEnvironment(listOfAgents); %
actInfo = getActionInfo(baseEnv); obsInfo = getObservationInfo(baseEnv); %%build the agent1
actorNetwork = [imageInputLayer([obsInfo.Dimension(1) 1 1],'Normalization','none','Name','state') fullyConnectedLayer(24,'Name','fc1') reluLayer('Name','relu1') fullyConnectedLayer(24,'Name','fc2') reluLayer('Name','relu2') fullyConnectedLayer(numel(actInfo.Elements),'Name','output') softmaxLayer('Name','actionProb')]; actorOpts = rlRepresentationOptions('LearnRate',1e-3,'GradientThreshold',1); actor = rlStochasticActorRepresentation(actorNetwork,... obsInfo,actInfo,'Observation','state',actorOpts); actor = setLoss(actor, @actorLossFunction); %obj.brain = rlPGAgent(actor,baseline,agentOpts);
agentOpts = rlPGAgentOptions('UseBaseline',false, 'DiscountFactor', 0.99); agent1.brain = rlPGAgent(actor,agentOpts); %%build the agent2
actorNetwork = [imageInputLayer([obsInfo.Dimension(1) 1 1],'Normalization','none','Name','state') fullyConnectedLayer(24,'Name','fc1') reluLayer('Name','relu1') fullyConnectedLayer(24,'Name','fc2') reluLayer('Name','relu2') fullyConnectedLayer(numel(actInfo.Elements),'Name','output') softmaxLayer('Name','actionProb')]; actorOpts = rlRepresentationOptions('LearnRate',1e-3,'GradientThreshold',1); actor = rlStochasticActorRepresentation(actorNetwork,... obsInfo,actInfo,'Observation','state',actorOpts); actor = setLoss(actor, @actorLossFunction); %obj.brain = rlPGAgent(actor,baseline,agentOpts); agentOpts = rlPGAgentOptions('UseBaseline',false, 'DiscountFactor', 0.99); agent2.brain = rlPGAgent(actor,agentOpts); %% averageGrad = []; averageSqGrad = []; learnRate = 0.05; gradDecay = 0.75; sqGradDecay = 0.95; numOfEpochs = 1; numEpisodes = 5000; maxStepsPerEpisode = 250; discountFactor = 0.995; aveWindowSize = 100; trainingTerminationValue = 220; loss_history = []; for i = 1:numOfEpochs action_hist = []; reward_hist = []; observation_hist = [multiAgentEnv.baseEnv.state]; for t = T_init:1:T_final actionList = multiAgentEnv.act(); [observation, reward, multiAgentEnv.isDone, ~] = multiAgentEnv.step(actionList); if t == T_final multiAgentEnv.isDone = true; end action_hist = cat(3, action_hist, actionList); reward_hist = cat(3, reward_hist, reward); if multiAgentEnv.isDone == true break else observation_hist = cat(3, observation_hist, observation); end end if size(observation_hist,3) ~= size(action_hist,3) print("gi") end clear observation reward actor = getActor(agent1.brain); batchSize = min(t,maxStepsPerEpisode); observations = observation_hist; actions = action_hist(1,:,:); rewards = reward_hist(1,:,:); observationBatch = permute(observations(:,:,1:batchSize), [2,1,3]); actionBatch = actions(:,:,1:batchSize); rewardBatch = rewards(:,1:batchSize); discountedReturn = zeros(1,int32(batchSize)); for t = 1:batchSize G = 0; for k = t:batchSize G = G + discountFactor ^ (k-t) * rewardBatch(k); end discountedReturn(t) = G; end lossData.batchSize = batchSize; lossData.actInfo = actInfo; lossData.actionBatch = actionBatch; lossData.discountedReturn = discountedReturn; % 6. Compute the gradient of the loss with respect to the policy
% parameters.
actorGradient = gradient(actor,'loss-parameters', {observationBatch},lossData); p1_pos = randi(baseEnv.L,1); p2_pos = randi(baseEnv.L,1); while p1_pos == p2_pos p2_pos = randi(baseEnv.L,1); end multiAgentEnv.reset([p1_pos; p2_pos]); end function loss = actorLossFunction(policy, lossData) % Create the action indication matrix.
batchSize = lossData.batchSize; Z = repmat(lossData.actInfo.Elements',1,batchSize); actionIndicationMatrix = lossData.actionBatch(:,:) == Z; % Resize the discounted return to the size of policy.
G = actionIndicationMatrix .* lossData.discountedReturn; G = reshape(G,size(policy)); % Round any policy values less than eps to eps.
policy(policy < eps) = eps; % Compute the loss.
loss = -sum(G .* log(policy),'all'); end
When I run the code, I am getting the following error:
Error using rl.representation.rlAbstractRepresentation/gradient (line 181)Unable to compute gradient from representation.Error in main1 (line 154) actorGradient = gradient(actor,'loss-parameters', {observationBatch},lossData);Caused by: Unable to evaluate the loss function. Check the loss function and ensure it runs successfully. Reference to non-existent field 'Advantage'.
I also tried running the example in the link; it works, but not my code. I put a breakpoint the loss function, but it isn't called during the gradient calculation, and from the error message, I suspect this is the problem, but the thing is it works when I run the code of the example in mathworks' website.
Best Answer