Reward Calculation and Feedback Process for Action Nodes in the Intelligent Consortium (Backpropagation and Gradient Descent)
As mentioned earlier, the intelligent structure of the Intelligent Consortium aims to construct all action nodes as multiple independent reinforcement learning networks containing neural network structures (Figure 1). In a reinforcement learning framework, an intelligent agent compares the expected reward value with the actual reward value through network computations, calculates the reward difference, and adjusts the neuron’s weight values and network behavior (preferences) through backpropagation and gradient descent. Similarly, in the behavior of the Intelligent Consortium’s action nodes, a comparable concept of expected reward exists. All nodes in the network behind an action node (including a certain number of driving nodes) assign an expected reward to the action node’s impending action. The expectations and opinions of all driving nodes directly influence the action node through driving behaviors. These influences, combined with the action node’s own expectations and ideas, form the action node’s final action plan and the corresponding expected action reward. Since driving nodes are also action nodes in another network, when observing the entire network from the perspective of an action node, it is possible to determine an action plan generated by that network and executed by its action node, along with the corresponding expected reward. After receiving the actual reward, the action node compares it with the expected reward, and the resulting difference is backpropagated to all driving nodes in the network behind the action node. The driving nodes adjust their perspectives to modify their subsequent driving methods, thereby altering the action node’s action plan. This process resembles the backpropagation and gradient descent used in deep learning neural networks. However, in the Intelligent Consortium, the human-constructed network involves numerous non-quantitative factors, making it similar to computer neural networks only in form and procedural steps, not in the mechanical application of mathematical algorithms.
Before calculating the reward difference for action nodes or constructing the network, a consensus must be reached within the network regarding the information feedback (publication) system. This system can take any form conceived by organizational members. It can be continuously optimized and improved during the development of the Intelligent Consortium to achieve more efficient and accurate information feedback.