AI4DI x Mindtrek: Reinforcement Learning

03.02.20

Recently, Vaisto Solutions was invited to speak at the Smart City Mindtrek Conference held in Tampere, Finland. During our presentation, we focused on the concept of reinforcement learning and the benefits that it provides to its recipients.

Reinforcement learning is an area of machine learning where so-called agents are used to find optimal paths to complete certain tasks. Agents try to behave like humans on a single task. Agent receives an input (Observations about the environment e.g. pictures from a camera) and produces output (Action e.g. “turn right”). In order to achieve optimal actions to certain inputs, agents are introduced a reward function (e.g. Do this as soon as possible and receive a reward). After specifying the environment and rewards, agents need to be trained in order for them to learn the task. In contrast to supervised learning, reinforcement learning does not use labeled data, but learns from interacting with given environment. 

Simulation environments are becoming more realistic and it starts to be more evident that work machine prototyping, development and also automation can be developed in simulation environments to a very mature state. This speeds up product development lead time and also saves cost in prototyping. The following 2 videos showcase our contribution to the AI4DI project.


In a digital twin work cycle automation, two agents are used: one for picking up the logs and one for driving the forest road. The agents receive a state of the machine and some visual observations as an input and produce an action. When the log is being picked up, the agent can freely control any joint of the grabber and when driving, it produces the throttle and the steer angle as an output.


In many cases, agents require lots of training. The training time depends on the hardware used and the complexity of the environment and the task. There are many tricks to be used to speed up the training process including simulating the environment at faster speeds and parallel environments. With the right reward function and computing power, The agents can learn complex behaviour such as helping and avoiding each other in order to complete the task in the most reliable and optimal way possible.