Here is a quick and easy to read article about DeepRacer from Amazon WebServices, a car learning to race with Reinforcement Learning. This article will be followed by one article diving deeper into Reinforcement Learning, and a second concluding on the training our DeepRacer did and the results to the AWS DeepRacer real-life track.
AWS DeepRacer is a 1/18th scale autonomous racing car. Launched at AWS re:Invent 2018, it is a new opening toward Machine Learning and especially Reinforcement Learning. It is driven by multiple AWS software such as Amazon SageMaker, Amazon RoboMaker for the main section.
The car is composed of two main parts: the chassis/servos and the computer. Here we are going to center our attention to creating our first model.
Let’s start !
The challenge
The challenge is to run this on a track; all by itself. This would be a nightmare to hard code it. But thanks we have Reinforcement Learning! The car is autonomous, so it means that we don’t have a remote controller, the only things the car has is its sensors, such as the camera, the accelerometer, the gyroscope and so on. We need to train a model wich given a state, takes action with in mind to complete the fastest lap possible. This is done with Reinforcement Learning, with AWS products and their dedicated service to AWS DeepRacer.
This is the video of the championship Cup at AWS re:Invent 2019 taking in Amsterdam. Have a look!
[embedyt] https://www.youtube.com/watch?v=JGMNk_HfruY[/embedyt]The workspace
First, we have to train the algorithm. To do so, AWS gives us a console to work more easily, a defined interface for everyone to work on DeepRacer.
Here we are going to go through multiple steps necessary to launch your first model into Reinforcement Learning.
- Setting an Environment:
The track you choose will greatly influence further training sessions, an algorithm can perform better or worse in different environments.
- Defining Action Space:
This defines which action is DeepRacer able to choose from. The action parameters are:
- The maximum steering angle,
- Steering angle granularity,
- Maximum speed,
- Speed granularity.
More action space means more precision of actions but also means more training time because of numerous outcomes to explore. This leads to a permanent action list, therefore the future of the performances is at stake, choose wisely.
- The Reward Function:
This reward function is the most important aspect of reinforcement learning. Having one in which your reward is greater for slowing down is somewhat counter-intuitive when you want the fastest car, this is a simple example but more intricate ones are easy to find. Having speed is important, but managing to hit the apex is also important and maybe impossible with a certain speed given a certain track, therefore we have to think about all the possible states, or try to and see if a reward function is working or not.
This is written in python and each user writes his own. It will tell the algorithm if an action is good or bad depending on whether it gives a positive or negative reward.
Here is the reward function I made:
def reward_function(params): import math # Read input parameters track_width = params['track_width'] distance_from_center = params['distance_from_center'] all_wheels_on_track = params['all_wheels_on_track'] speed = params['speed'] steering = abs(params['steering_angle']) distance_from_center = params["distance_from_center"] x = params['x'] y = params['y'] waypoints = params['waypoints'] closest_waypoints = params['closest_waypoints'] heading = params['heading'] progress = params['progress'] if 5 <= progress <= 10: reward=1 elif progress <= 20: reward=2.7 elif progress <= 30: reward=7.3 elif progress <= 40: reward=20 elif progress <= 50: reward=54 elif progress <= 60: reward=148 elif progress <= 70: reward=403 elif progress <= 80: reward=1096 elif progress <= 90: reward=2981 reward = 1 # Calculate the direction of the center line based on the closest waypoints next_point = waypoints[closest_waypoints[1]] prev_point = waypoints[closest_waypoints[0]] # Calculate the direction in radius, arctan2(dy, dx), the result is (-pi, pi) in radians track_direction = math.atan2(next_point[1] - prev_point[1], next_point[0] - prev_point[0]) # Convert to degree track_direction = math.degrees(track_direction) # Calculate the difference between the track direction and the heading direction of the car direction_diff = abs(track_direction - heading) # Penalize the reward if the difference is too large DIRECTION_THRESHOLD = 10.0 if direction_diff > DIRECTION_THRESHOLD: reward *= 0.5 #Rewarding if the car is fast during straight lines ( based on steering ) max_speed = 6 speed_1 = 0.8 * max_speed speed_2 = 0.5 * max_speed speed_3 = 0.25 * max_speed low_speed = 0.15 * max_speed if steering <= 7: if speed_1 <= speed: reward = 1 elif speed_2 <= speed: reward = 0.8 elif speed_3 <= speed: reward = 0.5 elif speed <= low_speed: reward = -0.3 elif 7 < steering <= 21: if speed_1 <= speed: reward = 0.6 elif speed_2 <= speed: reward = 0.7 elif speed_3 <= speed: reward = 0.6 elif speed <= low_speed: reward = -0.2 elif 21 < steering : if speed_1 <= speed: reward = 0.2 elif speed_2 <= speed: reward = 0.3 elif speed_3 <= speed: reward = 0.5 elif speed <= low_speed: reward = -0.1 # Calculate 9 markers that are at varying distances away from the center line marker_1 = 1/10 * track_width marker_2 = 2/10 * track_width marker_3 = 3/10 * track_width marker_4 = 4/10 * track_width marker_5 = 5/10 * track_width marker_6 = 6/10 * track_width marker_7 = 7/10 * track_width marker_8 = 8/10 * track_width marker_9 = 9/10 * track_width # Give higher reward if the car is closer to center line and vice versa if steering < 5 : if distance_from_center <= marker_1 : reward = 3.0 elif distance_from_center <= marker_2: reward = 2.5 elif distance_from_center <= marker_3: reward = 2 elif distance_from_center <= marker_4: reward = 1.7 elif distance_from_center <= marker_5: reward = 1.2 elif distance_from_center <= marker_6: reward = 0.8 elif distance_from_center <= marker_7: reward = 0.6 elif distance_from_center <= marker_8: reward = 0.4 elif distance_from_center <= marker_9: reward = 0.01 else: reward = 1e-3 # likely crashed/ close to off track elif steering < 15 : if distance_from_center <= marker_1 : reward = 1 elif distance_from_center <= marker_2: reward = 1.5 elif distance_from_center <= marker_3: reward = 1 elif distance_from_center <= marker_4: reward = 0.9 elif distance_from_center <= marker_5: reward = 0.7 elif distance_from_center <= marker_6: reward = 0.5 elif distance_from_center <= marker_7: reward = 0.3 elif distance_from_center <= marker_8: reward = 0.1 elif distance_from_center <= marker_9: reward = 0.05 else: reward = 1e-3 # likely crashed/ close to off track elif steering < 25 : if distance_from_center <= marker_1 : reward = 0.3 elif distance_from_center <= marker_2: reward = 0.5 elif distance_from_center <= marker_3: reward = 0.7 elif distance_from_center <= marker_4: reward = 0.9 elif distance_from_center <= marker_5: reward = 1 elif distance_from_center <= marker_6: reward = 1.2 elif distance_from_center <= marker_7: reward = 1.4 elif distance_from_center <= marker_8: reward = 1.5 elif distance_from_center <= marker_9: reward = 1.3 else: reward = 1e-3 # likely crashed/ close to off track #Give a reward if the car is fully on track if all_wheels_on_track and (0.5*track_width - distance_from_center) >= 0.05: reward = 2.0 else: reward = -1 if steering > 23: reward *= 0.90 return float(reward)
- Hyperparameters :
They are external to the car. They interfere with the way we let the car learn. These hyperparameters will heavily change his training comportment.
Hyperparameters are the Achilles heels of Reinforcement Learning because there is no easy way to implement them so that they adapt as the training goes on. It tells for example if it prefers to maximize short or long reward. If it explores many paths or sticks to a steady one. It is important to keep in mind that this requires trial and error. Here are the hyperparameters I used for my model :
Hyperparameter | Description | Default value |
---|---|---|
Gradient descent batch size | Affect the computation speed, a lower size means less computation but more noise in the calculation ( less precision in learning ). And vice versa. | 64 |
Entropy | It tells how much will your model explore randomly the environment. | 0.01 |
Discount factor | A high value means your model is looking for long-term rewards. And vice versa. | 0.999 |
Loss type | Using one or the other can be useful when dealing with convergence problems. Here for more details. | Huber |
Learning rate | This is how much your model learns from the experiences he had while training. | 0.0003 |
Number of experience episodes between each policy-updating iteration | Every x episodes, your model learns from what he did. | 20 |
Number of epochs | Number of epochs is the number of times the whole training data is shown to the network while training. | 10 |
The training
Once all the parameters and the reward function are entered we can begin training. The training occurs in an instance where AWS RoboMaker creates the environment for the virtual car and Amazon SageMaker takes care of all the computing. Taking the car’s inputs ( video frame by frame, speed, position, steering, etc ) and turning them into outputs of the action space previously described. This proceeds up to the point where the car leaves the track. We call that an episode. Each episode is broken in smaller pieces which are then sent to the model to analyze. The model changes itself and learns little by little based on the reward he received given a state.
The results
And this is it! You have a working virtual car which is trained for racing. Or, at least supposed to. I did around 3 hours to 4 hours of training on this model, I got the 230th place out of 650 participants 😯 ! This isn’t too bad for a beginner. Here are the results for this season :
Running the car
An article will be entirely dedicated to this step.
The AWS Backend
Let us be a bit more precise about the flow of data in AWS products.
First, Amazon SageMaker creates a permanent model, this model tells what the car to do.
Then it is sent in Amazon S3, a storage services. It sends a copy to AWS RoboMaker, with Gazebo. AWS RoboMaker simulates the physics taking place from the wheels, the camera’s field of view to the friction and acceleration. That experienced data is collected and stored via Redis. This experience is of the form of episodes broken into smaller pieces associated with a state, action, reward and a new state. Amazon SageMaker updates the improved model to Amazon S3. This iterative process goes on until the training is stopped.
Next,
I will dive deeper into the flow of data, the architecture and the maths behind PPO, the algorithm used by DeepRacer.
Eventually, I will end this with an article dealing with real-life racing.
Sources :
- Course on AWS Training and Certification: AWS DeepRacer: Driven by Reinforcement Learning
-
What are Hyperparameters ? and How to tune the Hyperparameters in a Deep Neural Network?