Pedestrian Intention Prediction
Brief Project description:
Collaborative research project between Volvo Cars,
and student teams from UC Berkeley and Chalmers University.
The research project was aimed at predicting the intention of a pedestrian
to cross or not cross the road. This project was a successful attempt to
emulate the behavior of a human driver to guess the intention of fellow road
users such as pedestrians. Our team based the approach on a research paper that
had some progress in this line of work. After replicating the paper, our teams
decided on experimenting additional approaches and eventually arrived at Fusion
based Intention Prediction Networks.
First Model that we built was based on the research paper.
This model consisted of 3 components: Object detector, object tracker,
and a classifier. The model predicted the intention of pedestrians to cross
or not cross by indicating a red/green bounding box around the pedestrians.
As far as the ego vehicle is concerned, the intention of pedestrians in the
immediate vicinity or the path of the vehicles are of importance, whereas other
pedestrians in the scene are not important (hence their bounding box colors should
be green for most of the scene duration). The model's input is the live video feed
from the front camera after which the object detector (YOLOv3) dissects the video
into frames and tries to localize the objects of interest (pedestrian in our case)
in each frame by plotting bounding boxes around them. After this stage, the detected
pedestrians in the video are tracked with the help of an object tracker (SORT)
resulting in unique IDs for each pedestrian in the video. Finally the detected and
tracked pedestrians from each frame are gathered as features for the 3D scene classifier
(DenseNet). The classifier uses the information of each pedestrian in the video to predict
their intention to cross, ultimately resulting in a green bounding box for non-crossing
pedestrian and red bounding boxes for crossing pedestrians.
The challenge here was to predict the intention at least
0.5 secs before the intended action occurred, which from
a safety perspective was required for the car to make any necessary decision.
Our teams took this basic framework and experimented with
alternate types for each component of the model such as SORT,
DeepSORT, Pose-Estimation based detection, and tracking. Similarly,
alternate types for classifiers such as Recurrent Neural Networks, Random Forest,
and DenseNets. Finally we tried feature engineering for the DenseNet
classifier which initially used only the cropped pedestrian images as
features. In this feature engineering we applied skeletons for the pedestrian
using Pose-Estimator and retrained the classifier resulting in an improved AP
score of 0.89 achieving more than the research paper that we started with. Thus
using this new classifier and additional feature engineering we arrived at a better
model for the intent prediction that predicts 0.5 secs before the intended action occurs.