Pedestrian Intention Prediction
Brief Project description:
Collaborative research project between Volvo Cars, and student teams from UC Berkeley and Chalmers University. The research project was aimed at predicting the intention of a pedestrian to cross or not cross the road. This project was a successful attempt to emulate the behavior of a human driver to guess the intention of fellow road users such as pedestrians. Our team based the approach on a research paper that had some progress in this line of work. After replicating the paper, our teams decided on experimenting additional approaches and eventually arrived at Fusion based Intention Prediction Networks. First Model that we built was based on the research paper. This model consisted of 3 components: Object detector, object tracker, and a classifier. The model predicted the intention of pedestrians to cross or not cross by indicating a red/green bounding box around the pedestrians. As far as the ego vehicle is concerned, the intention of pedestrians in the immediate vicinity or the path of the vehicles are of importance, whereas other pedestrians in the scene are not important (hence their bounding box colors should be green for most of the scene duration). The model's input is the live video feed from the front camera after which the object detector (YOLOv3) dissects the video into frames and tries to localize the objects of interest (pedestrian in our case) in each frame by plotting bounding boxes around them. After this stage, the detected pedestrians in the video are tracked with the help of an object tracker (SORT) resulting in unique IDs for each pedestrian in the video. Finally the detected and tracked pedestrians from each frame are gathered as features for the 3D scene classifier (DenseNet). The classifier uses the information of each pedestrian in the video to predict their intention to cross, ultimately resulting in a green bounding box for non-crossing pedestrian and red bounding boxes for crossing pedestrians. The challenge here was to predict the intention at least 0.5 secs before the intended action occurred, which from a safety perspective was required for the car to make any necessary decision. Our teams took this basic framework and experimented with alternate types for each component of the model such as SORT, DeepSORT, Pose-Estimation based detection, and tracking. Similarly, alternate types for classifiers such as Recurrent Neural Networks, Random Forest, and DenseNets. Finally we tried feature engineering for the DenseNet classifier which initially used only the cropped pedestrian images as features. In this feature engineering we applied skeletons for the pedestrian using Pose-Estimator and retrained the classifier resulting in an improved AP score of 0.89 achieving more than the research paper that we started with. Thus using this new classifier and additional feature engineering we arrived at a better model for the intent prediction that predicts 0.5 secs before the intended action occurs.