Driving Behavior Analysis — A Driver’s ‘DNA’

17 min readJun 18, 2021

Often driving down the highway you would notice a vehicle zooming past you, disregarding traffic rules and cutting lanes. Commercial trucks constitute a large part of the problem and ironically these even carry a sign saying, “How am I driving? ”. It’s almost comical if you think about the fact that the companies running these trucks don't even know about their driver’s performance on the road. Wouldn't it be better if these companies could keep track of their driver's driving behavior? Or even If you could keep a track of your own driver’s driving? Or if you knew how well your children were driving?

To gauge public opinion, we carried out a survey. The survey was based on simple questions regarding driving behaviors to establish the relevance of this project. We had over 83 responses, with ages ranging from 18 to 61. Before moving on, here are some of our findings :

The results motivated us to pursue our project. We decided to take MIT’s research on driving behavior analysis as a template and pursue this as our final year project. Along with my partner, we wanted to work on a project capable of somehow saving lives, and what better ways to save lives than to somehow improve people's driving and minimize accidents. The need to observe and analyze an individual’s driving behavior is not limited to logistical companies. Considering the vast increase in population and the substantial development of road networks in recent years, the number of vehicles on road has increased dramatically. This also translates into the escalation of road safety issues including car accidents. Most of these accidents are caused by human errors, including but not limited to reckless driving, driving under influence, sleep deprivation, and more. Characterizing the driving behavior would not only be useful for accident prevention but would also allow for better fleet management by companies and the issuance and renewal of driving licenses.

Modern vehicles are equipped with several electronic control units (ECU), which seamlessly communicate with one another using the controller area network (CAN) protocol. These ECU’s provide real-time data of several variables including but not limited to engine rotation per minute (RPM), fuel consumption, distance covered, and so on. This data is accessible using the onboard diagnostics control unit (OBD). By recording this real-time data of different ECU’s and integrating that with other information we collect, including but not limited to Global Positioning System (GPS) and linear acceleration we propose to predict and understand human driving behavior. With the help of machine learning algorithms, we can describe the complexity of driving behavior through a set of individual and easy-to-understand quantities; therefore, mapping the ‘Driver’s DNA’. The quantities can help enable almost real-time analysis of their driving. Representing this analysis through a user-friendly graphical interface would enable the user to evaluate their driving behavior.

To start off with, we made a block diagram of what we wanted to achieve. A simple map, that we could use as a template for the next months in order to attain our goal.

Before moving with our solution we set out the following set of goals:

The system should be budget-friendly
Use unsupervised machine learning algorithms to minimize any bias
User-friendly graphical interface
Ease of installment and use
Display a ‘Driver DNA’ profile per trip
Open-source format for future
Monitor trip details via a route map

The Diagram represents our system level design

Data Collection:

In order to obtain comprehensive and real-time data, we had several different choices of sensors to collect and store data. To collect engine data directly from the vehicles CAN BUS we decided to use the OBD module that can decrypt signals from various sensors and transmit live data to our receiving device via Bluetooth protocol.

We also had the option of using a wired OBD Module, however, in order to maximize driver comfortability, the Bluetooth module was preferred. During the initial stages of our project, we decided not only to use a smartphone to access and store this vehicular data but also use it to collect various other types of driving data such as GPS position, accelerometer data and a video of the vehicle's trip. This was done by mounting our smartphone over the dashboard and syncing it with our OBD-II module. To make data collection consistent and automate this process we also have the option of integrating Raspberry-pi into our data collection ecosystem later on. Data collected for each trip is then stored locally, and then transferred to our processing system later, as required.

Initially, we intended to use various hardware aspects, including Arduino and Raspberry Pi; however, through rigorous testing, we agreed to make the data collection process as minimal and easy to use as possible. For this very reason we dropped the use of extensive hardware. Our data collection aspect only needs two components:

OBD — II Bluetooth Module
Android Mobile Device

To collect the data for a single trip, the OBD Module is connected to the port under the steering wheel of the vehicle. The Module is automatically powered on taking in 5–12v depending on the car type to sustain itself.

The second part of our hardware setup is the Android smartphone. The mobile device is connected to the OBD Module via Bluetooth, it is then configured to be placed on the front of the vehicle dashboard so that it receives optimal GPS signals.

With the following setup done our data collection protocol can be engaged. During this project we decided to use third party applications installed on our android smartphone, as developing an android application was out of the scope of our project. The applications we used for data collection are:

After installation of the said android applications, our hardware setup for data collection is then complete, the remaining part of our project implementation is done on software, involving various algorithms ranging from preprocessing to unsupervised machine learning.

At the start of the trip, the hardware is connected in sync and triggered. Throughout the trip live vehicular data is being recorded with a frequency of 1 Hz. Third-party applications continuously log this data and store it locally on the smartphone device. After the ride ends and the applications are stopped, csv files of the recorded data are then transferred to our server on google drive.

The overall process for seamless data collection

Two csv files are collected for each trip constituting of OBD data from the vehicle and GPS data recorded on the smartphone. The data collected from OBD is inconsistent and contains various errors, including issues while sampling and logging the live data. This leads. Thus, preprocessing this raw data is of utmost importance, to build our data set and streamline our results. For the following reason, these raw csv files are then downloaded on our workstation. Where the files are sorted out per each trip according to the driver. The file sizes range from 100 kB to 5 MB, making them easier to handle and upload to our online drive for further processing.

The following image roughly depicts the raw data recorded on these files:

The scoring algorithms we ran were trained on our own dataset that we collected. Through the above-defined processes of data collection, we managed to collect over 42 different trips. We have over 26,140 seconds worth of data. Considering the pandemic had impacted our data collection prospects, we still managed to collect 430+ minutes of data from 6 different drivers, with 4–6 unique data signals per each trip. Considering the local industry does not follow the same sensor standards, a few of our trips with certain vehicles lack certain signals such as throttle positioning and fuel used. For this reason, we have further divided our dataset into two parts, one containing trips from all of the sensors we are using and the other containing trips that only have certain compulsory (RPM, Speed, Acceleration) sensors.

Pre-processing:

A python script that we developed, then reads the raw csv data and stores it in a data frame. We used several python libraries for this task, ranging from pandas to NumPy to sciktlearn. Various important columns of data are then copied to another temporary data frame. The columns include :

Vehicle RPM, Speed (km/h)
Acceleration (m/s²)
Throttle position (%)
Fuel Used(L)

Now, this data frame is then checked for any NULL values and is then cleaned.

To make all the data consistent the program then checks from where point onward did the driver start his/her trip. All the rows containing data before this are then discarded. The Instant fuel rate is then calculated by a custom function that compares the amount of fuel used after each data point and then calculates it according to the nearest second. Other custom functions are then used to calculate the exact time for each row and the time that row was logged compared to the start of the trip. Using the calculated time, a resampling and reshaping function is executed that reshapes the data frame according to our requirements and resamples the entire data at 1 Hz, consistent with our requirements. It is to be noted that not all vehicles contain the same sensors, therefore the program contains exceptions for trip data that does not contain the throttle position and fuel used parameters. The following figure shows the overall process :

The final data frame is then stored as a csv file containing all the required signals for our next step that involves scoring. Similarly, GPS data is also stored as a separate csv file. The final data frame is as shown below :

Data Processing :

Training of the model requires to make clusters of each of the features extracted from each signal, separately. To do this we first have to get signals from each trip’s processed csv file. We do this using the help of two libraries found in python, namely OS and pandas. Using the OS library, we get a list of all the files present in train directory. Now using a for loop, we access each csv file separately and store each csv file as a data frame using pandas. Each column in the data frame represents a separate signal for the trip. We extract the columns (signals) from the data frame separately and append them vertically on list. Making a 2-dimensional array where each row corresponds to a different trip. Doing this will give us 7 2-D lists for each signal, where each row contains data for a single trip in train directory. The signals being extracted from each trip are:

Speed (km/h)
RPM
Acceleration
Deceleration
Throttle Position (%)
Instantaneous Fuel Consumption (L/s)
Time (s)

Once we have the 2-D lists for all signals, we extract 7 features from each signal except for time — time is used in extracting these 7 features from the signals. The 7 features being extracted from each signal are:

Before heading forward with the machine learning algorithm, we only keep the data present between the 2nd and 98th percentile in each of the 7 features 2-D lists. Doing this will further reduce any inconsistencies present in the data due to data acquisition.

We have now extracted 7 features from each of the 6 signals of each trip. And now have 7 2-D lists, one for every feature, where every row corresponds to a different trip present in the train dataset.

We now move to machine learning, for this we opt for an unsupervised approach. This was due to the fact driving habits vary from person to person and there is not set benchmark to compare the driving habits of one individual from the other, except in extreme cases, and also, we do not want any biases creeping into our final results. We were also not given any prior information about the driver and with what aptitude they normally drive. Because of these reasons we went for unsupervised machine learning, using K-Means clustering.

Justifying the number of clusters that can be made was done using the v-measure score, using train and validation dataset, was done as part of SPROJ-1. Where v-meausre socre of 1 tells us that the train clusters and validation clusters are highly similar and a score of 0 tells us they are highly disimilar.

We proposed that 5 clusters should be made out of every feature, and by using the v-measure score of around

, we are able to justify that this number of clusters will not affect our dataset in a bad manner. K-means clustering algorithm takes the number of clusters to be made as input, so we had to first justify this number.

K-Means clustering algorithm also works very well with a not well-separated dataset, which in our case it is. It works by first taking the number of clusters to be made as input and then initialize the centers of each cluster randomly. Classification of each datapoint is done, by classifying it to the cluster whose center is closest. After classification is done the centers of clusters are recomputed and the whole procedure repeats itself, till an optimal center value for each cluster is reached.

In our work we proposed a number of clusters to be made as 5, assigning each cluster with an ordinal number from 1–5 representing that cluster's synthetic score. A higher score is assigned to cluster with extreme values while lower score is assigned to cluster with low values.

We performed K-Means clustering to each of the 7 features of each signal separately in R1, in total 42 clustering algorithms were run simultaneously each in 1 dimension. After clustering is done, we are left with 42 different models. For example, a model for first derivative of the RPM signal, a model for first derivative of a speed signal, etc.

After the models are created, we move to extract the same 7 features from the test trip. In the same manner, as done for training data set. The difference is that, in the test trip, each feature is only a 1-D array, as only one trip is tested at a time. Giving us a total of 42 test feature arrays. After the test trip features are extracted, we move on to scoring. We do this by first predicting to which cluster each element in each of the 7 features array belong to. We predict by using each feature respective model. i.e., we use speed maxima model to predict speed maxima feature clusters. Doing this will give us 7 arrays where each element in this array represents the score to each element in the feature array. Ranging from 1–5.

Doing this for each signal gives us a total of 42 arrays of scores for each of the 42-test features. We now take the mean of all arrays, to give us an average score of the feature throughout the trip. Giving us a total of 7 single values for each signal. Taking a weighted average of this score can give us a final signal score using the formula:

where w1 to w2 represents the weight for each feature final score, to get a final signal score. Determining these weights was out of the scope of our project so we set the weights for all of the features to 1. To get the optimal weights we have to employ a supervised learning approach along with another dataset and different testing environments. So, we set all the weights to 1. However, we are also using different statistical approaches to reach a final score such as median and mode, to give a more comprehensive look into an individual driving habit.

Finally, we reach a point to give a score to each of the 4 proposed driving DNA dimensions, namely: Aggressiveness, cautiousness, safeness, and energy efficiency.

We do this by using the final signal score. For aggressiveness, we use acceleration, throttle, and rpm scores. For safeness, we use speed and acceleration score. For cautiousness, we use the deceleration score, and finally, for energy efficiency, we use the instantaneous fuel consumption scores. Here we also have to note, that if multiple scores are used to give the final score for a DNA dimension, we get a final score using the same selected statistical approach we used for the final signal score. We are now left with the final DNA dimension score for all of the 4 dimensions, ranging from 1–5.

Data Labeling:

Using the data collected through various means we now have the option of labeling/scoring our given data for specific time intervals or events. Thereby initializing a ground truth for comparison with our synthetic scores later. We intend to use an unsupervised machine learning algorithm to analyze the driving behavior of different individuals per trip. Our goal for labeling the driving data is thus to compare it with our calculated scores to judge the performance of our algorithms. A camera (in our testing phase, another smartphone) is mounted on the dashboard facing the road to record a video throughout the trip. Using this video along with the data for speed, RPM, and acceleration, we have written a script that displays the video and plots of the data for a specific interval and asks the user to label the data from 1–5 according to the level of aggressiveness, where 5 is the most aggressive and 1 is the least. Labeling is done by various individuals of different age groups and then a mode is calculated for one trip. This is done to reduce the bias that may come with the process of data labeling. The following figure shows a picture of the output of our script.

Since data collection was a time-consuming matter, we initially started off with an online dataset acquired from Kaggle. Our final scoring was however done after training our models on the dataset that we collected. The results were more accurate to what we aimed at achieving, as they were based on the dataset that pertained to our local environment and driving conditions. As defined earlier our scoring is aimed to be unbiased and unsupervised. Therefore, there is an absence of any true results to compare our scores to for analysis.

As a proof of concept, we were able to perform data labeling on a sample trip. To label the data with as little bias as possible we contacted our fellow students from LUMS. We made a portal for registration for such data labeling. Eventually, over 11 students labeled our sample trip. As described earlier, people were briefed on different driving dimensions and what to expect from certain behaviors. The user was shown a video with live data from the OBD, the script asked them to label every 15 seconds, and then it was left to them to label the data accordingly.

Data from each user was then joined, and a mean was taken. This is considered the labeled score, where random humans gave a score to our trip based on their own judgment.

The results from the trip were then compared to the labeled scores as shown here. As seen through the percentage errors, our scores are not only comparable to the labeled scores but also consistent with our previous results. A proper data labeling survey can further compile such scores and potentially verify the accuracy of our scoring methods.

Representation of Analysis:

Through our implementation were able to calculate precise driving scores related to each individual trip, regardless of the driver. To mimic the ‘Driving- DNA’ aspect of our project, our synthetic scores are plotted on a radar graph, allowing the user to gauge their performance of a single trip. The script used to represent the radar map uses libraries such as plotly and pandas and is dynamically coded to include further dimensions. To present an extensive ‘DNA’ map to the end-user we classify the scores to different dimensions such as Aggressive, Safe, Energy Efficiency, and Cautiousness.

Another aspect of our representation is mapping the route of our driver’s trip.

Along with extracting the signal values from the CAN BUS, we also used a smartphone to log the location data for each trip at 1 Hz. Location data included longitude and latitude values. Using this data and google maps API we were able to plot the route taken by the driver for each trip.The image is displayed on our interactive Graphical User interface when the user selects their specific trip.

Our solution aimed at ensuring emphasis on the representation of the scores and data. We wanted to make the representation as fluid and readable as possible. For this very reason, we designed and developed a futuristic web application. The web application allows for a user to effortlessly access all of their trip data and calculated scores. We developed the entire front end of our web application with test trip data and test cases as a proof of concept. The entire application is coded using flutter and visual code. Our web application is complete with a dynamic floating navigation bar and a floating button. Capable of running on both mobile devices and desktops. The web app contains several pages, the details of which are provided below.

Initial login/ Welcome screen of the web application, that displays Navigation bar, Floating button and takes a user input to display that specific driver

Floating button, that stays at the bottom right corner of the screen to make it easier to go back, check the about us corner and look at the project roadmap.

Upon clicking the project roadmap, you are displayed with a pop-up containing a flowchart of our project.

As soon as correct details are entered, the user is greeted with another page, that contains the data for all their trips along with their respective times. The user can then click on any trip to get information out of.

When the trip is selected, the user can see their route map that is plotted with the GPS data that is logged along with the option to check their driving scores along three different domains: mean, median, and mode.

Any mode selected will then take the user to a page where their trip scores are plotted. The user can clearly see how he drove throughout his trip. The user also has the option to select the button that says how were these scores calculated.

Detailed image of the popup graphs for the users data during the trip.

This project can be taken as a proof of concept, that you can characterize an individual’s driving behavior only through the data acquired through CAN BUS. In the future, multiple steps can be taken to improve and optimize the algorithms and the final results.

Along with my partner Aaqib Rasool and under the guidance of our instructor Sir Muhammad Tahir, the project was successful in demonstrating our proposed solution for Driving Behavior Analysis. Not only were we able to complete the entire project in time but we also exceeded our scope of work.