Detailed Analysis of Chicago Crime Dataset

11 min readNov 20, 2020

Introduction :

Population in the urban areas, in the past few decades, has been increasing due to a lot of factors such as increasing job opportunities, greater facilities, better education, etc. However, with this increasing population, the crime rate has also gone up substantially. Due to this, it has become increasingly difficult for law enforcement agencies to control such criminal activities. The main reason for this difficulty can be the division or even lack of resources. To overcome this issue intelligent policing is what we need. It not only about increasing the response time anymore, but it is also about predicting the crime before it even occurs and acting accordingly, and staying prepared.

We as data scientists have a big role in this whole process. We can help to predict the crimes by using the past data and drawing inferences from that data. Data scientists help police predict crimes by looking at some crime patterns from the past, tracking the progress(in terms of crimes) of districts on a yearly/monthly basis, which in turn helps them to maintain stability by taking action in the respective districts in a timely manner. The analysis will help point out the districts where police require more effort and vice versa, which will help identify the areas which need special attention, consequently requiring a greater number of resources to be invested there. Moreover, if we are moving to a new place, data analysis can also help us in identifying areas that are more unsafe than the others. This could help the law enforcement agencies in maintaining order and keeping the city free from crime by:

1) Allocating the city’s resources where they are needed the most

2) Predicting the time and place, where the crime is most likely to occur.

3) Moreover, the law enforcement agencies can also change a policy and compare the crime data after the change with the one before the change to judge the effectiveness of the policy.

The following Chicago crime dataset has been publicly available through its CLEAR initiative. This dataset has been particularly intriguing because of its transparency (in terms of it being publicly available) and social relevance. Considering Chicago has one of the highest crime rates in the United States, the data set is an optimal choice for a benchmark analysis in the field of criminal proceedings.

More about the set :

The data set is regularly updated by Chicago PD, it was last updated in Oct 2020. We have data relevant to crimes and police performance from the years 2000–2020. The data set contains 22 columns and approximately 7.2 million rows. The identifier for each data row is the unique ID or case number of that relevant crime reported.

* A beat is the smallest police geographic area — each beat has a dedicated police beat car. Three to five beats make up a police sector, and three sectors make up a police district
*** Chicago has 77 community areas.

We have at our disposal extensive and detailed data, that now we can study and analyze. Using EDA and visualization tools we will now find patterns and relevant data columns which we can use in the future to run different models including algorithms including machine learning. However, before that, we must move on to data sorting and cleaning

Data Cleaning:

Let us first focus on cleaning the data, the following image shows the types of our different data types.

By using the following code we can unravel the number of NULL values in different columns of our data set.

We can see that there are several columns containing a lot of null data, considering Ward/Community Area highest amount of null values we can remove these. We would not be using these columns in our analysis anyway. Considering the number of null rows is much smaller than the 7.2 million rows, we can discard these to avoid further discrepancies.

The dataset also contains the ‘Date’ column, using the following code we converted the string data type to date-time type which we manipulated further to identify the exact times relevant to the day, month, or year of the incident. We can use this data further in our analysis.

EDA:

Understanding the data in a better and clear way through visualization is referred to as Exploratory Data Analysis (EDA), also known as the art part of Data Science. To get a better view of how columns in our dataset are related, furthermore, what trends and patterns they hold, we plotted several different graphs to understand how features that we plan to use for Machine Learning are related.

Before diving into a specific analysis of our data we can visualize the trend of overall crime. We did this by plotting a line graph for the total number of crimes per year starting from 2001 to 2020. The following code and the subsequent graph show us the number of cases and it’s yearly trend.

Moving on from this generalized visualization, we started off by drawing a heat map to visualize how many Arrests were completed with the number of failed ones. The following code snippet shows how the heatmap was plotted and how the ratio was calculated.

To get a better understanding of the trend regarding the True Arrest and False Arrest, we plotted the ratio of True/False per year and made a pie chart to show the overall percentage of True Arrest against False Arrests over the last 20 years. The following code snippet helped us plot the pie chart.

The following code snippet plots the line graph:

From these plots, it can be clearly seen that there is a large gap between False and True Arrests, and while there are some fluctuations in the trend, an overall decrease in the ratio can clearly be observed.

As District is one of the key features in our Machine Learning model, we plotted the size of each district to understand if the size of the District is related to the crimes reported in each District. The following code plots the map showing each District:

We can then plot the total cases reported in each district over the last 20 years. The following snippet helps us plot the graph:

After plotting this, we observed that District 8 and 11 had the most cases reported over 20 years and had the highest peak of cases reported around 2003. After mid of 2011, District 11 overtook District 8 and had the most cases reported till 2020. On the contrary, District 31 and 20 have remained the safest over the 20 years showing a huge gap from the rest of the districts.

After observing both the graphs, we see a relation between the size of District 8 and the crimes reported in it but after careful consideration, District 11 is relatively small and District 31 is relatively larger compared to the rest. So, we cannot draw a conclusion between the size of a District and the crimes reported in it.

We can now shift our focus to a district-wise analysis of our crime data. Firstly, we plotted a scatterplot for different types of cases for each district. This gives us an idea of what crime is more relevant compared to the other per each district. Here is the code that we used followed by the output scatter plot:

District wise representation for different crimes

Building on this we can now focus on the most common crime reported in each district. We can then move to visualize the efficiency of Chicago PD in solving different types of crimes. The following code and scatter plot represent the most relevant type of crime in each district.

A quick glance can tell us that although not all districts have a similar most-common crime type, however, ‘Theft’ is a primary type relevant to most of the districts, followed by ‘Battery’ and ‘Narcotics’ in some districts.

Lastly, to further interpret the efficiency of Chicago PD it is essential to see which types of crime does the police force handles best. This will help us figure out which crime should the Chicago PD focus its resources more on. To do this we can plot three different bar graphs to show us the total cases of each primary type, which cases did the police excel at solving, and which cases the police could not solve.

The following code snippets and graphs depict the answer to these questions:

Looking at the graphs plotted above, we can conclude that ‘Theft’ is the most relevant primary type of crime relevant across almost all the districts and is also the least solved primary type of crime. Other than that, the police have been generally more successful in solving crimes related to ‘Narcotics’. Therefore, it is evident that more resources and manpower must be focused on Theft.

Lastly, we will move to a district-wise analysis of the most and least solved cases. The following code snippets and scatterplots represent this. In both the plots, we can ignore the lower-lying primary types and focus more on the primary types lying higher on the y-axis. That represents the most and least solved cases.

Again, as the trend in overall Chicago PD showed, we can see that our district-wise analysis also suggests that ‘Theft’ is the least solved case across most of the districts while ‘Narcotics’ is comparatively the most solved case across districts.

Machine Learning and Statistical Interference :

Now, before we predict the trend of solved and unsolved cases, we built a ML model to predict the probability of a case ending up with True or False arrest given the district. This model was developed to support our prediction trend increase the accuracy of our study by going into detail regarding a crime and to help the police department in finding out what crime in their district is affecting their performance in a negative way. It was built to check the ability of a district to solve a specific crime. Here are five random crimes and districts and the following output probabilities are shown. Firstly we separated our data into training and testing parts (80–20 Split).We then calculated the false or true possibilities of different crimes.

Following is the code we used and the results we got :

The result :

We achieved an accuracy of over 86% while predicticing the proabblilties of false and true arrests of different crime types.

Moving on to the statisitical analysis part of our project. We divived the data into true and false arrests. For time series analysis, we can choose different models for forecasting the future depending on the data we have.By plotting the daily number of crime cases from the entire dataset, we can observe a seasonal and repetitive pattern overtime.We used the holt winters model for data.

For this, we trained the model on the initial 80 percent of the data. We call this the training data.Then we tested this against the remaining 20 percent of the data called the test data and see how well it fits.By using the model that we trained, we were able to predict the crime numbers a year later.

Finally, we extended this to predict crime numbers in each district. Below are the graphs and codes for our results:

c

False Arrest Forecast :

Moving on to district wise analysis and predictions. The following graphs are for specified district, however we can use the same code to predict and forecast the data for other districts aswell.

District 1 :

True

False

District 2 :

True

False:

Conclusion :

We have successfully used machine learning and time stamp forecasting to predict and comprehend our problem statement. Identifying districts where more attention is necessary. We tried to show how to perform various machine learning algorithms on a real-world dataset. We also explored various machine learning models such as Logistic Regression and also performed Time-series analysis. We developed these models for predicting future trends of crimes in specific districts and how they would act regarding specific crimes.

Written by: Muhammad Taimoor, Muhammad Afnan, and Aaqib Rasool