Predict Chicago Arrest Rate Basing on the Criminal
Predict Chicago Arrest Rate Basing on the Criminal Records From 2012 to 2017 TIANZHI ZHU (JULIAN)
Background Violent crime has always been a serious problem in many cities of many countries, and this problem is getting more and more serious recently. As one of the biggest city in the US, Chicago leads as the deadliest city in 2017. This problem should draw governments’ attention.
Database - Chicago Crimes 2012 to 2017 The dataset I am using contains criminal incidents that happened in Chicago since 2001, until last week. I choose to use the Chicago Crimes 2012 to 2017 as my dataset. This dataset contains 1, 048, 575 criminal records which happened between 2012 and 2017, According to the record, only 307, 777 criminals are under arrested from that 1, 048, 575 criminal cases, which accounts only 30%. The dataset records the criminal case information such as the case ID, time of the crime, criminal type, and the criminal location.
Goal Data Information Prediction Arrest rate VS other factors Using the result, I want to be able to predict the future criminal arrest rate in Chicago basing on some specific factors. The local police officers can take actions in advance to increase the arrest rate, or even avoid the crime from happening by using the model.
Software -- Weka Open source software For data mining Developed at The University of Waikato Environment for Knowledge Analysis
Proposed Solution - Preprocessing Data cleaning The database I chose has 1 million records and 24 columns which contains detailed criminal information in the last 5 years Chose 500 records as sample, testing purpose Reformat the features Remove irrelevant fields (ID, case number…) Merge/seperate Fields field (latitude and longitude) show the same trend (district and community)
Proposed Solution - Data Cleaning Too many instances: 289 no clear classifier No trend Blue: arrested Red:
PS - Discrete VS. Continuous Discrete (nominal) Classification problem May 3 rd and May 4 th 2016 Anything happened on those two days?
PS - Discrete VS. Continuous (numeric) Regression problem Narrow the research area
PS – Classifier Bayes Network Probabilistic graphical model that represents a set of RV and their conditional dependencies via a directed acyclic graph J 48 ~ C 4. 5 is an algorithm used to generate a decision tree The decision tree generated by C 4. 5 can be used for classification, thus it is often referred to as a statistical classifier Logistic regression Statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome
PS – Classifier J 48 Arrest Rate VS Primary Crime Type
PS – Classifier Logistic Arrest Rate VS Primary Crime Type
PS – Classifier Bayes Network The prediction of the arrestment against each feature
PS – Classifier J 48 Multi-attributes (next step) Features Primary Type Time Location (street, apartment) Latitude Longitude
PS – Classifier J 48 Multi-attributes (next step) Diagonal: what we want to c!!!
Next Step Python Test the models I choose for the dataset Design the GUI for some models Visualization Clean data, delete outliers Check and handle the classification error(using the filter Weka provides)
Challenge The database I chose is too large (1 million records) Important features missing weak correlation Criminals mind
Question?
- Slides: 18