About KNN Algorithm 

The KNN algorithm is a supervised machine learning algorithm that is used for both classification and regression tasks. It works by finding the K most similar data points to a new data point and then predicting the class or value of the new data point based on the classes or values of the K most similar data points.

To predict the value of a new data point using the KNN algorithm, the following steps are taken:

KNN algorithm working for Regression:

  • Find the K most similar data points to the new data point.
  • Calculate the average value of the K most similar data points.
  • Assign the new data point the average value of the K most similar data points.

KNN algorithm working for Clustering:

  • Find the K most similar data points to each data point in the dataset using a distance metric such as Euclidean distance or Manhattan distance.
  • Assign each data point to the cluster of its most similar data point.

Advantages and disadvantages of the KNN algorithm

  • The KNN algorithm is a simple and intuitive algorithm, and it is easy to implement.
  • It is also a very versatile algorithm, as it can be used for both classification and regression tasks.

KNN algorithm has some disadvantages.

  • It can be computationally expensive, especially for large datasets.
  • It is sensitive to outliers, as a single outlier can have a big impact on the prediction.

ANOVA Test

The ANOVA test is a statistical test that is used to compare the means of three or more groups. It is a powerful tool that can be used to determine whether there is a significant difference between the groups.

The ANOVA test works by comparing the variance between the groups to the variance within the groups. If the variance between the groups is significantly larger than the variance within the groups, then the ANOVA test will reject the null hypothesis, which is that there is no difference between the groups.

The ANOVA test is a relatively complex test, but it can be understood in simple terms. Imagine that we are comparing the average height of students in three different schools. We would collect the height of a sample of students from each school. We would then use the ANOVA test to compare the average height of the students in each school. If the ANOVA test rejects the null hypothesis, then we can conclude that there is a significant difference in the average height of the students in the three schools.

The ANOVA test is a valuable tool that can be used in a variety of fields, such as education, psychology, and medicine. It is a powerful test that can be used to determine whether there is a significant difference between groups.

Top days shooting

The histogram that shows the distribution of incident times in the dataset, with the x-axis showing the hour of the day and the y-axis showing the count. The most common time of day for incidents will be the bar with the highest height.

This visualization shows that most incidents happen between the hours of 6pm and midnight. This is likely because people are more likely to be out and about during these hours.

We can also say that usually during the nights few victims might be  drunkard and the things can go few insane consciously and unconsciously.

K-Means and DBSCAN Clustering

K-Means and DBSCAN are two widely used clustering algorithms in machine learning. These algorithms are used to group similar data points together, but they work in different ways.

K-Means Clustering

K-Means clustering is a centroid-based clustering algorithm. This means that it assigns each data point to a cluster based on its distance to the cluster centroid. The cluster centroid is the average of all the data points in the cluster.

K-Means clustering is efficient algorithm, but has some limitations.

    • The number of clusters (k) must be specified in advance.
    • K-Means clustering is sensitive to outliers. Outliers can skew the cluster centroids, which can lead to inaccurate clustering results.

DBSCAN Clustering

DBSCAN clustering is a density-based clustering algorithm. This means that it groups data points together based on their density. Density is defined as the number of data points within a given radius (epsilon) of a data point. The DBSCAN clustering is a powerful algorithm that can handle outliers and data of varying density. However, it can be computationally expensive for large datasets.

Age Distribution of People killed

This code will produce a histogram of the age column of the fatal police shootings Washington Post dataset. The histogram will show that the age distribution is skewed to the right, meaning that there are more young people than old people who are killed by police. The code will also add a vertical line at the median age, which is 29 years old.

The skewness is a measure of how asymmetrical a distribution is. A skewness of zero indicates a symmetrical distribution, while a positive skewness indicates a distribution that is skewed to the right and a negative skewness indicates a distribution that is skewed to the left.

In this case, the skewness of the age distribution is 0.52, which indicates that the distribution is moderately skewed to the right. This means that there are more young people than old people who are killed by police.

This visualization can be used to raise awareness of the problem of police violence against young people. It can also be used to advocate for policies that can help to reduce police violence, such as de-escalation training and crisis intervention training.

Trend in shooting

This will produce a line plot and an area plot of the trend in fatal police shootings over time in the United States, using the Washington Post Fatal Police Shootings dataset. The line plot will show a line connecting the data points, and the area plot will fill the area between the line plot and the x-axis. Both plots will be colored in blue.

The title of the plot will be “Trend in Fatal Police Shootings Over Time”, and the x-axis label will be “Year”. The y-axis label will be “Change in Number of Fatal Police Shootings (%)”. A legend will be added to the plot, showing that the blue line represents the trend.

Overall trend: The trend in fatal police shootings in the United States has been declining in recent years.

We see the following observations from the visualization and they are

The number of fatal police shootings decreased by 6.8% in 2020 compared to 2019.

The number of fatal police shootings decreased by 17.4% in 2019 compared to 2018.

The number of fatal police shootings decreased by 7.9% in 2018 compared to 2017.

Conclusion: The trend in fatal police shootings in the United States has been declining in recent years.

Mental illness by Age group

We can see that there is a significant increase in the number of mental illness cases by age group. This is likely due to a number of factors, such as the age of the victim, the age of the victim’s spouse, and the age of the victim’s child.

For example, younger people may be more likely to experience mental illness because they are still developing and may be more susceptible to stress and anxiety. Additionally, people with mental illness may be more likely to be victims of crime, which can further exacerbate their mental health problems.

The visualization also shows that the number of mental illness cases decreases after the age of 50. This may be because older people are more likely to have developed coping mechanisms for dealing with mental illness. Additionally, older people may be less likely to experience stressful life events, such as job loss or divorce, which can trigger mental illness.

Overall, from the visualization we can say that mental illness is a significant problem that affects people of all ages. However, the risk of mental illness appears to be highest among younger people.

Whether person was fleeing?

The above visualization is about the distribution of victims by fleeing status. The number of victims who were fleeing by fleeing status is about 60%, while the number of victims who were not fleeing by fleeing status is about 20%. This means that the number of victims who were fleeing by fleeing status is higher than the number of victims who were not fleeing by fleeing status.

This visualization suggests that a majority of the victims of police shootings were fleeing at the time of the shooting. This could be due to a number of factors, such as fear of the police, an attempt to escape arrest, or a mental health crisis.

Analysis on Age Groups of the Victims

From the visualization we can say that the young adults are more likely to be involved in encounters with police than the people aged 40 and over but to a lesser extent than young adults. The young adults are perceived as threatening by police. Also we can say that the young adults are less likely to be able to de-escalate encounters with cops and addition to this these young adults have less resources to defend themselves against police than the aged people around 30 to 40+ adults which include financial resources or social support networks. Coming to people aged 40 and over are at risk of being killed by police. Some people in that age group may be involved criminal activities and few experiencing mental illness that make them vulnerable to police violence.

Police Officer Wearing Body Cam

The most of the shootings were made by the police is without the body cams. From the graph we can say 75-80% police were not wearing the body cams during the shootout and 25-20% cops were seen worn the bodycams during the shootout. So we can say that most of the shootings were happened based on the emergency situations and we can not say completely that no time for switching the body cams for proof of footage.

Analysis on Armed Victims

From the graph we can see that the majority of the victims were armed with the gun or the knife. We can say probably the most of the shootings happened based on the type of the armed weapon used by the victim.

Mode of death of the Victims

From the graphs we can see the most of the manner of the deaths were shots and the least cause of deaths  is  by shot and tasered.

Analysis on Dataset

Victims in each state

There are 51 states and I believe the shootings will be more based on the population of the state also .Here in the bar graph of victim count in each US state, the number of victims in each state is divided by the population of that state to create a per capita victim rate.

The top 5 states with most victims are CA, TX, FL, AZ, GA.

Majority of victims in each state are white, followed by black and Hispanic.

California has the most victims overall, with Texas and Florida close behind.

Arizona and Georgia have the next highest victim counts.

Mental Illness Victims

In the beginning when I looked into the data briefly assumed that the most of the victims would be mentally unstable but from the graph we can say that the most of the victims were mentally stable and only few percentage of the people were mentally unstable.

Project-2 Fatal Police Shooting

Today in the dataset I have observed fatal police shooting dataset contains 17 attributes and 8002 instances. I saw there were many inconsistencies and inaccuracies. There were many missing values in almost all the attributes. However, I have found few interesting attributes that can be looked furtherly, such as age, gender, city/state and mental illness. By analyzing the age attribute maybe we can determine whether a notable proportion of shootings that are associated to distinct age groups. Taking the gender attribute, we can determine notable proportion of shootings like whether men are more likely to be killed or not?. Then by analyzing the city or state whether shootings are associated with specific geographic locations like certain cities or states have higher rates of fatal shootings. Finally the mental illness attribute can determine whether crimes may be associated to individual mental illness.

I believe that age, gender and city/state are most promising attributes for further analysis.