Mason Meeks
Since the mid-2010s, it feels as though our world and society has changed drastically, especially with how we engage politically. Using data from the Crowd Counting Consortium, I will be investigating the ebb and flow of protests from 2020 to 2022. Specifically, I will be using protest data from July of each of these years. The primary reason for using a singular month from each year is that July is directly in the middle of the time of year in which the majority of the year's protests happen. Therefore, the use of one month from each year will be simplier and still give me plenty of data.
I will also be using hate crime data from the FBI's Crime Data Explorer which will allow me to find correlations and create models to investigate the connection between a particular states's protest-based political affiliation, protest activity, and hate crime frequency. Lastly, I will be using U.S. Census Bureau data to control for population.
The overall goal of this project is two-fold: 1) to determine which states see the most per capita hate crimes and protests, and to understand if those factors are connected with that state's political affiliation, and 2) if an increase in media coverage of protests and violent hate crimes is accurate in relation to reality.
The github page for this notebook can be found here
# clone the course repository, change to right directory, and import libraries.
%cd /content
!git clone https://github.com/mmeekstu/data_science.git
%cd /content/data_science/'Project Data'/'July Data'
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_columns', None)
/content Cloning into 'data_science'... remote: Enumerating objects: 212, done. remote: Counting objects: 100% (78/78), done. remote: Compressing objects: 100% (54/54), done. remote: Total 212 (delta 30), reused 66 (delta 24), pack-reused 134 Receiving objects: 100% (212/212), 18.75 MiB | 18.34 MiB/s, done. Resolving deltas: 100% (89/89), done. /content/data_science/Project Data/July Data
This data comes from the Crowd Counting Consortium (found here) which analyzes every protest, demonstration, rally, vigil, etc. throughout the U.S. The creators of this data have coded in many aspects of protests including its location, size, claim, groups, interactions among protesters, interactions among police, and many others. I chose this data because it should give me an abundance of information that will allow me to ask and potentially answer many types of questions. The initial question I hope to investigate with this data is how the frequency of protests and number of protesters have changed since 2020. With major elections, COVID, and massive public responses to policies and governmental actions, I expect to find a lot of interesting statistics that will shape the overall goal of this project.
# Reading in csv files
df_2020 = pd.read_csv('Crowd Estimates July 2020 - Tally.csv')
df_2021 = pd.read_csv('Crowd Estimates July 2021 - Tally.csv')
df_2022 = pd.read_csv('Crowd Estimates July 2022 - Tally.csv')
After reading the csv files in, I am going to add a year column to each with their corresponding year. This will make grouping them quick and simple later on.
# Adding a year column to each
df_2020['Year'] = 2020
df_2021['Year'] = 2021
df_2022['Year'] = 2022
# Creating a list for easy access
df_list = list([df_2020, df_2021, df_2022])
Now for a quick display of the first row of each file. This will show how messy or neat the data is, and allow me to understand where I need to start organizing and tidying.
for df in df_list:
display(df.head(1))
CityTown | Location | County | StateTerritory | Country | Date | EstimateText | EstimateLow | BestGuess | EstimateHigh | AdjustedLow | AdjustedHigh | Actor | Claim | Pro(2)/Anti(1) | EventType | ReportedArrests | ReportedParticipantInjuries | ReportedPoliceInjuries | ReportedPropertyDamage | TownsCities | Events | MacroEvent | Misc. | Source1 | Source2 | Source3 | Source4 | Source5 | Source6 | Unnamed: 30 | Year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Ashland | Memorial Park; streets of Ashland; Highway 13 ... | NaN | WI | US | 2020-07-01 | NaN | NaN | NaN | NaN | NaN | NaN | water protectors | against recloation of Enbridge Line 5 Pipeline... | 1 | protest | 0 | 0 | 0.0 | 0.0 | 1 | 1 | NaN | NaN | https://kbjr6.com/2020/07/01/protesters-march-... | NaN | NaN | NaN | NaN | NaN | NaN | 2020 |
date | locality | state | location | title | size_text | size_low | size_high | organizations | participants | claims | valence | event_type | police_measures | participant_measures | police_injuries | participant_injuries | arrests | property_damage | police_deaths | participant_deaths | macroevent | notes | coder | source1 | source2 | source3 | source4 | source5 | source6 | source7 | source8 | source9 | source10 | source11 | source12 | source13 | source14 | source15 | source16 | source17 | source18 | source19 | source20 | source21 | source22 | source23 | source24 | source25 | source26 | source27 | source28 | source29 | source30 | Year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2021-07-01 | Addison | IL | Portillo's Food Service | NaN | 17 | 17.0 | 17.0 | Arise Chicago | workers | for better working conditions and higher pay f... | 0.0 | strike; picket | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | https://chicago.suntimes.com/2021/7/2/22561028... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2021 |
date | locality | state | location | title | size_text | size_low | size_high | organizations | participants | claims | valence | event_type | police_measures | participant_measures | police_injuries | participant_injuries | arrests | property_damage | police_deaths | participant_deaths | macroevent | notes | coder | source1 | source2 | source3 | source4 | source5 | source6 | source7 | source8 | source9 | source10 | source11 | source12 | source13 | source14 | source15 | source16 | source17 | source18 | source19 | source20 | source21 | source22 | source23 | source24 | source25 | source26 | source27 | source28 | source29 | source30 | Year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2022-07-01 | Akron | OH | Harold K. Stubbs Justice Center | NaN | about 20 | 20.0 | 20.0 | The Freedom BLOC | NaN | for justice for Jayland Walker, against police... | 1.0 | protest | NaN | conversation with counter-protester | NaN | NaN | NaN | NaN | NaN | NaN | 20220701-akron-jaylandwalker | NaN | NaN | https://twitter.com/sahrasulaiman/status/15431... | https://fox8.com/news/elected-officials-callin... | https://www.wkyc.com/article/news/local/akron/... | https://www.beaconjournal.com/story/news/2022/... | https://www.instagram.com/p/Cfcmqv6JPtD/ | https://www.instagram.com/p/CfepN98pt0G/ | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2022 |
There are a lot of usesless columns that need to be dropped. Also, this data is quite messy and is not the most consistently titled throughout the years. I am just going to run some for loops to iterate through the columns to drop what I don't want.
# Dropping a lot of data that will not be needed for the project
for (cName, cData) in df_2020.items():
if ("source" in cName) | ('Source' in cName) | ('macroevent' in cName) | ('MacroEvent' in cName) | ('Misc.' in cName) | ('Unnamed' in cName) | ('notes' in cName) | ('coder' in cName) | ('TownsCities' in cName) | ('Events' in cName) | ('BestGuess' in cName) | ('AdjustedLow' in cName) | ('AdjustedHigh' in cName) | ('EstimateText' in cName) | ('County' in cName) | ('size_text' in cName) | ('title' in cName):
df_2020 = df_2020.drop(columns = cName)
for (cName, cData) in df_2021.items():
if ("source" in cName) | ('Source' in cName) | ('macroevent' in cName) | ('MacroEvent' in cName) | ('Misc.' in cName) | ('Unnamed' in cName) | ('notes' in cName) | ('coder' in cName) | ('TownsCities' in cName) | ('Events' in cName) | ('BestGuess' in cName) | ('AdjustedLow' in cName) | ('AdjustedHigh' in cName) | ('EstimateText' in cName) | ('County' in cName) | ('size_text' in cName) | ('title' in cName):
df_2021 = df_2021.drop(columns = cName)
for (cName, cData) in df_2022.items():
if ("source" in cName) | ('Source' in cName) | ('macroevent' in cName) | ('MacroEvent' in cName) | ('Misc.' in cName) | ('Unnamed' in cName) | ('notes' in cName) | ('coder' in cName) | ('TownsCities' in cName) | ('Events' in cName) | ('BestGuess' in cName) | ('AdjustedLow' in cName) | ('AdjustedHigh' in cName) | ('EstimateText' in cName) | ('County' in cName) | ('size_text' in cName) | ('title' in cName):
df_2022 = df_2022.drop(columns = cName)
# Re-estabilshing the list
df_list = list([df_2020, df_2021, df_2022])
# Let's see what we have now
for df in df_list:
display(df.head(1))
CityTown | Location | StateTerritory | Country | Date | EstimateLow | EstimateHigh | Actor | Claim | Pro(2)/Anti(1) | EventType | ReportedArrests | ReportedParticipantInjuries | ReportedPoliceInjuries | ReportedPropertyDamage | Year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Ashland | Memorial Park; streets of Ashland; Highway 13 ... | WI | US | 2020-07-01 | NaN | NaN | water protectors | against recloation of Enbridge Line 5 Pipeline... | 1 | protest | 0 | 0 | 0.0 | 0.0 | 2020 |
date | locality | state | location | size_low | size_high | organizations | participants | claims | valence | event_type | police_measures | participant_measures | police_injuries | participant_injuries | arrests | property_damage | police_deaths | participant_deaths | Year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2021-07-01 | Addison | IL | Portillo's Food Service | 17.0 | 17.0 | Arise Chicago | workers | for better working conditions and higher pay f... | 0.0 | strike; picket | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2021 |
date | locality | state | location | size_low | size_high | organizations | participants | claims | valence | event_type | police_measures | participant_measures | police_injuries | participant_injuries | arrests | property_damage | police_deaths | participant_deaths | Year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2022-07-01 | Akron | OH | Harold K. Stubbs Justice Center | 20.0 | 20.0 | The Freedom BLOC | NaN | for justice for Jayland Walker, against police... | 1.0 | protest | NaN | conversation with counter-protester | NaN | NaN | NaN | NaN | NaN | NaN | 2022 |
It is defintely already more manageable and organized. However, there are a lot of inconsistencies throughout the naming. There seemed to be a large change in 2021 with how the investigators coded the data. I will go ahead and make the names more consistent.
# Renamming inconsistent data
df_2020 = df_2020.rename(columns={"CityTown": "City", "StateTerritory": "State", 'Actor': 'Actors', 'Pro(2)/Anti(1)': 'Political Affiliation'})
df_2021 = df_2021.rename(columns={'date': 'Date', 'locality': 'City', 'location': 'Location', 'state': 'State', 'size_low': 'EstimateLow', 'size_high': 'EstimateHigh', 'claims': 'Claim', 'event_type': 'EventType', 'participants': 'Actors', 'valence': 'Political Affiliation', 'police_injuries': 'ReportedPoliceInjuries', 'participant_injuries': 'ReportedParticipantInjuries', 'arrests': 'ReportedArrests', 'property_damage': 'ReportedPropertyDamage'})
df_2022 = df_2022.rename(columns={'date': 'Date', 'locality': 'City', 'location': 'Location', 'state': 'State', 'size_low': 'EstimateLow', 'size_high': 'EstimateHigh', 'claims': 'Claim', 'event_type': 'EventType', 'participants': 'Actors', 'valence': 'Political Affiliation', 'police_injuries': 'ReportedPoliceInjuries', 'participant_injuries': 'ReportedParticipantInjuries', 'arrests': 'ReportedArrests', 'property_damage': 'ReportedPropertyDamage'})
# Re-establishing the list
df_list = list([df_2020, df_2021, df_2022])
# Much more consistent
for df in df_list:
display(df.head(1))
City | Location | State | Country | Date | EstimateLow | EstimateHigh | Actors | Claim | Political Affiliation | EventType | ReportedArrests | ReportedParticipantInjuries | ReportedPoliceInjuries | ReportedPropertyDamage | Year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Ashland | Memorial Park; streets of Ashland; Highway 13 ... | WI | US | 2020-07-01 | NaN | NaN | water protectors | against recloation of Enbridge Line 5 Pipeline... | 1 | protest | 0 | 0 | 0.0 | 0.0 | 2020 |
Date | City | State | Location | EstimateLow | EstimateHigh | organizations | Actors | Claim | Political Affiliation | EventType | police_measures | participant_measures | ReportedPoliceInjuries | ReportedParticipantInjuries | ReportedArrests | ReportedPropertyDamage | police_deaths | participant_deaths | Year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2021-07-01 | Addison | IL | Portillo's Food Service | 17.0 | 17.0 | Arise Chicago | workers | for better working conditions and higher pay f... | 0.0 | strike; picket | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2021 |
Date | City | State | Location | EstimateLow | EstimateHigh | organizations | Actors | Claim | Political Affiliation | EventType | police_measures | participant_measures | ReportedPoliceInjuries | ReportedParticipantInjuries | ReportedArrests | ReportedPropertyDamage | police_deaths | participant_deaths | Year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2022-07-01 | Akron | OH | Harold K. Stubbs Justice Center | 20.0 | 20.0 | The Freedom BLOC | NaN | for justice for Jayland Walker, against police... | 1.0 | protest | NaN | conversation with counter-protester | NaN | NaN | NaN | NaN | NaN | NaN | 2022 |
I am going to go ahead and concatenate the tables now that naming is more consistent. This will allow me to more easily manage, transform, and analyze the data.
df_full = pd.concat([df_2020, df_2021, df_2022])
# Success. Here is what the new table looks like
df_full
City | Location | State | Country | Date | EstimateLow | EstimateHigh | Actors | Claim | Political Affiliation | EventType | ReportedArrests | ReportedParticipantInjuries | ReportedPoliceInjuries | ReportedPropertyDamage | Year | organizations | police_measures | participant_measures | police_deaths | participant_deaths | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Ashland | Memorial Park; streets of Ashland; Highway 13 ... | WI | US | 2020-07-01 | NaN | NaN | water protectors | against recloation of Enbridge Line 5 Pipeline... | 1.0 | protest | 0 | 0 | 0.0 | 0.0 | 2020 | NaN | NaN | NaN | NaN | NaN |
1 | Boulevard | NaN | CA | US | 2020-07-01 | 100.0 | 100.0 | members of the Kumeyaay Tribe | against construction of border wall, against d... | 1.0 | march | NaN | NaN | NaN | NaN | 2020 | NaN | NaN | NaN | NaN | NaN |
2 | Brentwood | Bank of America Financial Center | CA | US | 2020-07-01 | NaN | NaN | fired employees of Terranea Resort | against firing of staff and salary cuts at Ter... | 0.0 | caravan; protest | 0 | 0 | 0.0 | 0.0 | 2020 | NaN | NaN | NaN | NaN | NaN |
3 | Eugene | Lane County Jail | OR | US | 2020-07-01 | 150.0 | 150.0 | general protesters | show solidarity with inmates; seek release of ... | 0.0 | protest | 1 | 0 | 1.0 | 1.0 | 2020 | NaN | NaN | NaN | NaN | NaN |
4 | Holyoke | Providence Behavioral Health Hospital | MA | US | 2020-07-01 | NaN | NaN | Nurses Association | against closure of inpatient beds in psychiatr... | 0.0 | memorial; protest | 0 | 0 | 0.0 | 0.0 | 2020 | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2518 | Washington | National Mall | DC | NaN | 2022-07-31 | NaN | NaN | NaN | against 1776 Restoration Movement | 0.0 | counter-protest | NaN | NaN | NaN | NaN | 2022 | NaN | NaN | NaN | NaN | NaN |
2519 | Washington | U.S. Capitol | DC | NaN | 2022-07-31 | NaN | NaN | veterans | for the PACT Act, for passage of federal legis... | 0.0 | demonstration; rally | NaN | NaN | NaN | NaN | 2022 | Burn Pits 360 | NaN | round-the-clock sit-in on Capitol steps | NaN | NaN |
2520 | Washington | Union Station | DC | NaN | 2022-07-31 | NaN | NaN | NaN | for Medicare for all | 1.0 | demonstration | NaN | NaN | NaN | NaN | 2022 | March for Medicare for All | NaN | NaN | NaN | NaN |
2521 | West Hartford | Flatbush Ave and New Park Ave | CT | NaN | 2022-07-31 | NaN | NaN | NaN | against circumcision of male infants | 0.0 | protest | NaN | NaN | NaN | NaN | 2022 | Bloodstained Men | NaN | NaN | NaN | NaN |
2522 | Wooster | Wooster Square | OH | NaN | 2022-07-31 | NaN | NaN | NaN | against racism, against police brutality | 1.0 | demonstration | NaN | NaN | NaN | NaN | 2022 | Wayne County Racial Justice Coalition | NaN | NaN | NaN | NaN |
5254 rows × 21 columns
Since the scope of the project only involves number of protests, protestors, and their political affiliation, let's now further tidy the chart into only the parts I will need. I am going to keep state, estimated low and high, political affiliation, and year.
# Creating a new dataframe
df_new = pd.DataFrame(df_full[['State', 'EstimateLow', 'EstimateHigh', 'Political Affiliation', 'Year']])
df_new
Next, I am going to get rid of all the NA values since they won't be needed in the prediction model.
# Finding and getting rid of NAs
df_new = df_new.dropna()
df_new
State | EstimateLow | EstimateHigh | Political Affiliation | Year | |
---|---|---|---|---|---|
1 | CA | 100.0 | 100.0 | 1.0 | 2020 |
3 | OR | 150.0 | 150.0 | 0.0 | 2020 |
5 | CA | 150.0 | 150.0 | 0.0 | 2020 |
6 | CA | 50.0 | 50.0 | 0.0 | 2020 |
7 | CA | 7.0 | 7.0 | 0.0 | 2020 |
... | ... | ... | ... | ... | ... |
2503 | CA | 75.0 | 100.0 | 0.0 | 2022 |
2506 | CA | 36.0 | 36.0 | 0.0 | 2022 |
2507 | CA | 2.0 | 2.0 | 0.0 | 2022 |
2508 | PR | 15.0 | 20.0 | 0.0 | 2022 |
2514 | WI | 600.0 | 600.0 | 0.0 | 2022 |
2198 rows × 5 columns
I do not like that Political Affiliation is in integer format, so I am going to remap the values to make more intuitive sense.
# Remapping Political Affiliation to be more intuitive
df_new['Political Affiliation'] = df_new['Political Affiliation'].map({
0.0: 'Neither',
1.0: 'Left/Anti-Trump',
2.0: 'Right/Pro-Trump'
})
<ipython-input-16-d7bce5010979>:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_new['Political Affiliation'] = df_new['Political Affiliation'].map({
# Great
df_new
State | EstimateLow | EstimateHigh | Political Affiliation | Year | |
---|---|---|---|---|---|
1 | CA | 100.0 | 100.0 | Left/Anti-Trump | 2020 |
3 | OR | 150.0 | 150.0 | Neither | 2020 |
5 | CA | 150.0 | 150.0 | Neither | 2020 |
6 | CA | 50.0 | 50.0 | Neither | 2020 |
7 | CA | 7.0 | 7.0 | Neither | 2020 |
... | ... | ... | ... | ... | ... |
2503 | CA | 75.0 | 100.0 | Neither | 2022 |
2506 | CA | 36.0 | 36.0 | Neither | 2022 |
2507 | CA | 2.0 | 2.0 | Neither | 2022 |
2508 | PR | 15.0 | 20.0 | Neither | 2022 |
2514 | WI | 600.0 | 600.0 | Neither | 2022 |
2198 rows × 5 columns
I also want to change the mapping of the state column to match the full names of the states so that it is more consistent with the second data set and easier to interpret.
# Creating dict
states_dict = {
'AK': 'Alaska',
'AL': 'Alabama',
'AR': 'Arkansas',
'AS': 'American Samoa',
'AZ': 'Arizona',
'CA': 'California',
'CO': 'Colorado',
'CT': 'Connecticut',
'DC': 'District of Columbia',
'DE': 'Delaware',
'FL': 'Florida',
'GA': 'Georgia',
'GU': 'Guam',
'HI': 'Hawaii',
'IA': 'Iowa',
'ID': 'Idaho',
'IL': 'Illinois',
'IN': 'Indiana',
'KS': 'Kansas',
'KY': 'Kentucky',
'LA': 'Louisiana',
'MA': 'Massachusetts',
'MD': 'Maryland',
'ME': 'Maine',
'MI': 'Michigan',
'MN': 'Minnesota',
'MO': 'Missouri',
'MP': 'Northern Mariana Islands',
'MS': 'Mississippi',
'MT': 'Montana',
'NA': 'National',
'NC': 'North Carolina',
'ND': 'North Dakota',
'NE': 'Nebraska',
'NH': 'New Hampshire',
'NJ': 'New Jersey',
'NM': 'New Mexico',
'NV': 'Nevada',
'NY': 'New York',
'OH': 'Ohio',
'OK': 'Oklahoma',
'OR': 'Oregon',
'PA': 'Pennsylvania',
'PR': 'Puerto Rico',
'RI': 'Rhode Island',
'SC': 'South Carolina',
'SD': 'South Dakota',
'TN': 'Tennessee',
'TX': 'Texas',
'UT': 'Utah',
'VA': 'Virginia',
'VI': 'Virgin Islands',
'VT': 'Vermont',
'WA': 'Washington',
'WI': 'Wisconsin',
'WV': 'West Virginia',
'WY': 'Wyoming'
}
# Remapping
df_new['State'] = df_new['State'].map(states_dict)
<ipython-input-19-cd63c67008f0>:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_new['State'] = df_new['State'].map(states_dict)
# Done
df_new
State | EstimateLow | EstimateHigh | Political Affiliation | Year | |
---|---|---|---|---|---|
1 | California | 100.0 | 100.0 | Left/Anti-Trump | 2020 |
3 | Oregon | 150.0 | 150.0 | Neither | 2020 |
5 | California | 150.0 | 150.0 | Neither | 2020 |
6 | California | 50.0 | 50.0 | Neither | 2020 |
7 | California | 7.0 | 7.0 | Neither | 2020 |
... | ... | ... | ... | ... | ... |
2503 | California | 75.0 | 100.0 | Neither | 2022 |
2506 | California | 36.0 | 36.0 | Neither | 2022 |
2507 | California | 2.0 | 2.0 | Neither | 2022 |
2508 | Puerto Rico | 15.0 | 20.0 | Neither | 2022 |
2514 | Wisconsin | 600.0 | 600.0 | Neither | 2022 |
2198 rows × 5 columns
Now that there is one, organized table, let's take a look at the dtypes to see what pandas thinks of the data.
df_new.dtypes
State object EstimateLow float64 EstimateHigh float64 Political Affiliation object Year int64 dtype: object
It's not too bad. Since counting the number of people via decimal does not make much sense, I will change those to be integers.
df_new = df_new.astype({'EstimateLow': 'Int64', 'EstimateHigh': 'Int64'})
# Better
df_new.dtypes
State object EstimateLow Int64 EstimateHigh Int64 Political Affiliation object Year int64 dtype: object
Now let's take a look at some summary statistics and graphs. I am just going to group by the year column that I made earlier and tally total observations and take the summation and mean of the estimated high and low columns to get a more accurate representation of the data. This will give me some base-line information to begin forming questions and directing analyses.
# Grouping Years to find total number of observations
df_Counts = df_new.groupby('Year')['Year'].count()
display(df_Counts)
df_Counts.plot.bar(title='Total Protests', ylabel='Number of Protests')
Year 2020 437 2021 782 2022 979 Name: Year, dtype: int64
<Axes: title={'center': 'Total Protests'}, xlabel='Year', ylabel='Number of Protests'>
This is certainlly the trend I was expected to see. Since lockdown was in 2020, it makes sense that 2020 would have the fewest protests. However, the escalation into 2021 and 2022 was much quicker than I anticipated.
# Grouping 'EstimateLow' and EstimateHigh and summing
df_estimate_sum = df_new.groupby('Year')[['EstimateLow', 'EstimateHigh']].sum()
display(df_estimate_sum)
df_estimate_sum.plot.bar(title='Total Attendence', ylabel='Number of Participants')
EstimateLow | EstimateHigh | |
---|---|---|
Year | ||
2020 | 60227 | 67773 |
2021 | 70483 | 126632 |
2022 | 145882 | 173236 |
<Axes: title={'center': 'Total Attendence'}, xlabel='Year', ylabel='Number of Participants'>
This trend is also not too interesting. People were very cautious of large gatherings in 2020 and were not quick to forget in 2021.
# Grouping 'EstimateLow' and 'EstimateHigh' and averaging
df_estimate_avg = df_new.groupby('Year')[['EstimateLow', 'EstimateHigh']].mean()
display(df_estimate_avg)
df_estimate_avg.plot.bar(title='Average Attendence', ylabel='Number of Participants')
EstimateLow | EstimateHigh | |
---|---|---|
Year | ||
2020 | 137.819222 | 155.086957 |
2021 | 90.131714 | 161.933504 |
2022 | 149.011236 | 176.951992 |
<Axes: title={'center': 'Average Attendence'}, xlabel='Year', ylabel='Number of Participants'>
This is actually really interesting. The average attendence per protest in 2020 is rather equal to the attendence in the other years, if not more-so than 2021. Though lockdown seemed to affect the total number of protests and number of protestors, it did not seem to affect the average attendence.
Now I will be more closely investigating the relationship with overall protest numbers and their political affiliation over the years.
The first graph is a presentation of all of the protests within the full data table. That is, this is the total sum of protests and their political affiliation throughout the 3 years without taking into consideration the state. This seems lik a good place to start.
# Creating bar graph of total protests throughout the 3 years
df_new['Political Affiliation'].value_counts(sort=False).plot.bar(title='Protests by Political Affiliation',
ylabel='Number of Protests', xlabel='Political Affiliation')
<Axes: title={'center': 'Protests by Political Affiliation'}, xlabel='Political Affiliation', ylabel='Number of Protests'>
The number of righ-leaning protests seem to be far fewer than left-leaning or non-affiliated protests. I wonder if that trend will remain consistent throughout individual states and how that stacks up against the hate crime statistics.
To investigate that trend during each year, I wll create a cross-tabulation including State and Political Affliation for 2020-2022 and display these results in a heatmap.
# Creating three tables based on year
df_2020 = df_new.loc[df_new['Year'] == 2020]
df_2021 = df_new.loc[df_new['Year'] == 2021]
df_2022 = df_new.loc[df_new['Year'] == 2022]
# Creating three crosstab joint distributions
ct_2020 = pd.crosstab(df_2020.State, df_2020['Political Affiliation'], normalize=True)
ct_2021 = pd.crosstab(df_2021.State, df_2021['Political Affiliation'], normalize=True)
ct_2022 = pd.crosstab(df_2022.State, df_2022['Political Affiliation'], normalize=True)
# Displaying cross-tabulations
df_list = [ct_2020, ct_2021, ct_2022]
for df in df_list:
display(df)
Political Affiliation | Left/Anti-Trump | Neither | Right/Pro-Trump |
---|---|---|---|
State | |||
Alabama | 0.000000 | 0.000000 | 0.002294 |
Alaska | 0.000000 | 0.002294 | 0.002294 |
Arizona | 0.000000 | 0.009174 | 0.009174 |
Arkansas | 0.000000 | 0.000000 | 0.004587 |
California | 0.018349 | 0.064220 | 0.057339 |
Colorado | 0.000000 | 0.004587 | 0.018349 |
Connecticut | 0.002294 | 0.011468 | 0.013761 |
District of Columbia | 0.002294 | 0.000000 | 0.000000 |
Florida | 0.004587 | 0.027523 | 0.029817 |
Georgia | 0.000000 | 0.011468 | 0.002294 |
Hawaii | 0.002294 | 0.000000 | 0.000000 |
Idaho | 0.000000 | 0.000000 | 0.009174 |
Illinois | 0.000000 | 0.020642 | 0.011468 |
Indiana | 0.000000 | 0.004587 | 0.018349 |
Iowa | 0.000000 | 0.004587 | 0.004587 |
Kansas | 0.000000 | 0.006881 | 0.004587 |
Kentucky | 0.000000 | 0.000000 | 0.006881 |
Louisiana | 0.000000 | 0.011468 | 0.002294 |
Maine | 0.002294 | 0.006881 | 0.000000 |
Maryland | 0.004587 | 0.016055 | 0.006881 |
Massachusetts | 0.002294 | 0.016055 | 0.013761 |
Michigan | 0.002294 | 0.013761 | 0.013761 |
Minnesota | 0.000000 | 0.006881 | 0.006881 |
Mississippi | 0.000000 | 0.002294 | 0.000000 |
Missouri | 0.000000 | 0.016055 | 0.009174 |
Montana | 0.000000 | 0.000000 | 0.004587 |
Nebraska | 0.000000 | 0.000000 | 0.002294 |
Nevada | 0.000000 | 0.002294 | 0.004587 |
New Hampshire | 0.000000 | 0.002294 | 0.004587 |
New Jersey | 0.000000 | 0.009174 | 0.018349 |
New Mexico | 0.000000 | 0.000000 | 0.013761 |
New York | 0.000000 | 0.029817 | 0.043578 |
North Carolina | 0.004587 | 0.004587 | 0.011468 |
Ohio | 0.000000 | 0.004587 | 0.016055 |
Oklahoma | 0.002294 | 0.000000 | 0.000000 |
Oregon | 0.000000 | 0.011468 | 0.018349 |
Pennsylvania | 0.004587 | 0.011468 | 0.027523 |
Rhode Island | 0.000000 | 0.000000 | 0.002294 |
South Carolina | 0.000000 | 0.011468 | 0.000000 |
South Dakota | 0.000000 | 0.000000 | 0.002294 |
Tennessee | 0.000000 | 0.006881 | 0.006881 |
Texas | 0.004587 | 0.034404 | 0.027523 |
Utah | 0.000000 | 0.011468 | 0.009174 |
Vermont | 0.000000 | 0.000000 | 0.009174 |
Virginia | 0.004587 | 0.016055 | 0.013761 |
Washington | 0.002294 | 0.004587 | 0.018349 |
West Virginia | 0.000000 | 0.002294 | 0.004587 |
Wisconsin | 0.000000 | 0.009174 | 0.000000 |
Political Affiliation | Left/Anti-Trump | Neither | Right/Pro-Trump |
---|---|---|---|
State | |||
Alabama | 0.005115 | 0.003836 | 0.001279 |
Alaska | 0.000000 | 0.001279 | 0.000000 |
Arizona | 0.003836 | 0.002558 | 0.002558 |
Arkansas | 0.002558 | 0.001279 | 0.000000 |
California | 0.052430 | 0.040921 | 0.038363 |
Colorado | 0.007673 | 0.000000 | 0.005115 |
Connecticut | 0.005115 | 0.006394 | 0.006394 |
Delaware | 0.001279 | 0.001279 | 0.000000 |
District of Columbia | 0.015345 | 0.015345 | 0.002558 |
Florida | 0.010230 | 0.089514 | 0.010230 |
Georgia | 0.005115 | 0.005115 | 0.005115 |
Hawaii | 0.006394 | 0.006394 | 0.003836 |
Idaho | 0.002558 | 0.002558 | 0.005115 |
Illinois | 0.015345 | 0.016624 | 0.003836 |
Indiana | 0.010230 | 0.003836 | 0.002558 |
Iowa | 0.002558 | 0.000000 | 0.001279 |
Kansas | 0.001279 | 0.001279 | 0.000000 |
Kentucky | 0.005115 | 0.003836 | 0.000000 |
Louisiana | 0.001279 | 0.006394 | 0.001279 |
Maine | 0.007673 | 0.001279 | 0.002558 |
Maryland | 0.008951 | 0.007673 | 0.000000 |
Massachusetts | 0.017903 | 0.015345 | 0.003836 |
Michigan | 0.003836 | 0.006394 | 0.008951 |
Minnesota | 0.023018 | 0.003836 | 0.005115 |
Mississippi | 0.001279 | 0.001279 | 0.000000 |
Missouri | 0.003836 | 0.005115 | 0.003836 |
Montana | 0.001279 | 0.001279 | 0.003836 |
Nebraska | 0.001279 | 0.003836 | 0.000000 |
Nevada | 0.001279 | 0.002558 | 0.005115 |
New Hampshire | 0.005115 | 0.000000 | 0.001279 |
New Jersey | 0.008951 | 0.001279 | 0.001279 |
New Mexico | 0.000000 | 0.000000 | 0.005115 |
New York | 0.040921 | 0.052430 | 0.008951 |
North Carolina | 0.014066 | 0.006394 | 0.015345 |
North Dakota | 0.000000 | 0.000000 | 0.001279 |
Ohio | 0.014066 | 0.007673 | 0.005115 |
Oklahoma | 0.001279 | 0.002558 | 0.000000 |
Oregon | 0.005115 | 0.002558 | 0.007673 |
Pennsylvania | 0.024297 | 0.012788 | 0.010230 |
Puerto Rico | 0.001279 | 0.000000 | 0.000000 |
Rhode Island | 0.002558 | 0.001279 | 0.001279 |
South Carolina | 0.002558 | 0.002558 | 0.000000 |
South Dakota | 0.001279 | 0.001279 | 0.002558 |
Tennessee | 0.003836 | 0.007673 | 0.001279 |
Texas | 0.019182 | 0.026854 | 0.003836 |
Utah | 0.000000 | 0.002558 | 0.000000 |
Virginia | 0.014066 | 0.007673 | 0.010230 |
Washington | 0.010230 | 0.001279 | 0.002558 |
West Virginia | 0.002558 | 0.000000 | 0.001279 |
Wisconsin | 0.003836 | 0.005115 | 0.000000 |
Political Affiliation | Left/Anti-Trump | Neither | Right/Pro-Trump |
---|---|---|---|
State | |||
Alabama | 0.010215 | 0.000000 | 0.000000 |
Alaska | 0.006129 | 0.000000 | 0.002043 |
Arizona | 0.004086 | 0.001021 | 0.005107 |
Arkansas | 0.004086 | 0.002043 | 0.001021 |
California | 0.044944 | 0.081716 | 0.009193 |
Colorado | 0.009193 | 0.001021 | 0.001021 |
Connecticut | 0.004086 | 0.002043 | 0.000000 |
Delaware | 0.002043 | 0.000000 | 0.000000 |
District of Columbia | 0.025536 | 0.001021 | 0.006129 |
Florida | 0.022472 | 0.003064 | 0.004086 |
Georgia | 0.016343 | 0.003064 | 0.001021 |
Guam | 0.001021 | 0.000000 | 0.000000 |
Hawaii | 0.000000 | 0.003064 | 0.000000 |
Idaho | 0.006129 | 0.000000 | 0.001021 |
Illinois | 0.015322 | 0.043922 | 0.003064 |
Indiana | 0.011236 | 0.001021 | 0.004086 |
Iowa | 0.008172 | 0.031665 | 0.002043 |
Kansas | 0.004086 | 0.023493 | 0.001021 |
Kentucky | 0.006129 | 0.001021 | 0.000000 |
Louisiana | 0.006129 | 0.000000 | 0.000000 |
Maine | 0.004086 | 0.000000 | 0.000000 |
Maryland | 0.010215 | 0.032686 | 0.001021 |
Massachusetts | 0.013279 | 0.009193 | 0.004086 |
Michigan | 0.012257 | 0.004086 | 0.001021 |
Minnesota | 0.008172 | 0.003064 | 0.000000 |
Mississippi | 0.001021 | 0.000000 | 0.000000 |
Missouri | 0.011236 | 0.001021 | 0.000000 |
Montana | 0.006129 | 0.000000 | 0.004086 |
Nebraska | 0.004086 | 0.002043 | 0.002043 |
Nevada | 0.002043 | 0.000000 | 0.001021 |
New Hampshire | 0.001021 | 0.000000 | 0.002043 |
New Jersey | 0.005107 | 0.002043 | 0.001021 |
New Mexico | 0.006129 | 0.000000 | 0.001021 |
New York | 0.037794 | 0.023493 | 0.009193 |
North Carolina | 0.014300 | 0.005107 | 0.003064 |
North Dakota | 0.002043 | 0.000000 | 0.001021 |
Ohio | 0.034729 | 0.002043 | 0.005107 |
Oklahoma | 0.006129 | 0.000000 | 0.000000 |
Oregon | 0.006129 | 0.013279 | 0.001021 |
Pennsylvania | 0.025536 | 0.008172 | 0.002043 |
Puerto Rico | 0.001021 | 0.004086 | 0.000000 |
Rhode Island | 0.000000 | 0.032686 | 0.000000 |
South Carolina | 0.006129 | 0.001021 | 0.002043 |
South Dakota | 0.003064 | 0.000000 | 0.000000 |
Tennessee | 0.004086 | 0.002043 | 0.000000 |
Texas | 0.019408 | 0.006129 | 0.005107 |
Utah | 0.009193 | 0.000000 | 0.002043 |
Vermont | 0.004086 | 0.000000 | 0.000000 |
Virginia | 0.019408 | 0.003064 | 0.003064 |
Washington | 0.004086 | 0.002043 | 0.000000 |
West Virginia | 0.009193 | 0.000000 | 0.001021 |
Wisconsin | 0.014300 | 0.031665 | 0.001021 |
Wyoming | 0.003064 | 0.000000 | 0.002043 |
# Creating the heatmap for 2020
plt.subplots(figsize=(10,10))
sns.heatmap(ct_2020, xticklabels=True, yticklabels=True)
<Axes: xlabel='Political Affiliation', ylabel='State'>
For 2020, protest political affiliation seems to actually be focussed on the righ-leaning/pro-Trump side of the spectrum. This is rather strange to see as the overall distribution was heavily weighted to the left. Let's see how this distribtion changes.
# Creating the heatmap for 2021
plt.subplots(figsize=(10,10))
sns.heatmap(ct_2021, xticklabels=True, yticklabels=True)
<Axes: xlabel='Political Affiliation', ylabel='State'>
There certainly seems to be a more even spread across political affiliation, possibly with the left-leaning/anti-Trump side being more present. It seems the shift I was anticipating has already begun.
# Creating the heatmap for 2022
plt.subplots(figsize=(10,10))
sns.heatmap(ct_2022, xticklabels=True, yticklabels=True)
<Axes: xlabel='Political Affiliation', ylabel='State'>
And by 2022, protest political affiliation seems to be a mirror image of that in 2020. However, it is difficult to determine the intensity of the shift from a heatmap, so next I will create a conditional distribution of political affiliation given each state so that we can see which side is more dominant.
# Creating full cross-tabulation
ct_full = pd.crosstab(df_new.State, df_new['Political Affiliation'], normalize=True)
# Creating conditional distribution
state_counts = ct_full.sum(axis=1)
pa_given_state = ct_full.divide(state_counts, axis=0)
# Displaying distribution
pa_given_state
Political Affiliation | Left/Anti-Trump | Neither | Right/Pro-Trump |
---|---|---|---|
State | |||
Alabama | 0.736842 | 0.157895 | 0.105263 |
Alaska | 0.545455 | 0.181818 | 0.272727 |
Arizona | 0.280000 | 0.280000 | 0.440000 |
Arkansas | 0.500000 | 0.250000 | 0.250000 |
California | 0.313131 | 0.471380 | 0.215488 |
Colorado | 0.483871 | 0.096774 | 0.419355 |
Connecticut | 0.281250 | 0.375000 | 0.343750 |
Delaware | 0.750000 | 0.250000 | 0.000000 |
District of Columbia | 0.644068 | 0.220339 | 0.135593 |
Florida | 0.225352 | 0.598592 | 0.176056 |
Georgia | 0.526316 | 0.315789 | 0.157895 |
Guam | 1.000000 | 0.000000 | 0.000000 |
Hawaii | 0.352941 | 0.470588 | 0.176471 |
Idaho | 0.421053 | 0.105263 | 0.473684 |
Illinois | 0.262136 | 0.631068 | 0.106796 |
Indiana | 0.487179 | 0.153846 | 0.358974 |
Iowa | 0.208333 | 0.687500 | 0.104167 |
Kansas | 0.142857 | 0.771429 | 0.085714 |
Kentucky | 0.588235 | 0.235294 | 0.176471 |
Louisiana | 0.368421 | 0.526316 | 0.105263 |
Maine | 0.647059 | 0.235294 | 0.117647 |
Maryland | 0.279412 | 0.661765 | 0.058824 |
Massachusetts | 0.405797 | 0.405797 | 0.188406 |
Michigan | 0.355556 | 0.333333 | 0.311111 |
Minnesota | 0.619048 | 0.214286 | 0.166667 |
Mississippi | 0.500000 | 0.500000 | 0.000000 |
Missouri | 0.424242 | 0.363636 | 0.212121 |
Montana | 0.411765 | 0.058824 | 0.529412 |
Nebraska | 0.384615 | 0.384615 | 0.230769 |
Nevada | 0.230769 | 0.230769 | 0.538462 |
New Hampshire | 0.454545 | 0.090909 | 0.454545 |
New Jersey | 0.413793 | 0.241379 | 0.344828 |
New Mexico | 0.352941 | 0.000000 | 0.647059 |
New York | 0.381215 | 0.425414 | 0.193370 |
North Carolina | 0.457627 | 0.203390 | 0.338983 |
North Dakota | 0.500000 | 0.000000 | 0.500000 |
Ohio | 0.633803 | 0.140845 | 0.225352 |
Oklahoma | 0.800000 | 0.200000 | 0.000000 |
Oregon | 0.222222 | 0.444444 | 0.333333 |
Pennsylvania | 0.505495 | 0.252747 | 0.241758 |
Puerto Rico | 0.333333 | 0.666667 | 0.000000 |
Rhode Island | 0.054054 | 0.891892 | 0.054054 |
South Carolina | 0.444444 | 0.444444 | 0.111111 |
South Dakota | 0.500000 | 0.125000 | 0.375000 |
Tennessee | 0.318182 | 0.500000 | 0.181818 |
Texas | 0.367347 | 0.428571 | 0.204082 |
Utah | 0.409091 | 0.318182 | 0.272727 |
Vermont | 0.500000 | 0.000000 | 0.500000 |
Virginia | 0.492308 | 0.246154 | 0.261538 |
Washington | 0.464286 | 0.178571 | 0.357143 |
West Virginia | 0.687500 | 0.062500 | 0.250000 |
Wisconsin | 0.298246 | 0.684211 | 0.017544 |
Wyoming | 0.600000 | 0.000000 | 0.400000 |
To represent this data, I will use a stacked bar graph since it will display the proportions of political affiliation more clearly.
# Creating stacked bar graph
pa_given_state.plot.bar(stacked=True, figsize=(10,10), legend=True)
<Axes: xlabel='State'>
With this and the first bar graph of political distribution over all three years, it is easy to determine that there was a large political shift in protests over the years. I am now wondering if such a shift will be consistent with an increase in hate crimes seeing as left-leaning protests tend to have more to do with social justice movements. Therefore, I am currently hypothesizing that since there has been a great increase in left-leaning/anti-Trump protests, there will also be a great increase in hate crimes across the U.S.
This data comes from the FBI Crime Data Explorer and is a depiction of each of the US states' reported hate crime incidents. The data I am using are the reports from 2020, 2021, and 2022. I would like to clarify that I am not reading in the original data that you would find on the website as that data was poorly formatted for reading into Pandas. Therefore, I copied the data into a new Excel file and formatted it in a way that could be read into Pandas. I also did not copy over extraneous data that I would not be using in this project.
The data can be found under the "Documents and Downloads" section of the front page. Link to data: https://cde.ucr.cjis.gov/LATEST/webapp/#/pages/downloads
# Changing directory
%cd ../'Hate Crimes'
/content/data_science/Project Data/Hate Crimes
As previously, the first step is to read in the files and see what we have to work with. From there, I can tidy and format the data.
# Reading in files
hc_2020 = pd.read_csv('hate crimes by state 2020.csv')
hc_2021 = pd.read_csv('hate crimes by state 2021.csv')
hc_2022 = pd.read_csv('hate crimes by state 2022.csv')
# Let's see what we have
display(hc_2020.head(5))
display(hc_2021.head(5))
display(hc_2022.head(5))
Participating State | Total offenses | |
---|---|---|
0 | Alabama | 33 |
1 | Alaska | 10 |
2 | Arizona | 332 |
3 | Arkansas | 22 |
4 | California | 1537 |
Participating State | Total offenses | |
---|---|---|
0 | Alabama | 263 |
1 | Alaska | 14 |
2 | Arizona | 106 |
3 | Arkansas | 47 |
4 | California | 81 |
Participating State | Total offenses | |
---|---|---|
0 | Alabama | 254 |
1 | Alaska | 5 |
2 | Arizona | 225 |
3 | Arkansas | 38 |
4 | California | 2261 |
Everything looks good so far. Let's go ahead and add a year column for clarity and for future merging.
# Creating a year column with appropriate years
hc_2020['Year'] = 2020
hc_2021['Year'] = 2021
hc_2022['Year'] = 2022
# Displaying
display(hc_2020.head(5))
display(hc_2021.head(5))
display(hc_2022.head(5))
Participating State | Total offenses | Year | |
---|---|---|---|
0 | Alabama | 33 | 2020 |
1 | Alaska | 10 | 2020 |
2 | Arizona | 332 | 2020 |
3 | Arkansas | 22 | 2020 |
4 | California | 1537 | 2020 |
Participating State | Total offenses | Year | |
---|---|---|---|
0 | Alabama | 263 | 2021 |
1 | Alaska | 14 | 2021 |
2 | Arizona | 106 | 2021 |
3 | Arkansas | 47 | 2021 |
4 | California | 81 | 2021 |
Participating State | Total offenses | Year | |
---|---|---|---|
0 | Alabama | 254 | 2022 |
1 | Alaska | 5 | 2022 |
2 | Arizona | 225 | 2022 |
3 | Arkansas | 38 | 2022 |
4 | California | 2261 | 2022 |
I just want to chage the name of the 'Participating State' column to 'State' for clarity and brevity.
# Creating list of dfs
df_list = [hc_2020, hc_2021, hc_2022]
# Renaming columns
for df in df_list:
df.rename(columns={'Participating State': 'State'}, inplace=True)
# Displaying dfs
for df in df_list:
display(df.head(1))
State | Total offenses | Year | |
---|---|---|---|
0 | Alabama | 33 | 2020 |
State | Total offenses | Year | |
---|---|---|---|
0 | Alabama | 263 | 2021 |
State | Total offenses | Year | |
---|---|---|---|
0 | Alabama | 254 | 2022 |
Now that we have the three tables, I will go ahead and concatenate them.
# Now we concatonate the data
hc_full = pd.concat([hc_2020, hc_2021, hc_2022])
hc_full
State | Total offenses | Year | |
---|---|---|---|
0 | Alabama | 33 | 2020 |
1 | Alaska | 10 | 2020 |
2 | Arizona | 332 | 2020 |
3 | Arkansas | 22 | 2020 |
4 | California | 1537 | 2020 |
... | ... | ... | ... |
46 | Virginia | 211 | 2022 |
47 | Washington | 652 | 2022 |
48 | West Virginia | 59 | 2022 |
49 | Wisconsin | 163 | 2022 |
50 | Wyoming | 29 | 2022 |
153 rows × 3 columns
Now that we have a full dataframe, let's see what the dtypes are.
# Dtypes look good
hc_full.dtypes
State object Total offenses int64 Year int64 dtype: object
Everything looks good.
Now that we have our full table, let's look at some initial descriptive statistics.
hc_full['Total offenses'].describe()
count 153.000000 mean 209.320261 std 285.948943 min 1.000000 25% 47.000000 50% 117.000000 75% 252.000000 max 2261.000000 Name: Total offenses, dtype: float64
That is a rather large jump from the 75th percentile to the max. I wonder if the data is messed up somehow.
# Investigating the issue
hc_full.loc[hc_full['Total offenses'] == 2261]
State | Total offenses | Year | |
---|---|---|---|
4 | California | 2261 | 2022 |
Even after looking at the original data, California just seems to be quite the outlier. I will need to keep this in mind for the analysis.
Next step is look closer into these offenses by state.
# Grouping data by state and year
hc_group = hc_full.groupby(['State', 'Year'])['Total offenses'].sum()
hc_group
State Year Alabama 2020 33 2021 263 2022 254 Alaska 2020 10 2021 14 ... Wisconsin 2021 118 2022 163 Wyoming 2020 21 2021 21 2022 29 Name: Total offenses, Length: 153, dtype: int64
That presentation doesn't help too much, so let's create a bar chart to help better understand what we are looking at.
# Creating numpy array
y_indexes = np.arange(51)
height = .25
# Setting figure size for the rather large plot
plt.figure(figsize=(8,13))
# Creating the plot
plt.barh(y_indexes - height, hc_2020['Total offenses'], height=height, label='2020')
plt.barh(y_indexes, hc_2021['Total offenses'], height=height, label='2021')
plt.barh(y_indexes + height, hc_2022['Total offenses'], height=height, label='2022')
# Matching the numpy array to correct labels
plt.yticks(ticks=y_indexes, labels=hc_2020['State'], fontsize=8)
# Adding title, axis labels, and legend
plt.title('Hate Crime Offenses per State', fontsize=15)
plt.xlabel('Total Offenses', fontsize=12)
plt.ylabel('State', fontsize=12)
plt.legend(prop={'size': 10})
plt.grid(axis='x')
# Displaying plot
plt.show()
Intersting, it doesn't seem like there is much of a pattern within hate crime distribution over the three years. Some states see increase, others decreases, and others sporadic jumps. This is not at all what I was expecting after seeing the distribution of protest political affiliation. This might make the construction of a model much more challenging.
The first step in creating a model will be to put everything into one table. To do so, I will simply merge the hate crime data and protest data together on State and Year.
# Merging data
df_merged = df_new.merge(hc_full, on=['State', 'Year'])
df_merged
State | EstimateLow | EstimateHigh | Political Affiliation | Year | Total offenses | |
---|---|---|---|---|---|---|
0 | California | 100 | 100 | Left/Anti-Trump | 2020 | 1537 |
1 | California | 150 | 150 | Neither | 2020 | 1537 |
2 | California | 50 | 50 | Neither | 2020 | 1537 |
3 | California | 7 | 7 | Neither | 2020 | 1537 |
4 | California | 250 | 250 | Neither | 2020 | 1537 |
... | ... | ... | ... | ... | ... | ... |
2185 | North Dakota | 20 | 20 | Right/Pro-Trump | 2022 | 29 |
2186 | North Dakota | 50 | 50 | Left/Anti-Trump | 2022 | 29 |
2187 | Delaware | 30 | 50 | Left/Anti-Trump | 2022 | 16 |
2188 | Delaware | 50 | 50 | Left/Anti-Trump | 2022 | 16 |
2189 | Mississippi | 20 | 30 | Left/Anti-Trump | 2022 | 32 |
2190 rows × 6 columns
Next, to add a controlling variabel, I will be using state population data from 2020, 2021, and 2022. This data will be coming from the U.S. Census and, much like the hate crime data, will be poorly formatted for pandas. Therefore, I will not be using the originally formatted data, but a modified set. And again, the data itself was not modified, simply the formatting.
The original data can be found here.
# Changing directory
%cd ..
# Reading in population csv file
pop_df = pd.read_csv('Population Data.csv')
pop_df.head(5)
/content/data_science/Project Data
State | 2020 | 2021 | 2022 | |
---|---|---|---|---|
0 | Alabama | 5,031,362 | 5,049,846 | 5,074,296 |
1 | Alaska | 732,923 | 734,182 | 733,583 |
2 | Arizona | 7,179,943 | 7,264,877 | 7,359,197 |
3 | Arkansas | 3,014,195 | 3,028,122 | 3,045,637 |
4 | California | 39,501,653 | 39,142,991 | 39,029,342 |
Currently, the population for each year is under the column titled for that respective year. The first thing to do will be to create three seperate tables from this one, change the column name to population, add a year column to each, and concatenate the tabels back to one so that I can merge it with the above table.
# Creating three seperate tabels by year
pop_df_2020 = pop_df[['State', '2020']]
pop_df_2021 = pop_df[['State', '2021']]
pop_df_2022 = pop_df[['State', '2022']]
# Adding a year column to each
pop_df_2020['Year'] = 2020
pop_df_2021['Year'] = 2021
pop_df_2022['Year'] = 2022
# Changing column name to Population
pop_df_2020.rename(columns={'2020': 'Population'}, inplace=True)
pop_df_2021.rename(columns={'2021': 'Population'}, inplace=True)
pop_df_2022.rename(columns={'2022': 'Population'}, inplace=True)
# Concatenating the dfs
pop_half = pd.concat([pop_df_2020, pop_df_2021])
pop_df = pd.concat([pop_half, pop_df_2022])
pop_df
<ipython-input-46-ad03bc6c17c5>:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy pop_df_2020['Year'] = 2020 <ipython-input-46-ad03bc6c17c5>:8: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy pop_df_2021['Year'] = 2021 <ipython-input-46-ad03bc6c17c5>:9: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy pop_df_2022['Year'] = 2022 <ipython-input-46-ad03bc6c17c5>:12: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy pop_df_2020.rename(columns={'2020': 'Population'}, inplace=True) <ipython-input-46-ad03bc6c17c5>:13: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy pop_df_2021.rename(columns={'2021': 'Population'}, inplace=True) <ipython-input-46-ad03bc6c17c5>:14: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy pop_df_2022.rename(columns={'2022': 'Population'}, inplace=True)
State | Population | Year | |
---|---|---|---|
0 | Alabama | 5,031,362 | 2020 |
1 | Alaska | 732,923 | 2020 |
2 | Arizona | 7,179,943 | 2020 |
3 | Arkansas | 3,014,195 | 2020 |
4 | California | 39,501,653 | 2020 |
... | ... | ... | ... |
46 | Virginia | 8,683,619 | 2022 |
47 | Washington | 7,785,786 | 2022 |
48 | West Virginia | 1,775,156 | 2022 |
49 | Wisconsin | 5,892,539 | 2022 |
50 | Wyoming | 581,381 | 2022 |
153 rows × 3 columns
Now that I have created my new table, I can merge this with the previous full df.
# Merging tables
final_df = df_merged.merge(pop_df, on=['State', 'Year'])
final_df
State | EstimateLow | EstimateHigh | Political Affiliation | Year | Total offenses | Population | |
---|---|---|---|---|---|---|---|
0 | California | 100 | 100 | Left/Anti-Trump | 2020 | 1537 | 39,501,653 |
1 | California | 150 | 150 | Neither | 2020 | 1537 | 39,501,653 |
2 | California | 50 | 50 | Neither | 2020 | 1537 | 39,501,653 |
3 | California | 7 | 7 | Neither | 2020 | 1537 | 39,501,653 |
4 | California | 250 | 250 | Neither | 2020 | 1537 | 39,501,653 |
... | ... | ... | ... | ... | ... | ... | ... |
2185 | North Dakota | 20 | 20 | Right/Pro-Trump | 2022 | 29 | 779,261 |
2186 | North Dakota | 50 | 50 | Left/Anti-Trump | 2022 | 29 | 779,261 |
2187 | Delaware | 30 | 50 | Left/Anti-Trump | 2022 | 16 | 1,018,396 |
2188 | Delaware | 50 | 50 | Left/Anti-Trump | 2022 | 16 | 1,018,396 |
2189 | Mississippi | 20 | 30 | Left/Anti-Trump | 2022 | 32 | 2,940,057 |
2190 rows × 7 columns
With this final table, I can now begin creating my model. The first thing I will do is to simply train the model and test it on the training data alone. This should give me a baseline of how the model will look and perform.
from sklearn.neighbors import KNeighborsRegressor
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
# Creating features to fit model
features = ['State', 'EstimateLow', 'EstimateHigh', 'Political Affiliation',
'Year', 'Population']
# Setting the training data
X_train_dict = final_df[features].to_dict(orient="records")
y_train = final_df['Total offenses']
# Dummy encoding
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
# Scaling data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
# K-Nearest Neighbors Model
model = KNeighborsRegressor(n_neighbors=5)
model.fit(X_train_sc, y_train)
y_train_pred = model.predict(X_train_sc)
# Calculating prediction accuracy
mse = ((y_train - y_train_pred) ** 2).mean()
rmse = np.sqrt(mse)
rmse
147.59384928591015
It seems our current prediction is off by about 148 hate crime incidents in a year. Let's see what the average number of incidents is in a year.
final_df['Total offenses'].mean()
411.8073059360731
So it seems that the model is currently off by over 25%. Not a great starting point. Let's start running some cross-validation tests to see if we can find a better K-value and tuning the model a bit more.
# Creating a function to test the model based on features
def test_model(features):
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
# Setting the training data
X_dict = final_df[features].to_dict(orient="records")
y = final_df['Total offenses']
# Creating the pipeline for cv
vec = DictVectorizer(sparse=False)
scaler = StandardScaler()
def get_cv_error(k):
model = KNeighborsRegressor(n_neighbors=k)
pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
mse = np.mean(-cross_val_score(
pipeline, X_dict, y,
cv=10, scoring="neg_mean_squared_error"
))
# Finding and returning RMSE
rmse = np.sqrt(mse)
return rmse
# Applying the function to a series of K-values to find the best match
ks = pd.Series(range(1, 51))
ks.index = range(1, 51)
test_errs = ks.apply(get_cv_error)
display(test_errs.sort_values())
First, I will just see what I have with running the model as-is.
# Creating features to fit model
features = ['State', 'EstimateLow', 'EstimateHigh', 'Political Affiliation',
'Year', 'Population']
test_model(features)
2 1308.478804 3 1309.808228 1 1311.358962 4 1313.637205 16 1313.680531 15 1313.694672 14 1314.657191 17 1315.000239 13 1317.226181 18 1317.403031 5 1318.106265 19 1319.245065 12 1320.801347 20 1321.432688 6 1323.723434 21 1325.386166 11 1326.215236 22 1329.176557 7 1330.737379 23 1331.920860 10 1333.047922 24 1334.661910 8 1336.983182 25 1337.892473 26 1341.183565 9 1342.392255 27 1344.632541 28 1348.806691 29 1352.362345 50 1354.907282 30 1356.499465 49 1358.947974 31 1359.422038 32 1362.550583 48 1363.353416 33 1364.619850 34 1366.992803 47 1367.710584 35 1368.563191 36 1369.582802 37 1370.782396 46 1372.144758 38 1372.210104 39 1373.925720 40 1376.124108 45 1376.956504 41 1378.418426 42 1380.637737 44 1382.330132 43 1383.112618 dtype: float64
So the model is currently off by around 1300 protests per prediction. That is certainly not good. However, I do remember California being quite the outlier in the data set. Though I thought I controlled for that by including both State and Population in my model, it must not have had as significant an effect on the model as I hoped. Therefore, I will take out California from my data set to see if the error imporves.
# Setting State to index and dropping California
final_df = final_df.set_index('State').drop('California')
# Resetting index
final_df = final_df.reset_index()
final_df
State | EstimateLow | EstimateHigh | Political Affiliation | Year | Total offenses | Population | |
---|---|---|---|---|---|---|---|
0 | Oregon | 150 | 150 | Neither | 2020 | 324 | 4,244,795 |
1 | Oregon | 36 | 36 | Right/Pro-Trump | 2020 | 324 | 4,244,795 |
2 | Oregon | 100 | 100 | Neither | 2020 | 324 | 4,244,795 |
3 | Oregon | 200 | 300 | Right/Pro-Trump | 2020 | 324 | 4,244,795 |
4 | Oregon | 36 | 36 | Right/Pro-Trump | 2020 | 324 | 4,244,795 |
... | ... | ... | ... | ... | ... | ... | ... |
1888 | North Dakota | 20 | 20 | Right/Pro-Trump | 2022 | 29 | 779,261 |
1889 | North Dakota | 50 | 50 | Left/Anti-Trump | 2022 | 29 | 779,261 |
1890 | Delaware | 30 | 50 | Left/Anti-Trump | 2022 | 16 | 1,018,396 |
1891 | Delaware | 50 | 50 | Left/Anti-Trump | 2022 | 16 | 1,018,396 |
1892 | Mississippi | 20 | 30 | Left/Anti-Trump | 2022 | 32 | 2,940,057 |
1893 rows × 7 columns
With California out of the data, let's see what the error is now.
# Creating features to fit model
features = ['State', 'EstimateLow', 'EstimateHigh', 'Political Affiliation',
'Year', 'Population']
test_model(features)
1 202.281462 2 220.490351 3 233.627711 4 240.701775 5 245.742813 6 249.179314 7 252.304116 8 253.506376 10 254.003275 9 254.087586 11 254.187763 12 255.005329 17 255.689985 18 255.772137 16 256.126104 13 256.130441 15 256.276410 19 256.411711 14 257.117632 20 257.460189 21 258.802064 22 260.284278 26 260.489019 27 260.544855 28 260.581936 25 260.611264 29 260.878950 24 260.953020 30 261.308943 23 261.535952 31 261.817612 32 262.278236 33 262.605370 34 263.095780 35 263.617776 36 264.196391 37 264.939978 40 265.088988 41 265.098131 38 265.213977 39 265.358719 42 265.380857 43 265.857152 44 266.078680 45 266.572152 46 267.195899 47 267.937241 48 268.831120 49 269.769102 50 270.919691 dtype: float64
# Finding new mean without California
final_df['Total offenses'].mean()
263.62651875330164
That seems to be significantly better, but still not great as the prediction is still off by about 75%. Now I will start dropping some features to see if those are an issue.
# Dropping Estimate numbers
features = ['State', 'Political Affiliation',
'Year', 'Population']
test_model(features)
1 201.712462 2 219.637604 3 232.525221 4 239.885433 5 244.961057 6 248.779659 7 251.867976 10 253.138863 11 253.150450 8 253.254247 9 253.425539 12 253.941817 18 254.876959 13 255.005153 19 255.014155 17 255.341412 20 255.645817 14 256.184700 16 256.483802 21 256.650172 15 257.521982 22 257.907742 23 259.368045 26 259.579711 27 259.598468 25 259.784336 28 259.805735 24 260.098430 29 260.212482 30 260.718169 31 261.256101 32 261.907101 33 262.351444 34 262.969576 35 263.701343 36 264.356208 37 265.061600 38 265.412848 39 265.858757 42 265.901274 43 266.039150 41 266.073954 40 266.469497 44 266.513160 45 267.298049 46 267.645166 47 268.142926 48 268.859665 49 269.701741 50 270.749653 dtype: float64
That really did nothing. Let's try again.
# Dropping Political Affiliation
features = ['State', 'Year', 'Population']
test_model(features)
1 201.746776 2 218.804648 3 231.204409 4 238.419069 5 243.045434 6 246.303633 7 248.750718 8 250.660916 9 252.199084 10 253.468783 11 254.581780 12 255.592058 13 256.510428 14 257.343493 15 258.099415 16 258.786466 17 259.412386 18 259.984150 19 260.507925 20 260.989803 21 261.490521 22 262.033778 23 262.592530 24 263.162021 25 263.734628 26 264.269616 27 264.776092 28 265.353492 29 266.060545 30 266.825053 31 267.649860 32 268.531874 33 269.443800 34 270.391615 35 271.363956 36 272.260132 37 273.174829 38 274.155234 39 275.140242 40 276.153568 41 277.173154 42 278.214307 43 279.257721 44 280.343149 45 281.427668 46 282.507821 47 283.508811 48 284.508914 49 285.505341 50 286.495771 dtype: float64
That didn't seem to help either.
# Dropping Population
features = ['State', 'Political Affiliation',
'Year']
test_model(features)
1 176.719099 2 186.955735 3 194.149532 4 199.610703 5 205.435103 6 211.485052 7 217.205319 8 220.823619 9 225.671020 10 230.686885 11 236.434390 12 242.400573 13 248.186382 14 253.774471 15 259.098702 16 264.234499 17 268.586566 18 272.472398 19 276.302538 20 280.123386 21 284.046577 22 287.316948 23 290.572948 24 293.914099 25 297.149925 26 299.745101 27 302.163013 28 304.744103 50 306.586332 29 307.223699 49 308.718416 30 309.744480 31 310.653447 48 310.686657 32 311.755431 47 312.662292 33 312.909353 34 314.167509 46 314.707015 35 315.552996 45 316.012364 36 316.974352 44 317.465697 37 318.481275 43 318.768775 38 319.050999 39 319.606796 42 319.715064 40 319.888028 41 320.126783 dtype: float64
That might be the best I can get out of this model. I am not sure if I began by overfitting the model or if something was input incorrectly, but controlling for population seems to have an adverse effect on the model.
From the above descriptive statistics and the models, there are a few conclusions we can confidently draw. First, there was a very large increase in protests overall from 2020 to 2022. Second, not only did the number of protests increase, but their political affiliation also shifted quite drastically towards the left. Third, hate crimes did not seem to uniformily increase across the nation as protest numbers and affiliation had. This is quite suprising as it would make sense that an increase in social justice movements would follow an increase in socially unacceptable crimes. And lastly, at least for this model, it seems as though there is no connection between hate crimes and protest numbers or affiliation.
There are now two major conclusions we can draw from these conclusions in accordance with our two orignal questions: 1) There is no connection between hate crime and protest data. Something else is causing an increase in left-wing movements. 2) Since there is no apparent increase in the real number of hate crimes over the past few years, the media is not correctly portraying reality. In fact, these two conclusions might be linked in that though violent crime overall might be decreasing (as seen in many studies over the past few years), the media is portraying reality as though violent crime is increasing. Therefore, with the perception of an increase in crime, left-wing movements might be increasing in number to combat this fictitious rise. Future research might find more promising results by delving deeper into the connections between protest data and media coverage and citizens' perception of violent crime.