In the Heat of Protest¶

Mason Meeks

Since the mid-2010s, it feels as though our world and society has changed drastically, especially with how we engage politically. Using data from the Crowd Counting Consortium, I will be investigating the ebb and flow of protests from 2020 to 2022. Specifically, I will be using protest data from July of each of these years. The primary reason for using a singular month from each year is that July is directly in the middle of the time of year in which the majority of the year's protests happen. Therefore, the use of one month from each year will be simplier and still give me plenty of data.

I will also be using hate crime data from the FBI's Crime Data Explorer which will allow me to find correlations and create models to investigate the connection between a particular states's protest-based political affiliation, protest activity, and hate crime frequency. Lastly, I will be using U.S. Census Bureau data to control for population.

The overall goal of this project is two-fold: 1) to determine which states see the most per capita hate crimes and protests, and to understand if those factors are connected with that state's political affiliation, and 2) if an increase in media coverage of protests and violent hate crimes is accurate in relation to reality.

The github page for this notebook can be found here

In [1]:
# clone the course repository, change to right directory, and import libraries.
%cd /content
!git clone https://github.com/mmeekstu/data_science.git
%cd /content/data_science/'Project Data'/'July Data'
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_columns', None)
/content
Cloning into 'data_science'...
remote: Enumerating objects: 212, done.
remote: Counting objects: 100% (78/78), done.
remote: Compressing objects: 100% (54/54), done.
remote: Total 212 (delta 30), reused 66 (delta 24), pack-reused 134
Receiving objects: 100% (212/212), 18.75 MiB | 18.34 MiB/s, done.
Resolving deltas: 100% (89/89), done.
/content/data_science/Project Data/July Data

Protest Data¶

This data comes from the Crowd Counting Consortium (found here) which analyzes every protest, demonstration, rally, vigil, etc. throughout the U.S. The creators of this data have coded in many aspects of protests including its location, size, claim, groups, interactions among protesters, interactions among police, and many others. I chose this data because it should give me an abundance of information that will allow me to ask and potentially answer many types of questions. The initial question I hope to investigate with this data is how the frequency of protests and number of protesters have changed since 2020. With major elections, COVID, and massive public responses to policies and governmental actions, I expect to find a lot of interesting statistics that will shape the overall goal of this project.

ETL¶

In [2]:
# Reading in csv files
df_2020 = pd.read_csv('Crowd Estimates July 2020 - Tally.csv')
df_2021 = pd.read_csv('Crowd Estimates July 2021 - Tally.csv')
df_2022 = pd.read_csv('Crowd Estimates July 2022 - Tally.csv')

After reading the csv files in, I am going to add a year column to each with their corresponding year. This will make grouping them quick and simple later on.

In [3]:
# Adding a year column to each
df_2020['Year'] = 2020
df_2021['Year'] = 2021
df_2022['Year'] = 2022
In [4]:
# Creating a list for easy access
df_list = list([df_2020, df_2021, df_2022])

Now for a quick display of the first row of each file. This will show how messy or neat the data is, and allow me to understand where I need to start organizing and tidying.

In [5]:
for df in df_list:
  display(df.head(1))
CityTown Location County StateTerritory Country Date EstimateText EstimateLow BestGuess EstimateHigh AdjustedLow AdjustedHigh Actor Claim Pro(2)/Anti(1) EventType ReportedArrests ReportedParticipantInjuries ReportedPoliceInjuries ReportedPropertyDamage TownsCities Events MacroEvent Misc. Source1 Source2 Source3 Source4 Source5 Source6 Unnamed: 30 Year
0 Ashland Memorial Park; streets of Ashland; Highway 13 ... NaN WI US 2020-07-01 NaN NaN NaN NaN NaN NaN water protectors against recloation of Enbridge Line 5 Pipeline... 1 protest 0 0 0.0 0.0 1 1 NaN NaN https://kbjr6.com/2020/07/01/protesters-march-... NaN NaN NaN NaN NaN NaN 2020
date locality state location title size_text size_low size_high organizations participants claims valence event_type police_measures participant_measures police_injuries participant_injuries arrests property_damage police_deaths participant_deaths macroevent notes coder source1 source2 source3 source4 source5 source6 source7 source8 source9 source10 source11 source12 source13 source14 source15 source16 source17 source18 source19 source20 source21 source22 source23 source24 source25 source26 source27 source28 source29 source30 Year
0 2021-07-01 Addison IL Portillo's Food Service NaN 17 17.0 17.0 Arise Chicago workers for better working conditions and higher pay f... 0.0 strike; picket NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN https://chicago.suntimes.com/2021/7/2/22561028... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2021
date locality state location title size_text size_low size_high organizations participants claims valence event_type police_measures participant_measures police_injuries participant_injuries arrests property_damage police_deaths participant_deaths macroevent notes coder source1 source2 source3 source4 source5 source6 source7 source8 source9 source10 source11 source12 source13 source14 source15 source16 source17 source18 source19 source20 source21 source22 source23 source24 source25 source26 source27 source28 source29 source30 Year
0 2022-07-01 Akron OH Harold K. Stubbs Justice Center NaN about 20 20.0 20.0 The Freedom BLOC NaN for justice for Jayland Walker, against police... 1.0 protest NaN conversation with counter-protester NaN NaN NaN NaN NaN NaN 20220701-akron-jaylandwalker NaN NaN https://twitter.com/sahrasulaiman/status/15431... https://fox8.com/news/elected-officials-callin... https://www.wkyc.com/article/news/local/akron/... https://www.beaconjournal.com/story/news/2022/... https://www.instagram.com/p/Cfcmqv6JPtD/ https://www.instagram.com/p/CfepN98pt0G/ NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2022

There are a lot of usesless columns that need to be dropped. Also, this data is quite messy and is not the most consistently titled throughout the years. I am just going to run some for loops to iterate through the columns to drop what I don't want.

In [6]:
# Dropping a lot of data that will not be needed for the project
for (cName, cData) in df_2020.items():
    if ("source" in cName) | ('Source' in cName) | ('macroevent' in cName) | ('MacroEvent' in cName) | ('Misc.' in cName) | ('Unnamed' in cName) | ('notes' in cName) | ('coder' in cName) | ('TownsCities' in cName) | ('Events' in cName) | ('BestGuess' in cName) | ('AdjustedLow' in cName) | ('AdjustedHigh' in cName) | ('EstimateText' in cName) | ('County' in cName) | ('size_text' in cName) | ('title' in cName):
      df_2020 = df_2020.drop(columns = cName)
for (cName, cData) in df_2021.items():
    if ("source" in cName) | ('Source' in cName) | ('macroevent' in cName) | ('MacroEvent' in cName) | ('Misc.' in cName) | ('Unnamed' in cName) | ('notes' in cName) | ('coder' in cName) | ('TownsCities' in cName) | ('Events' in cName) | ('BestGuess' in cName) | ('AdjustedLow' in cName) | ('AdjustedHigh' in cName) | ('EstimateText' in cName) | ('County' in cName) | ('size_text' in cName) | ('title' in cName):
      df_2021 = df_2021.drop(columns = cName)
for (cName, cData) in df_2022.items():
    if ("source" in cName) | ('Source' in cName) | ('macroevent' in cName) | ('MacroEvent' in cName) | ('Misc.' in cName) | ('Unnamed' in cName) | ('notes' in cName) | ('coder' in cName) | ('TownsCities' in cName) | ('Events' in cName) | ('BestGuess' in cName) | ('AdjustedLow' in cName) | ('AdjustedHigh' in cName) | ('EstimateText' in cName) | ('County' in cName) | ('size_text' in cName) | ('title' in cName):
      df_2022 = df_2022.drop(columns = cName)
In [7]:
# Re-estabilshing the list
df_list = list([df_2020, df_2021, df_2022])
In [8]:
# Let's see what we have now
for df in df_list:
  display(df.head(1))
CityTown Location StateTerritory Country Date EstimateLow EstimateHigh Actor Claim Pro(2)/Anti(1) EventType ReportedArrests ReportedParticipantInjuries ReportedPoliceInjuries ReportedPropertyDamage Year
0 Ashland Memorial Park; streets of Ashland; Highway 13 ... WI US 2020-07-01 NaN NaN water protectors against recloation of Enbridge Line 5 Pipeline... 1 protest 0 0 0.0 0.0 2020
date locality state location size_low size_high organizations participants claims valence event_type police_measures participant_measures police_injuries participant_injuries arrests property_damage police_deaths participant_deaths Year
0 2021-07-01 Addison IL Portillo's Food Service 17.0 17.0 Arise Chicago workers for better working conditions and higher pay f... 0.0 strike; picket NaN NaN NaN NaN NaN NaN NaN NaN 2021
date locality state location size_low size_high organizations participants claims valence event_type police_measures participant_measures police_injuries participant_injuries arrests property_damage police_deaths participant_deaths Year
0 2022-07-01 Akron OH Harold K. Stubbs Justice Center 20.0 20.0 The Freedom BLOC NaN for justice for Jayland Walker, against police... 1.0 protest NaN conversation with counter-protester NaN NaN NaN NaN NaN NaN 2022

It is defintely already more manageable and organized. However, there are a lot of inconsistencies throughout the naming. There seemed to be a large change in 2021 with how the investigators coded the data. I will go ahead and make the names more consistent.

In [9]:
# Renamming inconsistent data
df_2020 = df_2020.rename(columns={"CityTown": "City", "StateTerritory": "State", 'Actor': 'Actors', 'Pro(2)/Anti(1)': 'Political Affiliation'})
df_2021 = df_2021.rename(columns={'date': 'Date', 'locality': 'City', 'location': 'Location', 'state': 'State', 'size_low': 'EstimateLow', 'size_high': 'EstimateHigh', 'claims': 'Claim', 'event_type': 'EventType', 'participants': 'Actors', 'valence': 'Political Affiliation', 'police_injuries': 'ReportedPoliceInjuries', 'participant_injuries': 'ReportedParticipantInjuries', 'arrests': 'ReportedArrests', 'property_damage': 'ReportedPropertyDamage'})
df_2022 = df_2022.rename(columns={'date': 'Date', 'locality': 'City', 'location': 'Location', 'state': 'State', 'size_low': 'EstimateLow', 'size_high': 'EstimateHigh', 'claims': 'Claim', 'event_type': 'EventType', 'participants': 'Actors', 'valence': 'Political Affiliation', 'police_injuries': 'ReportedPoliceInjuries', 'participant_injuries': 'ReportedParticipantInjuries', 'arrests': 'ReportedArrests', 'property_damage': 'ReportedPropertyDamage'})
In [10]:
# Re-establishing the list
df_list = list([df_2020, df_2021, df_2022])
In [11]:
# Much more consistent
for df in df_list:
  display(df.head(1))
City Location State Country Date EstimateLow EstimateHigh Actors Claim Political Affiliation EventType ReportedArrests ReportedParticipantInjuries ReportedPoliceInjuries ReportedPropertyDamage Year
0 Ashland Memorial Park; streets of Ashland; Highway 13 ... WI US 2020-07-01 NaN NaN water protectors against recloation of Enbridge Line 5 Pipeline... 1 protest 0 0 0.0 0.0 2020
Date City State Location EstimateLow EstimateHigh organizations Actors Claim Political Affiliation EventType police_measures participant_measures ReportedPoliceInjuries ReportedParticipantInjuries ReportedArrests ReportedPropertyDamage police_deaths participant_deaths Year
0 2021-07-01 Addison IL Portillo's Food Service 17.0 17.0 Arise Chicago workers for better working conditions and higher pay f... 0.0 strike; picket NaN NaN NaN NaN NaN NaN NaN NaN 2021
Date City State Location EstimateLow EstimateHigh organizations Actors Claim Political Affiliation EventType police_measures participant_measures ReportedPoliceInjuries ReportedParticipantInjuries ReportedArrests ReportedPropertyDamage police_deaths participant_deaths Year
0 2022-07-01 Akron OH Harold K. Stubbs Justice Center 20.0 20.0 The Freedom BLOC NaN for justice for Jayland Walker, against police... 1.0 protest NaN conversation with counter-protester NaN NaN NaN NaN NaN NaN 2022

I am going to go ahead and concatenate the tables now that naming is more consistent. This will allow me to more easily manage, transform, and analyze the data.

In [12]:
df_full = pd.concat([df_2020, df_2021, df_2022])
In [13]:
# Success. Here is what the new table looks like
df_full
Out[13]:
City Location State Country Date EstimateLow EstimateHigh Actors Claim Political Affiliation EventType ReportedArrests ReportedParticipantInjuries ReportedPoliceInjuries ReportedPropertyDamage Year organizations police_measures participant_measures police_deaths participant_deaths
0 Ashland Memorial Park; streets of Ashland; Highway 13 ... WI US 2020-07-01 NaN NaN water protectors against recloation of Enbridge Line 5 Pipeline... 1.0 protest 0 0 0.0 0.0 2020 NaN NaN NaN NaN NaN
1 Boulevard NaN CA US 2020-07-01 100.0 100.0 members of the Kumeyaay Tribe against construction of border wall, against d... 1.0 march NaN NaN NaN NaN 2020 NaN NaN NaN NaN NaN
2 Brentwood Bank of America Financial Center CA US 2020-07-01 NaN NaN fired employees of Terranea Resort against firing of staff and salary cuts at Ter... 0.0 caravan; protest 0 0 0.0 0.0 2020 NaN NaN NaN NaN NaN
3 Eugene Lane County Jail OR US 2020-07-01 150.0 150.0 general protesters show solidarity with inmates; seek release of ... 0.0 protest 1 0 1.0 1.0 2020 NaN NaN NaN NaN NaN
4 Holyoke Providence Behavioral Health Hospital MA US 2020-07-01 NaN NaN Nurses Association against closure of inpatient beds in psychiatr... 0.0 memorial; protest 0 0 0.0 0.0 2020 NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2518 Washington National Mall DC NaN 2022-07-31 NaN NaN NaN against 1776 Restoration Movement 0.0 counter-protest NaN NaN NaN NaN 2022 NaN NaN NaN NaN NaN
2519 Washington U.S. Capitol DC NaN 2022-07-31 NaN NaN veterans for the PACT Act, for passage of federal legis... 0.0 demonstration; rally NaN NaN NaN NaN 2022 Burn Pits 360 NaN round-the-clock sit-in on Capitol steps NaN NaN
2520 Washington Union Station DC NaN 2022-07-31 NaN NaN NaN for Medicare for all 1.0 demonstration NaN NaN NaN NaN 2022 March for Medicare for All NaN NaN NaN NaN
2521 West Hartford Flatbush Ave and New Park Ave CT NaN 2022-07-31 NaN NaN NaN against circumcision of male infants 0.0 protest NaN NaN NaN NaN 2022 Bloodstained Men NaN NaN NaN NaN
2522 Wooster Wooster Square OH NaN 2022-07-31 NaN NaN NaN against racism, against police brutality 1.0 demonstration NaN NaN NaN NaN 2022 Wayne County Racial Justice Coalition NaN NaN NaN NaN

5254 rows × 21 columns

Since the scope of the project only involves number of protests, protestors, and their political affiliation, let's now further tidy the chart into only the parts I will need. I am going to keep state, estimated low and high, political affiliation, and year.

In [ ]:
# Creating a new dataframe
df_new = pd.DataFrame(df_full[['State', 'EstimateLow', 'EstimateHigh', 'Political Affiliation', 'Year']])
df_new

Next, I am going to get rid of all the NA values since they won't be needed in the prediction model.

In [15]:
# Finding and getting rid of NAs
df_new = df_new.dropna()
df_new
Out[15]:
State EstimateLow EstimateHigh Political Affiliation Year
1 CA 100.0 100.0 1.0 2020
3 OR 150.0 150.0 0.0 2020
5 CA 150.0 150.0 0.0 2020
6 CA 50.0 50.0 0.0 2020
7 CA 7.0 7.0 0.0 2020
... ... ... ... ... ...
2503 CA 75.0 100.0 0.0 2022
2506 CA 36.0 36.0 0.0 2022
2507 CA 2.0 2.0 0.0 2022
2508 PR 15.0 20.0 0.0 2022
2514 WI 600.0 600.0 0.0 2022

2198 rows × 5 columns

I do not like that Political Affiliation is in integer format, so I am going to remap the values to make more intuitive sense.

In [16]:
# Remapping Political Affiliation to be more intuitive
df_new['Political Affiliation'] = df_new['Political Affiliation'].map({
    0.0: 'Neither',
    1.0: 'Left/Anti-Trump',
    2.0: 'Right/Pro-Trump'
})
<ipython-input-16-d7bce5010979>:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_new['Political Affiliation'] = df_new['Political Affiliation'].map({
In [17]:
# Great
df_new
Out[17]:
State EstimateLow EstimateHigh Political Affiliation Year
1 CA 100.0 100.0 Left/Anti-Trump 2020
3 OR 150.0 150.0 Neither 2020
5 CA 150.0 150.0 Neither 2020
6 CA 50.0 50.0 Neither 2020
7 CA 7.0 7.0 Neither 2020
... ... ... ... ... ...
2503 CA 75.0 100.0 Neither 2022
2506 CA 36.0 36.0 Neither 2022
2507 CA 2.0 2.0 Neither 2022
2508 PR 15.0 20.0 Neither 2022
2514 WI 600.0 600.0 Neither 2022

2198 rows × 5 columns

I also want to change the mapping of the state column to match the full names of the states so that it is more consistent with the second data set and easier to interpret.

In [18]:
# Creating dict
states_dict = {
        'AK': 'Alaska',
        'AL': 'Alabama',
        'AR': 'Arkansas',
        'AS': 'American Samoa',
        'AZ': 'Arizona',
        'CA': 'California',
        'CO': 'Colorado',
        'CT': 'Connecticut',
        'DC': 'District of Columbia',
        'DE': 'Delaware',
        'FL': 'Florida',
        'GA': 'Georgia',
        'GU': 'Guam',
        'HI': 'Hawaii',
        'IA': 'Iowa',
        'ID': 'Idaho',
        'IL': 'Illinois',
        'IN': 'Indiana',
        'KS': 'Kansas',
        'KY': 'Kentucky',
        'LA': 'Louisiana',
        'MA': 'Massachusetts',
        'MD': 'Maryland',
        'ME': 'Maine',
        'MI': 'Michigan',
        'MN': 'Minnesota',
        'MO': 'Missouri',
        'MP': 'Northern Mariana Islands',
        'MS': 'Mississippi',
        'MT': 'Montana',
        'NA': 'National',
        'NC': 'North Carolina',
        'ND': 'North Dakota',
        'NE': 'Nebraska',
        'NH': 'New Hampshire',
        'NJ': 'New Jersey',
        'NM': 'New Mexico',
        'NV': 'Nevada',
        'NY': 'New York',
        'OH': 'Ohio',
        'OK': 'Oklahoma',
        'OR': 'Oregon',
        'PA': 'Pennsylvania',
        'PR': 'Puerto Rico',
        'RI': 'Rhode Island',
        'SC': 'South Carolina',
        'SD': 'South Dakota',
        'TN': 'Tennessee',
        'TX': 'Texas',
        'UT': 'Utah',
        'VA': 'Virginia',
        'VI': 'Virgin Islands',
        'VT': 'Vermont',
        'WA': 'Washington',
        'WI': 'Wisconsin',
        'WV': 'West Virginia',
        'WY': 'Wyoming'
}
In [19]:
# Remapping
df_new['State'] = df_new['State'].map(states_dict)
<ipython-input-19-cd63c67008f0>:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_new['State'] = df_new['State'].map(states_dict)
In [20]:
# Done
df_new
Out[20]:
State EstimateLow EstimateHigh Political Affiliation Year
1 California 100.0 100.0 Left/Anti-Trump 2020
3 Oregon 150.0 150.0 Neither 2020
5 California 150.0 150.0 Neither 2020
6 California 50.0 50.0 Neither 2020
7 California 7.0 7.0 Neither 2020
... ... ... ... ... ...
2503 California 75.0 100.0 Neither 2022
2506 California 36.0 36.0 Neither 2022
2507 California 2.0 2.0 Neither 2022
2508 Puerto Rico 15.0 20.0 Neither 2022
2514 Wisconsin 600.0 600.0 Neither 2022

2198 rows × 5 columns

Now that there is one, organized table, let's take a look at the dtypes to see what pandas thinks of the data.

In [21]:
df_new.dtypes
Out[21]:
State                     object
EstimateLow              float64
EstimateHigh             float64
Political Affiliation     object
Year                       int64
dtype: object

It's not too bad. Since counting the number of people via decimal does not make much sense, I will change those to be integers.

In [22]:
df_new = df_new.astype({'EstimateLow': 'Int64', 'EstimateHigh': 'Int64'})
In [23]:
# Better
df_new.dtypes
Out[23]:
State                    object
EstimateLow               Int64
EstimateHigh              Int64
Political Affiliation    object
Year                      int64
dtype: object

Protest Data EDA¶

Now let's take a look at some summary statistics and graphs. I am just going to group by the year column that I made earlier and tally total observations and take the summation and mean of the estimated high and low columns to get a more accurate representation of the data. This will give me some base-line information to begin forming questions and directing analyses.

In [24]:
# Grouping Years to find total number of observations
df_Counts = df_new.groupby('Year')['Year'].count()
display(df_Counts)
df_Counts.plot.bar(title='Total Protests', ylabel='Number of Protests')
Year
2020    437
2021    782
2022    979
Name: Year, dtype: int64
Out[24]:
<Axes: title={'center': 'Total Protests'}, xlabel='Year', ylabel='Number of Protests'>

This is certainlly the trend I was expected to see. Since lockdown was in 2020, it makes sense that 2020 would have the fewest protests. However, the escalation into 2021 and 2022 was much quicker than I anticipated.

In [25]:
# Grouping 'EstimateLow' and EstimateHigh and summing
df_estimate_sum = df_new.groupby('Year')[['EstimateLow', 'EstimateHigh']].sum()
display(df_estimate_sum)
df_estimate_sum.plot.bar(title='Total Attendence', ylabel='Number of Participants')
EstimateLow EstimateHigh
Year
2020 60227 67773
2021 70483 126632
2022 145882 173236
Out[25]:
<Axes: title={'center': 'Total Attendence'}, xlabel='Year', ylabel='Number of Participants'>

This trend is also not too interesting. People were very cautious of large gatherings in 2020 and were not quick to forget in 2021.

In [26]:
# Grouping 'EstimateLow' and 'EstimateHigh' and averaging
df_estimate_avg = df_new.groupby('Year')[['EstimateLow', 'EstimateHigh']].mean()
display(df_estimate_avg)
df_estimate_avg.plot.bar(title='Average Attendence', ylabel='Number of Participants')
EstimateLow EstimateHigh
Year
2020 137.819222 155.086957
2021 90.131714 161.933504
2022 149.011236 176.951992
Out[26]:
<Axes: title={'center': 'Average Attendence'}, xlabel='Year', ylabel='Number of Participants'>

This is actually really interesting. The average attendence per protest in 2020 is rather equal to the attendence in the other years, if not more-so than 2021. Though lockdown seemed to affect the total number of protests and number of protestors, it did not seem to affect the average attendence.

Now I will be more closely investigating the relationship with overall protest numbers and their political affiliation over the years.

The first graph is a presentation of all of the protests within the full data table. That is, this is the total sum of protests and their political affiliation throughout the 3 years without taking into consideration the state. This seems lik a good place to start.

In [27]:
# Creating bar graph of total protests throughout the 3 years
df_new['Political Affiliation'].value_counts(sort=False).plot.bar(title='Protests by Political Affiliation',
                                                                  ylabel='Number of Protests', xlabel='Political Affiliation')
Out[27]:
<Axes: title={'center': 'Protests by Political Affiliation'}, xlabel='Political Affiliation', ylabel='Number of Protests'>

The number of righ-leaning protests seem to be far fewer than left-leaning or non-affiliated protests. I wonder if that trend will remain consistent throughout individual states and how that stacks up against the hate crime statistics.

To investigate that trend during each year, I wll create a cross-tabulation including State and Political Affliation for 2020-2022 and display these results in a heatmap.

In [28]:
# Creating three tables based on year
df_2020 = df_new.loc[df_new['Year'] == 2020]
df_2021 = df_new.loc[df_new['Year'] == 2021]
df_2022 = df_new.loc[df_new['Year'] == 2022]

# Creating three crosstab joint distributions
ct_2020 = pd.crosstab(df_2020.State, df_2020['Political Affiliation'], normalize=True)
ct_2021 = pd.crosstab(df_2021.State, df_2021['Political Affiliation'], normalize=True)
ct_2022 = pd.crosstab(df_2022.State, df_2022['Political Affiliation'], normalize=True)

# Displaying cross-tabulations
df_list = [ct_2020, ct_2021, ct_2022]
for df in df_list:
  display(df)
Political Affiliation Left/Anti-Trump Neither Right/Pro-Trump
State
Alabama 0.000000 0.000000 0.002294
Alaska 0.000000 0.002294 0.002294
Arizona 0.000000 0.009174 0.009174
Arkansas 0.000000 0.000000 0.004587
California 0.018349 0.064220 0.057339
Colorado 0.000000 0.004587 0.018349
Connecticut 0.002294 0.011468 0.013761
District of Columbia 0.002294 0.000000 0.000000
Florida 0.004587 0.027523 0.029817
Georgia 0.000000 0.011468 0.002294
Hawaii 0.002294 0.000000 0.000000
Idaho 0.000000 0.000000 0.009174
Illinois 0.000000 0.020642 0.011468
Indiana 0.000000 0.004587 0.018349
Iowa 0.000000 0.004587 0.004587
Kansas 0.000000 0.006881 0.004587
Kentucky 0.000000 0.000000 0.006881
Louisiana 0.000000 0.011468 0.002294
Maine 0.002294 0.006881 0.000000
Maryland 0.004587 0.016055 0.006881
Massachusetts 0.002294 0.016055 0.013761
Michigan 0.002294 0.013761 0.013761
Minnesota 0.000000 0.006881 0.006881
Mississippi 0.000000 0.002294 0.000000
Missouri 0.000000 0.016055 0.009174
Montana 0.000000 0.000000 0.004587
Nebraska 0.000000 0.000000 0.002294
Nevada 0.000000 0.002294 0.004587
New Hampshire 0.000000 0.002294 0.004587
New Jersey 0.000000 0.009174 0.018349
New Mexico 0.000000 0.000000 0.013761
New York 0.000000 0.029817 0.043578
North Carolina 0.004587 0.004587 0.011468
Ohio 0.000000 0.004587 0.016055
Oklahoma 0.002294 0.000000 0.000000
Oregon 0.000000 0.011468 0.018349
Pennsylvania 0.004587 0.011468 0.027523
Rhode Island 0.000000 0.000000 0.002294
South Carolina 0.000000 0.011468 0.000000
South Dakota 0.000000 0.000000 0.002294
Tennessee 0.000000 0.006881 0.006881
Texas 0.004587 0.034404 0.027523
Utah 0.000000 0.011468 0.009174
Vermont 0.000000 0.000000 0.009174
Virginia 0.004587 0.016055 0.013761
Washington 0.002294 0.004587 0.018349
West Virginia 0.000000 0.002294 0.004587
Wisconsin 0.000000 0.009174 0.000000
Political Affiliation Left/Anti-Trump Neither Right/Pro-Trump
State
Alabama 0.005115 0.003836 0.001279
Alaska 0.000000 0.001279 0.000000
Arizona 0.003836 0.002558 0.002558
Arkansas 0.002558 0.001279 0.000000
California 0.052430 0.040921 0.038363
Colorado 0.007673 0.000000 0.005115
Connecticut 0.005115 0.006394 0.006394
Delaware 0.001279 0.001279 0.000000
District of Columbia 0.015345 0.015345 0.002558
Florida 0.010230 0.089514 0.010230
Georgia 0.005115 0.005115 0.005115
Hawaii 0.006394 0.006394 0.003836
Idaho 0.002558 0.002558 0.005115
Illinois 0.015345 0.016624 0.003836
Indiana 0.010230 0.003836 0.002558
Iowa 0.002558 0.000000 0.001279
Kansas 0.001279 0.001279 0.000000
Kentucky 0.005115 0.003836 0.000000
Louisiana 0.001279 0.006394 0.001279
Maine 0.007673 0.001279 0.002558
Maryland 0.008951 0.007673 0.000000
Massachusetts 0.017903 0.015345 0.003836
Michigan 0.003836 0.006394 0.008951
Minnesota 0.023018 0.003836 0.005115
Mississippi 0.001279 0.001279 0.000000
Missouri 0.003836 0.005115 0.003836
Montana 0.001279 0.001279 0.003836
Nebraska 0.001279 0.003836 0.000000
Nevada 0.001279 0.002558 0.005115
New Hampshire 0.005115 0.000000 0.001279
New Jersey 0.008951 0.001279 0.001279
New Mexico 0.000000 0.000000 0.005115
New York 0.040921 0.052430 0.008951
North Carolina 0.014066 0.006394 0.015345
North Dakota 0.000000 0.000000 0.001279
Ohio 0.014066 0.007673 0.005115
Oklahoma 0.001279 0.002558 0.000000
Oregon 0.005115 0.002558 0.007673
Pennsylvania 0.024297 0.012788 0.010230
Puerto Rico 0.001279 0.000000 0.000000
Rhode Island 0.002558 0.001279 0.001279
South Carolina 0.002558 0.002558 0.000000
South Dakota 0.001279 0.001279 0.002558
Tennessee 0.003836 0.007673 0.001279
Texas 0.019182 0.026854 0.003836
Utah 0.000000 0.002558 0.000000
Virginia 0.014066 0.007673 0.010230
Washington 0.010230 0.001279 0.002558
West Virginia 0.002558 0.000000 0.001279
Wisconsin 0.003836 0.005115 0.000000
Political Affiliation Left/Anti-Trump Neither Right/Pro-Trump
State
Alabama 0.010215 0.000000 0.000000
Alaska 0.006129 0.000000 0.002043
Arizona 0.004086 0.001021 0.005107
Arkansas 0.004086 0.002043 0.001021
California 0.044944 0.081716 0.009193
Colorado 0.009193 0.001021 0.001021
Connecticut 0.004086 0.002043 0.000000
Delaware 0.002043 0.000000 0.000000
District of Columbia 0.025536 0.001021 0.006129
Florida 0.022472 0.003064 0.004086
Georgia 0.016343 0.003064 0.001021
Guam 0.001021 0.000000 0.000000
Hawaii 0.000000 0.003064 0.000000
Idaho 0.006129 0.000000 0.001021
Illinois 0.015322 0.043922 0.003064
Indiana 0.011236 0.001021 0.004086
Iowa 0.008172 0.031665 0.002043
Kansas 0.004086 0.023493 0.001021
Kentucky 0.006129 0.001021 0.000000
Louisiana 0.006129 0.000000 0.000000
Maine 0.004086 0.000000 0.000000
Maryland 0.010215 0.032686 0.001021
Massachusetts 0.013279 0.009193 0.004086
Michigan 0.012257 0.004086 0.001021
Minnesota 0.008172 0.003064 0.000000
Mississippi 0.001021 0.000000 0.000000
Missouri 0.011236 0.001021 0.000000
Montana 0.006129 0.000000 0.004086
Nebraska 0.004086 0.002043 0.002043
Nevada 0.002043 0.000000 0.001021
New Hampshire 0.001021 0.000000 0.002043
New Jersey 0.005107 0.002043 0.001021
New Mexico 0.006129 0.000000 0.001021
New York 0.037794 0.023493 0.009193
North Carolina 0.014300 0.005107 0.003064
North Dakota 0.002043 0.000000 0.001021
Ohio 0.034729 0.002043 0.005107
Oklahoma 0.006129 0.000000 0.000000
Oregon 0.006129 0.013279 0.001021
Pennsylvania 0.025536 0.008172 0.002043
Puerto Rico 0.001021 0.004086 0.000000
Rhode Island 0.000000 0.032686 0.000000
South Carolina 0.006129 0.001021 0.002043
South Dakota 0.003064 0.000000 0.000000
Tennessee 0.004086 0.002043 0.000000
Texas 0.019408 0.006129 0.005107
Utah 0.009193 0.000000 0.002043
Vermont 0.004086 0.000000 0.000000
Virginia 0.019408 0.003064 0.003064
Washington 0.004086 0.002043 0.000000
West Virginia 0.009193 0.000000 0.001021
Wisconsin 0.014300 0.031665 0.001021
Wyoming 0.003064 0.000000 0.002043
In [57]:
# Creating the heatmap for 2020
plt.subplots(figsize=(10,10))
sns.heatmap(ct_2020, xticklabels=True, yticklabels=True)
Out[57]:
<Axes: xlabel='Political Affiliation', ylabel='State'>

For 2020, protest political affiliation seems to actually be focussed on the righ-leaning/pro-Trump side of the spectrum. This is rather strange to see as the overall distribution was heavily weighted to the left. Let's see how this distribtion changes.

In [58]:
# Creating the heatmap for 2021
plt.subplots(figsize=(10,10))
sns.heatmap(ct_2021, xticklabels=True, yticklabels=True)
Out[58]:
<Axes: xlabel='Political Affiliation', ylabel='State'>

There certainly seems to be a more even spread across political affiliation, possibly with the left-leaning/anti-Trump side being more present. It seems the shift I was anticipating has already begun.

In [59]:
# Creating the heatmap for 2022
plt.subplots(figsize=(10,10))
sns.heatmap(ct_2022, xticklabels=True, yticklabels=True)
Out[59]:
<Axes: xlabel='Political Affiliation', ylabel='State'>

And by 2022, protest political affiliation seems to be a mirror image of that in 2020. However, it is difficult to determine the intensity of the shift from a heatmap, so next I will create a conditional distribution of political affiliation given each state so that we can see which side is more dominant.

In [32]:
# Creating full cross-tabulation
ct_full = pd.crosstab(df_new.State, df_new['Political Affiliation'], normalize=True)

# Creating conditional distribution
state_counts = ct_full.sum(axis=1)
pa_given_state = ct_full.divide(state_counts, axis=0)

# Displaying distribution
pa_given_state
Out[32]:
Political Affiliation Left/Anti-Trump Neither Right/Pro-Trump
State
Alabama 0.736842 0.157895 0.105263
Alaska 0.545455 0.181818 0.272727
Arizona 0.280000 0.280000 0.440000
Arkansas 0.500000 0.250000 0.250000
California 0.313131 0.471380 0.215488
Colorado 0.483871 0.096774 0.419355
Connecticut 0.281250 0.375000 0.343750
Delaware 0.750000 0.250000 0.000000
District of Columbia 0.644068 0.220339 0.135593
Florida 0.225352 0.598592 0.176056
Georgia 0.526316 0.315789 0.157895
Guam 1.000000 0.000000 0.000000
Hawaii 0.352941 0.470588 0.176471
Idaho 0.421053 0.105263 0.473684
Illinois 0.262136 0.631068 0.106796
Indiana 0.487179 0.153846 0.358974
Iowa 0.208333 0.687500 0.104167
Kansas 0.142857 0.771429 0.085714
Kentucky 0.588235 0.235294 0.176471
Louisiana 0.368421 0.526316 0.105263
Maine 0.647059 0.235294 0.117647
Maryland 0.279412 0.661765 0.058824
Massachusetts 0.405797 0.405797 0.188406
Michigan 0.355556 0.333333 0.311111
Minnesota 0.619048 0.214286 0.166667
Mississippi 0.500000 0.500000 0.000000
Missouri 0.424242 0.363636 0.212121
Montana 0.411765 0.058824 0.529412
Nebraska 0.384615 0.384615 0.230769
Nevada 0.230769 0.230769 0.538462
New Hampshire 0.454545 0.090909 0.454545
New Jersey 0.413793 0.241379 0.344828
New Mexico 0.352941 0.000000 0.647059
New York 0.381215 0.425414 0.193370
North Carolina 0.457627 0.203390 0.338983
North Dakota 0.500000 0.000000 0.500000
Ohio 0.633803 0.140845 0.225352
Oklahoma 0.800000 0.200000 0.000000
Oregon 0.222222 0.444444 0.333333
Pennsylvania 0.505495 0.252747 0.241758
Puerto Rico 0.333333 0.666667 0.000000
Rhode Island 0.054054 0.891892 0.054054
South Carolina 0.444444 0.444444 0.111111
South Dakota 0.500000 0.125000 0.375000
Tennessee 0.318182 0.500000 0.181818
Texas 0.367347 0.428571 0.204082
Utah 0.409091 0.318182 0.272727
Vermont 0.500000 0.000000 0.500000
Virginia 0.492308 0.246154 0.261538
Washington 0.464286 0.178571 0.357143
West Virginia 0.687500 0.062500 0.250000
Wisconsin 0.298246 0.684211 0.017544
Wyoming 0.600000 0.000000 0.400000

To represent this data, I will use a stacked bar graph since it will display the proportions of political affiliation more clearly.

In [33]:
# Creating stacked bar graph
pa_given_state.plot.bar(stacked=True, figsize=(10,10), legend=True)
Out[33]:
<Axes: xlabel='State'>

With this and the first bar graph of political distribution over all three years, it is easy to determine that there was a large political shift in protests over the years. I am now wondering if such a shift will be consistent with an increase in hate crimes seeing as left-leaning protests tend to have more to do with social justice movements. Therefore, I am currently hypothesizing that since there has been a great increase in left-leaning/anti-Trump protests, there will also be a great increase in hate crimes across the U.S.

Hate Crime Data¶

This data comes from the FBI Crime Data Explorer and is a depiction of each of the US states' reported hate crime incidents. The data I am using are the reports from 2020, 2021, and 2022. I would like to clarify that I am not reading in the original data that you would find on the website as that data was poorly formatted for reading into Pandas. Therefore, I copied the data into a new Excel file and formatted it in a way that could be read into Pandas. I also did not copy over extraneous data that I would not be using in this project.

The data can be found under the "Documents and Downloads" section of the front page. Link to data: https://cde.ucr.cjis.gov/LATEST/webapp/#/pages/downloads

ETL¶

In [34]:
# Changing directory
%cd ../'Hate Crimes'
/content/data_science/Project Data/Hate Crimes

As previously, the first step is to read in the files and see what we have to work with. From there, I can tidy and format the data.

In [35]:
# Reading in files
hc_2020 = pd.read_csv('hate crimes by state 2020.csv')
hc_2021 = pd.read_csv('hate crimes by state 2021.csv')
hc_2022 = pd.read_csv('hate crimes by state 2022.csv')

# Let's see what we have
display(hc_2020.head(5))
display(hc_2021.head(5))
display(hc_2022.head(5))
Participating State Total offenses
0 Alabama 33
1 Alaska 10
2 Arizona 332
3 Arkansas 22
4 California 1537
Participating State Total offenses
0 Alabama 263
1 Alaska 14
2 Arizona 106
3 Arkansas 47
4 California 81
Participating State Total offenses
0 Alabama 254
1 Alaska 5
2 Arizona 225
3 Arkansas 38
4 California 2261

Everything looks good so far. Let's go ahead and add a year column for clarity and for future merging.

In [36]:
# Creating a year column with appropriate years
hc_2020['Year'] = 2020
hc_2021['Year'] = 2021
hc_2022['Year'] = 2022

# Displaying
display(hc_2020.head(5))
display(hc_2021.head(5))
display(hc_2022.head(5))
Participating State Total offenses Year
0 Alabama 33 2020
1 Alaska 10 2020
2 Arizona 332 2020
3 Arkansas 22 2020
4 California 1537 2020
Participating State Total offenses Year
0 Alabama 263 2021
1 Alaska 14 2021
2 Arizona 106 2021
3 Arkansas 47 2021
4 California 81 2021
Participating State Total offenses Year
0 Alabama 254 2022
1 Alaska 5 2022
2 Arizona 225 2022
3 Arkansas 38 2022
4 California 2261 2022

I just want to chage the name of the 'Participating State' column to 'State' for clarity and brevity.

In [37]:
# Creating list of dfs
df_list = [hc_2020, hc_2021, hc_2022]

# Renaming columns
for df in df_list:
  df.rename(columns={'Participating State': 'State'}, inplace=True)

# Displaying dfs
for df in df_list:
  display(df.head(1))
State Total offenses Year
0 Alabama 33 2020
State Total offenses Year
0 Alabama 263 2021
State Total offenses Year
0 Alabama 254 2022

Now that we have the three tables, I will go ahead and concatenate them.

In [38]:
# Now we concatonate the data
hc_full = pd.concat([hc_2020, hc_2021, hc_2022])
hc_full
Out[38]:
State Total offenses Year
0 Alabama 33 2020
1 Alaska 10 2020
2 Arizona 332 2020
3 Arkansas 22 2020
4 California 1537 2020
... ... ... ...
46 Virginia 211 2022
47 Washington 652 2022
48 West Virginia 59 2022
49 Wisconsin 163 2022
50 Wyoming 29 2022

153 rows × 3 columns

Now that we have a full dataframe, let's see what the dtypes are.

In [39]:
# Dtypes look good
hc_full.dtypes
Out[39]:
State             object
Total offenses     int64
Year               int64
dtype: object

Everything looks good.

EDA¶

Now that we have our full table, let's look at some initial descriptive statistics.

In [40]:
hc_full['Total offenses'].describe()
Out[40]:
count     153.000000
mean      209.320261
std       285.948943
min         1.000000
25%        47.000000
50%       117.000000
75%       252.000000
max      2261.000000
Name: Total offenses, dtype: float64

That is a rather large jump from the 75th percentile to the max. I wonder if the data is messed up somehow.

In [41]:
# Investigating the issue
hc_full.loc[hc_full['Total offenses'] == 2261]
Out[41]:
State Total offenses Year
4 California 2261 2022

Even after looking at the original data, California just seems to be quite the outlier. I will need to keep this in mind for the analysis.

Next step is look closer into these offenses by state.

In [42]:
# Grouping data by state and year
hc_group = hc_full.groupby(['State', 'Year'])['Total offenses'].sum()
hc_group
Out[42]:
State      Year
Alabama    2020     33
           2021    263
           2022    254
Alaska     2020     10
           2021     14
                  ... 
Wisconsin  2021    118
           2022    163
Wyoming    2020     21
           2021     21
           2022     29
Name: Total offenses, Length: 153, dtype: int64

That presentation doesn't help too much, so let's create a bar chart to help better understand what we are looking at.

In [43]:
# Creating numpy array
y_indexes = np.arange(51)
height = .25

# Setting figure size for the rather large plot
plt.figure(figsize=(8,13))

# Creating the plot
plt.barh(y_indexes - height, hc_2020['Total offenses'], height=height, label='2020')

plt.barh(y_indexes, hc_2021['Total offenses'], height=height, label='2021')

plt.barh(y_indexes + height, hc_2022['Total offenses'], height=height, label='2022')

# Matching the numpy array to correct labels
plt.yticks(ticks=y_indexes, labels=hc_2020['State'], fontsize=8)

# Adding title, axis labels, and legend
plt.title('Hate Crime Offenses per State', fontsize=15)
plt.xlabel('Total Offenses', fontsize=12)
plt.ylabel('State', fontsize=12)
plt.legend(prop={'size': 10})
plt.grid(axis='x')

# Displaying plot
plt.show()

Intersting, it doesn't seem like there is much of a pattern within hate crime distribution over the three years. Some states see increase, others decreases, and others sporadic jumps. This is not at all what I was expecting after seeing the distribution of protest political affiliation. This might make the construction of a model much more challenging.

Modeling and Testing¶

The first step in creating a model will be to put everything into one table. To do so, I will simply merge the hate crime data and protest data together on State and Year.

In [44]:
# Merging data
df_merged = df_new.merge(hc_full, on=['State', 'Year'])
df_merged
Out[44]:
State EstimateLow EstimateHigh Political Affiliation Year Total offenses
0 California 100 100 Left/Anti-Trump 2020 1537
1 California 150 150 Neither 2020 1537
2 California 50 50 Neither 2020 1537
3 California 7 7 Neither 2020 1537
4 California 250 250 Neither 2020 1537
... ... ... ... ... ... ...
2185 North Dakota 20 20 Right/Pro-Trump 2022 29
2186 North Dakota 50 50 Left/Anti-Trump 2022 29
2187 Delaware 30 50 Left/Anti-Trump 2022 16
2188 Delaware 50 50 Left/Anti-Trump 2022 16
2189 Mississippi 20 30 Left/Anti-Trump 2022 32

2190 rows × 6 columns

Next, to add a controlling variabel, I will be using state population data from 2020, 2021, and 2022. This data will be coming from the U.S. Census and, much like the hate crime data, will be poorly formatted for pandas. Therefore, I will not be using the originally formatted data, but a modified set. And again, the data itself was not modified, simply the formatting.

The original data can be found here.

In [45]:
# Changing directory
%cd ..

# Reading in population csv file
pop_df = pd.read_csv('Population Data.csv')
pop_df.head(5)
/content/data_science/Project Data
Out[45]:
State 2020 2021 2022
0 Alabama 5,031,362 5,049,846 5,074,296
1 Alaska 732,923 734,182 733,583
2 Arizona 7,179,943 7,264,877 7,359,197
3 Arkansas 3,014,195 3,028,122 3,045,637
4 California 39,501,653 39,142,991 39,029,342

Currently, the population for each year is under the column titled for that respective year. The first thing to do will be to create three seperate tables from this one, change the column name to population, add a year column to each, and concatenate the tabels back to one so that I can merge it with the above table.

In [46]:
# Creating three seperate tabels by year
pop_df_2020 = pop_df[['State', '2020']]
pop_df_2021 = pop_df[['State', '2021']]
pop_df_2022 = pop_df[['State', '2022']]

# Adding a year column to each
pop_df_2020['Year'] = 2020
pop_df_2021['Year'] = 2021
pop_df_2022['Year'] = 2022

# Changing column name to Population
pop_df_2020.rename(columns={'2020': 'Population'}, inplace=True)
pop_df_2021.rename(columns={'2021': 'Population'}, inplace=True)
pop_df_2022.rename(columns={'2022': 'Population'}, inplace=True)

# Concatenating the dfs
pop_half = pd.concat([pop_df_2020, pop_df_2021])
pop_df = pd.concat([pop_half, pop_df_2022])
pop_df
<ipython-input-46-ad03bc6c17c5>:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pop_df_2020['Year'] = 2020
<ipython-input-46-ad03bc6c17c5>:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pop_df_2021['Year'] = 2021
<ipython-input-46-ad03bc6c17c5>:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pop_df_2022['Year'] = 2022
<ipython-input-46-ad03bc6c17c5>:12: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pop_df_2020.rename(columns={'2020': 'Population'}, inplace=True)
<ipython-input-46-ad03bc6c17c5>:13: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pop_df_2021.rename(columns={'2021': 'Population'}, inplace=True)
<ipython-input-46-ad03bc6c17c5>:14: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pop_df_2022.rename(columns={'2022': 'Population'}, inplace=True)
Out[46]:
State Population Year
0 Alabama 5,031,362 2020
1 Alaska 732,923 2020
2 Arizona 7,179,943 2020
3 Arkansas 3,014,195 2020
4 California 39,501,653 2020
... ... ... ...
46 Virginia 8,683,619 2022
47 Washington 7,785,786 2022
48 West Virginia 1,775,156 2022
49 Wisconsin 5,892,539 2022
50 Wyoming 581,381 2022

153 rows × 3 columns

Now that I have created my new table, I can merge this with the previous full df.

In [47]:
# Merging tables
final_df = df_merged.merge(pop_df, on=['State', 'Year'])
final_df
Out[47]:
State EstimateLow EstimateHigh Political Affiliation Year Total offenses Population
0 California 100 100 Left/Anti-Trump 2020 1537 39,501,653
1 California 150 150 Neither 2020 1537 39,501,653
2 California 50 50 Neither 2020 1537 39,501,653
3 California 7 7 Neither 2020 1537 39,501,653
4 California 250 250 Neither 2020 1537 39,501,653
... ... ... ... ... ... ... ...
2185 North Dakota 20 20 Right/Pro-Trump 2022 29 779,261
2186 North Dakota 50 50 Left/Anti-Trump 2022 29 779,261
2187 Delaware 30 50 Left/Anti-Trump 2022 16 1,018,396
2188 Delaware 50 50 Left/Anti-Trump 2022 16 1,018,396
2189 Mississippi 20 30 Left/Anti-Trump 2022 32 2,940,057

2190 rows × 7 columns

With this final table, I can now begin creating my model. The first thing I will do is to simply train the model and test it on the training data alone. This should give me a baseline of how the model will look and perform.

In [48]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler

# Creating features to fit model
features = ['State', 'EstimateLow', 'EstimateHigh', 'Political Affiliation',
            'Year', 'Population']

# Setting the training data
X_train_dict = final_df[features].to_dict(orient="records")
y_train = final_df['Total offenses']

# Dummy encoding
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)

# Scaling data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)

# K-Nearest Neighbors Model
model = KNeighborsRegressor(n_neighbors=5)
model.fit(X_train_sc, y_train)
y_train_pred = model.predict(X_train_sc)

# Calculating prediction accuracy
mse = ((y_train - y_train_pred) ** 2).mean()
rmse = np.sqrt(mse)
rmse
Out[48]:
147.59384928591015

It seems our current prediction is off by about 148 hate crime incidents in a year. Let's see what the average number of incidents is in a year.

In [49]:
final_df['Total offenses'].mean()
Out[49]:
411.8073059360731

So it seems that the model is currently off by over 25%. Not a great starting point. Let's start running some cross-validation tests to see if we can find a better K-value and tuning the model a bit more.

In [50]:
# Creating a function to test the model based on features
def test_model(features):

  from sklearn.feature_extraction import DictVectorizer
  from sklearn.preprocessing import StandardScaler
  from sklearn.neighbors import KNeighborsRegressor
  from sklearn.pipeline import Pipeline
  from sklearn.model_selection import cross_val_score

  # Setting the training data
  X_dict = final_df[features].to_dict(orient="records")
  y = final_df['Total offenses']

  # Creating the pipeline for cv
  vec = DictVectorizer(sparse=False)
  scaler = StandardScaler()
  def get_cv_error(k):
      model = KNeighborsRegressor(n_neighbors=k)
      pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
      mse = np.mean(-cross_val_score(
          pipeline, X_dict, y,
          cv=10, scoring="neg_mean_squared_error"
      ))
      # Finding and returning RMSE
      rmse = np.sqrt(mse)
      return rmse

  # Applying the function to a series of K-values to find the best match
  ks = pd.Series(range(1, 51))
  ks.index = range(1, 51)
  test_errs = ks.apply(get_cv_error)

  display(test_errs.sort_values())

First, I will just see what I have with running the model as-is.

In [51]:
# Creating features to fit model
features = ['State', 'EstimateLow', 'EstimateHigh', 'Political Affiliation',
            'Year', 'Population']
test_model(features)
2     1308.478804
3     1309.808228
1     1311.358962
4     1313.637205
16    1313.680531
15    1313.694672
14    1314.657191
17    1315.000239
13    1317.226181
18    1317.403031
5     1318.106265
19    1319.245065
12    1320.801347
20    1321.432688
6     1323.723434
21    1325.386166
11    1326.215236
22    1329.176557
7     1330.737379
23    1331.920860
10    1333.047922
24    1334.661910
8     1336.983182
25    1337.892473
26    1341.183565
9     1342.392255
27    1344.632541
28    1348.806691
29    1352.362345
50    1354.907282
30    1356.499465
49    1358.947974
31    1359.422038
32    1362.550583
48    1363.353416
33    1364.619850
34    1366.992803
47    1367.710584
35    1368.563191
36    1369.582802
37    1370.782396
46    1372.144758
38    1372.210104
39    1373.925720
40    1376.124108
45    1376.956504
41    1378.418426
42    1380.637737
44    1382.330132
43    1383.112618
dtype: float64

So the model is currently off by around 1300 protests per prediction. That is certainly not good. However, I do remember California being quite the outlier in the data set. Though I thought I controlled for that by including both State and Population in my model, it must not have had as significant an effect on the model as I hoped. Therefore, I will take out California from my data set to see if the error imporves.

In [52]:
# Setting State to index and dropping California
final_df = final_df.set_index('State').drop('California')

# Resetting index
final_df = final_df.reset_index()
final_df
Out[52]:
State EstimateLow EstimateHigh Political Affiliation Year Total offenses Population
0 Oregon 150 150 Neither 2020 324 4,244,795
1 Oregon 36 36 Right/Pro-Trump 2020 324 4,244,795
2 Oregon 100 100 Neither 2020 324 4,244,795
3 Oregon 200 300 Right/Pro-Trump 2020 324 4,244,795
4 Oregon 36 36 Right/Pro-Trump 2020 324 4,244,795
... ... ... ... ... ... ... ...
1888 North Dakota 20 20 Right/Pro-Trump 2022 29 779,261
1889 North Dakota 50 50 Left/Anti-Trump 2022 29 779,261
1890 Delaware 30 50 Left/Anti-Trump 2022 16 1,018,396
1891 Delaware 50 50 Left/Anti-Trump 2022 16 1,018,396
1892 Mississippi 20 30 Left/Anti-Trump 2022 32 2,940,057

1893 rows × 7 columns

With California out of the data, let's see what the error is now.

In [60]:
# Creating features to fit model
features = ['State', 'EstimateLow', 'EstimateHigh', 'Political Affiliation',
            'Year', 'Population']

test_model(features)
1     202.281462
2     220.490351
3     233.627711
4     240.701775
5     245.742813
6     249.179314
7     252.304116
8     253.506376
10    254.003275
9     254.087586
11    254.187763
12    255.005329
17    255.689985
18    255.772137
16    256.126104
13    256.130441
15    256.276410
19    256.411711
14    257.117632
20    257.460189
21    258.802064
22    260.284278
26    260.489019
27    260.544855
28    260.581936
25    260.611264
29    260.878950
24    260.953020
30    261.308943
23    261.535952
31    261.817612
32    262.278236
33    262.605370
34    263.095780
35    263.617776
36    264.196391
37    264.939978
40    265.088988
41    265.098131
38    265.213977
39    265.358719
42    265.380857
43    265.857152
44    266.078680
45    266.572152
46    267.195899
47    267.937241
48    268.831120
49    269.769102
50    270.919691
dtype: float64
In [61]:
# Finding new mean without California
final_df['Total offenses'].mean()
Out[61]:
263.62651875330164

That seems to be significantly better, but still not great as the prediction is still off by about 75%. Now I will start dropping some features to see if those are an issue.

In [62]:
# Dropping Estimate numbers
features = ['State', 'Political Affiliation',
            'Year', 'Population']

test_model(features)
1     201.712462
2     219.637604
3     232.525221
4     239.885433
5     244.961057
6     248.779659
7     251.867976
10    253.138863
11    253.150450
8     253.254247
9     253.425539
12    253.941817
18    254.876959
13    255.005153
19    255.014155
17    255.341412
20    255.645817
14    256.184700
16    256.483802
21    256.650172
15    257.521982
22    257.907742
23    259.368045
26    259.579711
27    259.598468
25    259.784336
28    259.805735
24    260.098430
29    260.212482
30    260.718169
31    261.256101
32    261.907101
33    262.351444
34    262.969576
35    263.701343
36    264.356208
37    265.061600
38    265.412848
39    265.858757
42    265.901274
43    266.039150
41    266.073954
40    266.469497
44    266.513160
45    267.298049
46    267.645166
47    268.142926
48    268.859665
49    269.701741
50    270.749653
dtype: float64

That really did nothing. Let's try again.

In [63]:
# Dropping Political Affiliation
features = ['State', 'Year', 'Population']

test_model(features)
1     201.746776
2     218.804648
3     231.204409
4     238.419069
5     243.045434
6     246.303633
7     248.750718
8     250.660916
9     252.199084
10    253.468783
11    254.581780
12    255.592058
13    256.510428
14    257.343493
15    258.099415
16    258.786466
17    259.412386
18    259.984150
19    260.507925
20    260.989803
21    261.490521
22    262.033778
23    262.592530
24    263.162021
25    263.734628
26    264.269616
27    264.776092
28    265.353492
29    266.060545
30    266.825053
31    267.649860
32    268.531874
33    269.443800
34    270.391615
35    271.363956
36    272.260132
37    273.174829
38    274.155234
39    275.140242
40    276.153568
41    277.173154
42    278.214307
43    279.257721
44    280.343149
45    281.427668
46    282.507821
47    283.508811
48    284.508914
49    285.505341
50    286.495771
dtype: float64

That didn't seem to help either.

In [64]:
# Dropping Population
features = ['State', 'Political Affiliation',
            'Year']

test_model(features)
1     176.719099
2     186.955735
3     194.149532
4     199.610703
5     205.435103
6     211.485052
7     217.205319
8     220.823619
9     225.671020
10    230.686885
11    236.434390
12    242.400573
13    248.186382
14    253.774471
15    259.098702
16    264.234499
17    268.586566
18    272.472398
19    276.302538
20    280.123386
21    284.046577
22    287.316948
23    290.572948
24    293.914099
25    297.149925
26    299.745101
27    302.163013
28    304.744103
50    306.586332
29    307.223699
49    308.718416
30    309.744480
31    310.653447
48    310.686657
32    311.755431
47    312.662292
33    312.909353
34    314.167509
46    314.707015
35    315.552996
45    316.012364
36    316.974352
44    317.465697
37    318.481275
43    318.768775
38    319.050999
39    319.606796
42    319.715064
40    319.888028
41    320.126783
dtype: float64

That might be the best I can get out of this model. I am not sure if I began by overfitting the model or if something was input incorrectly, but controlling for population seems to have an adverse effect on the model.

Conclusions¶

From the above descriptive statistics and the models, there are a few conclusions we can confidently draw. First, there was a very large increase in protests overall from 2020 to 2022. Second, not only did the number of protests increase, but their political affiliation also shifted quite drastically towards the left. Third, hate crimes did not seem to uniformily increase across the nation as protest numbers and affiliation had. This is quite suprising as it would make sense that an increase in social justice movements would follow an increase in socially unacceptable crimes. And lastly, at least for this model, it seems as though there is no connection between hate crimes and protest numbers or affiliation.

There are now two major conclusions we can draw from these conclusions in accordance with our two orignal questions: 1) There is no connection between hate crime and protest data. Something else is causing an increase in left-wing movements. 2) Since there is no apparent increase in the real number of hate crimes over the past few years, the media is not correctly portraying reality. In fact, these two conclusions might be linked in that though violent crime overall might be decreasing (as seen in many studies over the past few years), the media is portraying reality as though violent crime is increasing. Therefore, with the perception of an increase in crime, left-wing movements might be increasing in number to combat this fictitious rise. Future research might find more promising results by delving deeper into the connections between protest data and media coverage and citizens' perception of violent crime.