Suicide is the 10th leading cause of death and while it may seem like a small number, the suicide rate has increased by about 33%, from 1999 to 2017, according to CBS Baltimore.
These tragedies don't only affect the victim's families negatively but it also affects others around them. As being part of a pressing issue, we should be trying to understand where the problem comes from or if there is anything in the data that can help us understand this issue more.
There are lives being lost right now and any possible insight that can be gained could be potentially helpful.
We're essentially trying to find what factors can affect the number of suicides that happen within the framework of a community of people. Rather than looking at individual reasons for this phenomenon, we're trying to look at factors that are less individual and more plural and see if there's any particular way that society as a whole is contributing to this phenomenon. In essence, we're trying to answer the question that if we're given a sample of people to observe, what factors would contribute to the phenomenon of suicide? Notice that we say people, not person. Unlike a lot of other studies, we're not looking at factors that could contribute to an individual suicide.
We found our dataset online, on Kaggle.com
This is a dataset that shows data on countries, years, gender, age, suicide number, GDP per year and capita, population, and generation. What's impressive about this dataset that it groups all of these people by their gender, country, age etc and gives us separate statistics for each group of people
The main imports we're using are:
Pandas - used to manipulate data and analyze
Numpy - used for multiple numbers of mathematical operations
Folium - used for data visuals such as a map
MatPlotLib - used for data visuals on graphs
HeatMap - used to show a gradually changing effect over data via coloring shade
SeaBorn - used to plot data on a heatmap
sklearn - used for machine learning and making predictions
!pip install folium
import pandas as pd
import numpy as np
import folium
import matplotlib.pylab as plt
from folium.plugins import HeatMap, HeatMapWithTime
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn import linear_model
from sklearn import model_selection
from statsmodels import api as sm
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
from sklearn.linear_model import LogisticRegression
import warnings;
warnings.filterwarnings('ignore');
The data is loaded from the csv file and fit to a pandas dataframe.
# Importing the data and looking at it.
df = pd.read_csv("master.csv", sep=',')
display(df.head(10))
The dataset is relatively clean, however there are a few minor cases where some data has been transcribed in the form of '-' or '--'. We will replace them with NaN, to ensure they don't corrupt any calculations that take place later.
df=df.replace('-',np.NaN)
df=df.replace('--',np.NaN)
Now that we have the dataset on suicide rates, we can try and explore some of the possible factors that help contribute to suicide rate. We run the pandas describe function just to get a general sense of what numerical data is contained in the dataframe and what ranges of values are in them.
display(df.describe())
The description gives us some preliminary insight into the suicide rate data. It helps us get an idea of the range of years that this data covers, which is 1985-2016. It also gives us the population numbers for each of the countries involved, the suicides per 100 thousand of the population, and the gdp per capita. We can explore these factors further, to see their impact on the number of suicides that occur and to try and eventually be able to see which factor affects the number of suicides the most. We'll also look to explore the affect of the given year on suicide numbers. We also use the categorical data, like the generation of people, the country that the suicide is being reported in, the sex of the person and the generation they belong to.
As per this report, build from data provided by the World Health Organization suggests, suicide rates have been on the rise all over the world. To corroborate this, we've decided to look at the total number of suicides around the world(that have been reported) and plot them against the years they were reported in.
We first group all the data by the year column
a =df.groupby(['year']).sum().reset_index()
display(a.head())
We then plot the suicide numbers against the years they occurred in.
plt.plot(a['year'],a['suicides_no'])
plt.xlabel("Years",size = 15)
plt.ylabel("Number of Suicides",size = 15)
plt.title("Years vs. Number of Suicides", size = 15)
From the graph, there is an extreme dip after the year 2015, for the year 2016. To dig some more, lets see a count of data for each column using the pandas group by
a =df.groupby(['year']).count().reset_index()
display(a.tail())
There is clearly some data missing for 2016, as there seems are only 170 entries for 2016, compared to the volumes for the years before. Research from the previous report contradicts that finding, so there must be some missing data. For that reason, we will omit 2016 from our observations and observe suicide data from 1985-2015 instead.
Note that the HDI column also shows incosistent count values, hence we will not consider it when later analysing the factors behind suicide numbers. Also, from the definition of HDI, it is calculated by looking at life expectancies. Hence, analyzing it as a factor for suicide numbers would result in falsely positive results, as suicide numbers affect the calculation of HDI.
Now we'll remove the year 2016 from our data frame, then replot the graph.
is_not_2016 = df['year']!=2016
df = df[is_not_2016]
a =df.groupby(['year']).sum().reset_index()
plt.plot(a['year'],a['suicides_no'])
plt.xlabel("Years",size = 15)
plt.ylabel("Number of Suicides",size = 15)
plt.title("Years vs. Number of Suicides", size = 15)
The graph corroborates the before mentioned report, suggesting that suicide rates have been climbing up steadily over the given time period that we're analyzed. There's a steep rise around the 1990's interestingly, which could be a future direction of inquiry.
There are several studies that suggest suicide rates and numbers might be influenced by economic factors. This tutorial finds a relation between the economic conditions in an area and the suicide numbers. We seek to do the same, by plotting the Gross Domestic Product per capita, which is a measure of a country's economic output that controls for its number of people. We also control for population by using the suicide per 100 thousand statistic, to ensure that countries with high populations do not get unfairly represented.
plt.scatter(df['gdp_per_capita ($)'],df['suicides/100k pop'],c='black')
plt.xlabel("GDP Per Capita($)", size = 12)
plt.ylabel("Suicides/100k Population", size=12)
plt.title("GDP vs. Suicides")
The scatter plot confirms what we already suspect, that there is a overwhelming tendency for lower GDP per capita to correspond to having a higher rate of suicide. This raises more questions on how much of an impact the country has overall on suicide numbers, which we explore next.
Based of the observations of how the GDP of a country affects suicide rate, it makes sense to transition to how much of a factor the country any subset of people resides in makes a difference to suicide rates. Based of what country a population of people is, there would be a lot of factors that apply to them, such as social customs, government rules, general standard of living, education and health levels etc. We seek to gain some insight on how the geographical area a person lives in affects the number of suicides.
We can first break down the ten countries with the highest suicide rates and compare them with a simple bar graph. We achieve that by grouping the data first by country and then sorting it by suicide numbers.
groupByCountry = df.groupby(['country']).sum().reset_index()
groupByCountry = groupByCountry.sort_values(by=['suicides_no'])
We can then take out the last ten countries in the sorted dataframe, and then plot it using the bar graph.
groupByCountry = groupByCountry.tail(10)
ax = groupByCountry.plot.bar(x='country',y='suicides_no', rot=0, figsize=(15,5))
plt.xlabel('Country', size =13)
plt.ylabel('Suicide Rates', size =13)
plt.title("Country vs. Suicide Rates")
This is rather restrictive though, as we're only able to view the top 10 countries. To get a broader sense of the data and the geographical trends underneath it, we can use the Folium library to plot a map that shows all those areas of the world with high suicide rates.
To start with, we group our existing data by country and drop the extraneous information.
totalSuicidesByCountry = df.groupby(['country']).sum().reset_index()
totalSuicidesByCountry = totalSuicidesByCountry.drop(columns = ['year','population','suicides/100k pop','HDI for year','gdp_per_capita ($)'])
We need coordinate inforation for each country, so we use a dataset from https://www.kaggle.com/eidanch/counties-geographic-coordinates that has a central coordinate for each country. This coordinate will help represent each country on our map. We load this into a dataframe and clean up the data to match our data's format.
countryLocation = pd.read_csv("countries.csv", sep=',')
countryLocation = countryLocation.drop(columns= ['country'])
countryLocation = countryLocation.rename(columns={"name":"country"})
display(countryLocation.head())
Before we can plot the map based off of suicide numbers, we'll need to join the two separate dataframes into one dataframe. We can use an inner join on the country column for both dataframes. However, we'll need to standardize the country names in both so that the join represents all the countries. Looking at the data we have and the data from the link above, there are two notable differences. The data we have used the term Russian Federation, while the coordinate dataset uses Russia. Another difference is our data set using the term Republic of Korea , while the coordinate dataset uses North and South Korea . We can replace both the terms in the coordinate dataset, for the sake of the uniformity.
countryLocation=countryLocation.replace('Russia','Russian Federation')
countryLocation=countryLocation.replace('South Korea','Republic of Korea')
countryLocation=countryLocation.replace('North Korea','Republic of Korea')
display(countryLocation.head())
Now, we use the merge function to perform an inner join operation between the two dataframes.
totalSuicidesmerged =(pd.merge(totalSuicidesByCountry,countryLocation, on='country', how='inner'))
display(totalSuicidesmerged.head())
We can now plot the map using the Folium Libraries. We mark the countries with suicide rates with circles, with the radius of the circle depending on the number of suicides there. That is, bigger the circle on a country, the more the suicides.
map2 = folium.Map(location=[0, 0], zoom_start=2)
totalSuicidesmerged.apply(lambda row:folium.Circle(
radius=[row["suicides_no"]],
location=[row["latitude"],row["longitude"]],
popup='The Waterfront',
color='blue',
fill=True,
).add_to(map2), axis=1)
display(map2)
There are multiple articles, like this one titled 'More Millennials Are Dying 'Deaths of Despair' that talk about some generations of people tend to have higher tendencies that lead to them being more likely to commit suicide. It's definitely an important factor to consider when trying to identify possible influences on suicide numbers for a group of people.
We decide to first see the different generations that were present in the data.
print(df['generation'].unique())
From https://genhq.com/faq-info-about-generations/, the different generations are defined as follows:
We can then group the data by generation and plot the suicides per generation.
groupByGen = df.groupby(['generation']).sum().reset_index()
ax = groupByGen.plot.bar(x='generation', y = 'suicides_no', rot=0, figsize=(15,5))
groupByGen = plt.xlabel('Generation',size=15)
groupByGen = plt.ylabel('Number of Sucides', size=15)
groupByGen = plt.title('Generation vs. Number of Suicides', size=15)
The graph shows that the Boomer generation has the highest recorded numbers of suicide through this time frame, which intuitively makes sense as this time period corresponds
We can now look at how these generations have shown trends relative to each other through the time peroiod. Plotting a line graph with each of the generations helps us visualize that. Note that we have to clean the data, as all generations don't have suicide data for every generation. The graph currently has dips that correspond to that period.
To make the data simple to graph, we have now deleted the unecessary columns and used groupby to find the total number of suicides in one year for male and female.
# Plot based on year(x-axis), suicide rates(y-axis), line for every generation
genDf = df.groupby(['generation','year']).sum().reset_index()
genDf = genDf.drop(columns=['population', 'suicides/100k pop', 'HDI for year','gdp_per_capita ($)'])
We notice that there are many generations that reach 0.0 suicide rates however this is due to the time frames of those generations. These lines are shown as the ones that reach 0.0 number of suicide rates in the graph. Due to this, it seems some generations have no suicide rates however that is false. These data points were originally represented as NaN, however to prevent errors, we have converted them into 0.0s.
We also transpose the data, making all the columns as the different generations to easily graph. The last few are shown below.
genDf = genDf.pivot(index = 'year',columns='generation')
genDf = genDf.replace(np.NaN,0.0)
display(genDf.tail(5))
At this point, we have reached a line graph that represents the number of suicides happening throughout these years. Each line represents a different generation.
# Line graph
y = genDf.plot.line()
fig = plt.gcf()
fig.set_size_inches(15,10)
fig = plt.xlabel("Years",size = 15)
fig = plt.ylabel("Number of Suicides",size = 15)
fig = plt.title("Year vs. Number of Suicides based on Generations", size = 15)
The graph shows that a lot of generations dont't seem to have much of an overall direction, but that the suicide rates are definitely affected by what generation a person is in.
We can also analyze the proportion of suicides between the genders over the dataset. We first look at what kind of genders our data has:
print(df['sex'].unique())
Having confirmed that there are two genders being tracked by this data, we generate a heatmap to track the data through these years.
To start with, we transform our existing data and make the years as the new columns to easily graph. Each number in the graph represents the average number of suicide rates for a given year and gender.
# Dataframe for year vs. gender
genderDf = pd.pivot_table(df, values='suicides_no',
index=['sex'],
columns='year')
display(genderDf)
With the use of these heatmaps we understand how the suicide rates has increased/decreased over the year. As shown below, total number of suicide rates for women around the world is much lower then men. Max number of suicides for both men and women are found to be during the time around 1994. The darker the color, the more the suicides have occured during that year and for that gender.
#display(x)
fig, ax = plt.subplots(figsize=(20,3))
ax.set_ylabel('Sex', fontsize = 20)
ax.set_xlabel('Years', fontsize = 20)
y = sns.heatmap(genderDf, annot=False, cmap = 'GnBu', ax=ax)
Using all the trends we've seen so far, we can try to use these different factors we've observed and predict a numerical value for the number of suicides that could take place in a given subsection of people. This will be a more difficult task than most other models, as we'll be trying to use non numerical values to predict values that are numerical in nature. There is an inherent uncertainty to this act, as the categorical data we're analyzing doesn't directly map to numbers. We'll go on and try our best to handle this, and compare some different prediction models
First, we'll map out our features, that we'll be using as independent variables and our dependent variable, which is the suicide number.
factors = ['country','sex','generation','gdp_per_capita ($)','year']
features = df[factors]
suicide = df[['suicides_no']]
A popular way to use categorical data to map to non numerical data is to use the hot encoding technique. In this, each catergorical column of a dataframe is split into all the possible values it can take, and 1's and 0's are used for each row to determine what the value for that row should be. An excellent example can be found here.
A simple example can be seen using the pandas get dummies function.
tmp=df.copy()
tmp = pd.get_dummies(tmp)
display(tmp.head())
As you can see, each categorical variable has been stretched out into all of it's possible values Now, we can use the OneHotEncoder funtion from sklearn to create a hot encoder for us from our data. For more information: The sklearn docs are helpful.
enc = OneHotEncoder(sparse=False)
X_hot_encode = enc.fit_transform(features)
display(X_hot_encode)
X_hot_encode is now a feature variable generated by using the hot encoder
Another approach to the problem of trying to use categorical values to predict numerical ones is converting each categorical value to an independent number. This is explained better here, but the basic idea is to basically give each category of the variable a unique identifier and map every instance with that specific identifier. The code below does exactly that by using this function that generates a unique hash for every value in a column it sees that is not numerical.
def handle_non_numerical_data(df):
columns = df.columns.values
for column in columns:
text_digit_vals = {}
def convert_to_int(val):
return text_digit_vals[val]
if df[column].dtype != np.int64 and df[column].dtype != np.float64:
column_contents = df[column].values.tolist()
unique_elements = set(column_contents)
x = 0
for unique in unique_elements:
if unique not in text_digit_vals:
text_digit_vals[unique] = x
x+=1
df[column] = list(map(convert_to_int, df[column]))
return df
newdf = df.copy()
newdf = handle_non_numerical_data(newdf)
display(newdf.head())
features2 = newdf[factors]
suicide2 = newdf[['suicides_no']]
As can be seen, the above dataframe now has numbers for all of its categorical variables.
We'll first use Random forests to try and predict the number of suicides given all the other factors for a certain set of people. Random forests regression is based off of the idea of generating multiple decision trees and using all of them to guess the output. For more information, read here.
We'll first use a random forests regression with the method of converting each categorical value to a number. We fit the model to the features and output from the newdf dataframe that we just defined and then compute the R^2 score for the model
X = features2
y = suicide2
regr = RandomForestRegressor(max_depth=2, random_state=0)
regr.fit(X,y)
print('R^2 Score:')
print(regr.score(X,y))
The R^2 score can go up to a maximum of 1, where 1 is the most accurate. This model isn't performing very well unfortunately. To confirm that, we can create a graph to model the predicted values of the graph against the actual value.
We first split our data into training and test sets
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, suicide, test_size=0.2)
We then create the model and train it
regr = RandomForestRegressor(max_depth=2, random_state=0)
model = regr.fit(X_train, y_train)
We then generate predictions for the test set, that'll later be compared.
predictions = regr.predict(X_test)
Finally, we plot the predictions against the actual values.
plt.figure(figsize=(8,8))
plt.title("Predicted vs. Actual Values for Suicide numbers", fontsize=16)
plt.scatter(y_test,predictions)
plt.plot(y_test, y_test, color="Red")
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.show()
We then carry out the exact same process, but this time our feature set will be the X_hot_encode set, that we created using hot encode.
regr = RandomForestRegressor(max_depth=2, random_state=0)
regr.fit(X_hot_encode,y)
print('R^2 Score:')
print(regr.score(X_hot_encode,y))
This is performing better than the model above, suggesting that using the hot encoding is a better idea. We can now plot the predicted vs expected values, as we did before.
X_train, X_test, y_train, y_test = model_selection.train_test_split(X_hot_encode, suicide, test_size=0.2)
regr = RandomForestRegressor(max_depth=2, random_state=0)
model = regr.fit(X_train, y_train)
# generate predictions for player Rating to compare with y_test data
predictions = regr.predict(X_test)
plt.figure(figsize=(8,8))
plt.title("Predicted vs. Actual Values for Suicide number", fontsize=16)
plt.scatter(y_test,predictions)
plt.plot(y_test, y_test, color="Red") # identity line y=x
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.show()
We'll now use Linear Regression to try and predict the number of suicides given all the other factors for a certain set of people. Linear Regression tries to fit a linear equation based of the training data to account for the different factors and predict a value. For more information, read here.
The process is paralell to what was just done with the Random Forests, we'll first conduct the regression with the method of converting each categorical value to a number. We fit the model to the features and output from the newdf dataframe that we just defined and then compute the R^2 score for the model. We'll then plot the predicted vs actual suicide numbers
X = features2
y = suicide2
lm = linear_model.LinearRegression()
model = lm.fit(X,y)
print('R^2 Score:')
print(lm.score(X,y))
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, suicide, test_size=0.2)
lm = linear_model.LinearRegression()
model = lm.fit(X_train, y_train)
predictions = lm.predict(X_test)
predictions[0:10]
plt.figure(figsize=(8,8))
plt.title("Predicted vs. Actual Values for Suicide Numbers", fontsize=16)
plt.scatter(y_test,predictions)
plt.plot(y_test, y_test, color="Red") # identity line y=x
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.show()
The R^2 square score and the plot suggest that this isn't an ideal method to use for prediction. We'll now do the same process using the sklearn hot encoded model, with linear regression.
lm = linear_model.LinearRegression()
model = lm.fit(X_hot_encode,y)
print('R^2 Score:')
print(lm.score(X_hot_encode,y))
X_train, X_test, y_train, y_test = model_selection.train_test_split(X_hot_encode, suicide, test_size=0.2)
lm = linear_model.LinearRegression()
model = lm.fit(X_train, y_train)
predictions = lm.predict(X_test)
predictions[0:10]
plt.figure(figsize=(8,8))
plt.title("Predicted vs. Actual Values for Suicide Numbers", fontsize=16)
plt.scatter(y_test,predictions)
plt.plot(y_test, y_test, color="Red") # identity line y=x
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.show()
While far from usable, this is definitely the best model that we've seen so far. It is almost at 50 percent accuracy.
To conclude, there are definitely factors that affect the number of suicides taking place in communities. The exploratory data analysis proved that and also helped illustrate what a lot of people would consider intuition or common sense. However, using these general trends to try and predict concrete values seems unfeasible in practice. It would obviously be very beneficial to be able to predict what suicide rates could look like in certain situations, as this would help people be more aware where help is needed and what areas are most at risk.
When fitting the mathematical model to try and use catergorical data, the approach of assigning each data arbitary values wasn't effective at all. This was probably due to the data being given weights that they did not deserve, that is, the tenth unique value is unfairly given 10 times the weight of the first unique value in this approach. The hot encoding approach yielded marginally better results, and perhaps more work can be done to tailor it better to this dataset.
Hopefully, even more progress can be made that builds off of these approaches to narrow down the exact factors that drive one of the biggest problems our society is facing today.