Dataism: Poverty & Murder
This notebook is authored by Alexandre Puttick and modified by myself. Original is on GitHub.
Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pip install xlrd
Data
Data that correlates poverty, unemployment, murder rate. For the percentage of families incomes below $5000, what is the number of murders per 1,000,000 inhabitants per annum? If there are more people living in poverty, are there more murders? (source)
POVERTY_MURDER_DATA = pd.read_excel(Income vs. Murder.xlsx)
X = POVERTY_MURDER_DATA['Percent with income below $ 5000']
Y = POVERTY_MURDER_DATA['Number of murders per 1,000,000 inhabitants']
POVERTY_MURDER_DATA.head(5)
Percent with income below $ 5000 | Number of murders per 1,000,000 inhabitants | |
---|---|---|
0 | 16.5 | 11.2 |
1 | 20.5 | 13.4 |
2 | 26.3 | 40.7 |
3 | 16.5 | 5.3 |
4 | 19.2 | 24.8 |
plt.plot(X, Y, 'bo')
# label the axes
plt.xlabel('Percent with income below $5000')
plt.ylabel('No. murders per 10^6 residents')
# start the plot at the point (0,0)
plt.xlim(xmin = 0)
plt.ylim(ymin = 0)
plt.gcf()
Find the Best Fit Line
Use linear regression to generate a model that finds the best fit line. The best fit line offers two insights:
Test whether or not
x
has an influence ony
and give it a level of confidence.The slope can be used to predict trends
If the best fit line has some non-zero slope, then we can show how murders and poverty are correlated with some percentage of confidence (e.g. 95% confidence that x
has an influence on y
).
Choose Initial Weights
Set reasonable initial weights for w
and b
. The first guesses may or may not appear on the graph depending on where the random choice starts from.
from dataclasses import dataclass
def guess_weight(value):
return np.random.random_sample() * value
def init_guess():
return {'w': guess_weight(10), 'b': guess_weight(-100)}
def update_guess_data(weight_guess):
min_line_x = np.amin(X)
max_line_x = np.amax(X)
# Draw a line.
x_fit = np.linspace(min_line_x, max_line_x, 100)
y_fit = weight_guess['w']*x_fit+weight_guess['b']
return {'x': x_fit, 'y': y_fit, \
'w': weight_guess['w'], 'b': weight_guess['b']}
guess = update_guess_data(init_guess())
print("w:", guess['w'], " b:", guess['b'])
plt.plot(guess['x'], guess['y'], '-r')
plt.gcf()
Define the Loss Function
Build the array from in three steps:
is the list of predictions based on the (randomly chosen) starting points of
w
andb
.: subtract the list of predictions from the actual values
: square each value in the list
Sum the array and multiply by the inverse of its length (step 4):
def loss(guess_data):
y_predict = guess_data['w']*X + guess_data['b'] # 1
diff = y_predict - Y # 2
sq_diff = diff**2 # 3
loss = 1/len(X) * np.sum(sq_diff) #4
return loss
loss(guess)
Gradient Descent
Adjust the estimate to decrease the loss and get closer to a best fit line. Recalculate both w
and b
.
Take a step towards the valley. The size of the step is determined by the setting for (also known as the learning rate).
def gradient_descent(w, b):
alpha = 0.0001
dw = 2/len(X) * np.sum(((w*X + b) - Y) * X)
db = 2/len(X) * np.sum((w*X + b) - Y)
w_step = w - alpha * dw
b_step = b - alpha * db
return {'w': w_step, 'b': b_step}
Plot the single step in yellow.
def take_step(guess):
step = gradient_descent(guess['w'], guess['b'])
guess['w'] = step['w']
guess['b'] = step['b']
return guess
guess = update_guess_data(take_step(guess))
print("Loss:", loss(guess))
plt.plot(guess['x'], guess['y'], '-y')
plt.gcf()
Take another step in blue.
guess = update_guess_data(take_step(guess))
print("Loss:", loss(guess))
plt.plot(guess['x'], guess['y'], '-b')
plt.gcf()
Find the Best Fit
Stop looking for the best once the loss gets to a certain point. Plot the best fit in green. Hold a record of loss values in loss_hist
for plotting.
loss_hist = []
def find_best_fit(best_guess):
while True:
if loss(best_guess) < 46:
return best_guess
best_guess = update_guess_data(take_step(best_guess))
loss_hist.append(loss(best_guess))
best_fit = find_best_fit(guess)
plt.plot(best_fit['x'], best_fit['y'], '-g')
plt.gcf()
Plot the Loss Values
plt.clf()
plt.plot(loss_hist)
plt.gcf()