Dataism: Poverty & Murder

This notebook is authored by Alexandre Puttick and modified by myself. Original is on GitHub.

Setup

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
0.7s
Python
pip install xlrd
4.2s
Bash in Python

Data

Data that correlates poverty, unemployment, murder rate. For the percentage of families incomes below $5000, what is the number of murders per 1,000,000 inhabitants per annum? If there are more people living in poverty, are there more murders? (source)

Income vs. Murder.xlsx
POVERTY_MURDER_DATA = pd.read_excel(
Income vs. Murder.xlsx
)
X = POVERTY_MURDER_DATA['Percent with income below $ 5000']
Y = POVERTY_MURDER_DATA['Number of murders per 1,000,000 inhabitants']
POVERTY_MURDER_DATA.head(5)
0.4s
Python
Percent with income below $ 5000Number of murders per 1,000,000 inhabitants
016.511.2
120.513.4
226.340.7
316.55.3
419.224.8
5 items
plt.plot(X, Y, 'bo')
# label the axes
plt.xlabel('Percent with income below $5000')
plt.ylabel('No. murders per 10^6 residents')
# start the plot at the point (0,0)
plt.xlim(xmin = 0)
plt.ylim(ymin = 0)
plt.gcf()
1.2s
Python

Find the Best Fit Line

Use linear regression to generate a model that finds the best fit line. The best fit line offers two insights:

  1. Test whether or not x has an influence on y and give it a level of confidence.

  2. The slope can be used to predict trends

If the best fit line has some non-zero slope, then we can show how murders and poverty are correlated with some percentage of confidence (e.g. 95% confidence that x has an influence on y).

Choose Initial Weights

Set reasonable initial weights for w and b. The first guesses may or may not appear on the graph depending on where the random choice starts from.

from dataclasses import dataclass
      
def guess_weight(value):
  return np.random.random_sample() * value
def init_guess():
  return {'w': guess_weight(10), 'b': guess_weight(-100)}
def update_guess_data(weight_guess):
  min_line_x = np.amin(X)
  max_line_x = np.amax(X)
  
  # Draw a line. 
  x_fit = np.linspace(min_line_x, max_line_x, 100)
  y_fit = weight_guess['w']*x_fit+weight_guess['b']
  
  return {'x': x_fit, 'y': y_fit, \
          'w': weight_guess['w'], 'b': weight_guess['b']}
guess = update_guess_data(init_guess())
print("w:", guess['w'], " b:", guess['b'])
plt.plot(guess['x'], guess['y'], '-r')
plt.gcf()
0.9s
Python

Define the Loss Function

Build the array from 1..n1..n in three steps:

  1. f(xi)f(x_i) is the list of predictions based on the (randomly chosen) starting points of w and b.

  2. f(xi)yif(x_i)-y_i: subtract the list of predictions from the actual values

  3. (f(xi)yi)2(f(x_i)-y_i)^2: square each value in the list

Sum the array and multiply by the inverse of its length (step 4): 1ni=1n\frac{1}{n}\sum_{i=1}^{n}

def loss(guess_data):
  
  y_predict = guess_data['w']*X + guess_data['b'] # 1
  diff = y_predict - Y # 2
  sq_diff = diff**2 # 3
  loss = 1/len(X) * np.sum(sq_diff) #4
  return loss
loss(guess)
0.1s
lossPython
1988.250003490549

Gradient Descent

Adjust the estimate to decrease the loss and get closer to a best fit line. Recalculate both w and b.

Take a step towards the valley. The size of the step is determined by the setting for α\alpha (also known as the learning rate).

def gradient_descent(w, b):
    alpha = 0.0001
    
    dw = 2/len(X) * np.sum(((w*X + b) - Y) * X)
    db = 2/len(X) * np.sum((w*X + b) - Y)
    w_step = w - alpha * dw
    b_step = b - alpha * db
    
    return {'w': w_step, 'b': b_step}
0.1s
gradient-descentPython

Plot the single step in yellow.

def take_step(guess):
  step = gradient_descent(guess['w'], guess['b'])
  guess['w'] = step['w']
  guess['b'] = step['b']
  return guess
  
guess = update_guess_data(take_step(guess))
print("Loss:", loss(guess))
plt.plot(guess['x'], guess['y'], '-y')
plt.gcf()
0.9s
take-stepPython

Take another step in blue.

guess = update_guess_data(take_step(guess))
print("Loss:", loss(guess))
plt.plot(guess['x'], guess['y'], '-b')
plt.gcf()
0.7s
Python

Find the Best Fit

Stop looking for the best once the loss gets to a certain point. Plot the best fit in green. Hold a record of loss values in loss_hist for plotting.

loss_hist = []
def find_best_fit(best_guess):
  while True:
    if loss(best_guess) < 46:
      return best_guess
    best_guess = update_guess_data(take_step(best_guess))
    loss_hist.append(loss(best_guess))
best_fit = find_best_fit(guess)
plt.plot(best_fit['x'], best_fit['y'], '-g')
plt.gcf()
306.5s
Python

Plot the Loss Values

plt.clf()
plt.plot(loss_hist)
plt.gcf()
0.7s
Python
Runtimes (1)