David Schmudde / Jun 18 2020 / Published

Dataism: Poverty & Murder

This notebook is authored by Alexandre Puttick and modified by myself. Original is on GitHub.

Setup

import pandas as pdimport numpy as npimport matplotlib.pyplot as plt

0.7s

Python

pip install xlrd

4.2s

Bash in Python

Data

Data that correlates poverty, unemployment, murder rate. For the percentage of families incomes below $5000, what is the number of murders per 1,000,000 inhabitants per annum? If there are more people living in poverty, are there more murders? (source)

Income vs. Murder.xlsx

POVERTY_MURDER_DATA = pd.read_excel(Income vs. Murder.xlsx
)X = POVERTY_MURDER_DATA['Percent with income below $ 5000']Y = POVERTY_MURDER_DATA['Number of murders per 1,000,000 inhabitants']POVERTY_MURDER_DATA.head(5)

0.4s

Python

	Percent with income below $ 5000	Number of murders per 1,000,000 inhabitants
0	16.5	11.2
1	20.5	13.4
2	26.3	40.7
3	16.5	5.3
4	19.2	24.8

5 items

plt.plot(X, Y, 'bo')# label the axesplt.xlabel('Percent with income below $5000')plt.ylabel('No. murders per 10^6 residents')# start the plot at the point (0,0)plt.xlim(xmin = 0)plt.ylim(ymin = 0)plt.gcf()

1.2s

Python

Find the Best Fit Line

Use linear regression to generate a model that finds the best fit line. The best fit line offers two insights:

Test whether or not x has an influence on y and give it a level of confidence.
The slope can be used to predict trends

If the best fit line has some non-zero slope, then we can show how murders and poverty are correlated with some percentage of confidence (e.g. 95% confidence that x has an influence on y).

Choose Initial Weights

Set reasonable initial weights for w and b. The first guesses may or may not appear on the graph depending on where the random choice starts from.

from dataclasses import dataclass      def guess_weight(value):  return np.random.random_sample() * valuedef init_guess():  return {'w': guess_weight(10), 'b': guess_weight(-100)}def update_guess_data(weight_guess):  min_line_x = np.amin(X)  max_line_x = np.amax(X)    # Draw a line.   x_fit = np.linspace(min_line_x, max_line_x, 100)  y_fit = weight_guess['w']*x_fit+weight_guess['b']    return {'x': x_fit, 'y': y_fit, \          'w': weight_guess['w'], 'b': weight_guess['b']}guess = update_guess_data(init_guess())print("w:", guess['w'], " b:", guess['b'])plt.plot(guess['x'], guess['y'], '-r')plt.gcf()

0.9s

Python

Define the Loss Function

Build the array from $1..n$ in three steps:

$f(x_i)$ is the list of predictions based on the (randomly chosen) starting points of w and b.
$f(x_i)-y_i$ : subtract the list of predictions from the actual values
$(f(x_i)-y_i)^2$ : square each value in the list

Sum the array and multiply by the inverse of its length (step 4): $\frac{1}{n}\sum_{i=1}^{n}$

def loss(guess_data):    y_predict = guess_data['w']*X + guess_data['b'] # 1  diff = y_predict - Y # 2  sq_diff = diff**2 # 3  loss = 1/len(X) * np.sum(sq_diff) #4  return lossloss(guess)

0.1s

lossPython

1988.250003490549

Gradient Descent

Adjust the estimate to decrease the loss and get closer to a best fit line. Recalculate both w and b.

Take a step towards the valley. The size of the step is determined by the setting for $\alpha$ (also known as the learning rate).

def gradient_descent(w, b):    alpha = 0.0001        dw = 2/len(X) * np.sum(((w*X + b) - Y) * X)    db = 2/len(X) * np.sum((w*X + b) - Y)    w_step = w - alpha * dw    b_step = b - alpha * db        return {'w': w_step, 'b': b_step}

0.1s

gradient-descentPython

Plot the single step in yellow.

def take_step(guess):  step = gradient_descent(guess['w'], guess['b'])  guess['w'] = step['w']  guess['b'] = step['b']  return guess  guess = update_guess_data(take_step(guess))print("Loss:", loss(guess))plt.plot(guess['x'], guess['y'], '-y')plt.gcf()

0.9s

take-stepPython

Take another step in blue.

guess = update_guess_data(take_step(guess))print("Loss:", loss(guess))plt.plot(guess['x'], guess['y'], '-b')plt.gcf()

0.7s

Python

Find the Best Fit

Stop looking for the best once the loss gets to a certain point. Plot the best fit in green. Hold a record of loss values in loss_hist for plotting.

loss_hist = []def find_best_fit(best_guess):  while True:    if loss(best_guess) < 46:      return best_guess    best_guess = update_guess_data(take_step(best_guess))    loss_hist.append(loss(best_guess))best_fit = find_best_fit(guess)plt.plot(best_fit['x'], best_fit['y'], '-g')plt.gcf()

306.5s

Python

Plot the Loss Values

plt.clf()plt.plot(loss_hist)plt.gcf()

0.7s

Python

Dataism: Poverty & Murder

Setup

Data

Find the Best Fit Line

Choose Initial Weights

Define the Loss Function

Gradient Descent

Find the Best Fit

Plot the Loss Values

Runtimes (1)