# Dataism: **Supervised Learning** This notebook is authored by Alexandre Puttick and modified by myself. Original is on [GitHub](https://github.com/arputtick/dataism/blob/master/Week_3_Supervised_learning%2C_linear_regression.ipynb). Recall the "five steps of supervised learning" from class: 1. **Start with labeled data**: $(x_1,y_1), (x_2, y_2), ... , (x_n, y_n).$ Here $n$ is the number of training samples, the $x_i$ are the samples, and the $y_i$ are the labels. 2. **Begin with an initial random model:** A model is a map $f$ from inputs to outputs that takes a new sample $x$ and gives it a label $y = f(x).$ 3. **Define a "loss function":** A number measuring how good/bad the predictions are. 4. **Gradually adjust the model** $f$ **to decrease the loss:** I.e., Iteratively adjust $f$ so that it's predictions become better and better. Do this with *gradient descent.* 5. **Best model when loss is minimized.** # **Linear Regression** Let's go through the five steps in the context of *linear regression* , i.e., finding the line that best describes a 2D set of data points. ```python id=fe653f6f-a925-4137-9a52-f42445583048 # Import the necessary libraries: import numpy as np import matplotlib.pyplot as plt # matplotlib.pyplot contains a large variety of # visualization tools. ``` # 1. Start with labeled data: We start with the following data consisting of CO2-emissions taxes and the corresponding amount of CO2-emissions: ![Screenshot-20200616162119-349x162.png][nextjournal#file#111ef809-b5bc-42ec-a777-59951e1ba149] The following code saves the first column into a numpy array named 'X.' These are our samples $x_1 = 5,\, x_2 = 10, \,x_3 = 15,\, x_4 = 25.$ ```python id=949a3457-a19c-432d-ac3d-a09c93d98964 X = np.array([5,10,15,25]) print(X) ``` We save the second column unto a numpy array named 'Y.' These are the labels $y_1 = 5,\, y_2 = 4.3,\, \ldots$ ```python id=6f86a167-9435-44fd-84c9-b16fdaa8b738 Y = np.array([5,4.3,3.7,2.9]) print(Y) ``` ```python id=b8720263-e77b-43e8-8158-addf30b3c694 # Plot the labeled data plt.clf() # Specify that we're plotting X against Y. 'bo' stands for blue dots and # describes the style of the plot. plt.plot(X,Y, 'bo') # label the axes plt.ylabel('Tax') plt.xlabel('Co2-Emissions') # start the plot at the point (0,0) plt.xlim(xmin = 0) plt.ylim(ymin = 0) # display the plot plt.gcf() ``` ![result][nextjournal#output#b8720263-e77b-43e8-8158-addf30b3c694#result] # 2. Begin with an initial random model: We start with a randomly chosen line $f(x) = wx + b.$ Random means that we choose $w$ and $b$ randomly. In class I didn't choose completely randomly. I made sure to start with a negative slope $(w<0)$ and chose $b =5.$ Why? Remember that are goal is to minimize the loss or "reaching the valley from up in the mountains." Actually, there might be multiple valleys. Depending on the initial random model you start with, you might end up in a different valley (see image). [alt text][nextjournal#file#247e1fc3-9d2b-4fc5-b123-8a0b0038a978] We want the loss to be as small as possible, so we have to take care to avoid the shallower valley. Therefore, it always makes sense to use some knowledge to pick a starting point closer to the correct valley. From the picture, we can see that best line should have negative slope $(w <0)$ and $b$ should be close to 5. *Exercise*: Try choosing other values of $w$ and $b$ and see if you can make sense of what happens when you do gradient descent. ```python id=60518a2e-c988-4067-ae47-11bce07d5bb8 # randomly initialize the slope w = -10 * np.random.random_sample() # random number between -10 and 0 # not so randomly initialize the y-intercept b = 5 print(w,b) # Check values of w and b ``` Let's also visualize our initial model: ```python id=276371f4-9e41-451c-95b4-03e9668d8c18 plt.clf() plt.plot(X, Y, 'bo') # make list of one hundred evenly spaced X-values between 0 and 30) x = np.linspace(0,30,100) # the corresponding list of predictions y = w*x + b # connect the points from these lists with a red line plt.plot(x, y, '-r') # -r = 'red line' plt.ylim(ymin = 0) plt.xlim(xmin = 0) plt.gcf() ``` ![result][nextjournal#output#276371f4-9e41-451c-95b4-03e9668d8c18#result] ## Initial weights The general Machine Learning term for the `w` and `b` that we start with: *initial weights.* Our choice of initial weights determine where in the mountains we start our descent from. In situations where there are many valleys (local minima), it is customary to choose several different starting points and compare the models we end up with. One of the least understood phenomena of deep neural networks is that, even though there are many, many local minima, almost all of them seem to produce good models. ## Note on models vs. algorithms The words 'model' and 'algorithm' are often used interchangeably, but there is a standardized use in the context of machine learning. Usually we say that we use machine learning *algorithms* (like linear regression) to obtain a trained *model* (like the best-fit line). The confusion is that the trained model is itself an algorithm. It takes in a certain input and performs a series of steps that were "learned" during training and produces a desired output. Hence the idea that machine learning is about writing "algorithms that learn algorithms." In our example, the input of the model is an arbitrary emissions tax level and the output is the predicted level of CO2-emissions at that tax-level. In other data science contexts, 'model' can have a completely different meaning. For instance, 'modeling the data' might mean creating a some sort of visualization. # 3. Define a Loss Function For linear regression (and other 'regression problems' in machine learning), a typical choice of loss function is the **mean-squared error:** $$ Loss = \frac{1}{n}\sum_{i=1}^{n}(f(x_i)-y_i)^2, $$ where $n$ is the number of training samples (in our example $n =4$) and $f(x_i) = wx_i + b.$ Remember, the loss should be interpreted as the average of the (squared) distances between our data points and the current line (NOTE: the square function is just a way to keep the numbers positive when computing the loss; we could also use the absolute value). *Exercise:* Think about why it makes sense that the 'best line' would minimize this average. First we define an array called 'Y_predict:' $$ \mbox{Y\_predict} = [f(x_1),\, f(x_2),\,f(x_3), \,f(x_4)] $$ ```python id=2ea37516-164c-4e20-852a-79bf39045bf6 Y_predict = w*X + b ``` ```python id=e4be0687-5b72-4f24-aef6-6c957becdeca Y_predict ``` Now we compare the predictions and the labels: $$ \mbox{diff} = [f(x_1)-y_1,\, f(x_2)-y_2,\,f(x_3)-y_3, \,f(x_4)-y_4] $$ ```python id=0fede677-5606-4d6f-8d23-f9551f41d0f9 # difference between predictions and labels diff = Y_predict - Y diff ``` Then we square them: $\mbox{sq\_diff} = [(f(x_1)-y_1)^2,\, (f(x_2)-y_2)^2,\,(f(x_3)-y_3)^2, \,(f(x_4)-y_4)^2]$ ```python id=8583a59c-7e7f-4a39-99aa-2ad3557f4724 # square differences sq_diff = diff**2 sq_diff ``` Finally, we average the entries in `sq_diff` to obtain the loss function: ```python id=e60398b2-9cd3-4776-ae90-9b7f279470b0 loss = 1/len(X) * np.sum(sq_diff) # len(X) = 4. The len() function returns the length of an array. # We write len(X) instead of 4 so that our expression for the loss # works for any number of data points. loss ``` ## Different loss functions for different situations It's important to know that the choice of loss function in general depends on the situation. There are standard choices for regression or (image) classification, but you can also modify the loss function and get interesting results! For example, [neural style transfer](https://www.tensorflow.org/tutorials/generative/style_transfer) is based on supervised learning in which the label for an image is itself and the loss function includes two parts: 1. the ordinary mean-squared error and 2. a term that measures the distance between a given image and the painting who's style you want to transfer to that image. If you try to make both terms small at once, you end up with a mixed image. In the more activistic direction, there is [research](http://jmlr.org/papers/v20/18-262.html) concerning adding constraints to the loss function to combat unfairness in the resulting model. # 4. Gradually Adjust the Model to Decrease Loss (Gradient Descent) Now we begin descending from our initial random model into the valley where the loss is minimal. First we set the step size or 'learning rate,' which we called $\alpha$ in the slides. ```python id=88328afd-da86-4aaf-945c-95f12a463413 learning_rate = .0001 # this is alpha, step size ``` *Taking a step in the direction of steepest descent* means updating `w` and `b` using the following formulae: $$ w\_{new} = w_{old} - \alpha\cdot \frac{2}{n}\sum_{i=1}^{n}(f(x\_i)-y\_i)\cdot x\_i $$ $$ b_{new} = b_{old} - \alpha\cdot \frac{2}{n}\sum_{i=1}^{n}(f(x_i)-y_i). $$ In case you know calculus and are curious where this came from, you can also write this as: $$ w_{new} = w_{old} - \alpha\cdot \frac{\partial Loss}{\partial w} $$ $$ b_{new} = b_{old} - \alpha\cdot \frac{\partial Loss}{\partial b}. $$ First we have to compute the terms that get multiplied by $\alpha$ in the above formula. We name these `dw` and `db`: $$ dw = \frac{2}{n}\sum_{i=1}^{n}(f(x\_i)-y\_i)\cdot x\_i $$ $$ db = \frac{2}{n}\sum_{i=1}^{n}(f(x_i)-y_i). $$ ```python id=d0248a0d-8e65-4861-a314-7f37d75bc90c dw = 2/len(X) * np.sum((Y_predict - Y) * X) db = 2/len(X) * np.sum(Y_predict - Y) ``` **Note:** The 'step size' isn't exactly $\alpha$, but $\alpha\cdot db$ and $\alpha\cdot dw.$ Both $dw$ and $db$ become smaller and smaller as we near the bottom of the valley, resulting in smaller and smaller steps, as David pointed out in class. Now we 'take a step,' i.e., update `w` and `b` as in the above formulae: ```python id=d94f9fbe-9677-4b6f-a6a6-3c6e0e2f8a5c w = w - learning_rate * dw w ``` ```python id=d00749c0-d363-4bef-aa2b-d938bf3e369e b = b - learning_rate * db b ``` **That was just a single step.** The following code repeats the whole process over and over until we reach the bottom of the valley: ```python id=f96ef1bc-1a47-4b72-82dc-6df6ac9cb46e ## First specify the number of steps you want to take num_steps = 500 ## We want to keep track of the loss (how high up in the mountains we are) ## after every step. We do this so we can visualize who the loss changed as ## we descended. ## To do this, start with an empty array, which we call 'loss_hist' for ## 'Loss history.' We will add the loss at our current location with every step. loss_hist = [] # empty array ``` ```python id=e35aff6f-a1c3-4aa2-a31a-276a12f2ec7b ## The following for-loop repeats the process of taking a step in the ## direction of steepest descent. Each time the loop runs, it uses the ## w and b at our current location. for step in range(num_steps): # range(num_steps) is a list [0,1,...,999] Y_predict = w*X + b # Computing the loss at our location in the mountains diff = Y_predict - Y sq_diff = diff**2 loss = 1/len(X) * np.sum(sq_diff) # Compute the 'gradient' dw = 2/len(X) * np.sum((Y_predict - Y) * X) db = 2/len(X) * np.sum(Y_predict - Y) # Take the step: w = w - learning_rate * dw b = b - learning_rate * db # record the loss at every step by adding to our loss_hist array. loss_hist.append(loss) ``` Let's see what the values of w and b that we ended up with are: ```python id=f962c1d5-fdc6-4217-ac9b-37457d798f73 w ``` ```python id=c8545a0b-5cf1-4887-8bf0-a691ecdf76c2 b ``` We can also plot the line we ended up with: ```python id=c6019e3e-b462-42c1-949d-ee4b04412f4a plt.clf() # include the original data points in our plot plt.plot(X, Y, 'bo') # take a long list of 100 values on the x-axis between 0 and 30 x = np.linspace(0,30,100) # compute the corresponding y-values for that long list y = w*x + b # predictions for values in that long list # connect all of these points with a red line plt.plot(x, y, '-r') # -r = red line # start the plot at the point (0,0) plt.ylim(ymin = 0) plt.xlim(xmin = 0) # display the plot plt.gcf() ``` ![result][nextjournal#output#c6019e3e-b462-42c1-949d-ee4b04412f4a#result] And visualize how the loss changed over time: ```python id=48396a37-2a42-4f4b-96cc-efca369c1cfb plt.clf() plt.plot(loss_hist) plt.gcf() ``` ![result][nextjournal#output#48396a37-2a42-4f4b-96cc-efca369c1cfb#result] As was noted during class, there's a trade off between the size the steps we take and the number of steps it takes us to reach the valley. Smaller steps means we need to take more to steps to get there. But **our steps can be too big**. Imaging you have immensely long legs and step so far that you reach a point on the other side of the valley that's even higher up from where you started! If you do this over and over, you'll end up climbing to infinity. If you really want to optimize and reach the valley as few steps as possible, you should make $\alpha$ as big as you can without ending up in this climbing to infinity situation. *Exercise:* Play around with different values of the 'learning rate' $\alpha$ and the number of steps `num_steps`. Try to find the best values i.e. largest $\alpha$ that works and the smallest number of steps. **Note:** As Aleks pointed out, the best thing to do would be to take huge steps in the beginning while you're still high up and gradually decrease $\alpha$ as you get close to the bottom. There is a bunch of research about how to do that. You can google "adaptive learning rates." [nextjournal#file#111ef809-b5bc-42ec-a777-59951e1ba149]: [nextjournal#output#b8720263-e77b-43e8-8158-addf30b3c694#result]: [nextjournal#file#247e1fc3-9d2b-4fc5-b123-8a0b0038a978]: [nextjournal#output#276371f4-9e41-451c-95b4-03e9668d8c18#result]: [nextjournal#output#c6019e3e-b462-42c1-949d-ee4b04412f4a#result]: [nextjournal#output#48396a37-2a42-4f4b-96cc-efca369c1cfb#result]:
This notebook was exported from https://nextjournal.com/a/Mf6drTaCDmS8Ri2KHWrto?change-id=CpUKh69452LoCNCyU8ARJs ```edn nextjournal-metadata {:article {:nodes {"0fede677-5606-4d6f-8d23-f9551f41d0f9" {:compute-ref #uuid "4662c086-207e-464b-b1de-f7b6649be420", :exec-duration 187, :id "0fede677-5606-4d6f-8d23-f9551f41d0f9", :kind "code", :output-log-lines {}, :runtime [:runtime "e6338cba-a6ed-4a3e-a870-f9734ed87030"]}, "111ef809-b5bc-42ec-a777-59951e1ba149" {:id "111ef809-b5bc-42ec-a777-59951e1ba149", :kind "file"}, "247e1fc3-9d2b-4fc5-b123-8a0b0038a978" {:id "247e1fc3-9d2b-4fc5-b123-8a0b0038a978", :kind "file"}, "276371f4-9e41-451c-95b4-03e9668d8c18" {:compute-ref #uuid "391e8277-af80-4820-81c4-4dd318172912", :exec-duration 560, :id "276371f4-9e41-451c-95b4-03e9668d8c18", :kind "code", :output-log-lines {}, :runtime [:runtime "e6338cba-a6ed-4a3e-a870-f9734ed87030"]}, "2ea37516-164c-4e20-852a-79bf39045bf6" {:compute-ref #uuid "396b8569-56a6-46cb-aac4-bc11dfff4e07", :exec-duration 117, :id "2ea37516-164c-4e20-852a-79bf39045bf6", :kind "code", :output-log-lines {}, :runtime [:runtime "e6338cba-a6ed-4a3e-a870-f9734ed87030"]}, "48396a37-2a42-4f4b-96cc-efca369c1cfb" {:compute-ref #uuid "ea2ce489-d2ed-46ba-932c-d49d4278b948", :exec-duration 506, :id "48396a37-2a42-4f4b-96cc-efca369c1cfb", :kind "code", :output-log-lines {}, :runtime [:runtime "e6338cba-a6ed-4a3e-a870-f9734ed87030"]}, "60518a2e-c988-4067-ae47-11bce07d5bb8" {:compute-ref #uuid "e4cf61cc-c95d-45d9-8663-7f4dae012b85", :exec-duration 404, :id "60518a2e-c988-4067-ae47-11bce07d5bb8", :kind "code", :output-log-lines {:stdout 2}, :runtime [:runtime "e6338cba-a6ed-4a3e-a870-f9734ed87030"]}, "6f86a167-9435-44fd-84c9-b16fdaa8b738" {:compute-ref #uuid "bbb52162-8e56-4fdf-85dd-feb8fd4bb20c", :exec-duration 390, :id "6f86a167-9435-44fd-84c9-b16fdaa8b738", :kind "code", :output-log-lines {:stdout 2}, :runtime [:runtime "e6338cba-a6ed-4a3e-a870-f9734ed87030"]}, "8583a59c-7e7f-4a39-99aa-2ad3557f4724" {:compute-ref #uuid "e0ed64a5-051d-48a0-b1f6-7ff18a080c14", :exec-duration 284, :id "8583a59c-7e7f-4a39-99aa-2ad3557f4724", :kind "code", :output-log-lines {}, :runtime [:runtime "e6338cba-a6ed-4a3e-a870-f9734ed87030"]}, "88328afd-da86-4aaf-945c-95f12a463413" {:compute-ref #uuid "3b0690e7-1e5c-4a47-a0a9-59f96cb594c5", :exec-duration 146, :id "88328afd-da86-4aaf-945c-95f12a463413", :kind "code", :output-log-lines {}, :runtime [:runtime "e6338cba-a6ed-4a3e-a870-f9734ed87030"]}, "949a3457-a19c-432d-ac3d-a09c93d98964" {:compute-ref #uuid "5e1b360c-cac7-4733-a830-8446614b433d", :exec-duration 335, :id "949a3457-a19c-432d-ac3d-a09c93d98964", :kind "code", :output-log-lines {:stdout 2}, :runtime [:runtime "e6338cba-a6ed-4a3e-a870-f9734ed87030"]}, "b8720263-e77b-43e8-8158-addf30b3c694" {:compute-ref #uuid "69b77340-3ffa-4a53-9fb0-7ed1cfce0ec0", :exec-duration 637, :id "b8720263-e77b-43e8-8158-addf30b3c694", :kind "code", :output-log-lines {}, :runtime [:runtime "e6338cba-a6ed-4a3e-a870-f9734ed87030"]}, "c6019e3e-b462-42c1-949d-ee4b04412f4a" {:compute-ref #uuid "659ec9c0-15c0-4cdb-9b8c-e84a96b10d39", :exec-duration 618, :id "c6019e3e-b462-42c1-949d-ee4b04412f4a", :kind "code", :output-log-lines {}, :runtime [:runtime "e6338cba-a6ed-4a3e-a870-f9734ed87030"]}, "c8545a0b-5cf1-4887-8bf0-a691ecdf76c2" {:compute-ref #uuid "0fffe99c-df9a-4e83-9b02-34e2c2a1ff05", :exec-duration 168, :id "c8545a0b-5cf1-4887-8bf0-a691ecdf76c2", :kind "code", :output-log-lines {}, :runtime [:runtime "e6338cba-a6ed-4a3e-a870-f9734ed87030"]}, "d00749c0-d363-4bef-aa2b-d938bf3e369e" {:compute-ref #uuid "04178c83-fd32-4d63-b8ff-50717d87d23b", :exec-duration 161, :id "d00749c0-d363-4bef-aa2b-d938bf3e369e", :kind "code", :output-log-lines {}, :runtime [:runtime "e6338cba-a6ed-4a3e-a870-f9734ed87030"]}, "d0248a0d-8e65-4861-a314-7f37d75bc90c" {:compute-ref #uuid "c45f04b3-7da5-4436-bb4e-53c2d80727ed", :exec-duration 239, :id "d0248a0d-8e65-4861-a314-7f37d75bc90c", :kind "code", :output-log-lines {}, :runtime [:runtime "e6338cba-a6ed-4a3e-a870-f9734ed87030"]}, "d94f9fbe-9677-4b6f-a6a6-3c6e0e2f8a5c" {:compute-ref #uuid "a3fc7760-13cc-400f-ba15-8bf25254922a", :exec-duration 156, :id "d94f9fbe-9677-4b6f-a6a6-3c6e0e2f8a5c", :kind "code", :output-log-lines {}, :runtime [:runtime "e6338cba-a6ed-4a3e-a870-f9734ed87030"]}, "e35aff6f-a1c3-4aa2-a31a-276a12f2ec7b" {:compute-ref #uuid "fc84571b-a22c-44e1-a273-eefc5bd8248a", :exec-duration 197, :id "e35aff6f-a1c3-4aa2-a31a-276a12f2ec7b", :kind "code", :output-log-lines {}, :runtime [:runtime "e6338cba-a6ed-4a3e-a870-f9734ed87030"]}, "e4be0687-5b72-4f24-aef6-6c957becdeca" {:compute-ref #uuid "b321debd-916a-4423-a8d4-18a174475ab7", :exec-duration 144, :id "e4be0687-5b72-4f24-aef6-6c957becdeca", :kind "code", :output-log-lines {}, :runtime [:runtime "e6338cba-a6ed-4a3e-a870-f9734ed87030"]}, "e60398b2-9cd3-4776-ae90-9b7f279470b0" {:compute-ref #uuid "2b6968b7-4082-4793-8fbc-725b802470bb", :exec-duration 200, :id "e60398b2-9cd3-4776-ae90-9b7f279470b0", :kind "code", :output-log-lines {}, :runtime [:runtime "e6338cba-a6ed-4a3e-a870-f9734ed87030"]}, "e6338cba-a6ed-4a3e-a870-f9734ed87030" {:environment [:environment {:article/nextjournal.id #uuid "5b45e08b-5b96-413e-84ed-f03b5b65bd66", :change/nextjournal.id #uuid "5df5e18c-0be4-4d8d-b099-6ce55ca12cf4", :node/id "0149f12a-08de-4f3d-9fd3-4b7a665e8624"}], :id "e6338cba-a6ed-4a3e-a870-f9734ed87030", :kind "runtime", :language "python", :type :nextjournal}, "f962c1d5-fdc6-4217-ac9b-37457d798f73" {:compute-ref #uuid "48c5e27a-c0ec-4bae-9fc3-0edd05504422", :exec-duration 168, :id "f962c1d5-fdc6-4217-ac9b-37457d798f73", :kind "code", :output-log-lines {}, :runtime [:runtime "e6338cba-a6ed-4a3e-a870-f9734ed87030"]}, "f96ef1bc-1a47-4b72-82dc-6df6ac9cb46e" {:compute-ref #uuid "2d7d30cc-ed95-4a11-a823-633034e15f23", :exec-duration 110, :id "f96ef1bc-1a47-4b72-82dc-6df6ac9cb46e", :kind "code", :output-log-lines {}, :runtime [:runtime "e6338cba-a6ed-4a3e-a870-f9734ed87030"]}, "fe653f6f-a925-4137-9a52-f42445583048" {:compute-ref #uuid "0ccd0c94-450a-4f2f-83c9-c8548e035488", :exec-duration 191, :id "fe653f6f-a925-4137-9a52-f42445583048", :kind "code", :output-log-lines {}, :runtime [:runtime "e6338cba-a6ed-4a3e-a870-f9734ed87030"]}}, :nextjournal/id #uuid "02e25a86-3339-4fcf-9bc7-e5b033622328", :article/change {:nextjournal/id #uuid "5fb55205-34ff-4628-948f-67becdc052a4"}}} ```