Scipy Lectures are a good resource.
import numpy as np
We’ll use a simple 2 dimensional nonlinear dataset to run our algorithms, so that it’s easier to visualize, yet complex enough that we see different types of behaviour from different algorithms.
# See Github for the full code - hidden for brevity
Xs = ...
Ys = ...
test_Xs = ...
test_Ys = ...
def accuracy(model,X,Y):
pass
def plot(model,X,Y):
pass
For each algorithm, we’ll train it on a dataset with 1000 samples, and plot the decision curve learned by the algorithm. For example, here’s the “optimal” classifier. The color of the point is the true class, and the color of the background is what the classifier predicts.
import matplotlib.pyplot as plt
%matplotlib inline
Xs = np.random.rand(1000,2)*10
Ys = np.array([x[1] > x[0]+np.sin(x[0]*2) for x in Xs])
test_Xs = np.random.rand(1000,2)*10
test_Ys = np.array([x[1] > x[0]+np.sin(x[0]*2) for x in test_Xs])
def accuracy(model,X,Y):
return np.count_nonzero([model.predict(x)==y for (x,y) in zip(X,Y)])/len(X)
def plot(model,X,Y,shades=False,nPoints=100,**kwargs):
plt.figure(figsize=(8,8))
h = 100
x_min, y_min = np.min(X,axis=0)
x_max, y_max = np.max(X,axis=0)
xx, yy = np.meshgrid(np.linspace(x_min, x_max, h),
np.linspace(y_min, y_max, h))
# here "model" is your model's prediction (classification) function
Z = np.array([model.predict(x) for x in np.c_[xx.ravel(), yy.ravel()]])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Pastel2,**kwargs)
subpoints = np.random.choice(len(X),nPoints)
plt.scatter(X[subpoints, 0], X[subpoints, 1], c=Y[subpoints], cmap=plt.cm.Dark2)
plt.axis('off')
# Plot also the training points
class Dummy:
def predict(x):
return x[1] > x[0]+np.sin(x[0]*2)
plot(Dummy,Xs,Ys)
$k$-Nearest Neighbors is a non-parametric machine learning model that predicts classes for points x based on the k closest values to x seen in the training set.
from collections import Counter
class kNN:
def __init__(self,k=1):
self.k = k
def fit(self,xs,ys,k=1):
"""
Since kNN is nonparametric, we need to keep the training data around during test time.
k-NN doesn't do any additional preprocessing
"""
self.xs = xs
self.ys = ys
def predict(self,x):
# Sort training points by distance to the new point
closest_points = np.argsort(np.linalg.norm(self.xs-x,axis=1))[:self.k]
# Get labels of these close points
closest_labels = self.ys[closest_points]
# Return the most common label
return Counter(closest_labels).most_common(1)[0][0]
kNN_classifier = kNN(1)
kNN_classifier.fit(Xs,Ys)
plot(kNN_classifier,test_Xs,test_Ys)
print("Accuracy is ",accuracy(kNN_classifier,test_Xs,test_Ys))
Accuracy is 0.978
Linear regression attempts to find the best linear model $f(x) =w^Tx$ to fit the data. The training procedure revolves around solving the following equation
$$w^* = \arg\min_{w \in \mathbb{R}^d} \sum_{i=1}^n (w^TX_i - y_i)^2 $$We can write this in matrix form, where $X$ is a $n \times d$ matrix where each row corresponds to a training point.
$$w^* = \arg\min_{w \in \mathbb{R}^d} \|Xw-y\|_2^2$$Using matrix calculus, we have that $\nabla_w = X^T(Xw-y)$, so the optimal solution $w^*$ satisfies
$$(X^TX)w^* = X^Ty$$class LinReg:
def fit(self,xs,ys):
"""
Solving (X^TX)w^* = X^Ty
"""
self.w = np.linalg.solve(xs.T.dot(xs),xs.T.dot(ys))
def predict(self,x):
return self.w.dot(x) > 0.5 # If predict >0.5, then say class 1, else class 0
linear_classifier = LinReg()
linear_classifier.fit(Xs,Ys)
print("Accuracy is ",accuracy(linear_classifier,test_Xs,test_Ys))
plot(linear_classifier,test_Xs,test_Ys)
Accuracy is 0.856
Logistic regression is similar to linear regression, but specializes to classification tasks by predicting probabilities $p(y_i=1~\mid~x_i)$.
If we naively apply linear regression to predicting probabilities, we’ll be maximizing $p(y=1|x) = w^Tx$. However, since $w^Tx$ can be between $(-\infty, \infty)$, it matches poorly to the desired output $(0,1)$.
Estimating the odds $\cfrac{p(y=1~\midx)}{p(y=0\mid~x)}$ as a linear function is better, but still attains values only in $[0,\infty)$. To truly exploit the full range of values that a linear function affords, we predict the log-odds as a linear function of $x$.
Solving this out for the probabilities, we see that the probability is given by the sigmoid function
$$P(Y=1|x) = \cfrac{e^{w^Tx}}{1+e^{w^Tx}} = \frac{1}{1+e^{-w^Tx}}$$We now have a model which predicts probabilities:
$$ \begin{align*} P(Y=y ~~\vert~~X=x,w) &= \begin{cases} p & y=1 \\ 1-p & y=0 \end{cases}\\~\\ &=p^y(1-p)^{1-y} \end{align*} $$Now, how do we find the so-called “best” probabilities? The standard method to do so is Maximum Likelihood Estimation (MLE), which maximizes the probability of the training data under our model. Intuitively, we would like to find the model which was most likely to generate the data in the first place.
In the following calculations, we’ll use the abbreviation $p_i = \cfrac{e^{w^Tx_i}}{1+e^{w^Tx_i}}$ (our predicted probability for $y_i=1$ under $x_i$)
Unlike linear regression above, this has no closed-form solution. To solve, instead we have two methods: gradient descent and convex optimization
The gradient of the objective is given by
$$\nabla_w = \sum_{i=1}^n y_i x - p_i x = \sum_{i=1}^n (y_i-p_i)x$$class LogRegGD:
def __init__(self,a=.001,nIter=1000):
self.a = a # Learning Rate
self.nIter = nIter # Number of gradient update
def probabilities(self,xs):
return 1/(1+np.exp(-1*xs.dot(self.w)))
def fit(self,xs,ys):
"""
Take gradient steps
"""
xs = np.hstack([xs,np.ones((len(xs),1))]) # Add bias term
self.w = np.random.rand(xs.shape[1])
for i in range(self.nIter):
ps = self.probabilities(xs)
self.w += self.a*(ys-ps).dot(xs) # Take Gradient Descent term
def predict(self,x):
x = np.concatenate((x,[1])) # Add bias term
return self.w.dot(x) > 0 # If w^Tx > 0, then p > .5
logistic_classifier = LogRegGD()
logistic_classifier.fit(Xs,Ys)
print("Accuracy is ",accuracy(logistic_classifier,test_Xs,test_Ys))
plot(logistic_classifier,test_Xs,test_Ys)
Accuracy is 0.941
The objective function is convex; we can use a convex optimization library to solve. We’ll use cvxpy
(install guide here). This involves extra libraries, but has a significant advantage over the gradient descent objective in that it returns the exact solution, without worrying about hyperparameters and divergence concerns.
Recall from above that the objective is given by
$$\arg \max_{w} \sum_{i=1}^n y_i w^Tx - \log 1+e^{w^Tx}$$We can write this in matrix notation as
$$y^TXw - \mathbf{1}^T\text{Logistic}(Xw)$$Where $\mathbf{1}$ is the ones-vector in $\mathbb{R}^n$, and $\text{Logistic}(\cdot)$ applies the logistic function $f(x) = \log 1 + e^x$ to each entry
import cvxpy as cvx
class LogRegCVX:
def fit(self,xs,ys):
# Adding a bias term
xs = np.hstack([xs,np.ones((len(xs),1))])
# Formulating the objective
w = cvx.Variable(xs.shape[1])
objective = -1*(ys.dot(xs)*w-cvx.sum_entries(cvx.logistic(xs*w)))
# Solving and storing the optimal value
cvx.Problem(cvx.Minimize(objective)).solve()
self.w = np.array(w.value).flatten()
def predict(self,x):
x = np.concatenate((x,[1])) # Add bias term
return self.w.dot(x) > 0 # If w^Tx > 0, then p > .5
cvx_logistic_classifier = LogRegCVX()
cvx_logistic_classifier.fit(Xs,Ys)
print("Accuracy is ",accuracy(cvx_logistic_classifier,test_Xs,test_Ys))
plot(cvx_logistic_classifier,test_Xs,test_Ys)
Accuracy is 0.935
Find the source code for this post here.