Machine Learning
Portfolio
1 About
This document is written in emacs and org-mode, for export to html.
The inline code blocks in this document are executed sequentially in a single python session by org-babel. Usage here is meant to be representative of use during an interactive session.
portfolio.visualize.Plot.embed()
produces a
representation of the plot for embedding in an html document. For
normal use, use save
or show
instead.
The source of this document and the project are available on GitHub here. If you are viewing this on GitHub, an exported version with output and graphs is available here.
2 Setup
Install the project dependencies via Poetry
poetry install --no-interaction --ansi
Activate the venv. In emacs you can activate a venv located at
./.venv
via
(pyvenv-activate ".venv")
Load iris dataset
import numpy as np from portfolio.visualize import Plot from sklearn.datasets import load_iris iris = load_iris()
Get data for only type 1 and 2 for binary classifiers, and label as -1 or 1
x = iris.data[iris.target > 0] y = np.where(iris.target[iris.target > 0] == 1, -1, 1)
3 Models
3.1 Perceptron
3.1.1 Example runs from project description
This section contains code equivalent to that in the example run in the project documentation, to show that it meets the specifications.
import pandas as pd import numpy as np import matplotlib.pyplot as plt from portfolio.perceptron import Perceptron, plot_decision_regions, IRIS_OPTIONS from portfolio.visualize import Plot
Download and parse the dataset
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', **IRIS_OPTIONS)
Extract the first 100 labels (which are the first two types)
y = df.iloc[0:100, 4].values y
Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-setosa | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor | Iris-versicolor |
Since we only selected the first two, the classes are either
Iris-setosa
or Iris-versicolor
. Label
Iris-setosa
as 1 and everything else as 0.
y = np.where(y == 'Iris-setosa', -1, 1) y
-1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
The first and third features are seperable, so select only those for the values in y
X = df.iloc[0:100, [0, 2]].values X
5.1 | 1.4 |
4.9 | 1.4 |
4.7 | 1.3 |
4.6 | 1.5 |
5 | 1.4 |
5.4 | 1.7 |
4.6 | 1.4 |
5 | 1.5 |
4.4 | 1.4 |
4.9 | 1.5 |
5.4 | 1.5 |
4.8 | 1.6 |
4.8 | 1.4 |
4.3 | 1.1 |
5.8 | 1.2 |
5.7 | 1.5 |
5.4 | 1.3 |
5.1 | 1.4 |
5.7 | 1.7 |
5.1 | 1.5 |
5.4 | 1.7 |
5.1 | 1.5 |
4.6 | 1 |
5.1 | 1.7 |
4.8 | 1.9 |
5 | 1.6 |
5 | 1.6 |
5.2 | 1.5 |
5.2 | 1.4 |
4.7 | 1.6 |
4.8 | 1.6 |
5.4 | 1.5 |
5.2 | 1.5 |
5.5 | 1.4 |
4.9 | 1.5 |
5 | 1.2 |
5.5 | 1.3 |
4.9 | 1.5 |
4.4 | 1.3 |
5.1 | 1.5 |
5 | 1.3 |
4.5 | 1.3 |
4.4 | 1.3 |
5 | 1.6 |
5.1 | 1.9 |
4.8 | 1.4 |
5.1 | 1.6 |
4.6 | 1.4 |
5.3 | 1.5 |
5 | 1.4 |
7 | 4.7 |
6.4 | 4.5 |
6.9 | 4.9 |
5.5 | 4 |
6.5 | 4.6 |
5.7 | 4.5 |
6.3 | 4.7 |
4.9 | 3.3 |
6.6 | 4.6 |
5.2 | 3.9 |
5 | 3.5 |
5.9 | 4.2 |
6 | 4 |
6.1 | 4.7 |
5.6 | 3.6 |
6.7 | 4.4 |
5.6 | 4.5 |
5.8 | 4.1 |
6.2 | 4.5 |
5.6 | 3.9 |
5.9 | 4.8 |
6.1 | 4 |
6.3 | 4.9 |
6.1 | 4.7 |
6.4 | 4.3 |
6.6 | 4.4 |
6.8 | 4.8 |
6.7 | 5 |
6 | 4.5 |
5.7 | 3.5 |
5.5 | 3.8 |
5.5 | 3.7 |
5.8 | 3.9 |
6 | 5.1 |
5.4 | 4.5 |
6 | 4.5 |
6.7 | 4.7 |
6.3 | 4.4 |
5.6 | 4.1 |
5.5 | 4 |
5.5 | 4.4 |
6.1 | 4.6 |
5.8 | 4 |
5 | 3.3 |
5.6 | 4.2 |
5.7 | 4.2 |
5.7 | 4.2 |
6.2 | 4.3 |
5.1 | 3 |
5.7 | 4.1 |
Plotting the data, we can clearly see that these two features are seperable.
plt.figure() plt.scatter(X[:50, 0], X[:50, 1], color='red', marker='o', label='setosa') plt.scatter(X[50:100, 0], X[50:100, 1], color='blue', marker='x', label='versicolor') plt.xlabel('petal length') plt.ylabel('sepal length') plt.legend(loc='upper left') Plot.embed(plt)
Initialize the perceptron with a learning rate of
0.1
and a maximum of 1000 iterations.
pn = Perceptron(0.1, 1000)
fit
runs the perceptron algorithm on the given
data. The number of errors per iteration is stored in
errors
. Since this only took 6 iterations to
converge, it dropped out early instead of doing the entire 1000.
pn.fit(X, y) pn.errors
2 | 2 | 3 | 2 | 1 | 0 |
This data can be seen plotted below.
plt.figure() plt.plot(range(1, len(pn.errors) + 1), pn.errors, marker='o') plt.xlabel('Iteration') plt.ylabel('# of misclassifications') Plot.embed(plt)
pn.net_input(X)
-1.32 | -1.184 | -1.23 | -0.798 | -1.252 | -0.978 | -0.98 | -1.07 | -0.844 | -1.002 | -1.342 | -0.752 | -1.116 | -1.322 | -2.16 | -1.546 | -1.706 | -1.32 | -1.182 | -1.138 | -0.978 | -1.138 | -1.708 | -0.774 | -0.206 | -0.888 | -0.888 | -1.206 | -1.388 | -0.684 | -0.752 | -1.342 | -1.206 | -1.592 | -1.002 | -1.616 | -1.774 | -1.002 | -1.026 | -1.138 | -1.434 | -1.094 | -1.026 | -0.888 | -0.41 | -1.116 | -0.956 | -0.98 | -1.274 | -1.252 | 3.394 | 3.438 | 3.826 | 3.14 | 3.552 | 3.914 | 3.87 | 2.274 | 3.484 | 3.162 | 2.57 | 3.232 | 2.8 | 4.006 | 2.344 | 3.052 | 3.982 | 3.118 | 3.574 | 2.89 | 4.324 | 2.732 | 4.234 | 4.006 | 3.074 | 3.12 | 3.712 | 4.144 | 3.71 | 2.094 | 2.776 | 2.594 | 2.754 | 4.802 | 4.118 | 3.71 | 3.598 | 3.324 | 3.254 | 3.14 | 3.868 | 3.824 | 2.936 | 2.206 | 3.436 | 3.368 | 3.368 | 3.21 | 1.592 | 3.186 |
Running it on the original data in order, we can see that it correctly classifies all of the samples.
pn.predict(X)
-1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
pn.weight
-0.4 | -0.68 | 1.82 |
Here is a visualization of the computed decision boundary
plot_decision_regions(X, y, pn) Plot.embed(plt)
The other data I tested against:
Really close data. 701 seems like a really big number, but they are close so I'm assuming it could be right.
X1 = df.iloc[0:100, [0, 1]].values pn.fit(X1, y) pn.errors.shape
701 |
plot_decision_regions(X1, y, pn) Plot.embed(plt)
And data with a very wide separation. Here we can see it dropping out after only 3 iterations.
X2 = df.iloc[0:100, [2, 3]].values pn.fit(X2, y) pn.errors
2 | 2 | 0 |
plot_decision_regions(X2, y, pn) Plot.embed(plt)
Close all the figures we opened:
plt.close('all')
3.2 Linear Regression
For single dimension data, linear regression is defined as \(w = A^{-1}b\), where A is \(xx^\mathsf{T}\), and b is \(yx\). Since the matrix may not be invertible, we take the pseudo inverse. Since I added a column of ones to the A matrix, \(w_0\) holds the y intercept, and since it is only a single dimension, \(w_1\) holds the slope of the line.
The linear regression code may be run on the iris dataset with the following:
from portfolio.linear_regression import LinearRegression x_sepal_width = x[:, 1] regression_plot = Plot(x_sepal_width, y) regression_plot.add_view(LinearRegression) regression_plot.embed()
3.3 Decision Stumps
Decision stumps are a type of weak learner. They consist of a decision tree with only a single node. My implementation only accepts a binary split of classes labeled -1 and 1, although in theory a decision stump can split based on threshold into any number of classes.
I couldn't figure out how to translate the pseudocode for the efficient \(O(dm)\) implementation, so this is the naïve \(O(dm^2)\) version. I also have a somewhat fuzzy understanding of the book's explanation, but this is equivalent as far as I can tell.
Since the decision stump needs a threshold \(\theta\) in order to classify everything on each side as a different class, we need to define a set of steps. Given that two points may be arbitrarily close, I used the \(x\) values from the data to ensure that if there was a single optimal \(\theta\) it would be selected.
The other part of the decision stump is the dimension of the data against which it will classify. This implementation considers all given dimensions and selects the most suitable one.
The last part is the error function. The error function I used is a count of the differences between the predicted class and the expected class.
For each of the combinations of dimensions and possible thresholds, the decision stump checks the error function. The one with the lowest error is selected as the dimension to classify against and the \(\theta\) to use as a threshold.
Shown here is the decision boundary, along with the values plotted for the chosen dimension, with the label on the y axis. As you can see, these are not seperable, but the optimal boundary was still found.
from portfolio.decision_stumps import DecisionStumps x_sepal_width = x[:, 1] regression_plot = Plot(x, y) regression_plot.add_view(DecisionStumps) regression_plot.embed()
3.4 Support Vector Machine
3.4.1 Soft SVM
SVMs are a linear classification model . Unlike the perceptron, SVMs optimize for the widest margin between classes.
In addition, since this is a soft-margin svm, it uses a a hinge loss function. This allows it to accept non linearly serperable data, and produce a reasonable classification boundary.
Optimization of this loss function is implemented via a stochastic gradient decent (SGD).
For example, this data from iris is not seperable.
from portfolio import svm y = df.iloc[50:, 4].values y = np.where(y == 'Iris-versicolor', -1, 1) X = df.iloc[50:, [0, 3]].values np.c_[y, X]
label | sepal length | petal width |
---|---|---|
-1 | 7 | 1.4 |
-1 | 6.4 | 1.5 |
-1 | 6.9 | 1.5 |
-1 | 5.5 | 1.3 |
-1 | 6.5 | 1.5 |
-1 | 5.7 | 1.3 |
-1 | 6.3 | 1.6 |
-1 | 4.9 | 1 |
-1 | 6.6 | 1.3 |
-1 | 5.2 | 1.4 |
-1 | 5 | 1 |
-1 | 5.9 | 1.5 |
-1 | 6 | 1 |
-1 | 6.1 | 1.4 |
-1 | 5.6 | 1.3 |
-1 | 6.7 | 1.4 |
-1 | 5.6 | 1.5 |
-1 | 5.8 | 1 |
-1 | 6.2 | 1.5 |
-1 | 5.6 | 1.1 |
-1 | 5.9 | 1.8 |
-1 | 6.1 | 1.3 |
-1 | 6.3 | 1.5 |
-1 | 6.1 | 1.2 |
-1 | 6.4 | 1.3 |
-1 | 6.6 | 1.4 |
-1 | 6.8 | 1.4 |
-1 | 6.7 | 1.7 |
-1 | 6 | 1.5 |
-1 | 5.7 | 1 |
-1 | 5.5 | 1.1 |
-1 | 5.5 | 1 |
-1 | 5.8 | 1.2 |
-1 | 6 | 1.6 |
-1 | 5.4 | 1.5 |
-1 | 6 | 1.6 |
-1 | 6.7 | 1.5 |
-1 | 6.3 | 1.3 |
-1 | 5.6 | 1.3 |
-1 | 5.5 | 1.3 |
-1 | 5.5 | 1.2 |
-1 | 6.1 | 1.4 |
-1 | 5.8 | 1.2 |
-1 | 5 | 1 |
-1 | 5.6 | 1.3 |
-1 | 5.7 | 1.2 |
-1 | 5.7 | 1.3 |
-1 | 6.2 | 1.3 |
-1 | 5.1 | 1.1 |
-1 | 5.7 | 1.3 |
1 | 6.3 | 2.5 |
1 | 5.8 | 1.9 |
1 | 7.1 | 2.1 |
1 | 6.3 | 1.8 |
1 | 6.5 | 2.2 |
1 | 7.6 | 2.1 |
1 | 4.9 | 1.7 |
1 | 7.3 | 1.8 |
1 | 6.7 | 1.8 |
1 | 7.2 | 2.5 |
1 | 6.5 | 2 |
1 | 6.4 | 1.9 |
1 | 6.8 | 2.1 |
1 | 5.7 | 2 |
1 | 5.8 | 2.4 |
1 | 6.4 | 2.3 |
1 | 6.5 | 1.8 |
1 | 7.7 | 2.2 |
1 | 7.7 | 2.3 |
1 | 6 | 1.5 |
1 | 6.9 | 2.3 |
1 | 5.6 | 2 |
1 | 7.7 | 2 |
1 | 6.3 | 1.8 |
1 | 6.7 | 2.1 |
1 | 7.2 | 1.8 |
1 | 6.2 | 1.8 |
1 | 6.1 | 1.8 |
1 | 6.4 | 2.1 |
1 | 7.2 | 1.6 |
1 | 7.4 | 1.9 |
1 | 7.9 | 2 |
1 | 6.4 | 2.2 |
1 | 6.3 | 1.5 |
1 | 6.1 | 1.4 |
1 | 7.7 | 2.3 |
1 | 6.3 | 2.4 |
1 | 6.4 | 1.8 |
1 | 6 | 1.8 |
1 | 6.9 | 2.1 |
1 | 6.7 | 2.4 |
1 | 6.9 | 2.3 |
1 | 5.8 | 1.9 |
1 | 6.8 | 2.3 |
1 | 6.7 | 2.5 |
1 | 6.7 | 2.3 |
1 | 6.3 | 1.9 |
1 | 6.5 | 2 |
1 | 6.2 | 2.3 |
1 | 5.9 | 1.8 |
Running the soft svd classifier on it, we get a reasonably accurate classification of the data.
plot = Plot(X, y) plot.add_view(svm.Soft) plot.embed()
3.5 K-Nearest Neighbor
K nearest neighbors classifier simply classifies a sample based upon the closest training data.
At least in this implementation, although also in general to a lesser extent, KNN tends to be very computationally intensive for large datasets, since it must compute the distance to all points in order to find the best.
from portfolio.knn import KNN y = df.iloc[50:, 4].values y = np.where(y == 'Iris-versicolor', -1, 1) X = df.iloc[50:, [0, 3]].values np.c_[y, X]
label | sepal length | petal width |
---|---|---|
-1 | 7 | 1.4 |
-1 | 6.4 | 1.5 |
-1 | 6.9 | 1.5 |
-1 | 5.5 | 1.3 |
-1 | 6.5 | 1.5 |
-1 | 5.7 | 1.3 |
-1 | 6.3 | 1.6 |
-1 | 4.9 | 1 |
-1 | 6.6 | 1.3 |
-1 | 5.2 | 1.4 |
-1 | 5 | 1 |
-1 | 5.9 | 1.5 |
-1 | 6 | 1 |
-1 | 6.1 | 1.4 |
-1 | 5.6 | 1.3 |
-1 | 6.7 | 1.4 |
-1 | 5.6 | 1.5 |
-1 | 5.8 | 1 |
-1 | 6.2 | 1.5 |
-1 | 5.6 | 1.1 |
-1 | 5.9 | 1.8 |
-1 | 6.1 | 1.3 |
-1 | 6.3 | 1.5 |
-1 | 6.1 | 1.2 |
-1 | 6.4 | 1.3 |
-1 | 6.6 | 1.4 |
-1 | 6.8 | 1.4 |
-1 | 6.7 | 1.7 |
-1 | 6 | 1.5 |
-1 | 5.7 | 1 |
-1 | 5.5 | 1.1 |
-1 | 5.5 | 1 |
-1 | 5.8 | 1.2 |
-1 | 6 | 1.6 |
-1 | 5.4 | 1.5 |
-1 | 6 | 1.6 |
-1 | 6.7 | 1.5 |
-1 | 6.3 | 1.3 |
-1 | 5.6 | 1.3 |
-1 | 5.5 | 1.3 |
-1 | 5.5 | 1.2 |
-1 | 6.1 | 1.4 |
-1 | 5.8 | 1.2 |
-1 | 5 | 1 |
-1 | 5.6 | 1.3 |
-1 | 5.7 | 1.2 |
-1 | 5.7 | 1.3 |
-1 | 6.2 | 1.3 |
-1 | 5.1 | 1.1 |
-1 | 5.7 | 1.3 |
1 | 6.3 | 2.5 |
1 | 5.8 | 1.9 |
1 | 7.1 | 2.1 |
1 | 6.3 | 1.8 |
1 | 6.5 | 2.2 |
1 | 7.6 | 2.1 |
1 | 4.9 | 1.7 |
1 | 7.3 | 1.8 |
1 | 6.7 | 1.8 |
1 | 7.2 | 2.5 |
1 | 6.5 | 2 |
1 | 6.4 | 1.9 |
1 | 6.8 | 2.1 |
1 | 5.7 | 2 |
1 | 5.8 | 2.4 |
1 | 6.4 | 2.3 |
1 | 6.5 | 1.8 |
1 | 7.7 | 2.2 |
1 | 7.7 | 2.3 |
1 | 6 | 1.5 |
1 | 6.9 | 2.3 |
1 | 5.6 | 2 |
1 | 7.7 | 2 |
1 | 6.3 | 1.8 |
1 | 6.7 | 2.1 |
1 | 7.2 | 1.8 |
1 | 6.2 | 1.8 |
1 | 6.1 | 1.8 |
1 | 6.4 | 2.1 |
1 | 7.2 | 1.6 |
1 | 7.4 | 1.9 |
1 | 7.9 | 2 |
1 | 6.4 | 2.2 |
1 | 6.3 | 1.5 |
1 | 6.1 | 1.4 |
1 | 7.7 | 2.3 |
1 | 6.3 | 2.4 |
1 | 6.4 | 1.8 |
1 | 6 | 1.8 |
1 | 6.9 | 2.1 |
1 | 6.7 | 2.4 |
1 | 6.9 | 2.3 |
1 | 5.8 | 1.9 |
1 | 6.8 | 2.3 |
1 | 6.7 | 2.5 |
1 | 6.7 | 2.3 |
1 | 6.3 | 1.9 |
1 | 6.5 | 2 |
1 | 6.2 | 2.3 |
1 | 5.9 | 1.8 |
knn = KNN(X, y)
knn.predict([[0, 0]]).item()
-1 |
knn.predict([[3, 8]]).item()
1 |
Given the same data as the SVM example, we can see that KNN is able to classify this data entirely as opposed to the lossy linear classifier. This isn't necessarily an argument, since it doesn't take into account the possibility of overfitting, which KNN is always affected by.
Given certain datasets, including this one, the performance is very good.
fig = plt.figure() visualize.plot_decision_regions(fig.add_subplot(), X, y, knn) Plot.embed(fig)