**What is Machine Learning?**

Herbet Simon:

“Learning is any process by which a system improves performance from experience.”

Fundamentally, in machine learning:

- Input: a training set of N data points
- Learning: Use the above training set to learn what each class looks like i.e. to differentiate and characterize each class
- Evaluation: Predicts labels for a test set of data (different from the training set) and compare the true labels (ground truth) to the ones predicted by the classifier

Machine learning is useful for the following analytical solutions:

- Make predictions of new data points
- For data with labels
- Usually through supervised machine learning e.g. kNN, SVM, Decision Trees, Random Forests, Bagging, Boosting, etc.

- Find patterns in the data
- For data without labels
- Usually through unsupervised machine learning e.g. PCA, MDS, Clustering

Machine learning is a powerful technique to solve complex problems without or little human intervention. However, the learning needs to be practical, cost-effective, and in line with the business objectives. For example, the winner of Netflix machine learning competition successfully derived a very complex algorithm for the company’s user recommendation but it was never materialized for commercial use due to expensive engineering costs.

### Machine Learning Components

To applied machine learning thoroughly, analysts need to consider the following components:

#### 1. Representation

A machine learning classifier must be represented in some ways that computer can handle. The set of classifiers that computer can possibly learn is called the hypothesis space of the learner.

Sample machine learning representations include SVM, K-nearest neighbor, Bayesian networks, etc.

A related question is which features of data to use and how to represent them.

**Feature Selection**

**Feature Selection**

More features used in machine learning do not necessarily increase the model performance. In high dimensions, nearest neighbor point tends to be very far away and hence it is hard to interpolate data points efficiently. The statistical term the “curse of dimensionality” refers to the difficulty of fitting a model when many possible predictors are available. Statistically sound result would require the sample size N to grown exponentially with the number of dimension d.

#### 2. Evaluation

An evaluation function (or objective function or scoring function) is needed to distinguish good classifiers from bad ones. There are different evaluation functions, each of which is suitable for certain scenarios, depending on the objectives we want to achieve in machine learning.

Sample evaluation functions are precision and recall, squared error, K-L divergence, etc.

*Test Validation*

*Test Validation*

Basically, test validation is performed by using the training and test data sets and there are different ways to validate the analysis results:

- Holdout validation
- The simplest kind of cross validation
- Train the classifier in training data set and measure performance in the testing data set
- Advantage: short time to compute
- Disadvantage: high variance, depend heavily on the division between training and testing data sets

- K-fold cross validation
- Divide the data set into k subsets and perform k holdout validations
- In each time, one of the k subsets is used as the test data set and the other (k – 1) subsets are collectively used as the training set
- The average error across all k trials is computed
- Advantage: it matters less how the data is divided; the variance of the estimate is reduced as k is increased
- Disadvantage: take longer time to run k times validation

It is noted that **cross-validation error is an unbiased estimate of the out of sample-error**.

For example, the below figure is the 4-fold cross-validation process:

### Python: k-fold Cross-validation for Polynomial Regression Model using `sklearn`

*

```
## Assuming the sample dataset containing x (independent variable) and y (dependent variable)
## total of 30 sample data points
df=pd.DataFrame(dict(x=x,y=y))
from sklearn.cross_validation import train_test_split
datasize=df.shape[0]
#split dataset using the index
itrain,itest = train_test_split(range(30),train_size=24, test_size=6)
xtrain= df.x[itrain].values
ytrain = df.y[itrain].values
xtest= df.x[itest].values
ytest = df.y[itest].values
from sklearn.preprocessing import PolynomialFeatures
## simulate the x dataset by transforming the original x to some power d (degrees)
## x --> x^2, X^3, x^4, etc.
def make_features(train_set, test_set, degrees):
traintestlist=[]
for d in degrees:
traintestdict={}
traintestdict['train'] = PolynomialFeatures(d).fit_transform(train_set.reshape(-1,1))
traintestdict['test'] = PolynomialFeatures(d).fit_transform(test_set.reshape(-1,1))
traintestlist.append(traintestdict)
return traintestlist
```

```
from sklearn.cross_validation import KFold
n_folds=4
degrees=range(21)
results=[]
for d in degrees:
hypothesisresults=[]
for train, test in KFold(24, n_folds): # split data into train/test groups, 4 times
tvlist=make_features(xtrain[train], xtrain[test], degrees)
clf = LinearRegression()
clf.fit(tvlist[d]['train'], ytrain[train]) # fit
hypothesisresults.append(mean_squared_error(ytrain[test], clf.predict(tvlist[d]['test']))) # evaluate score function on held-out data
results.append((np.mean(hypothesisresults), np.min(hypothesisresults), np.max(hypothesisresults))) # average
mindeg = np.argmin([r[0] for r in results])
ttlist=make_features(xtrain, xtest, degrees)
#fit on whole training set now.
clf = LinearRegression()
clf.fit(ttlist[mindeg]['train'], ytrain) # fit
pred = clf.predict(ttlist[mindeg]['test'])
err = mean_squared_error(pred, ytest)
errtr=mean_squared_error(ytrain, clf.predict(ttlist[mindeg]['train']))
errout=0.8*errtr+0.2*err
```

#### 3. Optimization

Last but not least, optimization is required to improve the efficiency of the learner, and determine if the evaluation function has more than one optimum solution.

Many optimization methods have been studied such as greedy search, gradient descent, branch-and-bound, etc.

## Machine Learning Checklists

Any analysts should consider the following noteworthy points when doing machine learning:

**Data alone is not enough**- Every machine learners needs to entail some knowledge or assumptions beyond the data to be able to generalize it correctly
- Fortunately, generic assumptions such as smoothness, similar example have similar classes, limited dependencies, etc. are often enough to do well

**Over-fitting is common**- This happens when the learners achieve very high accuracy on training data but performs so poorly on test data
- The root cause of this is usually the properties of the sample selected (bias and variance). Data noise aggravate overfitting but overfitting can happen regardless of whether there is noise or not
- Cross-validation and regularization can help to combat this issue

**Escape the curse of dimensionality**- Machine learners may turn ineffective in high-dimension data as a fixed-size training sets only cover a dwindling fraction of the tremendous input space
- the solution for the curse of dimensionality, named the ‘blessing of non-uniformity’, is that in most applications, samples are not spread uniformly throughout the instance space, but are concentrated on a lower-dimensional manifold.

**Feature Engineering is important**- A key step in machine learning is to select the right data features to study (possibly the most time-consuming step) as this differentiates a good learner from a bad one
- Feature engineering involves gathering data, integrate it, clean it, pre-processing it, and performing trial and error to design the best features
- This can be very tricky as features are domain-specific and using too many features can be very time-consuming and again, cause over-fitting

**More data beats a cleverer algorithm**- To improve the performance of a learners, analysts usually try to design a better learning algorithm rather than gathering more data to learn.
- However, a rule of thumb is a simple algorithm with more data would beat a clever one with lesser amount of data.
- This however concerns the scalability of machine learning and subjects to the curse of dimensionality

**The more models, the better**- Machine learning is not just about selecting a single best learner, but to combine different learners to further improve the results
- For example, in the Netflix machine learning competition, the winner stacked ensembles of over 100 learners, ad combined them to achieve the best results

**Correlation does not imply causation**- A common notion in machine learning is to imply correlation as causation. However, this is usually applicable in very restricted areas
- Correlation, in its basics, is a sign of a potetial causal connection, and we can use it to further investation

## Machine Learning Algorithms

**Nearest Neighbor Classification (kNN)**

**Nearest Neighbor Classification (kNN)**

New data point is classified under the class which entails the majority vote of its k nearest neighbors. Some properties of kNN:

- Simple and good for low dimensional data
- Minimize the number of ‘island’ data points
- If k is too large, the boundary will become too smooth
- Lower variance, but increase bias of the data points

**Support Vector Machine (SVM)**

**Support Vector Machine (SVM)**

SVM is a popular machine learning algorithm that is widely used for many types of classification problems. SVM is basically a discriminative classifier identified by a separating hyperplane. Given a labelled training data set, SVM algorithm outputs an optimal hyperplane which categorizes new example.

SVM algorithm can explained by an example of a 2D-points and a separating straight line. The problem then becomes finding the best separating straight lie to separate the points into two classes.

*Source: http://docs.opencv.org/2.4/doc/tutorials/ml/introduction_to_svm/introduction_to_svm.html*

The good separating line is defined as one which passes as far as possible from all points. If the line is too close to the points, it is noise sensitive and will not generalize correctly. Similarly, SVM algorithm is to find the best hyperplane that gives the largest minimum distance to the training data (i.e. maximizes the margin of the training data). Margin is equal to twice of the minimum distance to the training examples. SVM can be extended to patterns that are not linearly separable by transformations of original data to map into new space (Kernel function).

Support vectors are data points that lie closest to the SVM hyperplane. The have direct impact on the optimal location of the hyperplane.

**Import the following packages for the codes to run: **

```
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")
```