Analytics, Journals, Programming

Journal: Regression

What is Regression?

Regression is a fundamental analytical method that assists analyst how two variables are related. In marketing analytics, regression is widely use to understand how customer behaviours change in response with marketing tactics, such as change of dollar spent by customer against increase of number of promotions.

Linear Regression

Regression linear model, aka ordinary least squares (OLS),  is defined as:


Where y is n by 1, X is an n by k matrix of covariates, β is a k by 1 vector of parameters, and the errors ϵ are uncorrelated with equal variance


The errors do not need to be assumed to be normally distributed.

It follows that


is BLUE (the Best Linear Unbiased Estimator).

Explained variance of y is defined as


And the r-squared measures goodness of fit as


 Multivariate Regression

In multivariate model, we can use more than one predictor variables. However, we should avoid having predictor variables that are highly correlated with each other (i.e. collinearity). Collinearity results in instability, high  variance in estimates, and misleading interpretation.

Python: Linear Regression by using statsmodels*

import statsmodels.api as sm
from statsmodels.formula.api import ols

# statsmodels works nicely with pandas dataframes

# for multivariate OLS: Outcome ~ X1 + X2 + X3
# for other complex formula: 
m = ols('PRICE ~ RM',bos).fit()
print (m.summary())

This outputs information about the regression model, including the intercept, coefficient, p-value, R-squared, etc.


Python: Linear Regression by using sklearn*

from sklearn.linear_model import LinearRegression

X = bos.drop('PRICE', axis = 1)
lm = LinearRegression()

# Use all 13 predictors from the dataset to fit linear regression model, bos.PRICE)

# calculate the squared errors
se = np.sum((bos.PRICE - lm.predict(X)) ** 2)
# calculate the mean squared errors 
mseFull = np.mean((bos.PRICE - lm.predict(X)) ** 2)

An advantage of the sklearn is its built-in function for cross-validation to select test and training data sets. For example, the following randomly select the test and training dataset such that around one third of the data is used for test and the rest for training data set. We then can compare the MSE of the regression model when applied on the testing and training sets.

X_train, X_test, Y_train, Y_test = sklearn.cross_validation.train_test_split(
    X, bos.PRICE, test_size=0.33, random_state = 5)
lm = LinearRegression(), Y_train)
print("Fit a model X_train, and calculate MSE with Y_train:", 
     np.mean((Y_train - lm.predict(X_train)) ** 2))
print("Fit a model X_train, and calculate MSE with X_test, Y_test:", 
     np.mean((Y_test - lm.predict(X_test)) ** 2))

Odds Ratio

If someone’s probability of experiencing an outcome is p, then that person’s odds of the outcome are p/(1-p).

The odds ratio is the ratio of two different people’s odds of some outcome.


Estimating β: Least Squares

 Least Squares is a method that can estimate the coefficients of a linear model by minimizing the difference:


The solution can be written in compact matrix notion as


Logistic Regression

Logistics regression is a probabilistic model that links observed binary data to a set of features. The logistic regression model assumes that there is a set of weights, coefficients, and parameters, for each feature so that the data was generated with the probability distribution as:


Python: Logistic Regression by using statsmodels*

# fit logistic regression model
import statsmodels.api as sm
from statsmodels.formula.api import logit, glm, ols

## assume a dataset contains Temperature and Failure column
## we try to identify the relationship between temperature and the outcomes (1 if failure, and 0 if success)
dat = pd.DataFrame(data, columns = ['Temperature', 'Failure'])
logit_model = logit('Failure ~ Temperature',dat).fit()

This outputs the information about the logistic regression model


We then can calculate the probabilities of the outcome (failure in this case) for different temperatures.

# calculate predicted failure probabilities for new termperatures
x = np.linspace(50, 85, 1000)
p = logit_model.params
eta = p['Intercept'] + x*p['Temperature']
y = np.exp(eta)/(1 + np.exp(eta))

Polynomial Regression

A polynomial is a function that combines multiple powers of x linearly. The conceptual notions of polynomial regression is similar to linear regression where we have a sample of data points (i.e. sample or training examples D), and we are interested in using this sample to estimate a function g to approximate the function f. The function f can be used for prediction at new data points, or on the entire population (i.e. out-of-sample prediction).

Polynomial regression learns and chooses the function g which minimizes the cost measure of risk function R among all the functions in hypothesis set H, and give a final hypothesis g to estimate f.





Where N is the number of points in D.

Python: Polynomial Regression by using sklearn*

The following codes take a dataset consisting of x (independent variable) and y (dependent variable) and construct a polynomial regression at different orders from 0 to 20th. The codes also perform 1-fold cross-validation in the training dataset, run the regression, and calculate the MSE for each order of the polynomial model.

## Assuming the sample dataset containing x (independent variable) and y (dependent variable)
## total of 30 sample data points

from sklearn.cross_validation import train_test_split

#split dataset using the index
itrain,itest = train_test_split(range(30),train_size=24, test_size=6)
xtrain= df.x[itrain].values
ytrain = df.y[itrain].values
xtest= df.x[itest].values
ytest = df.y[itest].values
from sklearn.preprocessing import PolynomialFeatures

## simulate the x dataset by transforming the original x to some power d (degrees)
## x --> x^2, X^3, x^4, etc. 
def make_features(train_set, test_set, degrees):
    for d in degrees:
        traintestdict['train'] = PolynomialFeatures(d).fit_transform(train_set.reshape(-1,1))
        traintestdict['test'] = PolynomialFeatures(d).fit_transform(test_set.reshape(-1,1))
    return traintestlist
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

## assuming we are modelling polynomial regression models from 0 to 20th order

traintestlists=make_features(xtrain, xtest, degrees)

for d in degrees:#for increasing polynomial degrees 0,1,2...
    Xtrain = traintestlists[d]['train']
    Xtest = traintestlists[d]['test']
    #set up model
    est = LinearRegression()
    #fit, ytrain)
    prediction_on_training = est.predict(Xtrain)
    prediction_on_test = est.predict(Xtest)
    #calculate mean squared error for train dataset
    error_train[d] = mean_squared_error(ytrain, prediction_on_training)
    # calculate mean squared error for test dataset
    error_test[d] = mean_squared_error(ytest, prediction_on_test)

Ridge Regression

In a linear regression model, in place of minimizing the sum of squared residuals, ridge regression says to minimize:


This new risk takes the empirical risk (R-squared) and adds a “penalty term” to it to minimize overfitting. The term is proportional to the sum of the squares of the coefficients and is positive, so it will keep their values down.

This technique is called regularization or shrinkage as it takes the coefficient towards smaller sizes and hence, help to minimize overfitting problem.

Import the following libraries and dataset for the code to run

%matplotlib inline 

import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import sklearn
import statsmodels.api as sm

import seaborn as sns

# special matplotlib argument for improved plots
from matplotlib import rcParams
from sklearn.datasets import load_boston
boston = load_boston()
bos = pd.DataFrame(