Logistic Regression

Table of Contents show

Linear Regression is used to model continuous valued functions
Can it also be used to predict categorical labels?
Logistic Regression: The probability of some event occurring as a linear function of a set of predictor variables

Suppose the numerical values of 0 and 1 are assigned to the two outcomes of a binary variable
Often the 0 represents a negative response and 1 represents a positive response
The mean of this variable will be the proportion of the positive responses
If p is the proportion of observation with an outcome of 1, then 1-p is the probability of a outcome of 0
odds: the ratio p/1-p
Odds ratio: the probability of winning over losing
logit is the logarithm of the odds or just log odds Mathematically, the logit transformation is written as

l=logit(p) = ln(p/1-p)

Support Vector Machine

A relatively new classification method for both linear and nonlinear data
It uses a nonlinear mapping to transform the original training data into a higher dimension
With the new dimension, it searches for the linear optimal separating hyperplane (i.e., “decision boundary”)
With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can always be separated by a hyperplane
SVM finds this hyperplane using support vectors (“essential” training tuples) and margins (defined by the support vectors)

SVM-History and Applications

Vapnik and colleagues (1992)—groundwork from Vapnik & Chervonenkis’ statistical learning theory in 1960s
Features: training can be slow but accuracy is high owing to their ability to model complex nonlinear decision boundaries (margin maximization)
Used for: classification and numeric prediction
Applications: handwritten digit recognition, object recognition, speaker identification, benchmarking time-series prediction tests

SVM—General Philosophy

SVM—Margins and Support Vectors

SVM—When Data Is Linearly Separable

Let data D be (X₁, y₁), …, (X_|D|, y_|D_|), where X_i is the set of training tuples associated with the class labels y_i
There are infinite lines (hyperplanes) separating the two classes but we want to find the best one (the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e., maximum marginal hyperplane (MMH)

SVM—Linearly Separable

A separating hyperplane can be written as
- W ● X + b = 0
- where W={w₁, w₂, …, w_n} is a weight vector and b a scalar (bias)
For 2-D it can be written as
- w₀ + w₁ x₁ + w₂ x₂ = 0
The hyperplane defining the sides of the margin:
- H₁: w₀ + w₁ x₁ + w₂ x₂ ≥ 1 for y_i= +1, and
- H₂: w₀ + w₁ x₁ + w₂ x₂ ≤ – 1 for y_i= –1
Any training tuples that fall on hyperplanes H₁ or H₂ (i.e., the
sides defining the margin) are support vectors
This becomes a constrained (convex) quadratic optimization problem: Quadratic objective function and linear constraints à Quadratic Programming (QP) à Lagrangian multipliers

Why Is SVM Effective on High Dimensional Data?

The complexity of trained classifier is characterized by the # of support vectors rather than the dimensionality of the data

SVM: Different Kernel functions

Instead of computing the dot product on the transformed data, it is math. equivalent to applying a kernel function K(X_i, X_j) to the original data, i.e., K(X_i, X_j) = Φ(X_i) Φ(X_j)

Typical Kernel Functions

SVM can also be used for classifying multiple (> 2) classes and for regression analysis (with additional parameters)

Working with Iris Dataset

Refer Jupyter notebook for Implementation – To be discussed with Module 5

Accuracy Prediction

Accuracy Measures

Accuracy:

The accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly classified by the classifier
Also known as Recognition rate

Error Rate or Misclassification Rate)

Represented as 1-Acc(M)
Where Acc(M) is the accuracy of classifier M\

Confusion Matrix

Given m classes, a confusion matrix is a table of at least size m by b

	Predicted (0)	Predicted (1)
Actual (0)	True Negative (TN)	False Positive (FP)
Actual (1)	False Negative (FN)	True Positive (TP)

N = number of observations
Accuracy = (TN + TP) / N
Accuracy = (TN + TP) / (TN+FP+FN+TP)

Sensitivity = TP / (TP + FN)
Specificity = TN / (TN + FP)
Overall error rate = (FP + FN) / N

False Negative Error Rate = FN / (TP+FN)
False Positive Error Rate = FP / (TN+FP)

Precision = TP / TP + FP
Recall = TP / TP + FN
F1 Score =(2 *precision *recall) / (precision + recall)

	Predicted (0)	Predicted (1)
Actual (0)	True Negative (TN)	False Positive (FP)
Actual (1)	False Negative (FN)	True Positive (TP)

Alternate Accuracy Measures

Predicted Class

Actual Class	C₀	C₁
C₀	N_0,0 = number of C₀ cases classified correctly	N_0,1 = number of C₀ clases classified incorrectly as C₀
C₁	N_1,0 = number of C₁ cases classified incorrectly as C₀	N_1,1 = number of C₁ cases classified correctly

If “C1” is important class

Sensitivity = % of “C1” class correctly classified
- Sensitivity = N_1,1 / (N_1,0 + N_1,1)
Specificity = % of “C0” class correctly classified
- Specificity = N_0,0 / (N_0,0 + N_0,1)
False Positive rate = % of predicted “C₁’s” that were not “C₁’s”
False Negative rate = % of predicted “C₁’s” that were not “C₁’s”

True Positive Rate is known as Sensitivity or Recall

True Negative Rate is known as Specificity

ROC Curves

ROC – Receiver Operating Characteristic
Can be used to compare tests / Procedures
ROC Curves: Simplest case
- Consider diagnostic test for a disease
- Test has 2 possible outcomes:
  - Positive = suggesting presence of disease
  - Negative
- An individual can test either positive or negative for the disease

ROC Analysis

True Positives = Test states you have the disease when you do have the disease
True Negatives = Test states you do not have the disease when you do not have the disease
False Positives = Test states you have the disease when you do not have the disease
False Negatives = Test states you have the disease when model says you do not have the disease

Receiver Operating Characteristic

Receiver Operator Characteristic

True Positive Rate (Sensitivity) on y-axis
- Proportion of positive
False Positive Rate (Specificity) on x-axis
- Proportion of negative labelled as positive

Predictor Error Measures

Loss functions measure the error between y_i and the predicted value (ý_i)

\[Absolute Error: |y_i – ý_i|\] \[SquaredError: (y_i – ý_i)^2\] \[Mean Absolute Error: {\sum_{i=1}^{d} |y_i – ý_i|\over d}\] \[Mean Squared Error: {\sum_{i=1}^d (y_i – ý_i)^2\over d}\] \[Relative Absolute Error: {\sum_{i=1}^{d} |y_i – ý_i|\over \sum_{i=1}^{d} |y_i – \bar{y}|}\] \[Mean Squared Error: {\sum_{i=1}^{d} (y_i – ý_i)^2\over \sum_{i=1}^{d} (y_i – \bar{y})^2}\]

Evaluating the accuracy of a classifier or Predictor

Advanced Python

Logistic Regression