**Table of Contents**show

**Logistic Regression**

- Linear Regression is used to model continuous valued functions
- Can it also be used to predict categorical labels?
- Logistic Regression: The probability of some event occurring as a linear function of a set of predictor variables

- Suppose the numerical values of 0 and 1 are assigned to the two outcomes of a binary variable
- Often the 0 represents a negative response and 1 represents a positive response
- The mean of this variable will be the proportion of the positive responses
- If p is the proportion of observation with an outcome of 1, then 1-p is the probability of a outcome of 0
**odds:**the ratio p/1-p**Odds ratio:**the probability of winning over losing**logit**is the logarithm of the odds or just**log odds**Mathematically, the logit transformation is written as

`l=logit(p) = ln(p/1-p)`

**Support Vector Machine**

- A relatively new classification method for both linear and nonlinear data
- It uses a nonlinear mapping to transform the original training data into a higher dimension
- With the new dimension, it searches for the linear optimal separating
**hyperplane**(i.e., “decision boundary”) - With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can always be separated by a hyperplane
- SVM finds this hyperplane using
**support vectors**(“essential” training tuples) and**margins**(defined by the support vectors)

**SVM-History and Applications**

- Vapnik and colleagues (1992)—groundwork from Vapnik & Chervonenkis’ statistical learning theory in 1960s
**Features:**training can be slow but accuracy is high owing to their ability to model complex nonlinear decision boundaries (margin maximization)**Used for:**classification and numeric prediction**Applications:**handwritten digit recognition, object recognition, speaker identification, benchmarking time-series prediction tests

**SVM—General Philosophy**

**SVM—Margins and Support Vectors**

**SVM—When Data Is Linearly Separable**

- Let data D be (
**X**_{1}, y_{1}), …, (**X**_{|D|}, y_{|D}_{|}), where**X**_{i}is the set of training tuples associated with the class labels y_{i} - There are infinite lines (hyperplanes) separating the two classes but we want to find the best one (the one that minimizes classification error on unseen data)
*SVM searches for the hyperplane with the largest margin*, i.e.,**maximum marginal hyperplane**(MMH)

**SVM—Linearly Separable**

- A separating hyperplane can be written as
**W**●**X**+ b = 0- where
**W**={w_{1}, w_{2}, …, w_{n}} is a weight vector and b a scalar (bias)

- For 2-D it can be written as
- w
_{0}+ w_{1}x_{1}+ w_{2}x_{2}= 0

- w
- The hyperplane defining the sides of the margin:
- H
_{1}: w_{0}+ w_{1}x_{1}+ w_{2}x_{2}≥ 1 for y_{i}_{ }= +1, and - H
_{2}: w_{0}+ w_{1}x_{1}+ w_{2}x_{2}≤ – 1 for y_{i}_{ }= –1

- H
- Any training tuples that fall on hyperplanes H
_{1}or H_{2}(i.e., the

sides defining the margin) are**support vectors** - This becomes a
**constrained (convex) quadratic optimization**problem: Quadratic objective function and linear constraints à*Quadratic Programming (QP)*à Lagrangian multipliers

**Why Is SVM Effective on High Dimensional Data?**

The **complexity** of trained classifier is characterized by the # of support vectors rather than the dimensionality of the data

**SVM: Different Kernel functions**

Instead of computing the dot product on the transformed data, it is math. equivalent to applying a kernel function K(**X _{i}**,

**X**) to the original data, i.e., K(

_{j}**X**,

_{i}**X**) = Φ(

_{j}**X**) Φ(

_{i}**X**)

_{j}**Typical Kernel Functions**

SVM can also be used for classifying multiple (> 2) classes and for regression analysis (with additional parameters)

**Working with Iris Dataset**

Refer Jupyter notebook for Implementation – To be discussed with Module 5

**Accuracy Prediction**

**Accuracy Measures**

**Accuracy:**

- The accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly classified by the classifier
- Also known as Recognition rate

**Error Rate or Misclassification Rate)**

- Represented as 1-Acc(M)
- Where Acc(M) is the accuracy of classifier M\

**Confusion Matrix**

Given m classes, a confusion matrix is a table of at least size m by b

Predicted (0) | Predicted (1) | |

Actual (0) | True Negative (TN) | False Positive (FP) |

Actual (1) | False Negative (FN) | True Positive (TP) |

- N = number of observations
- Accuracy = (TN + TP) / N
- Accuracy = (TN + TP) / (TN+FP+FN+TP)

- Sensitivity = TP / (TP + FN)
- Specificity = TN / (TN + FP)
- Overall error rate = (FP + FN) / N

- False Negative Error Rate = FN / (TP+FN)
- False Positive Error Rate = FP / (TN+FP)

- Precision = TP / TP + FP
- Recall = TP / TP + FN
- F1 Score =(2 *precision *recall) / (precision + recall)

Predicted (0) | Predicted (1) | |

Actual (0) | True Negative (TN) | False Positive (FP) |

Actual (1) | False Negative (FN) | True Positive (TP) |

**Alternate Accuracy Measures**

**Predicted Class**

Actual Class | C_{0} | C_{1} |

C_{0} | N_{0,0} = number of C_{0} cases classified correctly | N_{0,1} = number of C_{0} clases classified incorrectly as C_{0} |

C_{1} | N_{1,0} = number of C_{1} cases classified incorrectly as C_{0} | N_{1,1} = number of C_{1} cases classified correctly |

If “C1” is important class

**Sensitivity**= % of “C1” class correctly classified**Sensitivity**=**N**_{1,1}/ (N_{1,0}+ N_{1,1})

**Specificity**= % of “C0” class correctly classified**Specificity**=**N**_{0,0}/ (N_{0,0}+ N_{0,1})

**False Positive rate**= % of predicted “C_{1}’s” that were not “C_{1}’s”**False Negative rate**= % of predicted “C_{1}’s” that were not “C_{1}’s”

a

**True Positive Rate is known as Sensitivity or Recall**

**True Negative Rate is known as Specificity**

**ROC Curves**

- ROC – Receiver Operating Characteristic
- Can be used to compare tests / Procedures
- ROC Curves: Simplest case
- Consider diagnostic test for a disease
- Test has 2 possible outcomes:
- Positive = suggesting presence of disease
- Negative

- An individual can test either positive or negative for the disease

**ROC Analysis**

**True Positives**= Test states you have the disease when you do have the disease**True Negatives**= Test states you do not have the disease when you do not have the disease**False Positives**= Test states you have the disease when you do not have the disease**False Negatives**= Test states you have the disease when model says you do not have the disease

**Receiver Operating Characteristic**

**Receiver Operator Characteristic**

- True Positive Rate (Sensitivity) on y-axis
- Proportion of positive

- False Positive Rate (Specificity) on x-axis
- Proportion of negative labelled as positive

**Predictor Error Measures**

Loss functions measure the error between y_{i} and the predicted value (**ý _{i}**)

**Evaluating the accuracy of a classifier or Predictor**

Views: 0