Machine Learning Programming Workshop

2.2E Logistic Regression in Machine Learning

Prepared By: Cheong Shiu Hong (FTFNCE)

import numpy as np # Linear Algebra
import pandas as pd # Data Frames
import matplotlib.pyplot as plt # Visualization
from mpl_toolkits.mplot3d import axes3d # 3D Visualization
import ipywidgets as widgets # Interactivity
from IPython.display import display # Display Widgets
import time # To Track Time

%matplotlib notebook

def sigmoid(x, grad=False):
    if grad:
        return sigmoid(x) * (1-sigmoid(x))
    return 1/(1+np.exp(-x))

epsilon = 1e-10

5) Logistic Regression with Breast Cancer Dataset

return to top

Import Cancer Dataset from Sci-Kit Learn Library

import sklearn.datasets as datasets

cancer = datasets.load_breast_cancer()

Let's Check Out the Dataset

cancer.keys()

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

print("Feature Names:\n", cancer['feature_names'], "\n\nLabel Names:\n", cancer['target_names'])

Feature Names:
 ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension'] 

Label Names:
 ['malignant' 'benign']

Shape of X and Y

cancer['data'].shape, cancer['target'].shape

((569, 30), (569,))

df = pd.DataFrame(cancer['data'], columns=cancer['feature_names'])
df.head()

Split the Data</h4>

X = cancer['data'][:480]
Y = cancer['target'][:480]
X_val = cancer['data'][480:]
Y_val = cancer['target'][480:]

Define the Model

def model(theta, X):
    return sigmoid(np.dot(theta, X)) # Vectorize the Calculations

Define the Training Algorithm

def train(x, y, learning_rate=3e-6, iterations=1, first=False):
    global theta, prev_theta
    prev_theta = theta
    
    X = np.vstack([np.ones(y.shape[0]), x])

    for _ in range(iterations):
        # Model
        pred = model(theta, X)

        # Calculations for Backpropagation
        error = np.mean((y * np.log(pred + epsilon)) + ((1-y) * np.log(1-pred + epsilon)), -1)
        cost = - error
        dcost_dtheta = np.dot(X, pred-y)
        theta = theta - (dcost_dtheta * learning_rate)
        
    class_pred = np.round(pred)
    acc = np.sum(class_pred == y)/len(y)
    
    return cost, dcost_dtheta, acc

Instantiate Random Numbers 'Theta' to be Trained

Since there are 30 Features, We will need 31 Parameters (30 Features + 1 Y-Intercept) to be Trained

theta = np.random.randn(X.shape[1]+1)
print(theta.shape)

(31,)

Training The Parameters for 20 x 20000 Iterations

epochs = 20

total_time = time.time()
start = time.time()

for i in range(1, epochs+1):
    lr = 4e-8 if i <= 15 else 2e-8
    cost, dcost_dtheta, acc = train(X.T, Y, learning_rate=lr, iterations=20000)
    print('Epoch {} - Cost: {:.3f} | Accuracy: {:.2f}%\nTime: {:.2f}s\n'.format(i, cost, acc*100, time.time()-start))
    start = time.time()

print('Total Time Taken: {:.2f}s'.format(time.time()-total_time))

Epoch 1 - Cost: 1.903 | Accuracy: 83.12%
Time: 1.06s

Epoch 2 - Cost: 1.636 | Accuracy: 85.21%
Time: 1.04s

Epoch 3 - Cost: 1.429 | Accuracy: 85.62%
Time: 1.03s

Epoch 4 - Cost: 1.228 | Accuracy: 86.25%
Time: 1.03s

Epoch 5 - Cost: 1.034 | Accuracy: 86.67%
Time: 1.03s

Epoch 6 - Cost: 0.827 | Accuracy: 87.50%
Time: 1.03s

Epoch 7 - Cost: 0.631 | Accuracy: 88.12%
Time: 1.07s

Epoch 8 - Cost: 0.477 | Accuracy: 89.38%
Time: 1.04s

Epoch 9 - Cost: 0.395 | Accuracy: 90.21%
Time: 1.03s

Epoch 10 - Cost: 0.358 | Accuracy: 89.58%
Time: 1.03s

Epoch 11 - Cost: 0.336 | Accuracy: 90.42%
Time: 1.03s

Epoch 12 - Cost: 0.320 | Accuracy: 90.62%
Time: 1.03s

Epoch 13 - Cost: 0.307 | Accuracy: 90.83%
Time: 1.03s

Epoch 14 - Cost: 0.294 | Accuracy: 91.04%
Time: 1.04s

Epoch 15 - Cost: 0.283 | Accuracy: 91.46%
Time: 1.05s

Epoch 16 - Cost: 0.277 | Accuracy: 91.67%
Time: 1.03s

Epoch 17 - Cost: 0.272 | Accuracy: 91.67%
Time: 1.03s

Epoch 18 - Cost: 0.266 | Accuracy: 91.67%
Time: 1.04s

Epoch 19 - Cost: 0.261 | Accuracy: 91.88%
Time: 1.01s

Epoch 20 - Cost: 0.256 | Accuracy: 91.88%
Time: 1.02s

Total Time Taken: 20.72s

Evaluate the Performance of the Model

Xs = np.vstack([np.ones(Y.shape[0]), X.T])
modelpred = model(theta, Xs)
print(- np.mean((Y * np.log(modelpred + epsilon)) + ((1-Y) * np.log(1-modelpred + epsilon)))) # Cross Entropy Loss
print((np.round(modelpred) == Y).sum() / Y.shape[0]) # Accuracy

0.25612675616137903
0.91875

Xs = np.vstack([np.ones(Y_val.shape[0]), X_val.T])
modelpred = model(theta, Xs)
print(- np.mean((Y_val * np.log(modelpred + epsilon)) + ((1-Y_val) * np.log(1-modelpred + epsilon)))) # Cross Entropy Loss
print((np.round(modelpred) == Y_val).sum() / Y_val.shape[0]) # Accuracy

0.2952229490737121
0.8876404494382022

6) Logistic Regression with Sci-Kit Learn

return to top

Sci-Kit Learn is a Powerful Python Library that has Many Built-In Machine Learning Algorithms

Import Sklearn's Logistic Regression Object from sklearn.linear_model

from sklearn.linear_model import LogisticRegression

Instantiate Linear Regression Object

model = LogisticRegression(solver='liblinear')

Fit Model to Data

model.fit(X, Y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)

Evaluate Score of Fitted Model

model.score(X, Y)

0.9541666666666667

model.score(X_val, Y_val)

0.9662921348314607

Evaluate Cross Entropy Loss of Fitted Model

skpred = model.predict(X)
- np.mean((Y * np.log(skpred + epsilon)) + ((1-Y) * np.log(1-skpred + epsilon))) # Cross Entropy Loss

1.055351500860188

skpred = model.predict(X_val)
- np.mean((Y_val * np.log(skpred + epsilon)) + ((1-Y_val) * np.log(1-skpred + epsilon))) # Cross Entropy Loss

0.77615227844069

Bootstrap Aggregating (Bagging)

num_bags = 250
bag_size = 250

bags = []
for i in range(num_bags):
    idx = np.random.choice(np.arange(X.shape[0]), bag_size)
    bags.append([X[idx], Y[idx]])

models = []
for bag in bags:
    models.append(LogisticRegression(solver='liblinear'))
    models[-1].fit(bag[0], bag[1])

skpreds = []
for model in models:
    skpreds.append(model.predict(X))
avg_preds = np.array(skpreds).mean(0)
print(- np.mean((Y * np.log(avg_preds + epsilon)) + ((1-Y) * np.log(1-avg_preds + epsilon)))) # Cross Entropy Loss
print((np.round(avg_preds) == Y).sum() / Y.shape[0])

0.14976474846046414
0.9520833333333333

skpreds = []
for model in models:
    skpreds.append(model.predict(X_val))
avg_preds = np.array(skpreds).mean(0)
print(- np.mean((Y_val * np.log(avg_preds + epsilon)) + ((1-Y_val) * np.log(1-avg_preds + epsilon)))) # Cross Entropy Loss
print((np.round(avg_preds) == Y_val).sum() / Y_val.shape[0])

0.10659285951853358
0.9662921348314607

7) What other algorithms can we use?

return to top

Sci-kit Learn Documentation

Naive Bayes

from sklearn.naive_bayes import GaussianNB

NB = GaussianNB()

NB.fit(X, Y)

GaussianNB(priors=None, var_smoothing=1e-09)

NB.score(X, Y)

0.9395833333333333

Decision Trees

from sklearn.tree import DecisionTreeClassifier

dec_tree = DecisionTreeClassifier()

dec_tree.fit(X, Y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

dec_tree.score(X, Y)

1.0

Support Vector Machines

from sklearn.svm import SVC

SVM1 = SVC(kernel='linear')
SVM2 = SVC()

SVM1.fit(X, Y)
SVM2.fit(X, Y)

C:\Users\cheon\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

SVM1.score(X, Y), SVM2.score(X, Y)

(0.96875, 1.0)

Ensemble Algorithms

from sklearn.ensemble import RandomForestClassifier

RFC = RandomForestClassifier()

RFC.fit(X, Y)

C:\Users\cheon\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

RFC.score(X, Y)

1.0

Previous: Next:

Logistic Regression Multi-Class Logistic Regression

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	...	worst radius	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension
0	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.3001	0.14710	0.2419	0.07871	...	25.38	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890
1	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.0869	0.07017	0.1812	0.05667	...	24.99	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902
2	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.1974	0.12790	0.2069	0.05999	...	23.57	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758
3	11.42	20.38	77.58	386.1	0.14250	0.28390	0.2414	0.10520	0.2597	0.09744	...	14.91	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300
4	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.1980	0.10430	0.1809	0.05883	...	22.54	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678

Machine Learning Programming Workshop

2.2E Logistic Regression in Machine Learning

Prepared By: Cheong Shiu Hong (FTFNCE)

Contents

5) Logistic Regression with Breast Cancer Dataset

Import Cancer Dataset from Sci-Kit Learn Library

Let's Check Out the Dataset

Shape of X and Y

Split the Data</h4>

Define the Model

Define the Training Algorithm

Instantiate Random Numbers 'Theta' to be Trained

Since there are 30 Features, We will need 31 Parameters (30 Features + 1 Y-Intercept) to be Trained

Training The Parameters for 20 x 20000 Iterations

Evaluate the Performance of the Model

6) Logistic Regression with Sci-Kit Learn

Sci-Kit Learn is a Powerful Python Library that has Many Built-In Machine Learning Algorithms

Import Sklearn's Logistic Regression Object from sklearn.linear_model

Instantiate Linear Regression Object

Fit Model to Data

Evaluate Score of Fitted Model

Evaluate Cross Entropy Loss of Fitted Model

Bootstrap Aggregating (Bagging)

7) What other algorithms can we use?

Sci-kit Learn Documentation

Naive Bayes

Decision Trees

Support Vector Machines

Ensemble Algorithms