Machine Learning Programming Workshop

2.2E Logistic Regression in Machine Learning

Prepared By: Cheong Shiu Hong (FTFNCE)



In [1]:
import numpy as np # Linear Algebra
import pandas as pd # Data Frames
import matplotlib.pyplot as plt # Visualization
from mpl_toolkits.mplot3d import axes3d # 3D Visualization
import ipywidgets as widgets # Interactivity
from IPython.display import display # Display Widgets
import time # To Track Time
In [2]:
%matplotlib notebook


In [3]:
def sigmoid(x, grad=False):
    if grad:
        return sigmoid(x) * (1-sigmoid(x))
    return 1/(1+np.exp(-x))
In [4]:
epsilon = 1e-10


5) Logistic Regression with Breast Cancer Dataset

return to top

Import Cancer Dataset from Sci-Kit Learn Library

In [5]:
import sklearn.datasets as datasets
In [6]:
cancer = datasets.load_breast_cancer()


Let's Check Out the Dataset

In [7]:
cancer.keys()
Out[7]:
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])
In [8]:
print("Feature Names:\n", cancer['feature_names'], "\n\nLabel Names:\n", cancer['target_names'])
Feature Names:
 ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension'] 

Label Names:
 ['malignant' 'benign']

Shape of X and Y

In [9]:
cancer['data'].shape, cancer['target'].shape
Out[9]:
((569, 30), (569,))
In [10]:
df = pd.DataFrame(cancer['data'], columns=cancer['feature_names'])
df.head()
Out[10]:
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 ... 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 ... 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 ... 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678

5 rows × 30 columns


Split the Data</h4>

In [11]:
X = cancer['data'][:480]
Y = cancer['target'][:480]
X_val = cancer['data'][480:]
Y_val = cancer['target'][480:]


Define the Model

In [12]:
def model(theta, X):
    return sigmoid(np.dot(theta, X)) # Vectorize the Calculations


Define the Training Algorithm

In [13]:
def train(x, y, learning_rate=3e-6, iterations=1, first=False):
    global theta, prev_theta
    prev_theta = theta
    
    X = np.vstack([np.ones(y.shape[0]), x])

    for _ in range(iterations):
        # Model
        pred = model(theta, X)

        # Calculations for Backpropagation
        error = np.mean((y * np.log(pred + epsilon)) + ((1-y) * np.log(1-pred + epsilon)), -1)
        cost = - error
        dcost_dtheta = np.dot(X, pred-y)
        theta = theta - (dcost_dtheta * learning_rate)
        
    class_pred = np.round(pred)
    acc = np.sum(class_pred == y)/len(y)
    
    return cost, dcost_dtheta, acc


Instantiate Random Numbers 'Theta' to be Trained

Since there are 30 Features, We will need 31 Parameters (30 Features + 1 Y-Intercept) to be Trained

In [14]:
theta = np.random.randn(X.shape[1]+1)
print(theta.shape)
(31,)


Training The Parameters for 20 x 20000 Iterations

In [15]:
epochs = 20

total_time = time.time()
start = time.time()

for i in range(1, epochs+1):
    lr = 4e-8 if i <= 15 else 2e-8
    cost, dcost_dtheta, acc = train(X.T, Y, learning_rate=lr, iterations=20000)
    print('Epoch {} - Cost: {:.3f} | Accuracy: {:.2f}%\nTime: {:.2f}s\n'.format(i, cost, acc*100, time.time()-start))
    start = time.time()

print('Total Time Taken: {:.2f}s'.format(time.time()-total_time))
Epoch 1 - Cost: 1.903 | Accuracy: 83.12%
Time: 1.06s

Epoch 2 - Cost: 1.636 | Accuracy: 85.21%
Time: 1.04s

Epoch 3 - Cost: 1.429 | Accuracy: 85.62%
Time: 1.03s

Epoch 4 - Cost: 1.228 | Accuracy: 86.25%
Time: 1.03s

Epoch 5 - Cost: 1.034 | Accuracy: 86.67%
Time: 1.03s

Epoch 6 - Cost: 0.827 | Accuracy: 87.50%
Time: 1.03s

Epoch 7 - Cost: 0.631 | Accuracy: 88.12%
Time: 1.07s

Epoch 8 - Cost: 0.477 | Accuracy: 89.38%
Time: 1.04s

Epoch 9 - Cost: 0.395 | Accuracy: 90.21%
Time: 1.03s

Epoch 10 - Cost: 0.358 | Accuracy: 89.58%
Time: 1.03s

Epoch 11 - Cost: 0.336 | Accuracy: 90.42%
Time: 1.03s

Epoch 12 - Cost: 0.320 | Accuracy: 90.62%
Time: 1.03s

Epoch 13 - Cost: 0.307 | Accuracy: 90.83%
Time: 1.03s

Epoch 14 - Cost: 0.294 | Accuracy: 91.04%
Time: 1.04s

Epoch 15 - Cost: 0.283 | Accuracy: 91.46%
Time: 1.05s

Epoch 16 - Cost: 0.277 | Accuracy: 91.67%
Time: 1.03s

Epoch 17 - Cost: 0.272 | Accuracy: 91.67%
Time: 1.03s

Epoch 18 - Cost: 0.266 | Accuracy: 91.67%
Time: 1.04s

Epoch 19 - Cost: 0.261 | Accuracy: 91.88%
Time: 1.01s

Epoch 20 - Cost: 0.256 | Accuracy: 91.88%
Time: 1.02s

Total Time Taken: 20.72s


Evaluate the Performance of the Model

In [16]:
Xs = np.vstack([np.ones(Y.shape[0]), X.T])
modelpred = model(theta, Xs)
print(- np.mean((Y * np.log(modelpred + epsilon)) + ((1-Y) * np.log(1-modelpred + epsilon)))) # Cross Entropy Loss
print((np.round(modelpred) == Y).sum() / Y.shape[0]) # Accuracy  
0.25612675616137903
0.91875
In [17]:
Xs = np.vstack([np.ones(Y_val.shape[0]), X_val.T])
modelpred = model(theta, Xs)
print(- np.mean((Y_val * np.log(modelpred + epsilon)) + ((1-Y_val) * np.log(1-modelpred + epsilon)))) # Cross Entropy Loss
print((np.round(modelpred) == Y_val).sum() / Y_val.shape[0]) # Accuracy  
0.2952229490737121
0.8876404494382022


6) Logistic Regression with Sci-Kit Learn

return to top

Sci-Kit Learn is a Powerful Python Library that has Many Built-In Machine Learning Algorithms

Import Sklearn's Logistic Regression Object from sklearn.linear_model

In [18]:
from sklearn.linear_model import LogisticRegression


Instantiate Linear Regression Object

In [19]:
model = LogisticRegression(solver='liblinear')


Fit Model to Data

In [20]:
model.fit(X, Y)
Out[20]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)


Evaluate Score of Fitted Model

In [21]:
model.score(X, Y)
Out[21]:
0.9541666666666667
In [22]:
model.score(X_val, Y_val)
Out[22]:
0.9662921348314607


Evaluate Cross Entropy Loss of Fitted Model

In [23]:
skpred = model.predict(X)
- np.mean((Y * np.log(skpred + epsilon)) + ((1-Y) * np.log(1-skpred + epsilon))) # Cross Entropy Loss
Out[23]:
1.055351500860188
In [24]:
skpred = model.predict(X_val)
- np.mean((Y_val * np.log(skpred + epsilon)) + ((1-Y_val) * np.log(1-skpred + epsilon))) # Cross Entropy Loss
Out[24]:
0.77615227844069


Bootstrap Aggregating (Bagging)

In [25]:
num_bags = 250
bag_size = 250
In [26]:
bags = []
for i in range(num_bags):
    idx = np.random.choice(np.arange(X.shape[0]), bag_size)
    bags.append([X[idx], Y[idx]])
In [27]:
models = []
for bag in bags:
    models.append(LogisticRegression(solver='liblinear'))
    models[-1].fit(bag[0], bag[1])
In [28]:
skpreds = []
for model in models:
    skpreds.append(model.predict(X))
avg_preds = np.array(skpreds).mean(0)
print(- np.mean((Y * np.log(avg_preds + epsilon)) + ((1-Y) * np.log(1-avg_preds + epsilon)))) # Cross Entropy Loss
print((np.round(avg_preds) == Y).sum() / Y.shape[0])
0.14976474846046414
0.9520833333333333
In [29]:
skpreds = []
for model in models:
    skpreds.append(model.predict(X_val))
avg_preds = np.array(skpreds).mean(0)
print(- np.mean((Y_val * np.log(avg_preds + epsilon)) + ((1-Y_val) * np.log(1-avg_preds + epsilon)))) # Cross Entropy Loss
print((np.round(avg_preds) == Y_val).sum() / Y_val.shape[0])
0.10659285951853358
0.9662921348314607


7) What other algorithms can we use?

return to top

Naive Bayes

In [30]:
from sklearn.naive_bayes import GaussianNB
In [31]:
NB = GaussianNB()
In [32]:
NB.fit(X, Y)
Out[32]:
GaussianNB(priors=None, var_smoothing=1e-09)
In [33]:
NB.score(X, Y)
Out[33]:
0.9395833333333333

Decision Trees

In [34]:
from sklearn.tree import DecisionTreeClassifier
In [35]:
dec_tree = DecisionTreeClassifier()
In [36]:
dec_tree.fit(X, Y)
Out[36]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
In [37]:
dec_tree.score(X, Y)
Out[37]:
1.0

Support Vector Machines

In [38]:
from sklearn.svm import SVC
In [39]:
SVM1 = SVC(kernel='linear')
SVM2 = SVC()
In [40]:
SVM1.fit(X, Y)
SVM2.fit(X, Y)
C:\Users\cheon\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
Out[40]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)
In [41]:
SVM1.score(X, Y), SVM2.score(X, Y)
Out[41]:
(0.96875, 1.0)

Ensemble Algorithms

In [42]:
from sklearn.ensemble import RandomForestClassifier
In [43]:
RFC = RandomForestClassifier()
In [44]:
RFC.fit(X, Y)
C:\Users\cheon\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
Out[44]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
In [45]:
RFC.score(X, Y)
Out[45]:
1.0