Breast Cancer Analysis Using Machine Learning

Monalisa Panda
Sep 14, 2020
6 min read

Breast Cancer Analysis and Prediction is one of the most popular problems in Machine Learning. It is one of the finest for ML Practicisers. In this Project, All the columns present in the dataset are broadly discussed.

DATA DESCRIPTION:

Radius: Distance from the center to the perimeter
Perimeter: The value of the core tumor. The total distance between the points gives a perimeter.
Area: Area of cancer cells.
Smoothness: this gives the local variation in the radius lengths. The smoothness is given by the difference of radial length and means lengths of the lines around it.
Compactness: It is the value of estimation of perimeter and area, it is given by perimeter^2 / area - 1.0
Concavity: The severity of concave points is given. Smaller chords encapsulate small concavities better. This feature is affected by the length
Concave points: The concavity measures the magnitude of contour concavities while concave points measure the number of concave points
Symmetry: The longest chord is taken as a major axis. The length difference between the line perpendicular to the major axis is taken. This is known as the symmetry.
Fractal dimension: It is a measure of nonlinear growth. As the ruler used to measure the perimeter increases, the precision decreases, and hence the perimeter decreases. This data is plotted using log scale and the downward slope gives us an approximation of fractal dimension
Texture: standard derivation of the grayscale area. This is helpful to find out the variation.

So now we have to go through the steps to build the model on Breast cancer.

Step:1 Importing the Libraries:

The very first step is to Import the Libraries. For this one, we have imported some analytical libraries such as NumPy and pandas and some visualization libraries such as matplotlib seaborn and plotly.

# Python libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import itertools
from itertools import chain
from sklearn.preprocessing import StandardScaler
import warnings
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import plotly.figure_factory as ff

warnings.filterwarnings('ignore') #ignore warning messages

STEP:2 LOAD THE DATA:

In this step, the data is loaded. This dataset is already present in sklearn library so we can directly import the dataset from there.

#loading the dataset and converting it into dataframe as dataframes are easier to manipulate and analyse the data
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
a=np.c_[data.data, data.target]
columns = np.append(data.feature_names, ["target"])
df_cancer=pd.DataFrame(a,columns=columns)

Step:3 Analyzing the dataset:

In this step let's analyze the first 5 rows, NAN values(if there), and some important features and labels.

#The head() shows top 5 rows of our dataframe, using this to see if the data has been correctly converted into dataframe and different column names and getting the sense of our data
df_cancer.head()

df_cancer['target'].value_counts()

O/p: 0 357 1 212

In this dataset, our label is the target and other columns are the features. in the Target label, there are all total 352 zeros and 212 ones are present. 1 indicates the malignant and indicates the Benign.

# Dividing the data into two classes malignant, according to our dataset benign has target value 1 and malignant has target value 0
Malignant=df_cancer[df_cancer['target'] ==0]
Benign=df_cancer[df_cancer['target'] ==1]

In this above line of code, we have created two DataFrame named Malignant(Which is extracted the target values is equal to 1),Benign(Which is extracted the target values is equal to 0).

Step:3 Visualization:

Let's Visualize the Target column.

CountChart:

#------------COUNT-----------------------
trace = go.Bar(x = (len(Malignant), len(Benign)), y = ['Malignant', 'Benign'], orientation = 'h', opacity = 0.8, marker=dict(
        color=[ 'gold', 'lightskyblue'],
        line=dict(color='#000000',width=1.5)))

layout = dict(title =  'Count of diagnosis variable')
                    
fig = dict(data = [trace], layout=layout)
py.iplot(fig

PiePlot:

The Below code is to plot the percentage plot inside the PiePlot of Malignant and Benign of the Target Variable.

#------------PERCENTAGE-------------------
trace = go.Pie(labels = ['benign','malignant'], values = data['diagnosis'].value_counts(), 
               textfont=dict(size=15), opacity = 0.8,
               marker=dict(colors=['lightskyblue', 'gold'], 
                           line=dict(color='#000000', width=1.5)))


layout = dict(title =  'Distribution of diagnosis variable')
           
fig = dict(data = [trace], layout=layout)
py.iplot(fig)

# Creating lists for the names of features by dividing them into three categories
mean_features= ['mean radius','mean texture','mean perimeter','mean area','mean smoothness','mean compactness', 'mean concavity','mean concave points','mean symmetry', 'mean fractal dimension']
error_features=['radius error',
 'texture error',
 'perimeter error',
 'area error',
 'smoothness error',
 'compactness error',
 'concavity error',
 'concave points error',
 'symmetry error',
 'fractal dimension error']
worst_features=['worst radius',
 'worst texture',
 'worst perimeter',
 'worst area',
 'worst smoothness',
 'worst compactness',
 'worst concavity',
 'worst concave points',
 'worst symmetry',
 'worst fractal dimension']

In this above line of code three features are created of consisting the mean_features, error_feature, and worst_features.

After this one function is created for the histogram plot.

# Created a function to plot histograms with 10 subplots, creating functions for tasks reduces space complexity
bins = 20 #Number of bins is set to 20, bins are specified to divide the range of values into intervals
def histogram(features):
  plt.figure(figsize=(10,15))
  for i, feature in enumerate(features):
      plt.subplot(5, 2, i+1)  #subplot function the number of rows are given as 5 and number of columns as 2, the value i+1 gives the subplot number, subplot numbers start with 1
      sns.distplot(Malignant[feature], bins=bins, color='red', label='Malignant');
      sns.distplot(Benign[feature], bins=bins, color='green', label='Benign');
      plt.title(str(' Density Plot of:  ')+str(feature))
      plt.xlabel('X variable')
      plt.ylabel('Density Function')
      plt.legend(loc='upper right')
  plt.tight_layout()
  plt.show()

After the function has been created individual features are plotted by calling this Histogram function.

Mean_Fetaures

#Calling the function with the parameter mean features
histogram(mean_features)

2. Error_features:

histogram(error_features)

3. Worst_features:

histogram(worst_features)

Then We can plot the ROC Plot after this

def ROC_curve(X,Y,string):
  X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4)  # Splitting the data for training and testing in 60/40 ratio
  model=LogisticRegression(solver='liblinear') #Using logistic regression model
  model.fit(X_train,y_train)
  probability=model.predict_proba(X_test) #Predicting probability
  fpr, tpr, thresholds = roc_curve(y_test, probability[:,1]) #False positive rate, True Positive Rate and Threshold is returned using this function
  roc_auc = auc(fpr, tpr) #The area under the curve is given by this function
  plt.figure()
  plt.plot(fpr, tpr, lw=1, color='green', label=f'AUC = {roc_auc:.3f}')
  plt.plot([0,1],[0,1],linestyle='--',label='Baseline') #Plotting the baseline
  plt.title(string)
  plt.xlabel('False Positive Rate')
  plt.ylabel('True Positive Rate ')
  plt.legend()
  plt.show()

According to ROC Curves, the features with the highest area under the ROC Curve show the highest accuracy. Here we can see the worst and mean features are showing very high accuracy. When studying the histograms we need to find the ones which have the least intersecting area between the two classes. As it will be easier to classify if there is a clear distinction of X variable values. A huge density difference between values of two classes can also help in the distinction which shows most of the objects lie in a particular class for a particular X variable The top 5 features according to this are:

Worst area
Worst perimeter
Worst radius
Mean Concave Points
Mean Concavity

Then we print The mean of all the instances of all features for both Benign and Malignant classes.

mean radius 17.462830 mean texture 21.604906 mean perimeter 115.365377 mean area 978.376415 mean smoothness 0.102898 type: float64

Then we can create features and targets.

#Creating X and Y, where X has all the features and Y contains target
X=df_cancer.drop(['target'],axis=1)
Y=df_cancer['target']

The max_depth for a decision tree should be equal to or less than the square root of the instances for most optimum case, hence I chose the range of 1 to 24. If the depth is too large we see overfitting and if too low we see underfitting. The min_samples_leaf gives the minimum samples to become a leaf node. A too low value will give overfitting and too large a value will make it computationally expensive, hence I took the range to be 1 to 20.

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
max_depth = list(range(1,24))
min_leaf=list(range(1,20))
params = [{'classifier__max_depth':max_depth,'classifier__min_samples_leaf':min_leaf}] #Defining parameters for the grid search
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4) 
pipe=Pipeline([('sc',StandardScaler()),('smt',SMOTE()),('classifier',DecisionTreeClassifier(random_state=0,min_samples_split=6,max_features=10))]) #Creating a pipeline
grid_search_cv = GridSearchCV(pipe,params,scoring='accuracy',refit=True, verbose=1,cv=5) #Grid Search function which will put different combinations of the parameters
grid_search_cv.fit(X_train,y_train)

O/p:

GridSearchCV(cv=5, error_score=nan,estimator=Pipeline(memory=None,steps=[('sc', StandardScaler(copy=True,with_mean=True,with_std=True)),

model=grid_search_cv.best_estimator_  #Finding the best model from grid search

from sklearn.metrics import accuracy_score 
model.fit(X_train,y_train) #Fitting the model 
test_pred = model.predict(X_test)
print(accuracy_score(y_test, test_pred)) #accuracy score function, to print the accuracy of the model
y_test.value_counts()

O/p: 0.9429824561403509

Then let's calculate the accuracy.

from sklearn.metrics import confusion_matrix 
from sklearn.metrics import classification_report 
matrix=np.array(confusion_matrix(y_test,test_pred,labels=[0,1])) #Creating confusion matrix
pd.DataFrame(matrix,index=['Cancer','No Cancer'],columns=['Predicted_Cancer','Predicted_No_Cancer']) #Labelling the matrix

It is a tree flowchart, each observation splits according to some feature. There are two ways to go from each node if the condition is true it goes one way and if false it goes the other way. The first line(here X7),gives the feature and compares it to some value. The second row gives us the value of gini index at every node.Gini index is computed mathematically. Gini index=0 means the node is perfect and we get definite class. The sample row gives us the number of samples being considered. The value row in each node gives us the number of samples in each class. In all the nodes the features are considered but the feature which gives best gini index is chosen.

from sklearn import tree
plt.figure(figsize=(40,40))
tree.plot_tree(model['classifier']) #function used to plot decision tree

For plotting Important features we have to extract from the classifier.

feat_importances = pd.Series(model['classifier'].feature_importances_, index=X.columns) #function to save the most important features
feat_importances = feat_importances.nlargest(5) #as we need only 5 features nlargest() is used
feat_importances.plot(kind='barh',figsize=(12,8),title='Most Important Features') #plotting bar graph

imp_features=list(feat_importances.index)
print(feat_importances)

The support vector defines the hyperplane which maximizes the margin between two classes. In a diagram, the support vector shows the margin of the hyperplane.

k=1
plt.figure(figsize=(20,40))
for i in range(0,4): 
  for j in range(1,5):
    inp=pd.concat([X[imp_features[i]],X[imp_features[j]]],axis=1)
    s=svc['classifier'].fit(inp,Y)
    decision_function = svc['classifier'].decision_function(inp)
    plt.subplot(4, 4, k)
    k=k+1
    plt.scatter(X[imp_features[i]], X[imp_features[j]], c=Y, s=30, cmap=plt.cm.Paired)
    ax = plt.gca()

    xlim = ax.get_xlim()
    ylim = ax.get_ylim()
    xx, yy = np.meshgrid(np.linspace(xlim[0], xlim[1], 50),np.linspace(ylim[0], ylim[1], 50))
    xy = np.vstack([xx.ravel(), yy.ravel()]).T
    Z = svc['classifier'].decision_function(xy).reshape(xx.shape)
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, levels=[-1, 0, 1], alpha=0.5,linestyles=['--', '-', '--'])
    ax.scatter(s.support_vectors_[:, 0], s.support_vectors_[:, 1], s=10,linewidth=1, facecolors='none', edgecolors='k') #Showing support vectors
    plt.title(str(imp_features[i])+' & '+str(imp_features[j]))