Breast Cancer Analysis and Prediction is one of the most popular problems in Machine Learning. It is one of the finest for ML Practicisers. In this Project, All the columns present in the dataset are broadly discussed.

## DATA DESCRIPTION:

**Radius:**Distance from the center to the perimeter**Perimeter:**The value of the core tumor. The total distance between the points gives a perimeter.**Area:**Area of cancer cells.**Smoothness:**this gives the local variation in the radius lengths. The smoothness is given by the difference of radial length and means lengths of the lines around it.**Compactness:**It is the value of estimation of perimeter and area, it is given by perimeter^2 / area - 1.0**Concavity:**The severity of concave points is given. Smaller chords encapsulate small concavities better. This feature is affected by the length**Concave points**: The concavity measures the magnitude of contour concavities while concave points measure the number of concave points**Symmetry:**The longest chord is taken as a major axis. The length difference between the line perpendicular to the major axis is taken. This is known as the symmetry.**Fractal dimension:**It is a measure of nonlinear growth. As the ruler used to measure the perimeter increases, the precision decreases, and hence the perimeter decreases. This data is plotted using log scale and the downward slope gives us an approximation of fractal dimension**Texture:**standard derivation of the grayscale area. This is helpful to find out the variation.

So now we have to go through the steps to build the model on Breast cancer.

*Step:1 Importing the Libraries:*

The very first step is to Import the Libraries. For this one, we have imported some analytical libraries such as NumPy and pandas and some visualization libraries such as matplotlib seaborn and plotly.

*# Python libraries*
**import** **pandas** **as** **pd**
**import** **numpy** **as** **np**
**import** **seaborn** **as** **sns**
**import** **matplotlib****.****pyplot** **as** **plt**
%**matplotlib** inline
**import** **itertools**
**from** **itertools** **import** chain
**from** **sklearn****.****preprocessing** **import** StandardScaler
**import** **warnings**
**import** **plotly****.****offline** **as** **py**
py.init_notebook_mode(connected=**True**)
**import** **plotly****.****graph_objs** **as** **go**
**import** **plotly****.****tools** **as** **tls**
**import** **plotly****.****figure_factory** **as** **ff**
warnings.filterwarnings('ignore') *#ignore warning messages *

*STEP:2 LOAD THE DATA:*

In this step, the data is loaded. This dataset is already present in sklearn library so we can directly import the dataset from there.

*#loading the dataset and converting it into dataframe **as** dataframes are easier to manipulate and analyse the data*
**from** **sklearn****.****datasets** **import** load_breast_cancer
data = load_breast_cancer()
a=np.c_[data.data, data.target]
columns = np.append(data.feature_names, ["target"])
df_cancer=pd.DataFrame(a,columns=columns)

*Step:3 Analyzing the dataset:*

In this step let's analyze the first 5 rows, NAN values(if there), and some important features and labels.

*#The **head**(**)** shows top **5** rows **of** our dataframe**,** using **this** to see **if** the data has been correctly converted into dataframe and different column names and getting the sense **of** our data*
df_cancer.head()

`df_cancer['target'].value_counts()`

**O/p: **0 357
1 212

In this dataset, our label is the target and other columns are the features. in the Target label, there are all total 352 zeros and 212 ones are present. 1 indicates the malignant and indicates the Benign.

*# Dividing the data into two classes malignant**,** according to our dataset benign has target value **1** and malignant has target value **0*
Malignant=df_cancer[df_cancer['target'] ==0]
Benign=df_cancer[df_cancer['target'] ==1]

In this above line of code, we have created two DataFrame named Malignant(Which is extracted the target values is equal to 1),Benign(Which is extracted the target values is equal to 0).

*Step:3 Visualization:*

Let's Visualize the Target column.

**CountChart:**

*#**--**--**--**--**--**--**COUNT**--**--**--**--**--**--**--**--**--**--**--**-*
trace = go.Bar(x = (len(Malignant), len(Benign)), y = ['Malignant', 'Benign'], orientation = 'h', opacity = 0.8, marker=dict(
color=[ 'gold', 'lightskyblue'],
line=dict(color='#000000',width=1.5)))
layout = dict(title = 'Count of diagnosis variable')
fig = dict(data = [trace], layout=layout)
py.iplot(fig

**PiePlot:**

**The Below code is to plot the percentage plot inside the PiePlot of Malignant and Benign of the Target Variable.**

*#**--**--**--**--**--**--**PERCENTAGE**--**--**--**--**--**--**--**--**--**-*
trace = go.Pie(labels = ['benign','malignant'], values = data['diagnosis'].value_counts(),
textfont=dict(size=15), opacity = 0.8,
marker=dict(colors=['lightskyblue', 'gold'],
line=dict(color='#000000', width=1.5)))
layout = dict(title = 'Distribution of diagnosis variable')
fig = dict(data = [trace], layout=layout)
py.iplot(fig)

*# Creating lists **for** the names **of** features by dividing them into three categories*
mean_features= ['mean radius','mean texture','mean perimeter','mean area','mean smoothness','mean compactness', 'mean concavity','mean concave points','mean symmetry', 'mean fractal dimension']
error_features=['radius error',
'texture error',
'perimeter error',
'area error',
'smoothness error',
'compactness error',
'concavity error',
'concave points error',
'symmetry error',
'fractal dimension error']
worst_features=['worst radius',
'worst texture',
'worst perimeter',
'worst area',
'worst smoothness',
'worst compactness',
'worst concavity',
'worst concave points',
'worst symmetry',
'worst fractal dimension']

In this above line of code three features are created of consisting the mean_features, error_feature, and worst_features.

After this one function is created for the histogram plot.

*# Created a **function** to plot histograms **with** **10** subplots**,** creating functions **for** tasks reduces space complexity*
bins = 20 *#Number **of** bins is **set** to **20**,** bins are specified to divide the range **of** values into intervals*
**def** histogram(features):
plt.figure(figsize=(10,15))
**for** i, feature **in** enumerate(features):
plt.subplot(5, 2, i+1) *#subplot **function** the number **of** rows are given **as** **5** and number **of** columns **as** **2**,** the value i**+**1** gives the subplot number**,** subplot numbers start **with** **1*
sns.distplot(Malignant[feature], bins=bins, color='red', label='Malignant');
sns.distplot(Benign[feature], bins=bins, color='green', label='Benign');
plt.title(str(' Density Plot of: ')+str(feature))
plt.xlabel('X variable')
plt.ylabel('Density Function')
plt.legend(loc='upper right')
plt.tight_layout()
plt.show()

After the function has been created individual features are plotted by calling this Histogram function.

**Mean_Fetaures**

*#Calling the **function** **with** the parameter mean features*
histogram(mean_features)

2. **Error_features:**

`histogram(error_features)`

3. **Worst_features:**

`histogram(worst_features)`

Then We can plot the ROC Plot after this

**def** ROC_curve(X,Y,string):
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4) *# Splitting the data **for** training and testing **in** **60**/**40** ratio*
model=LogisticRegression(solver='liblinear') *#Using logistic regression model*
model.fit(X_train,y_train)
probability=model.predict_proba(X_test) *#Predicting probability*
fpr, tpr, thresholds = roc_curve(y_test, probability[:,1]) *#False positive rate**,** True Positive Rate and Threshold is returned using **this** **function*
roc_auc = auc(fpr, tpr) *#The area under the curve is given by **this** **function*
plt.figure()
plt.plot(fpr, tpr, lw=1, color='green', label=f'AUC = **{roc_auc:.3f}**')
plt.plot([0,1],[0,1],linestyle='--',label='Baseline') *#Plotting the baseline*
plt.title(string)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate ')
plt.legend()
plt.show()

According to ROC Curves, the features with the highest area under the ROC Curve show the highest accuracy. Here we can see the worst and mean features are showing very high accuracy. When studying the histograms we need to find the ones which have the least intersecting area between the two classes. As it will be easier to classify if there is a clear distinction of X variable values. A huge density difference between values of two classes can also help in the distinction which shows most of the objects lie in a particular class for a particular X variable The top 5 features according to this are:

Worst area

Worst perimeter

Worst radius

Mean Concave Points

Mean Concavity

Then we print *The mean of all the instances of all features for both Benign and Malignant classes.*

mean radius 17.462830 mean texture 21.604906 mean perimeter 115.365377 mean area 978.376415 mean smoothness 0.102898 type: float64

Then we can create features and targets.

*#Creating **X** and **Y**,** where **X** has all the features and **Y** contains target*
X=df_cancer.drop(['target'],axis=1)
Y=df_cancer['target']

The max_depth for a decision tree should be equal to or less than the square root of the instances for most optimum case, hence I chose the range of 1 to 24. If the depth is too large we see overfitting and if too low we see underfitting. The min_samples_leaf gives the minimum samples to become a leaf node. A too low value will give overfitting and too large a value will make it computationally expensive, hence I took the range to be 1 to 20.

**from** **imblearn****.****over_sampling** **import** SMOTE
**from** **imblearn****.****pipeline** **import** Pipeline
max_depth = list(range(1,24))
min_leaf=list(range(1,20))
params = [{'classifier__max_depth':max_depth,'classifier__min_samples_leaf':min_leaf}] *#Defining parameters **for** the grid search*
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4)
pipe=Pipeline([('sc',StandardScaler()),('smt',SMOTE()),('classifier',DecisionTreeClassifier(random_state=0,min_samples_split=6,max_features=10))]) *#Creating a pipeline*
grid_search_cv = GridSearchCV(pipe,params,scoring='accuracy',refit=**True**, verbose=1,cv=5) *#Grid Search **function** which will put different combinations **of** the parameters*
grid_search_cv.fit(X_train,y_train)

O/p:

GridSearchCV(cv=5, error_score=nan,estimator=Pipeline(memory=None,steps=[('sc', StandardScaler(copy=True,with_mean=True,with_std=True)),

`model=grid_search_cv.best_estimator_ `*#Finding the best model **from** grid search*

**from** **sklearn****.****metrics** **import** accuracy_score
model.fit(X_train,y_train) *#Fitting the model *
test_pred = model.predict(X_test)
print(accuracy_score(y_test, test_pred)) *#accuracy score **function**,** to print the accuracy **of** the model*
y_test.value_counts()

O/p: 0.9429824561403509

Then let's calculate the accuracy.

**from** **sklearn****.****metrics** **import** confusion_matrix
**from** **sklearn****.****metrics** **import** classification_report
matrix=np.array(confusion_matrix(y_test,test_pred,labels=[0,1])) *#Creating confusion matrix*
pd.DataFrame(matrix,index=['Cancer','No Cancer'],columns=['Predicted_Cancer','Predicted_No_Cancer']) *#Labelling the matrix*

It is a tree flowchart, each observation splits according to some feature. There are two ways to go from each node if the condition is true it goes one way and if false it goes the other way. The first line(here X7),gives the feature and compares it to some value. The second row gives us the value of gini index at every node.Gini index is computed mathematically. Gini index=0 means the node is perfect and we get definite class. The sample row gives us the number of samples being considered. The value row in each node gives us the number of samples in each class. In all the nodes the features are considered but the feature which gives best gini index is chosen.

**from** **sklearn** **import** tree
plt.figure(figsize=(40,40))
tree.plot_tree(model['classifier']) *#**function** used to plot decision tree*

For plotting Important features we have to extract from the classifier.

`feat_importances = pd.Series(model['classifier'].feature_importances_, index=X.columns) `*#**function** to save the most important features*
feat_importances = feat_importances.nlargest(5) *#**as** we need only **5** features **nlargest**(**)** is used*
feat_importances.plot(kind='barh',figsize=(12,8),title='Most Important Features') *#plotting bar graph*
imp_features=list(feat_importances.index)
print(feat_importances)

The support vector defines the hyperplane which maximizes the margin between two classes. In a diagram, the support vector shows the margin of the hyperplane.

```
k=1
plt.figure(figsize=(20,40))
```**for** i **in** range(0,4):
**for** j **in** range(1,5):
inp=pd.concat([X[imp_features[i]],X[imp_features[j]]],axis=1)
s=svc['classifier'].fit(inp,Y)
decision_function = svc['classifier'].decision_function(inp)
plt.subplot(4, 4, k)
k=k+1
plt.scatter(X[imp_features[i]], X[imp_features[j]], c=Y, s=30, cmap=plt.cm.Paired)
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
xx, yy = np.meshgrid(np.linspace(xlim[0], xlim[1], 50),np.linspace(ylim[0], ylim[1], 50))
xy = np.vstack([xx.ravel(), yy.ravel()]).T
Z = svc['classifier'].decision_function(xy).reshape(xx.shape)
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, levels=[-1, 0, 1], alpha=0.5,linestyles=['--', '-', '--'])
ax.scatter(s.support_vectors_[:, 0], s.support_vectors_[:, 1], s=10,linewidth=1, facecolors='none', edgecolors='k') *#Showing support vectors*
plt.title(str(imp_features[i])+' & '+str(imp_features[j]))

So in this way, we can build a Breast Cancer.

For code: __https://github.com/CodersArts2017/Jupyter-Notebooks/blob/master/Breast_Cancer.ipynb__

**Thank You!**

**Happy Coding ;)**