top of page

Cervical Cancer Risk Factor Data Set - classification and clustering

Updated: Nov 3, 2021



Description :


The dataset has been collected at Hospital Universitario de Caracas in Caracas, Venezuela. This dataset contains the detailed information of habits, demographic and historical medical records of 858 patients. In this dataset, there are 55 patients diagnosed with cervical cancer disease and the number of healthy patients is 803. Cervical cancer is the leading gynecological malignancy worldwide and According to the WHO report it is most common among women in developing countries.


Recommended Model :


Algorithms to be used, Decision tree, Logistic regression, support vector machines, KNN

etc.


Recommended Projects :

To diagnose the cervical cancer possibility.

Dataset link



Overview of data

Detailed overview of dataset

  • Records in the dataset = 858 ROWS

  • Columns in the dataset = 36 COLUMNS


  1. Age : Age of patients

  2. Number of Sexual partner (Numerical)

  3. First sexual intercourse (age)

  4. Number of pregnancies

  5. Smokes (0-no, 1-yes)

  6. Smokes (years)

  7. Smokes (packs/year)

  8. Hormonal Contraceptives

  9. Hormonal Contraceptives (years)

  10. IUD

  11. IUD (years)

  12. STDs

  13. STDs STDs number (integer)

  14. STDs:condylomatosis

  15. STDs: cervical condylomatosis

  16. STDs:vaginalcondylomatosis

  17. STDs:vulvo-perineal condylomatosis

  18. STDs:syphilis

  19. STDs:pelvic inflammatory disease

  20. STDs:genital herpes

  21. STDs:molluscumcontagiosum

  22. STDs:AIDS

  23. STDs:HIV

  24. STDs:Hepatitis B

  25. STDs:HPV

  26. STDs: Number of diagnosis

  27. STDs: Time since first diagnosis

  28. STDs: Time since last diagnosis

  29. Dx:Cancer

  30. Dx:CIN

  31. Dx:HPV

  32. Dx

Target Variables - There are four target variable in this dataset

Hinselmann: target variable

Schiller: target variable

Cytology: target variable

Biopsy: class or target variable


EDA[Code]


Dataset


import pandas as pd
# Load Data
file_loc = "data\\risk_factors_cervical_cancer.csv"
cervical_cancer_data = pd.read_csv(file_loc)
cervical_cancer_data.head()



Total number of Rows and Column in the dataset

# Number of Rows and columns 
rows_col = cervical_cancer_data.shape
print("Total number of Rows in the dataset : {}".format(rows_col[0]))
print("Total number of columns in the dataset : {}".format(rows_col[1]))


Dataset information

# Data information
cervical_cancer_data.info()


Check The number of missing values in the dataset



# Missing Values
cervical_cancer_data.isna().sum()



Statistical information about the dataset



# Statistical information
cervical_cancer_data.describe()


Data Visualization


The number of patients diagnosed with cervical cancer



import matplotlib.pyplot as plt
import seaborn as sns 
sns.set_style("whitegrid")
plt.figure(figsize=(8,5))
sns.countplot(x= "Biopsy",data=cervical_cancer_data)


Age of Patients


# Histogram 
plt.figure(figsize=(8,5))
sns.histplot(x="Age",data=cervical_cancer_data)



plt.figure(figsize=(8,5))
sns.countplot(x= "Smokes",data=cervical_cancer_data)

The number of patients who smokes, 0 - No, 1- Yes, ?- Dont know

Smoke years


# Histogram 
plt.figure(figsize=(15,5))
sns.histplot(x="Smokes (years)",data=cervical_cancer_data)


Other related data




If you need implementation for any of the topics mentioned above or assignment help on any of its variants, feel free to contact us

bottom of page