Oct 28, 2021

Census income data set - category classification

Updated: Nov 3, 2021

Description :

The data extracted by Barry Becker using the1994 census dataset. Dataset contains 14 attributes consisting of 8 categorical and 6 continuous attributes containing information about age, education, nationality, marital status, relationship status, occupation, work classification, gender, race, working hours per week, capital loss and capital gain. The target variable in the dataset income level which predicts whether a person earns more than 50 thousand dollars per year or not based on the given set of attributes.

Recommended Model :

Algorithms to be used Decision tree Classifier, Random forest, svm’s, Logistic regression etc

Recommended Projects :

Determine the weather a person makes over 50,000 a year, predict income

Dataset link

Data set Link : https://archive.ics.uci.edu/ml/datasets/census+income

Overview of data

Detailed overview of dataset

  • Records in the dataset = 32561 ROWS

  • Columns in the dataset = 15 COLUMNS

  1. age: Age of person (continuous).

  2. workclass: The working sector of a person (Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.)

  3. fnlwgt: final weight The weights on the Current Population Survey (CPS) files are controlled to independent estimates of the civilian noninstitutional population of the US. These are prepared monthly for the US by the Population Division here at the Census Bureau. ( continuous).

  4. Education: Qualification (Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.)

  5. education-num : Education number continuous.

  6. marital-status: Marital status (Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.)

  7. Occupation: Occupation of person (Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspect, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.)

  8. relationship: (Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.)

  9. race: race of person ( White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.)

  10. sex : Gender of person Female, Male.

  11. capital-gain: Capital gain of a person per year ( continuous.)

  12. Capital-loss: Capital Loss of person per year ( continuous)

  13. hours-per-week: Work hours per Week (continuous)

  14. native-country: (United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.)

  • Target variable :

  • Income: -Earn money >50K,<=50K per year.

EDA[Code]

Dataset

# load data
 
import pandas as pd
 
file_loc="data\\adult.csv"
 
census_data = pd.read_csv(file_loc)
 
census_data.head()

Total Number of Rows and Columns in the dataset

shape=census_data.shape
 
print("Total records in the dataset :", shape[0])
 
print("Total columns in the dataset :", shape[1])

Check the details of dataset

# Data information
 
census_data.info()

Check the missing values in the dataset.

# Check the missing values in each column
 
census_data.isna().sum()

Statistical information

# Statistical information about the dataset
 
census_data.describe()

Data Visualization :

Correlation

import seaborn as sns
 
import matplotlib.pyplot as plt
 
# correlation
 
corr = census_data.corr()
 
corr.style.background_gradient(cmap='coolwarm')

Count plot of income

sns.set_style("whitegrid")
 
plt.figure(figsize = (8,5))
 
sns.countplot(x='income', data=census_data)
 
plt.show()

Count plot of gender

sns.countplot(x='sex', data=census_data)
 
plt.show()

Count plot of Workclass

plt.figure(figsize = (18,10))
 
sns.countplot(x='workclass', data=census_data)
 
plt.show()

Other related data

Occupancy Detection Data Set - Classification

Student performance dataset - Classification and Regression

Wholesale customer - Classification and Clustering

Online retail dataset - classification, clustering and regression

Cervical Cancer Risk Factor Dataset - classification and clustering

Blood Transfusion service center dataset - Classification

Divorce Predictor Dataset -classification

Fire Forest Dataset - Regression

Heart Disease dataset -Classification

If you need implementation for any of the topics mentioned above or assignment help on any of its variants, feel free to contact us.