BACKGROUND: INCOME CLASSIFIER
Census data is one of the largest sources of a variety of statistical information related to population. It typically includes information related to Age, Gender, Household composition, Employment Details, Accommodation Details, and so on. Till recent years, collecting census data has been a manual process involving field visits and registrations. With advances in technology, the methods of collecting this data have improved to a great extent. And so is the population! With a population of more than 7 billion,one can imagine the volumeof the census data associated with it. This data is collected from a variety of sources such as manual entries, online surveys, data from social media and search engines and is in various formats. Traditional database systems are inefficient at handling such data. This is where Big Data Technologies come into picture.
As per a study by U.S. Census Bureau, analytics on census data could have been helpful during the Great Recession in various ways such as avoiding job loss in Supply-Chain businesses, reducing housing foreclosure rates, and so on.
Big Data Analytics refers to a set of tools and methods used to obtain knowledge from information. Application of Big Data Analytics on census data can facilitate better decision making in various Government and Industrial sectors such as Healthcare, Education, Finance, Retail, and Housing. One such application is an Income Classifier. In this project, let us take a sample of world census data and build an Income Classifier using various Big Data Techniques described in subsequent sections.
LEARNING OBJECTIVES 1. HDFS and Hive for Data Storage and Management 2. Data Ingestion using Sqoop 3. Machine Learning using PySpark This Project is divided into three parts to cover the above learning objectives. DATASET The dataset named censusdata.csv is provided in your LMS. We will be using the same dataset for all the three parts. Input: The dataset contains 15 columns Targeted Column: Income; the income is provided in the form of two values: <=50k or >50k Number of other columns: 14; these are demographics and other features used for describing a person
List of Attributes:
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof- specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other,Black
sex: Female, Male
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands
income: >50K, <=50K
1. HDFS AND HIVE
Problem Statement 1
Census Analytics is a project where you need to collect the data of people along with their incomes. As the census data is usually in large volume, the analysis of the data will be a cumbersome task. To overcome this, we will be using the Hadoop Ecosystem.
As a first step, you need to load the data into HDFS and create a table in Hive that can be used for querying the data. You have to create different types of tables and execute queries, as mentioned below and compare the time required for execution for different types of tables.
Steps to be performed:
1. Download the dataset named censusdata.csv that is provided in your LMS
2. Load the downloaded data into HDFS
3. Create an internal table in Hive to store the data
a. Create the table structure
b. Load the data from HDFS into the Hive table
4. Create an internal table in Hive with partitions
a. Create a Partition Table in Hiveusing “workclass” as the Partition Key
b. Load data from the staging table (Table created in Step 3) into this table
5. Create an external table in Hive to hold the same data stored in HDFS
6. Create an externaltable in Hivewith partitions using“workclass” as Partition Key
7. For each of the four tables created above, perform the following operations
Find out the number of adults based on income and gender. Note the time taken for getting the result
Find out the number of adults based on income and workclass. Note the time taken for getting the result
Write your observations by comparing the time taken for executing the commands between:
a. Internal & External Tables
b. Partitioned & Non-partitioned Tables
8. Delete the internal as well as external tables. Comment on the effect on dataand metadata after the deletion is performed for both internal and external tables.
2. DATA INGESTION
Problem Statement 2
In a similar scenario as above, the data is available in a MySQL database. Due to the inefficiency of RDBMS systems to store and analyze Big Data, it is recommended that we move the data to the Hadoop Ecosystem.
Ingest the data from MySQL database into Hive using Sqoop. Data pipeline needs to be created to ingest data from an RDBMS into Hadoop Cluster and then load data into Hive
To make the analysis faster, use Spark on top of Hive after getting data into the Hadoop cluster. Using Spark, query different tables from Hive to analyze the dataset.
Steps to be performed:
1. Create the necessary structure in a MySQL database using the steps mentioned below:
a. Create a new database in MySQL with the name midproject
b. Create a table in this database with the name census_adult to store the input dataset
c. Load the dataset into the table
d. Verify whether data is loaded properly
e. Verify the table for unwanted data such as ‘?’,’Nan’ and ‘Null’
f. Get the counts for the columns which contain unwanted data
g. Clean the data by replacing the unwanted data with others
2. Import the above data from MySQL into a Hive table using Sqoop
3. Connect to PySpark using web console to access the created Hive table. Perform the following queries and note the time taken for execution in each of the queries.