Data Science Assignment Help | What is Data Wrangling? - Codersarts


Data Science Assignment Help | What is Data Wrangling?

What is data Wrangling?


Data Wrangling is the process of converting data from the initial format to a format that may be readable and better for analysis.


Here we use the below data set :


https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data


Import pandas


Open Jupyter notebook or any online jupyter notebook editor and import pandas-


import pandas as pd

import matplotlib.pylab as plt

Reading the data and add header


filename = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/auto.csv"


headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style", "drive-wheels","engine-location","wheel-base",

"length","width","height","curb-weight","engine-type", "num-of-cylinders", "engine-size","fuel

-system","bore","stroke","compression-ratio","horsepower", "peak-rpm","city-mpg","highway-mpg","price"]



Read CSV

df = pd.read_csv(filename, names = headers)


Show data in tabular form


df.head()


Data display in tabular form and you will face some challenges like this-

  • identify missing data

  • deal with missing data

  • correct data format

Identify and handle missing values


Identify missing values


Convert "?" to NaN


Missing data comes with the question mark "?". We replace "?" with NaN (Not a Number)


Example:


import numpy as np


# replace "?" to NaN

df.replace("?", np.nan, inplace = True)

df.head(5)


It set NaN at first five index row where "?" is presented.


How to detect missing data:


There are two method used to detect missing data.


  • .isnull() - Return true at the place of missing data and other place return false.

  • .notnull() - Return true at the placed data and false at missing data place.


Example:


mis_value = df.isnull()

mis_value.head(5)


Count missing value -In columns


Using for loop:


Example:


Write this for loop and find result


for column in mis_value .columns.values.tolist():

print(column)

print (mis_value [column].value_counts())

print("")


How we will work with missing data


Drop data

  • drop the whole row- Let suppose any value is necessary like price but it is missing at any row then we remove whole row.

  • drop the whole column - let we suppose if price is missing at any column then it reason of delete whole column because price is necessary for data science to calculate price.


Replace data

  • replace it by mean

  • replace it by frequency - replace as per frequency for example- 84 % is good, and 16% bad, then 16% remove by good.

  • replace it based on other functions

Calculate the average of any column


Example


avg= df["column name"].astype("float").mean(axis=0)

print("Average of column name:", avg)


Replace "NaN" by mean value - of any column


Example


df["column_name"].replace(np.nan, avg, inplace=True)


Calculate the mean value - of any column


Example


avg=df['column_name'].astype('float').mean(axis=0)

print("Average of column_name:", avg)


Replace NaN by mean value


Example


df["column_name"].replace(np.nan, avg, inplace=True)


How count each column data separately


Use value_counts() function


Example:


df['column_name'].value_counts()


Output like this: let suppose column_name is qualification then count each qualification with name.


mca 78

bca 45


Calculate for us the most common (max) automatically


df['column_name'].value_counts().idxmax()


Output:


mca 78


Replace NaN by most frequent


Example


df["column_name"].replace(np.nan, "four", inplace=True)


All NaN replace by most frequent- by "four"


Drop whole row with NaN in "Column_name" column


Let suppose column_name is "price"


df.dropna(subset=["price"], axis=0, inplace=True)


# reset index, because we dropped two rows


df.reset_index(drop=True, inplace=True)


Correct data format


In Pandas, we use

  • .dtype() to check the data type

  • .astype() to change the data type


Show list of data type:


Use this syntax to list data type -


syntax:

df.dtypes


How to convert data type in proper format


There are different type of data format used -


Syntax:


df[["column1", "column2"]] = df[["column1", "column2"]].astype("float")

df[["column3"]] = df[["column3"]].astype("int")

df[["column4"]] = df[["column4"]].astype("float")

df[["column5"]] = df[["column5"]].astype("float")


Again check it by using following -

It show list so that you can verify that data type is change or not


Syntax:


df.dtypes

Data Standardization


What is Standardization?

Standardization is the process of transforming data into a common format which allows the researcher to make the meaningful comparison.


Example

Transform mpg to L/100km

The formula for unit conversion is

L/100km = 235 / mpg


First go through the data to verify it by using this syntax-


Syntax:


df.head()


Example:


Convert mpg to L/100km by mathematical operation


df['city-L/100km'] = 235/df["city-mpg"]


It add new column city-L/100km after change the value of column city-mpg


# check your transformed data


df.head()

Data Normalization


Why normalization?

Normalization is the process of transforming values of several variables into a similar range.


Example:


# replace (original value) by (original value)/(maximum value)


df['length'] = df['length']/df['length'].max()

df['width'] = df['width']/df['width'].max()


Binning


Why binning?

Binning is a process of transforming continuous numerical variables into discrete categorical 'bins', for grouped analysis.


Indicator variable (or dummy variable)


What is an indicator variable?

An indicator variable (or dummy variable) is a numerical variable used to label categories. They are called 'dummies' because the numbers themselves don't have inherent meaning.


Why we use indicator variables?

So we can use categorical variables for regression analysis in the later modules.


List of Other Codersarts Assignment Help Expert Services


If you like Codersarts blog and looking for Programming Assignment Help Service,Database Development Service,Web development service,Mobile App Development, Project help, Hire Software Developer,Programming tutors help and suggestion  you can send mail at contact@codersarts.com.

Please write your suggestion in comment section below if you find anything incorrect in this blog post

Contact Us

Tel: (+91) 0120  4118730  

Time :   10 : 00  AM -  08 : 00 PM IST 

Registered address: G-69, Sector 63, 

 Noida - 201301, India

We Provide Services Across The different countries

USA    Australia   Canada   UK    UAE    Singapore   New Zealand    Malasia   India   Ireland   Germany

CodersArts is a Product by Sofstack Technology Solutions Pvt. Ltd.

  • CodersArts | Linkedin
  • Instagram