top of page

Wholesale data set - classification and clustering

Updated: Nov 3, 2021

ree

Description :


This data set refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories. The wholesale distributor operating in different regions of Portugal has information on annual spending of several items in their stores across different regions and channels. The dataset consist of 440 large retailers annual spending on 6 different varieties of product in 3 different regions (lisbon , oporto, other) and across different sales channel ( Hotel, channel)


Recommended Model :


Algorithms to be used, XGBoost classifier, logistic regression, k means clustering etc.


Recommended Projects :


To predict which region and which channel will spend more and which region and channel to spend less.

Dataset link



Overview of data


Detailed overview of dataset

  • Records in the dataset = 440 ROWS

  • Columns in the dataset = 8 COLUMNS

  1. FRESH: annual spending (m.u.) on fresh products (Continuous)

  2. MILK:- annual spending (m.u.) on milk products (Continuous)

  3. GROCERY:- annual spending (m.u.) on grocery products (Continuous)

  4. FROZEN:- annual spending (m.u.) on frozen products (Continuous)

  5. DETERGENTS_PAPER :- annual spending (m.u.) on detergents and paper products (Continuous)

  6. DELICATESSEN:- annual spending (m.u.)on and delicatessen products (Continuous);

  7. CHANNEL: - sales channel Hotel and Retailer

  8. REGION:- three regions ( Lisbon, Oporto, Other)


EDA[Code]


Data

import pandas as pd
#Load Data

file_loc = "data\\Wholesale customers data.csv"
wholesale_cust_data = pd.read_csv(file_loc)
wholesale_cust_data.head()

ree
Dataset

Total number of Rows and Columns


row_col = wholesale_cust_data.shape
print("Tota number of rows in the dataset : {}".format(row_col[0]))
print("Total number of columns in the dataset : {}".format(row_col[1]))

ree
rows and column

Details about dataset


# check information 
wholesale_cust_data.info()

ree
data information

Check the number of Missing values in the dataset


# missing values
wholesale_cust_data.isna().sum()

ree
Missing values output

Statistical information


# statistical information 
wholesale_cust_data.describe()

ree
Statistical information

Data Visualization


Correlation

import seaborn as sns
import matplotlib.pyplot as plt
# correlation
corr = wholesale_cust_data.corr()
corr.style.background_gradient(cmap='cubehelix')

ree
Correlation

Count the value in Channel and Regions column


# Replace data values
wholesale_cust_data["Channel"] = wholesale_cust_data["Channel"].replace(1,"Hotel")
wholesale_cust_data["Channel"] = wholesale_cust_data["Channel"].replace(2,"Retail")

# Replace values
wholesale_cust_data["Region"] = wholesale_cust_data["Region"].replace(1,"Lisbon")
wholesale_cust_data["Region"] = wholesale_cust_data["Region"].replace(2,"Oporto")
wholesale_cust_data["Region"] = wholesale_cust_data["Region"].replace(3,"Other")

import matplotlib.pyplot as plt
import seaborn as sns 
colmns = ['Channel','Region']
for col in colmns:
    sns.set_style("whitegrid")
    plt.figure(figsize = (8,5))
    sns.countplot(x=wholesale_cust_data[col], data=wholesale_cust_data) 
    plt.show()


ree
channel count

ree
Region count

Channel count by Regions


sns.set_style('whitegrid')
sns.countplot(x="Channel",hue='Region',data=wholesale_cust_data)


ree
channel count by region

Box Plots of Each columns


cols = wholesale_cust_data.select_dtypes(exclude ='object').columns

for i in cols:
    sns.set_theme(style="whitegrid")
    plt.figure(figsize=(10,3))
    ax = sns.boxplot(x=wholesale_cust_data[i])

ree

ree

ree
Box plots


Other related data



If you need implementation for any of the topics mentioned above or assignment help on any of its variants, feel free to contact us.
ree

Comments


bottom of page