top of page

Avocado Prices Dataset - Regression

Updated: Nov 3, 2021

Description :

This dataset provides information about weekly 2018 retail scan data for National retail volume (units) and price. Retail scan data comes directly from retailers’ cash registers based on actual retail sales of Hass avocados. Starting in 2013, the data reflects an expanded, multi-outlet retail data set. Multi-outlet reporting includes an aggregation of the following channels: grocery, mass, club, drug, dollar and military. The Average Price (of avocados) in the data reflects a per unit (per avocado) cost, even when multiple units (avocados) are sold in bags. The Product Lookup codes (PLU’s) in the data are only for Hass avocados. Other varieties of avocados (e.g. greenskins) are not included in this table.

Recommended Model :

Algorithms to be used: Regression, SVM, RandomForestRegressor, Time series analysis etc.

Recommended Project :

Avocado Price Prediction

Dataset link:

Overview of data

Detailed overview of dataset:

- Rows = 18249

- Columns= 13

  1. Date: The date of the observation

  2. AveragePrice: the average price of a single avocado

  3. type: conventional or organic

  4. year: the year

  5. Region: the city or region of the observation

  6. Total Volume: Total number of avocados sold

  7. 4046: Total number of avocados with PLU 4046 sold

  8. 4225: Total number of avocados with PLU 4225 sold

  9. 4770: Total number of avocados with PLU 4770 sold

This data was downloaded from the Hass Avocado Board website in May of 2018 & compiled into a single CSV.


import pandas as pd  
# load data data = pd.read_csv('avocadocsv') 

# check details of the dataframe

# check the no.of missing values in each column 

# statistical information about the dataset 

# data distribution  

import seaborn as sns 
import matplotlib.pyplot as plt

sns.countplot(x='v1', data=data)

#Adding new feature 'message_length'
data['message_length'] = data['v2'].apply(lambda x: len(x.split(" ")))

sns.countplot(x='type', data=data)

sns.countplot(x='year', data=data)

sns.histplot(data['AveragePrice'],  kde=False)

sns.histplot(data['Total Volume'], bins = 20, kde= False)

sns.histplot(data['4046'], bins = 20,  kde=False)

sns.histplot(data['4225'], bins = 20,  kde=False)

sns.histplot(data['4770'], bins = 20,  kde=False)

sns.histplot(data['Total Bags'], bins = 20,  kde=False)

sns.histplot(data['Small Bags'], bins = 20,  kde=False)

sns.histplot(data['Large Bags'], bins = 20,  kde=False)

sns.histplot(data['XLarge Bags'], bins = 20,  kde=False)

Other datasets for classification:

Bike Sharing Dataset,

Medical Cost Personal Dataset,

Concrete Strength Dataset

If you need implementation for any of the topics mentioned above or assignment help on any of its variants, feel free to contact us.

bottom of page