Introduction
This tutorial demonstrates how to build a basic recommendation system using a dataset of property listings. The goal is to develop functions that recommend neighborhoods based on a user's budget and suggest prices for hosting properties based on proximity to similar listings. These recommendations can be helpful for potential customers seeking properties within a specific price range or hosts looking to price their properties competitively.
Prerequisites
Before starting this tutorial, you should have:
Basic Python Knowledge:Â Familiarity with Python programming, including working with libraries like Pandas.
Python Environment:Â A Python environment set up on your machine, such as Jupyter Notebook, Google Colab, or a local Python installation.
Installed Libraries:Â Ensure that you have the following Python libraries installed:
Pandas
Geopandas
Scikit-learn
Matplotlib (for visualization)
What You Will Learn
By going through this code, you will learn:
Loading and Exploring Data:
How to read a CSV file into a Pandas DataFrame.
How to inspect the data using functions like head()Â and info()Â to understand its structure and types.
Data Cleaning:
How to clean and preprocess a specific column in the dataset (in this case, the price column).
How to convert text data into a numeric format suitable for analysis.
Building a Recommender System:
How to create a function that recommends neighborhoods based on a budget range. This includes filtering data, grouping, and counting occurrences.
How to extend the recommendation to consider both absolute and relative counts of listings within neighborhoods.
Price Suggestion Algorithm:
How to calculate the suggested price for hosting properties based on nearby listings. This involves computing distances between coordinates and averaging prices.
How to incorporate optional filters, such as room types, to refine the recommendations.
Dataset
Airbnb data for 250,000+ listings in 10 major cities, including information about hosts, pricing, location, and room type, along with over 5 million historical reviews.
Link to the dataset: https://www.kaggle.com/datasets/mysarahmadbhat/airbnb-listings-reviews
NOTE: Prices are in local currency
Loading and Exploring the Data
import pandas as pd
import numpy as np
import re
# Load the dataset
data = pd.read_csv('listings.csv')
data.head()
# Get an overview of the dataset
data.info()
Loading the Dataset: The pd.read_csv() function is used to load the data from a CSV file named listings.csv into a Pandas DataFrame named data.
Previewing the Data:Â data.head()Â displays the first five rows of the DataFrame, giving you a quick look at the data's structure and contents.
Data Overview:Â data.info()Â provides a concise summary of the DataFrame, including the number of entries (rows), the data types of each column, and the number of non-null values in each column. This step is essential for understanding the dataset and identifying any potential data cleaning or preprocessing needs.
Cleaning the Price Column
# Function to clean the price column and convert it to a float
def clean(price):
price = price.replace(',', '') # Remove commas from the price string
price = re.findall(r'\d+\.\d+', price) # Extract numerical values with decimals using regex
return float(price[0]) # Convert the first match to float
# Apply the cleaning function to the price column
data.price = data.price.apply(clean)
data.price = pd.to_numeric(data.price)
# Check the updated data types and structure
data.info()
Cleaning Function:Â The clean()Â function takes a price value as input, removes any commas (using replace()), and extracts the numeric value using a regular expression (re.findall()). The function then converts this value to a float.
Applying the Function: data.price.apply(clean) applies the clean() function to each entry in the price column. This transforms the price data into a clean, numeric format.
Ensuring Numeric Format: pd.to_numeric(data.price) ensures that the entire price column is of numeric type, which is crucial for any numerical analysis or calculations.
Verification: The final data.info() call checks that the price column is now correctly formatted as a float and that the DataFrame is ready for further analysis.
Building Recommender System 1: Neighborhood Recommendation
def recommend_neighboorhood(df, budget_min, budget_max, relative=False):
'''
Recommends neighborhoods to customers based on their budget range.
Parameters:
- df: Pandas DataFrame containing listing data.
- budget_min: Minimum budget (float) specified by the customer.
- budget_max: Maximum budget (float) specified by the customer.
- relative: Boolean, if True, recommendations are based on relative counts.
Returns:
- The neighborhood name with the most properties in the specified budget range.
'''
# Filter data based on the budget range
filtered_data_on_budget = df[df.price.between(float(budget_min), float(budget_max))]
# Group by neighborhood and count listings
neighbourhood_counts = filtered_data_on_budget.groupby('neighbourhood_cleansed')['neighbourhood_cleansed'].value_counts().sort_values(ascending=False)
# Convert Series to DataFrame for easier manipulation
df2 = neighbourhood_counts.to_frame()
df2 = df2.rename(columns={'neighbourhood_cleansed': 'counts'})
df2 = df2.droplevel(-1) # Remove unnecessary level
df2 = df2.reset_index()
# Recommendation based on absolute numbers
if not relative:
neighbourhood = df2.neighbourhood_cleansed[0]
else:
# Calculate relative counts
df2['relative'] = df2.counts / df2.counts.sum()
df2 = df2.sort_values(by='relative', ascending=False)
neighbourhood = df2.neighbourhood_cleansed[0]
return neighbourhood
Function Purpose:Â The recommend_neighboorhood()Â function recommends a neighborhood based on a given budget range. It can consider either absolute counts of listings or relative proportions.
Budget Filtering:Â The df[df.price.between(float(budget_min), float(budget_max))]Â line filters the DataFrame to include only those listings whose price falls within the specified range.
Grouping and Counting:Â The function groups the filtered data by neighborhood and counts the number of listings per neighborhood. This count is sorted in descending order.
Handling Absolute vs. Relative Counts: If the relative parameter is False, the function recommends the neighborhood with the most listings (absolute numbers). If relative is True, it recommends based on the proportion of listings in each neighborhood.
Testing Recommender 1
recommend_neighboorhood(data,10,50)
'Oostelijk Havengebied - Indische Buurt'
recommend_neighboorhood(data,10,50,True)
'De Baarsjes - Oud-West'
recommend_neighboorhood(data,7000,8000)
'Oostelijk Havengebied - Indische Buurt'
recommend_neighboorhood(data,7000,8000,True)
'Centrum-West'
recommend_neighboorhood(data,0,10)
'Oostelijk Havengebied - Indische Buurt'
recommend_neighboorhood(data,0,10,True)
'Centrum-Oost'
Building Recommender System 2: Price Recommendation
def recommend_price(df, latitude, longitude, n_neighbours, room_type=None):
'''
Suggests a price for hosting based on nearby listings and, optionally, room type.
Parameters:
- df: Pandas DataFrame containing listing data.
- latitude: Latitude of the property location.
- longitude: Longitude of the property location.
- n_neighbours: Number of neighboring properties to consider.
- room_type: Optional; restricts to properties of a specific room type.
Returns:
- Suggested price based on nearby listings.
'''
# Calculate Euclidean distance between the given coordinates and the listings
df['distance'] = np.sqrt((longitude - df.longitude) ** 2 + (latitude - df.latitude) ** 2)
# Sort the listings by distance
df = df.sort_values(by='distance')
# Select the closest n_neighbours
top_neighbours = df[:n_neighbours]
# If no room_type is specified, calculate the mean price of the closest listings
if not room_type:
mean_price = top_neighbours.price.mean()
else:
# Filter by room_type if specified
top_neighbours = top_neighbours[top_neighbours.room_type == room_type]
# Handle case where no matching room types are found
if top_neighbours.empty:
print('Room not available')
return None
else:
mean_price = top_neighbours.price.mean()
return mean_price
Function Purpose:Â The recommend_price()Â function suggests a price for a property based on the prices of nearby listings. The recommendation can be refined by specifying a room type.
Distance Calculation:Â The Euclidean distance between the input coordinates and each listing's coordinates is computed to find the closest listings.
Sorting and Filtering: Listings are sorted by distance, and the n_neighbours closest listings are selected.
Room Type Filtering: If a specific room type is provided, the function filters the listings accordingly. If no matching listings are found, it returns None and prints a message.
Price Suggestion:Â The function calculates the mean price of the selected listings and returns it as the suggested price.
Testing Recommender 2
recommend_price(data, 4,52,10)
321.4
recommend_price(data, 4,52,5)
516.8
recommend_price(data, 40,52,5,'Private room')
99.0
recommend_price(data, 4,52,5, 'Entire home/apt')
814.6666666666666
recommend_price(data, 4,52,5, 'Hotel room')
Room not available
Putting it All Together
# Import Libraries
"""
import pandas as pd
import numpy as np
import re
"""# Read Data"""
data = pd.read_csv('listings.csv')
data.head()
data.info()
"""### clean price column and convert it to float"""
def clean(price):
price = price.replace(',','')
price = re.findall(r'\d+\.\d+',price)
return float(price[0])
data.price = data.price.apply(clean)
data.price = pd.to_numeric(data.price)
data.info()
"""## Recommender 1"""
def recommend_neighboorhood(df, budget_min, budget_max, relative = False):
'''
A function that recommends neighbourhood to customers based on the range of budget provided by them
parameters:
df = a pandas dataframe that contains listing data
budget_min = type: float ; minimum budget set by the customer
budget_max = type: float ; maximum budget set by the customer
relative = type: booolean ; true if the customer wants to consider relative numbers
returns:
Name of the neighbourhood with max properties in the given budget.
'''
# select rows that have price value between the given budget range (inclusive on boundaries)
filtered_data_on_budget = df[df.price.between(float(budget_min),float(budget_max))]
# groupby neighbourshoods and get the count, sort the dataframe by counts in a descending order
neighbourhood_counts = filtered_data_on_budget.groupby('neighbourhood_cleansed')['neighbourhood_cleansed'].value_counts().sort_values(ascending=False)
df2 = neighbourhood_counts.to_frame()
df2 = df2.rename(columns={'neighbourhood_cleansed':'counts'})
df2 = df2.droplevel(-1)
df2 = df2.reset_index()
# if considering absolute numbers
if not relative:
# select the neighbourhood with maximum counts (absolute numbers)
neighbourhood = df.neighbourhood_cleansed[0]
# if considering relative numbers
else:
# get the relative count percentage
df2['relative'] = df2.counts / df2.counts.sum()
# sort in descending order based on relative numbers
df2 = df2.sort_values(by='relative',ascending = False)
# select the neighbourhood with maximum counts (relative numbers)
neighbourhood = df2.neighbourhood_cleansed[0]
return neighbourhood
"""## test cases"""
recommend_neighboorhood(data,10,50)
recommend_neighboorhood(data,10,50,True)
recommend_neighboorhood(data,7000,8000)
recommend_neighboorhood(data,7000,8000,True)
recommend_neighboorhood(data,0,10)
recommend_neighboorhood(data,0,10,True)
"""# Recommender 2"""
def recommend_price(df, latitude, longitude, n_neighbours, room_type= None):
'''
A function that returns the suggested price for hosting based on the rates of selected number of hotels near the property
and the room type (if given).
parameters:
df = a pandas dataframe that contains listing data
latitude = type: float ; latitude of location
longitude= type: float ; longitude of location
n_neighbours= type: int ; the number of neighbouring properties the user wants to take into account
room_type= type:str ; if specified, restricts the neighbours search to properties of the given room type
returns:
suggested price
'''
# calculate the euclidean distance between the given coordinates and the coordinates in the dataframe and put them in column
# named distance
df['distance'] = np.sqrt((longitude-df.longitude)**2 + (latitude - df.latitude)**2)
# sort the dataframe by the calculated distance in ascending order
df = df.sort_values(by='distance')
# selec the top n number of neighbours
top_neighbours = df[:n_neighbours]
# if room_type is not given:
if not room_type:
# calculate the mean price of all rows to be the suggested price
mean_price = top_neighbours.price.mean()
# if room_type is given:
else:
# select rows with the given room type
top_neighbours = top_neighbours[top_neighbours.room_type == room_type]
# if dataframe is empty, then print ('Room not available'):
if top_neighbours.empty:
print('Room not available')
return None
# if dataframe is not empty:
else:
# calculate the mean price of all rows to be the suggested price
mean_price = top_neighbours.price.mean()
return mean_price
"""### test cases"""
recommend_price(data, 4,52,10)
recommend_price(data, 4,52,5)
recommend_price(data, 40,52,5,'Private room')
recommend_price(data, 4,52,5, 'Entire home/apt')
recommend_price(data, 4,52,5, 'Hotel room')
Video
For the complete solution or any help regarding the ChatGPT and Open AI API assignment help feel free to contact us.
Comments