Dataset Description
The dataset consists of 768 samples and 8 features. The dataset was created for energy analysis of buildings using different shapes simulated in Ecotect. The buildings vary in terms of glazing area, glazing area distribution, orientation, and other parameters. The dataset aims to predict two real-valued responses: heating load (y1) and cooling load (y2). The dataset can also be used as a multi-class classification problem if the response values are rounded to the nearest integer. Attribute Information:
The dataset contains eight attributes (features) denoted by X1 to X8. The features are as follows: a) X1: Relative Compactness - Represents the relative compactness of the building. b) X2: Surface Area - Indicates the total surface area of the building. c) X3: Wall Area - Represents the total area of the walls in the building. d) X4: Roof Area - Indicates the total area of the roof in the building. e) X5: Overall Height - Represents the overall height of the building. f) X6: Orientation - Indicates the orientation of the building. g) X7: Glazing Area - Represents the total glazing area in the building. h) X8: Glazing Area Distribution - Indicates the distribution of glazing area in the building. Prediction Targets:
The two response variables to be predicted are y1 and y2. y1: Heating Load - Represents the amount of heating required for the building. y2: Cooling Load - Indicates the amount of cooling required for the building. Possible Analysis Tasks:
Regression: The dataset can be used for regression tasks to predict the real-valued heating load and cooling load based on the given features. Classification: If the response variables are rounded to the nearest integer, the dataset can be used for multi-class classification tasks to predict the heating load and cooling load categories. Potential
Applications
Energy Efficiency: The dataset can be utilized to analyze the energy performance of buildings based on their features and predict heating and cooling loads. This can aid in optimizing energy usage and improving energy efficiency in building design and operation. Overall, the dataset provides valuable information about various building characteristics and aims to predict the heating load and cooling load, which can be beneficial for energy analysis and optimizing energy consumption in buildings.When it comes to efficient building design, the computation of the heating load (HL) and the cooling load (CL) is required to determine the specifications of the heating and cooling equipment needed to maintain comfortable indoor air conditions. In order to estimate the required cooling and heating capacities, architects and building desioners need information about the characteristics of the building and of the conditioned space (for example occupancy and activity level). For this reason, we will investigate the effect of eight input variables: (RC), surface area, wall area, roof area, overall height, orientation, glazing area, and glazing area distribution, to determine the output variables HL and CL of residential buildings.
In order to assess the effectiveness of our model, we will employ the R-squared (R2) score as our evaluation metric. R-squared is a statistical measure that quantifies the degree to which the data align with the fitted regression line. Its significance lies in the ability to generate predictions that closely approximate the actual values. In our endeavor, our aim is to attain a substantial R-squared value, as higher values indicate superior model fitting to the data.
Summary of the Dataset
In our energy analysis, we utilized Ecotect to simulate 12 distinct building shapes. These buildings vary in terms of glazing area, glazing area distribution, orientation, and other parameters. By simulating various settings based on these characteristics, we obtained a total of 768 building shapes. Our dataset consists of 768 samples and 8 features, with the goal of predicting two real-valued responses. Alternatively, the dataset can be used for multi-class classification by rounding the response to the nearest integer.
Import the Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
from sklearn.tree import DecisionTreeRegressor
%matplotlib inline
Define the Functions to Work With
def dataset_info(data_frame):
"""
To get the dataset information like number of null values and data types
"""
data_frame = data_frame.copy()
null_dataset = data_frame.isnull().sum()
data_type_dataset = data_frame.dtypes
indices = null_dataset.index
null_values = null_dataset.values
data_type_values = data_type_dataset.values
dataset_info_dict = {'features': indices, 'null_values': null_values, 'data_type': data_type_values}
return pd.DataFrame(dataset_info_dict).head(len(indices))
Define the evaluation function
def get_r2_score(model, test_features, test_labels):
"""
Evaluate the performance of a machine learning model using R2 score.
Args:
model: Trained machine learning model.
test_features: Test features.
test_labels: Test labels.
Returns:
r2: R2 score.
"""
y_preds = model.predict(test_features)
r2 = r2_score(test_labels, y_preds)
print(f'R2 score: {r2:.3f}')
return r2
Import the Dataset
df = pd.read_csv('.ENB2012_data.csv')
df.head()
X1 X2 X3 X4 X5 X6 X7 X8 Y1 Y2
0 0.98 514.5 294.0 110.25 7.0 2 0.0 0 15.55 21.33
1 0.98 514.5 294.0 110.25 7.0 3 0.0 0 15.55 21.33
2 0.98 514.5 294.0 110.25 7.0 4 0.0 0 15.55 21.33
3 0.98 514.5 294.0 110.25 7.0 5 0.0 0 15.55 21.33
4 0.90 563.5 318.5 122.50 7.0 2 0.0 0 20.84 28.28
Rename the columns
df.columns = ['relative_compactness', 'surface_area', 'wall_area', 'roof_area', 'overall_height',
'orientation', 'glazing_area', 'glazing_area_distribution', 'heating_load', 'cooling_load']
dataset_info(df)
data_type features null_values
0 float64 relative_compactness 0
1 float64 surface_area 0
2 float64 wall_area 0
3 float64 roof_area 0
4 float64 overall_height 0
5 int64 orientation 0
6 float64 glazing_area 0
7 int64 glazing_area_distribution 0
8 float64 heating_load 0
9 float64 cooling_load 0
Let us see the distribution of the variables
# Define the number of rows and columns for the subplot grid
num_rows = 4
num_cols = 3
# Create a figure and axis objects for the subplots
fig, axes = plt.subplots(num_rows, num_cols, figsize=(12, 12))
# Flatten the axes object to simplify indexing
axes = axes.flatten()
columns_to_plot = df.columns
# Select the columns to plot (replace 'column1', 'column2', etc. with your actual column names)
# Iterate over the selected columns and create a KDE plot for each
for i, column in enumerate(columns_to_plot):
# Plot the KDE for the current column
sns.kdeplot(data=df[column], ax=axes[i], legend=False)
# Set the title for the plot
axes[i].set_title(column)
# Remove any empty subplots
if len(columns_to_plot) < (num_rows * num_cols):
for j in range(len(columns_to_plot), (num_rows * num_cols)):
fig.delaxes(axes[j])
# Adjust the spacing between subplots
fig.tight_layout()
# Display the plots
plt.show()
# Set the color palette
color_palette = "coolwarm" # Replace with your desired color palette
# Create the correlation heatmap
plt.figure(figsize=(12, 12))
sns.heatmap(df.corr().round(3), annot=True, cmap=color_palette, fmt='.3f')
# Display the heatmap
plt.show()
Performing Standard Scaling
# Create a StandardScaler object
scaler = StandardScaler()
# Select the columns to be scaled (replace 'column1', 'column2', etc. with your actual column names)
X = df.drop(['heating_load','cooling_load'], axis=1)
y = df[['heating_load','cooling_load']]
columns_to_scale = list(X.columns)
columns_to_scale.remove('overall_height')
columns_to_scale.remove('orientation')
# Fit the scaler to the selected columns and transform the data
X_scaled = scaler.fit_transform(X[columns_to_scale])
# Create a new DataFrame with the scaled data
X_scaled = pd.DataFrame(X_scaled, columns=columns_to_scale)
# Concatenate the scaled columns with the remaining columns from the original DataFrame
X_scaled = pd.concat([X_scaled, X.drop(columns=columns_to_scale)], axis=1)
X_scaled = X_scaled.reindex(columns = X.columns)
Split the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.2, random_state = 42)
Training
Training a decision tree regressor
dt_regressor = DecisionTreeRegressor(random_state=42)
dt_regressor .fit(X_train, y_train)
y_pred = dt_regressor .predict(X_test)
print(get_r2_score(dt_regressor, X_test, y_test))
R2 score: 0.969
If you require any assistance with your Machine Learning projects, please do not hesitate to contact us. We have a team of experienced developers who specialize in Machine Learning and can provide you with the necessary support and expertise to ensure the success of your project. You can reach us through our website or by contacting us directly via email or phone.
Comments