Fire Forest data set - Regression

Pushkar Nandgaonkar
Nov 1, 2021
3 min read

Updated: Nov 3, 2021

Description :

This dataset contains information about forest fires. This dataset is used to Predict Forest Fires using Meteorological Data. In [Cortez and Morais, 2007], the output 'area' was first transformed with a ln(x+1) function. Then, several Data Mining methods were applied. After fitting the models, the outputs were post-processed with the inverse of the ln(x+1) transform. Four different input setups were used. The experiments were conducted using a 10-fold (cross-validation) x 30 runs. Two regression metrics were measured: MAD and RMSE. A Gaussian support vector machine (SVM) fed with only 4 direct weather conditions (temp, RH, wind and rain) obtained the best MAD value: 12.71 +- 0.01 (mean and confidence interval within 95% using a t-student distribution). The best RMSE was attained by the naive mean predictor. An analysis to the regression error curve (REC) shows that the SVM model predicts more examples within a lower admitted error. In effect, the SVM model predicts better small fires, which are the majority.

Recommended Model :

Algorithms to be used, regression, random forest, Support Vector Machines etc.

Recommended Projects :

To Predict the burned area of forest fires by using this dataset.

Dataset link

Data set Link : UCI MLR - https://archive.ics.uci.edu/ml/datasets/forest+fires

Kaggle : - https://www.kaggle.com/elikplim/forest-fires-data-set

Overview of data

Detailed overview of dataset

Records in the dataset = 517 ROWS
Columns in the dataset = 13 COLUMNS

X - x-axis spatial coordinate within the Montesinho park map: 1 to 9
Y - y-axis spatial coordinate within the Montesinho park map: 2 to 9
month - month of the year: 'jan' to 'dec'
day - day of the week: 'mon' to 'sun'
FFMC - FFMC index from the FWI system: 18.7 to 96.20
DMC - DMC index from the FWI system: 1.1 to 291.3
DC - DC index from the FWI system: 7.9 to 860.6
ISI - ISI index from the FWI system: 0.0 to 56.10
temp - temperature in Celsius degrees: 2.2 to 33.30
RH - relative humidity in %: 15.0 to 100
wind - wind speed in km/h: 0.40 to 9.40
rain - outside rain in mm/m2 : 0.0 to 6.4
area - the burned area of the forest (in ha): 0.00 to 1090.84 (this output variable is very skewed towards 0.0, thus it may make sense to model with the logarithm transform).

EDA[Code]

Dataset

import pandas as pd
# Load Data
file_loc = "data\\forestfires.csv"
forest_fire_data = pd.read_csv(file_loc)
forest_fire_data.head()

Total Number of Rows and columns in the dataset

# Number of Rows and columns 
rows_col = forest_fire_data.shape
print("Total number of Rows in the dataset : {}".format(rows_col[0]))
print("Total number of columns in the dataset : {}".format(rows_col[1]))

Check Details

# Data information
forest_fire_data.info()

Check missing values in the dataset

# Missing Values
forest_fire_data.isna().sum()

Statistical information

# Statistical information
forest_fire_data.describe()

Data Visualization

import seaborn as sns
import matplotlib.pyplot as plt
# correlation
corr = forest_fire_data.corr()
corr.style.background_gradient(cmap='coolwarm')

Count plot of the month

sns.set_style("whitegrid")
plt.figure(figsize=(8,5))
sns.countplot(x= "month",data=forest_fire_data)

Count plot of Day

sns.set_style("whitegrid")
plt.figure(figsize=(8,5))
sns.countplot(x= "day",data=forest_fire_data)

Histogram plot of rain.

# Histogram 
plt.figure(figsize=(8,5))
sns.histplot(x="rain",data=forest_fire_data)

Histogram plot of FFMC

# Histogram 
plt.figure(figsize=(8,5))
sns.histplot(x="FFMC",data=forest_fire_data)

Histogram plot of DMC

# Histogram 
plt.figure(figsize=(8,5))
sns.histplot(x="DMC",data=forest_fire_data)

Histogram plot of Tempreture

# Histogram 
plt.figure(figsize=(8,5))
sns.histplot(x="temp",data=forest_fire_data)

Other related data

Occupancy Detection Data Set - Classification

Census income Data Set - Classification

Wholesale customer - Classification and Clustering

Online retail dataset - classification, clustering and regression

Cervical Cancer Risk Factor Dataset - classification and clustering

Blood Transfusion service center dataset - Classification

Divorce Predictor Dataset -classification

Student performance dataset - Classification and Regression

Heart Disease dataset -Classification

If you need implementation for any of the topics mentioned above or assignment help on any of its variants, feel free to contact us.