- Apr 26

Predicting Coffee Preference Score using ML

Introduction

Welcome to our latest blog post! Today, we are excited to share a new another project requirement with you, titled"Predicting Coffee Preference Score using ML". In this project requirement the goal is to build a model that can predict how much people will like a coffee based on its chemical makeup. We'll use a dataset that tells us about the levels of various chemicals like caffeine and tannin in different types of coffee, along with other info like customer opinions and ratings.

In the Solution approach sections of this blog post, we will discuss our approach to solving this project requirement. We will walk you through our thought process, the methodologies we plan to employ, and the tools we will be using. Our goal is to provide a comprehensive solution that is both effective and efficient.

Then, we will showcase the output of our analysis, including visualizations, key findings, and interpretations of the data.

Project Requirements :

This project aims to analyze a dataset representing the chemicals in coffee. We want to build a model which will be able to predict the Preference_Score of the coffee based on the quantity of chemicals present. The provided dataset includes the following information:

Dataset Information :

Caffeine: Amount of caffeine in the coffee in mg
Tannin: Amount of Tannin in the coffee in mg
Thiamin: Amount of Thiamin in the coffee in mg
Xanthine: Amount of Xanthine in the coffee in mg
Spermidine: Amount of Spermidine in the coffee in mg
Gualacol: Amount of Gualacol in the coffee in mg
Chlorogenic_acid: Amount of Chlorogenic acid in the coffee in mg
Preference_Score: Preference score of the drink given by the
consumer. • Drink_Name: Name of the drink.
Cus_opinion: customer’s opinion on if they liked the service or not.
Rating: Rating given to the employees.

In order to build the desired predictive model, develop the following tasks and answer the following questions.

Questions and Tasks

Load and explore the dataset

How many numerical features are there? How many categorical features?
Verify if there are missing values in the dataset and handle them
Justify the choices you make for handling the missing values

Prepare the dataset for a Linear Regression task.

Verify the features values distribution of the numerical variables.
Is features transformation necessary for the numerical variables? Let’s take into account that we are preparing the dataset for a Linear Regression task, with the goal of building a "Preference_Score" predictive model. If transfor mation is necessary, after justifying your choices, do proceed as described.

Task :

Verify the presence of outliers and eventually handle them. Justify your choices.
Is encoding necessary for the categorical variables? If yes, which kind of encoding? Specify your choices, justify them and perform categorical data encoding, if necessary.
Increase the dimensionality of the dataset introducing Polynomial Features – degree = 3 (continuous variables)
Eventually include any other transformation which might be necessary/appropriate and justify your choices.

3. Features Selection

Perform One Way ANOVA and test the relationship between variable Drink_Name and Preference_Score. Eventually, consider the possibility to remove the feature. Justify your choice.
Perform Features Selection and visualize the features which have been se lected. Select one appropriate methodology for features selection and justify your choice.

4. Linear Regression

Train a Multiple Linear Regression model, using the Sklearn implementation of Linear Regression to find the best θ vector. Use all the transformed fea tures, excluding the derived polynomial features. Evaluate the model with the learned θ on the test set.
Use all the transformed features, excluding the derived polynomial features, to identify the best values of θ by means of a Batch Gradient Descent procedure. Identify the best values of η (starting with an initial value of η = 0.1 ). Evaluate the model with the trained θ on the test set. Plot the train and the test error for increasing number of iterations of the Gradient Descent procedure (with the best value of η). Provide a comment of the plot.
Use the complete set of features, including the derived polynomial features. Train a Multiple Linear Regression model, using the Sklearn implementation of Linear Regression to find the best θ vector. Evaluate the model with the learned θ on the test set. Plot the train and the test error for increasing the size of the train-set (with the best value of η). Provide a comment of the plot.
Use the complete set of features, including the derived polynomial features. Train a Ridge Regression model identifying the best value of the learning rate α that allows the model to achieve the best generalization performances. Evaluate the model.
Use the complete set of features, including the derived polynomial features. Train a Linear Regression model with Lasso regularization. Comment on the importance of each feature given the related trained parameter value of the trained model. Also, verify the number of features selected (related coefficient θ different from zero) with different values of α.
Use the subset of features selected in the Feature Selection task (question 3b). Train a Multiple Linear Regression model using the Sklearn implementation of Linear Regression to find the best θ. vector. Evaluate the model.
Create a table with the evaluation results obtained from all the models above on both the train and test sets.
Compare and discuss the results obtained above.

Solution Approach :

1. Dataset Overview:

Dataset: Coffee Sales Dataset
Features:
Chemical composition of coffee (Caffeine, Tannin, Thiamin, Xanthine, Spermidine, Guaiacol, Chlorogenic acid)
Consumer-related variables (Preference Score, Drink Name, Customer Opinion, Rating)

2. Data Exploration and Preprocessing:

Loaded and explored the dataset to understand its structure and characteristics.
Identified numerical and categorical features.
Checked for missing values and handled them appropriately:
Filled missing values in chemical composition based on the drink name.
Performed data transformation to address skewness:
Applied log transformation to skewed numerical features.
Removed outliers from the dataset using a robust approach.
Encoded categorical variables using Label Encoding and One-Hot Encoding.

3. Feature Engineering:

Generated polynomial features of degree 3 to capture non-linear relationships.
Conducted feature selection using ANOVA to assess the relationship between drink names and preference scores.
Visualized correlation matrix to identify feature relationships.

4. Model Building and Evaluation:

Trained Multiple Linear Regression models using Sklearn and Batch Gradient Descent.
Evaluated models on test sets and plotted learning curves to assess performance.
Applied Ridge and Lasso regularization techniques to mitigate overfitting and improve generalization.
Analyzed feature importance and the effect of regularization on model coefficients.

5. Results and Insights:

Achieved improved prediction accuracy using transformed features and regularization techniques.
Identified key features influencing coffee preference scores.
Evaluated various regression models and compared their performance.
Provided insights into the impact of different methods on model accuracy and interpretability.

Output :

Our project delves into the fascinating world of coffee chemistry with the aim of predicting coffee preference scores using machine learning. By analyzing a dataset detailing the chemical composition of different coffee blends, alongside consumer-related variables such as drink names, customer opinions, and ratings, we strive to develop a model that accurately predicts how much people will enjoy a cup of coffee based on its chemical makeup.

Throughout our project, we'll navigate various tasks, from exploring and preprocessing the dataset to building and evaluating our predictive model. We'll tackle challenges like handling missing values, transforming features, and selecting relevant variables for our model. Utilizing techniques like Multiple Linear Regression, Batch Gradient Descent, and feature engineering, we'll develop a robust model that provides valuable insights into coffee preferences and the factors influencing them. Join us as we uncover the secrets behind predicting coffee preferences and share our findings and insights with you!

If you require any assistance with the project discussed in this blog, or if you find yourself in need of similar support for other projects, please don't hesitate to reach out to us. Our team can be contacted at any time via email at contact@codersarts.com.