This project aims to analyze a dataset representing the chemicals in coffee. We

want to build a model which will be able to predict the Preference_Score of the coffee

based on the quantity of chemicals present.

The provided dataset includes the following

information:

• **Caffeine:** Amount of caffeine in the coffee in mg

• **Tannin:** Amount of Tannin in the coffee in mg

• **Thiamin:** Amount of Thiamin in the coffee in mg

• **Xanthine:** Amount of Xanthine in the coffee in mg

• **Spermidine:** Amount of Spermidine in the coffee in mg

• **Gualacol:** Amount of Gualacol in the coffee in mg

• **Chlorogenic_acid:** Amount of Chlorogenic acid in the coffee in mg

• **Preference_Score:** Preference score of the drink given by the consumer.

• **Drink_Name:** Name of the drink.

• **Cus_opinion:** customer’s opinion on if they liked the service or not.

• **Rating:** Rating given to the employees.

In order to build the desired predictive model, develop the following tasks and answer

the following questions.

**Questions and Tasks**

1. Load and explore the dataset

(a) How many numerical features are there? How many categorical features?

(b) Verify if there are missing values in the dataset and handle them

(c) Justify the choices you make for handling the missing values

2. Prepare the dataset for a Linear Regression task.

(a) Verify the features values distribution of the numerical variables.

(b) Is features transformation necessary for the numerical variables? Let’s take

into account that we are preparing the dataset for a Linear Regression task,

with the goal of building a "Preference_Score" predictive model. If transfor-

mation is necessary, after justifying your choices, do proceed as described.

(c) Verify the presence of outliers and eventually handle them. Justify your

choices.

(d) Is encoding necessary for the categorical variables? If yes, which kind of

encoding? Specify your choices, justify them and perform categorical data

encoding, if necessary.

(e) Increase the dimensionality of the dataset introducing Polynomial Features –

degree = 3 (continuous variables)

(f ) Eventually include any other transformation which might be necessary/appropriate

and justify your choices.

3. Features Selection

(a) Perform One Way ANOVA and test the relationship between variable Drink_Name

and Preference_Score. Eventually, consider the possibility to remove the feature. Justify your choice.

(b) Perform Features Selection and visualize the features which have been selected. Select one appropriate methodology for features selection and justify

your choice.

4. Linear Regression

(a) Train a Multiple Linear Regression model, using the Sklearn implementation

of Linear Regression to find the best θ vector. Use all the transformed features, excluding the derived polynomial features. Evaluate the model with the learned θ on the test set.

(b) Use all the transformed features, excluding the derived polynomial features, to

identify the best values of θ by means of a Batch Gradient Descent procedure.

Identify the best values of η (starting with an initial value of η = 0.1 ). Evaluate

the model with the trained θ on the test set. Plot the train and the test error

for increasing number of iterations of the Gradient Descent procedure (with

the best value of η). Provide a comment of the plot.

(c) Use the complete set of features, including the derived polynomial features.

Train a Multiple Linear Regression model, using the Sklearn implementation

of Linear Regression to find the best θ vector. Evaluate the model with the

learned θ on the test set. Plot the train and the test error for increasing the

size of the train-set (with the best value of η). Provide a comment of the plot.

(d) Use the complete set of features, including the derived polynomial features.

Train a Ridge Regression model identifying the best value of the learning

rate α that allows the model to achieve the best generalization performances.

Evaluate the model.

(e) Use the complete set of features, including the derived polynomial features.

Train a Linear Regression model with Lasso regularization. Comment on the

importance of each feature given the related trained parameter value of the

trained model. Also, verify the number of features selected (related coefficient

θ different from zero) with different values of α.

(f ) Use the subset of features selected in the Feature Selection task (question 3b).

Train a Multiple Linear Regression model using the Sklearn implementation

of Linear Regression to find the best θ vector. Evaluate the model.

(g) Create a table with the evaluation results obtained from all the models above

on both the train and test sets.

(h) Compare and discuss the results obtained above.

This project can be used as final year project, capstone project, personal portfolio project, resume, proof of concept.

If you need implementation for the above problem or any of its variants, feel free to contact us.

## Comments