top of page

Machine learning with spark | sample assignment

Question 1

You are provided with a subsample of million song dataset download. There are total of 90 attributes and the first one is the year, ranging from 1922 to 2011 which is also our target variable.


Part 1

Read comma separated file and load in to spark dataframe.

- Count the number of data points we have. Print the list of first 40 instances.


Part 2

- Normalize features between 0 and 1. Normalization helps to converge machine learning algorithms faster.


Hint: https://spark.apache.org/docs/latest/ml-features.html#minmaxscaler


Part 3

In learning problem, its natural to shift labels if its not starting from zero. Find out the range of prediction year and shift labels if necessary so that lowest one starts from zero.


Hint: If year ranges from 1900-2000 then you can also represent year between 0-100 as well by subtracting minimum year.


Part 4

- Split dataset into training, validation and test set.

- Create a baseline model where we always provide the same prediction irrespective of our input. (Use training data)

- Implement a function to give Root mean square error given a RDD.

- Measure our performance of base model using it. (Use test data)


Hint 1: Intro to train, validation, test set, https://en.wikipedia.org/wiki/Test_set

Hint 2: Root mean squared error - https://en.wikipedia.org/wiki/Root-mean-square_deviation


Part 5

Visualize predicted vs actual using a scatter plot.



Question 2

We will extend our work on Homework 1 and use the same million song dataset download and apply some other method to see if it improves upon our baseline prediction.


Part 1

- Load dataset into spark dataframe

- Split dataset into train, validation and test data (70%-10%-20%)


Part 2

Train our model on training data and evaluate the model based on validation set.

- You can use any machine learning algorithms like linear regression, random forest regressor etc.



Part 3

Visualize the log of the training error as a function of iteration. The scatter plot visualizes the logarithm of the training error for all 50 iterations.


Part 4

Use this model for prediction on test data. Calculate Root Mean Square Error of our model.



Question 3

Entity resolution is the process of joining one data source to another that describes same entity.By ER, we mean finding records in a dataset that refer to sam entity across different data sources like books, websites etc.


Part 1

Read each file and create RDD consisting of lines. First column of Google.csv is URLs and first column of Amazon.csv is alphanumeric strings. We call them ID to simplify and want to parse ID for each row. Load the data into RDD so that ID is the key and other information are included in value.


The file format of an Amazon line is:

"id","title","description","manufacturer","price"


The file format of a Google line is:

"id","name","description","manufacturer","price"


Part 2

Bag-of-words is conceptually simple approach to text analysis. Here we treat each document as unordered collection of words. We will construct some components for bag-of-words analysis on description field of datasets.


  1. Implement a function that takes a string and returns non-empty tokens by splitting using regular expressions.

  2. Stopwords are common (English) words that do not contribute much to the content or meaning of a document (e.g., "the", "a", "is", "to", etc.). Stopwords add noise to bag-of-words comparisons, so they are usually excluded. Using the included file "stopwords.txt", implement a function, an improved tokenizer that does not emit stopwords.

  3. Tokenize both Amazon and Google datasets. Here for each instance, tokenize the values corresponding to the descriptions.

Part 3

Write a function that takes a list of tokens and returns a dictionary mapping tokens to term frequency.


Note: You can use MLLIB for TF and IDF functions


Part 4

Combine the datasets to create a corpus. Each element of the corpus is a <key, value> pair where key is ID and value is associated tokens from two datasets combined.


Part 5

Write an IDF function that return a pair RDD where key is each unique token and value is corresponding IDF value. Plot a histogram of IDF values.


Note: You can use MLLIB for HashingTF and IDF functions


Part 6

Write a function which does the following on entire corpus:

  1. Calculate token frequencies for tokens

  2. Create a Python dictionary where each token maps to the token's frequency times the token's IDF weight

Note: TF x IDF is also known as tf-idf in NLP terminology.


Question 4

This project will explore the relative performances of several machine learning algorithms. Objective here is to predict the chance of customers defaulting payment on their credit card.


The dataset has been obtained from a customer base in Taiwan and can be found in Files section. You can also click here to obtain the dataset(Source: UCI Machine Learning Repository).


Attribute Information:

This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables: X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit. X2: Gender (1 = male; 2 = female). X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others). X4: Marital status (1 = married; 2 = single; 3 = others). X5: Age (year). X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above. X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005. X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.


This is a supervised learning problem and specifically a binary classification. Last column Y is our target variable and 1 signify the customer will default next month.

Build a model on Spark to correctly predict whether or not customer will default next month. You should approximately follow the pipeline specified in "Machine Learning Pipeline" document.


Specifically these procedural stages needs to be followed:

  1. Data Loading

  2. Data Transformation

  3. Model Learning

  4. Model Evaluation

Details related to each procedure may vary.



bottom of page