Implementing Machine Learning Model Using Apache Kafka and Apache Spark

naveen kumar
Oct 9, 2020
5 min read

StopHacking is a start-up incubated in Monash University to develop cloud service to detect and stop computer hackers. Although they have some rule-based service to identify certain hacks, they would like to add machine learning models which can integrate with their Spark cluster to process large amounts of data and detect any potential hacks. They hired us as the Analytics Engineer to investigate the open data from the Cyber Range Labs of UNSW Canberra and build models based on the data to identify abnormal system behaviour. In addition, they want us to help them integrate the machine learning models into the streaming platform using Apache Kafka and Apache Spark Streaming to detect any real-time threats, in order to stop the hacking. In this part A of the assignment, we would only need to process the static data and train machine learning models based on them.

What you are provided with Four data files:

Linux_memory_1.csv
Linux_memory_2.csv
Linux_process_1.csv
Linux_process_2.csv

These files are available in Moodle in the Assessment section in the Assignment Data folder.

A Metadata file is included which contains the information about the dataset.

Information on Dataset

Four data files recorded from Linux systems are provided, which captures information relating to the memory activity, and process activity. They are a subset of the Internet of Things dataset collected by Dr. Nour Moustafa, from IoT devices, Linux systems, Windows systems, and network devices. For more detailed information on the dataset, please refer to the Metadata file included in the assignment dataset or from the website below

https://ieee-dataport.org/documents/toniot-datasets.

In this assignment, the memory activity and the process activity data are providedseparately, without explicit linkage or computer ID to link up memory and process.

For each data, there is a binary label and a multi-class label, one for indicating whether the activity is an attack or not, and the other for indicating which kind of attack the activity is under.

What you need to achieve

The StopHacking company requires us to build separate models for the memory activity and the process activity. Each activity needs a model for a binary classification predicting

whether it is an attack or not, as shown in use case 1 & 2 below. To build the binary

classification model, use the column “attack” as your label in the model (label 0 as non-

attack event, and label 1 as attack event). You can select any columns as features from

each activity data, except “attack” or “type”.

Process activity - attack or not Binary classification
Memory activity - attack or not Binary classification

Architecture

The overall architecture of the assignment setup is represented by the following figure. Part A of the assignment consists of preparing the data, performing data exploration and

extracting features, building and persisting the machine learning models.

In both parts, for the data pre-processing, the machine learning processes , you are required to implement the solutions using PySpark SQL / MLlib / ML packages. For the data visualisations, please use Matplotlib packages to prepare the plots, and excessive usage of

Pandas for data processing is discouraged. Please follow the steps to document the

processes and write the codes in Jupyter Notebook.

Getting Started

Download the datasets from moodle.
Create an Assignment-2A.ipynb file in Jupyter Notebook to write your solution for process data.

You will be using Python 3+ and PySpark 3.0 for this assignment.

Data preparation and exploration

Create a SparkConf object for using as many local cores as possible, for a proper application name, and for changing the max partition byte configuration1 to enable a minimum of 2 partitions2 when reading each file in Spark SQL (so each dataframe should have at least 4 partitions when reading from the given datafiles).
Then create a SparkSession using the SparkConf object.

Loading the data

Load each activity data into a Spark dataframe and cache the data. Then print out the row count of each dataframe.

In order to speed up the loading process, please specify the schema before reading the data into dataframes. You may find relevant schema info from the metadata file, however, note that some data may not fully comply with the schema. For those that do not comply with the schema, import them as StringType and further transform them in step 1.2.2.

For each column in each dataframe above,

Check the null data (if any) and print out the corresponding count in each column
Are these columns: MINFLT, MAJFLT, VSTEXT, RSIZE, VGROW

RGROW in memory data following the datatype from the metadata file? If not, please transform them into the proper formats.

Exploring the data

Show the count of attack and non-attack in each activity based on the column “attack”, then show the count of each kind of attack in process activity based on the column “type”.

Do you see any class imbalance? Examine and describe what you observe

More details of configuration can be found on https://spark.apache.org/docs/latest/configuration.html

This is a mock scenario considering the data size and the VM setup. 2. For each numeric feature in each activity, show the basic statistics (including count, mean, stddev, min, max); for each non-numeric feature in each activity, display the top-10 values and the corresponding counts.

No need to show the labels at “attack” or “type” column

3. For each activity, present two plots3 worthy of presenting to the StopHacking

company, describe your plots and discuss the findings from the plots

Hint - 1: you can use the basic plots (e.g. histograms, line charts, scatter plots) for

relationship between a column and the “attack” label (such as “ts” and “attack”, “PID”

and “attack”); or more advanced plots like correlation plots for relationship between

each column;

2: if your data is too large for the plotting, consider using sampling before plotting

100 words max for each plot’s description and discussion

Feature extraction and ML training

Randomly split the dataset into 80% training data and 20% testing data for each use case
With the class imbalance observed from 1.3.1, for the binary classification use case 1 & 2, prepare rebalanced training data, with attack events and non-attack events being 1:2 ratio, while using 20%4 attack events data from the training data from 2.1.1. Cache the rebalanced training data, and display the count of each event's data.

Hint - you can use undersampling to get the rebalanced training data

Preparing features, labels and models

Based on data exploration from 1.3.3, which features would you select? Discuss the 5 reason for selecting them and how you plan to further transform them .

400 words max for the discussion
Hint - things to consider include whether to scale the numeric data, whether to choose one-hot encoding or string-indexing for a specific model

Create Transformers / Estimators for transforming / assembling the features you selected above in 2.2.1

Training and evaluating models

For each use case, use the corresponding ML Pipeline from previous step to train the models on the rebalanced training data from 2.1.2

Hint - each model training might take from 1min to 40min, depending on the complexity of the pipeline model, the amount of training data and the VM computing power
For each use case, test the models on the testing data from 2.1.1 and display the count of each combination of attack label and prediction label in formats as below.
Compute the AUC, accuracy, recall and precision for the attack label from each model testing result using pyspark MLlib / ML APIs. Discuss which metric is more proper for measuring the model performance on identifying attacks.
Display the top-5 most important features in each model. Discuss which pipeline model is better, and whether the feature “ts” should be included in the model . And visualise the ROC curve for the better model you selected for each use case.
Using the pipeline model you selected in the previous step, re-train the pipeline model using a bigger set of rebalanced training data, with attack events and non-attack events being 1:2 ratio, while using all attack events data from the full data for both use cases. Then persist the better models for each use case.

contact@codersarts.com