top of page

Big Data Analytics - Coursework Assignment


Introduction

In this blog post, we will see into the intricacies of a significant project titled "Big Data Analytics - Coursework." This project revolves around understanding, analyzing, and deriving insights from the UNSW-NB15 dataset, which encompasses a rich collection of network traffic data designed for cybersecurity analysis.


Project Overview:

The project aims to conduct a comprehensive analysis of the UNSW-NB15 dataset, which encompasses a blend of real-world normal activities and synthetic contemporary attack behaviors. The dataset, generated using the IXIA PerfectStorm tool within the Cyber Range Lab of the Australian Centre for Cyber Security (ACCS), is designed for big data analytics. The tasks involve understanding the dataset, querying and analyzing using Apache Hive, performing advanced analytics using PySpark, and documenting the entire process.


Tasks:

1 . Understanding Dataset: UNSW-NB15

The UNSW-NB151 dataset's raw network packets were generated using the IXIA PerfectStorm tool within the Cyber Range Lab of the Australian Centre for Cyber Security (ACCS). This dataset was designed to combine real modern normal activities with synthetic contemporary attack behaviors. Tcpdump was employed to capture 100 GB of raw traffic, resulting in Pcap files. The dataset encompasses nine types of attacks: Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode, and Worms. To further analyze the data, Argus and Bro-IDS tools were utilized, and twelve algorithms were developed, generating a total of 49 features along with their corresponding class labels.


a). The features are outlined in this section.

b). The quantity of attaks and their respective sub-categories is delineated here.

c). In this coursework, we utilize a total of 10 million records stored in a CSV file (available for download). The file size amounts to approximately 600MB, which is sufficiently large to warrant the application of big data methodologies for analysis. As a specialist in big data, our initial approach involves comprehending the dataset's features before implementing modeling techniques. Should you wish to view a subset of this dataset, you can import it into Hadoop HDFS and execute a Hive query to display the first 5-10 records for better comprehension.


Dataset Features : -

No.

Name

Type

Description

1

srcip

nominal

Source IP address

2

sport

integer

Source port number

3

dstip

nominal

Destination IP address

4

dsport

integer

Destination port number

5

proto

nominal

Transaction protocol

6

state

nominal

Indicates the state and its dependent protocol

7

dur

Float

Record total duration

8

sbytes

Integer

Source to destination transaction bytes

9

dbytes

Integer

Destination to source transaction bytes

10

sttl

Integer

Source to destination time to live value

11

dttl

Integer

Destination to source time to live value

12

sloss

Integer

Source packets retransmitted or dropped

13

dloss

Integer

Destination packets retransmitted or dropped

14

service

nominal

Service used

15

Sload

Float

Source bits per second

16

Dload

Float

Destination bits per second

17

Spkts

integer

Source to destination packet count

18

Dpkts

integer

Destination to source packet count

19

swin

integer

Source TCP window advertisement value

20

dwin

integer

Destination TCP window advertisement value

21

stcpb

integer

Source TCP base sequence number

22

dtcpb

integer

Destination TCP base sequence number

23

smeansz

integer

Mean of the flow packet size transmitted by the source

24

dmeansz

integer

Mean of the flow packet size transmitted by the destination

25

trans_depth

integer

Represents the pipelined depth into the connection of HTTP request/response transaction

26

res_bdy_len

integer

Actual uncompressed content size of the data transferred from the server’s HTTP service

27

Sjit

Float

Source jitter (milliseconds)

28

Djit

Float

Destination jitter (milliseconds)

29

Stime

Timestamp

Record start time

30

Ltime

Timestamp

Record last time

31

Sintpkt

Float

Source interpacket arrival time (milliseconds)

32

Dintpkt

Float

Destination interpacket arrival time (milliseconds)

33

tcprtt

Float

TCP connection setup round-trip time, the sum of ’synack’ and ’ackdat’

34

synack

Float

TCP connection setup time, the time between the SYN and the SYN_ACK packets

35

ackdat

Float

TCP connection setup time, the time between the SYN_ACK and the ACK packets

36

is_sm_ips_ports

Binary

If source and destination IP addresses equal and port numbers equal, this variable takes value 1

37

ct_state_ttl

Integer

No. for each state according to specific range of values for source/destination time to live

38

ct_flw_http_mthd

Integer

No. of flows that have methods such as Get and Post in HTTP service

39

is_ftp_login

Binary

If the FTP session is accessed by user and password then 1 else 0

40

ct_ftp_cmd

integer

No. of flows that have a command in FTP session

41

ct_srv_src

integer

No. of connections that contain the same service and source address in 100 connections

42

ct_srv_dst

integer

No. of connections that contain the same service and destination address in 100 connections

43

ct_dst_ltm

integer

No. of connections of the same destination address in 100 connections

44

ct_src_ltm

integer

No. of connections of the same source address in 100 connections

45

ct_src_dport_ltm

integer

No. of connections of the same source address and the destination port in 100 connections

46

ct_dst_sport_ltm

integer

No. of connections of the same destination address and the source port in 100 connections

47

ct_dst_src_ltm

integer

No. of connections of the same source and destination address in 100 connections

48

attack_cat

nominal

The name of each attack category

49

Label

binary

0 for normal and 1 for attack records


2. Big Data Query & Analysis by Apache Hive

This task involves utilizing Apache Hive to transform large raw data into actionable insights for end users. The process begins by thoroughly understanding the dataset. Subsequently, at least 4 Hive queries should be formulated (refer to the marking scheme). Suitable visualization tools should be applied to present the findings both numerically and graphically. A brief interpretation of the findings should also be provided.

Finally, screenshots of the outcomes, including tables and plots, along with the scripts/queries should be included in the report.


3. Advanced Analytics using PySpark 

In this section, you will conduct advanced analytics using PySpark.


3.1. Analyze and Interpret Big Data 

We need to learn and understand the data through at least 4 analytical methods (descriptive statistics, correlation, hypothesis testing, density estimation, etc.). You need to present your work numerically and graphically. Apply tooltip text, legend, title, X-Y labels etc. accordingly to help end-users for getting insights.


3.2. Design and Build a Classifier 

a) Design and build a binary classifier over the dataset. Explain your algorithm and its configuration. Explain your findings into both numerical and graphical representations. Evaluate the performance of the model and verify the accuracy and the effectiveness of your model. 

b) Apply a multi-class classifier to classify data into ten classes (categories): one normal and nine attacks (e.g., Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode and Worms). Briefly explain your model with supportive statements on its parameters, accuracy and effectiveness.


3. Documentation 

Document all your work. Your final report must follow 5 sections detailed in the “format of final submission” section (refer to the next page). Your work must demonstrate appropriate understanding of academic writing and integrity.


Summarize the key points discussed in the blog post and reiterate the significance of the project in the context of big data analytics. Encourage readers to explore further and engage with the provided sample assignment.



 

Codersarts provides tailored assistance for your big data analytics project, following the outlined tasks and objectives in your blog. Our team specializes in guiding you through each stage of the project, from understanding the dataset to implementing advanced analytics techniques.


We offer hands-on support in utilizing Apache Hive and PySpark for data transformation, querying, and analysis. Our experts ensure efficient preprocessing and feature engineering to enhance the accuracy of your models. With a focus on coding best practices, we ensure the best and reliability of your analytical solutions.


Codersarts facilitates thorough project evaluation, conducting quantitative assessments and offering insightful interpretations of your findings. Additionally, we provide services such as documentation review and problem-solving sessions to enhance the overall quality and success of your big data analytics endeavor.


If you require any assistance with the project discussed in this blog, or if you find yourself in need of similar support for other projects, please don't hesitate to reach out to us. Our team can be contacted at any time via email at contact@codersarts.com.

bottom of page