Big Data Analytics - Coursework Assignment

Pushkar Nandgaonkar
Mar 21, 2024
4 min read

Introduction

In this blog post, we will see into the intricacies of a significant project titled "Big Data Analytics - Coursework." This project revolves around understanding, analyzing, and deriving insights from the UNSW-NB15 dataset, which encompasses a rich collection of network traffic data designed for cybersecurity analysis.

Project Overview:

The project aims to conduct a comprehensive analysis of the UNSW-NB15 dataset, which encompasses a blend of real-world normal activities and synthetic contemporary attack behaviors. The dataset, generated using the IXIA PerfectStorm tool within the Cyber Range Lab of the Australian Centre for Cyber Security (ACCS), is designed for big data analytics. The tasks involve understanding the dataset, querying and analyzing using Apache Hive, performing advanced analytics using PySpark, and documenting the entire process.

Tasks:

1 . Understanding Dataset: UNSW-NB15

The UNSW-NB151 dataset's raw network packets were generated using the IXIA PerfectStorm tool within the Cyber Range Lab of the Australian Centre for Cyber Security (ACCS). This dataset was designed to combine real modern normal activities with synthetic contemporary attack behaviors. Tcpdump was employed to capture 100 GB of raw traffic, resulting in Pcap files. The dataset encompasses nine types of attacks: Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode, and Worms. To further analyze the data, Argus and Bro-IDS tools were utilized, and twelve algorithms were developed, generating a total of 49 features along with their corresponding class labels.

a). The features are outlined in this section.

b). The quantity of attaks and their respective sub-categories is delineated here.

c). In this coursework, we utilize a total of 10 million records stored in a CSV file (available for download). The file size amounts to approximately 600MB, which is sufficiently large to warrant the application of big data methodologies for analysis. As a specialist in big data, our initial approach involves comprehending the dataset's features before implementing modeling techniques. Should you wish to view a subset of this dataset, you can import it into Hadoop HDFS and execute a Hive query to display the first 5-10 records for better comprehension.

Dataset Features : -

No.	Name	Type	Description
1	srcip	nominal	Source IP address
2	sport	integer	Source port number
3	dstip	nominal	Destination IP address
4	dsport	integer	Destination port number
5	proto	nominal	Transaction protocol
6	state	nominal	Indicates the state and its dependent protocol
7	dur	Float	Record total duration
8	sbytes	Integer	Source to destination transaction bytes
9	dbytes	Integer	Destination to source transaction bytes
10	sttl	Integer	Source to destination time to live value
11	dttl	Integer	Destination to source time to live value
12	sloss	Integer	Source packets retransmitted or dropped
13	dloss	Integer	Destination packets retransmitted or dropped
14	service	nominal	Service used
15	Sload	Float	Source bits per second
16	Dload	Float	Destination bits per second
17	Spkts	integer	Source to destination packet count
18	Dpkts	integer	Destination to source packet count
19	swin	integer	Source TCP window advertisement value
20	dwin	integer	Destination TCP window advertisement value
21	stcpb	integer	Source TCP base sequence number
22	dtcpb	integer	Destination TCP base sequence number
23	smeansz	integer	Mean of the flow packet size transmitted by the source
24	dmeansz	integer	Mean of the flow packet size transmitted by the destination
25	trans_depth	integer	Represents the pipelined depth into the connection of HTTP request/response transaction
26	res_bdy_len	integer	Actual uncompressed content size of the data transferred from the server’s HTTP service
27	Sjit	Float	Source jitter (milliseconds)
28	Djit	Float	Destination jitter (milliseconds)
29	Stime	Timestamp	Record start time
30	Ltime	Timestamp	Record last time
31	Sintpkt	Float	Source interpacket arrival time (milliseconds)
32	Dintpkt	Float	Destination interpacket arrival time (milliseconds)
33	tcprtt	Float	TCP connection setup round-trip time, the sum of ’synack’ and ’ackdat’
34	synack	Float	TCP connection setup time, the time between the SYN and the SYN_ACK packets
35	ackdat	Float	TCP connection setup time, the time between the SYN_ACK and the ACK packets
36	is_sm_ips_ports	Binary	If source and destination IP addresses equal and port numbers equal, this variable takes value 1
37	ct_state_ttl	Integer	No. for each state according to specific range of values for source/destination time to live
38	ct_flw_http_mthd	Integer	No. of flows that have methods such as Get and Post in HTTP service
39	is_ftp_login	Binary	If the FTP session is accessed by user and password then 1 else 0
40	ct_ftp_cmd	integer	No. of flows that have a command in FTP session
41	ct_srv_src	integer	No. of connections that contain the same service and source address in 100 connections
42	ct_srv_dst	integer	No. of connections that contain the same service and destination address in 100 connections
43	ct_dst_ltm	integer	No. of connections of the same destination address in 100 connections
44	ct_src_ltm	integer	No. of connections of the same source address in 100 connections
45	ct_src_dport_ltm	integer	No. of connections of the same source address and the destination port in 100 connections
46	ct_dst_sport_ltm	integer	No. of connections of the same destination address and the source port in 100 connections
47	ct_dst_src_ltm	integer	No. of connections of the same source and destination address in 100 connections
48	attack_cat	nominal	The name of each attack category
49	Label	binary	0 for normal and 1 for attack records

2. Big Data Query & Analysis by Apache Hive

This task involves utilizing Apache Hive to transform large raw data into actionable insights for end users. The process begins by thoroughly understanding the dataset. Subsequently, at least 4 Hive queries should be formulated (refer to the marking scheme). Suitable visualization tools should be applied to present the findings both numerically and graphically. A brief interpretation of the findings should also be provided.

Finally, screenshots of the outcomes, including tables and plots, along with the scripts/queries should be included in the report.

3. Advanced Analytics using PySpark

In this section, you will conduct advanced analytics using PySpark.

3.1. Analyze and Interpret Big Data

We need to learn and understand the data through at least 4 analytical methods (descriptive statistics, correlation, hypothesis testing, density estimation, etc.). You need to present your work numerically and graphically. Apply tooltip text, legend, title, X-Y labels etc. accordingly to help end-users for getting insights.

3.2. Design and Build a Classifier

a) Design and build a binary classifier over the dataset. Explain your algorithm and its configuration. Explain your findings into both numerical and graphical representations. Evaluate the performance of the model and verify the accuracy and the effectiveness of your model.

b) Apply a multi-class classifier to classify data into ten classes (categories): one normal and nine attacks (e.g., Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode and Worms). Briefly explain your model with supportive statements on its parameters, accuracy and effectiveness.

3. Documentation

Document all your work. Your final report must follow 5 sections detailed in the “format of final submission” section (refer to the next page). Your work must demonstrate appropriate understanding of academic writing and integrity.

Summarize the key points discussed in the blog post and reiterate the significance of the project in the context of big data analytics. Encourage readers to explore further and engage with the provided sample assignment.

Codersarts provides tailored assistance for your big data analytics project, following the outlined tasks and objectives in your blog. Our team specializes in guiding you through each stage of the project, from understanding the dataset to implementing advanced analytics techniques.

We offer hands-on support in utilizing Apache Hive and PySpark for data transformation, querying, and analysis. Our experts ensure efficient preprocessing and feature engineering to enhance the accuracy of your models. With a focus on coding best practices, we ensure the best and reliability of your analytical solutions.

Codersarts facilitates thorough project evaluation, conducting quantitative assessments and offering insightful interpretations of your findings. Additionally, we provide services such as documentation review and problem-solving sessions to enhance the overall quality and success of your big data analytics endeavor.

If you require any assistance with the project discussed in this blog, or if you find yourself in need of similar support for other projects, please don't hesitate to reach out to us. Our team can be contacted at any time via email at contact@codersarts.com.