Mar 21

Big Data Analytics - Coursework Assignment

Introduction

In this blog post, we will see into the intricacies of a significant project titled "Big Data Analytics - Coursework." This project revolves around understanding, analyzing, and deriving insights from the UNSW-NB15 dataset, which encompasses a rich collection of network traffic data designed for cybersecurity analysis.

Project Overview:

The project aims to conduct a comprehensive analysis of the UNSW-NB15 dataset, which encompasses a blend of real-world normal activities and synthetic contemporary attack behaviors. The dataset, generated using the IXIA PerfectStorm tool within the Cyber Range Lab of the Australian Centre for Cyber Security (ACCS), is designed for big data analytics. The tasks involve understanding the dataset, querying and analyzing using Apache Hive, performing advanced analytics using PySpark, and documenting the entire process.

Tasks:

1 . Understanding Dataset: UNSW-NB15

The UNSW-NB151 dataset's raw network packets were generated using the IXIA PerfectStorm tool within the Cyber Range Lab of the Australian Centre for Cyber Security (ACCS). This dataset was designed to combine real modern normal activities with synthetic contemporary attack behaviors. Tcpdump was employed to capture 100 GB of raw traffic, resulting in Pcap files. The dataset encompasses nine types of attacks: Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode, and Worms. To further analyze the data, Argus and Bro-IDS tools were utilized, and twelve algorithms were developed, generating a total of 49 features along with their corresponding class labels.

a). The features are outlined in this section.

b). The quantity of attaks and their respective sub-categories is delineated here.

c). In this coursework, we utilize a total of 10 million records stored in a CSV file (available for download). The file size amounts to approximately 600MB, which is sufficiently large to warrant the application of big data methodologies for analysis. As a specialist in big data, our initial approach involves comprehending the dataset's features before implementing modeling techniques. Should you wish to view a subset of this dataset, you can import it into Hadoop HDFS and execute a Hive query to display the first 5-10 records for better comprehension.

Dataset Features : -

2. Big Data Query & Analysis by Apache Hive

This task involves utilizing Apache Hive to transform large raw data into actionable insights for end users. The process begins by thoroughly understanding the dataset. Subsequently, at least 4 Hive queries should be formulated (refer to the marking scheme). Suitable visualization tools should be applied to present the findings both numerically and graphically. A brief interpretation of the findings should also be provided.

Finally, screenshots of the outcomes, including tables and plots, along with the scripts/queries should be included in the report.

3. Advanced Analytics using PySpark 

In this section, you will conduct advanced analytics using PySpark.

3.1. Analyze and Interpret Big Data 

We need to learn and understand the data through at least 4 analytical methods (descriptive statistics, correlation, hypothesis testing, density estimation, etc.). You need to present your work numerically and graphically. Apply tooltip text, legend, title, X-Y labels etc. accordingly to help end-users for getting insights.

3.2. Design and Build a Classifier 

a) Design and build a binary classifier over the dataset. Explain your algorithm and its configuration. Explain your findings into both numerical and graphical representations. Evaluate the performance of the model and verify the accuracy and the effectiveness of your model. 

b) Apply a multi-class classifier to classify data into ten classes (categories): one normal and nine attacks (e.g., Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode and Worms). Briefly explain your model with supportive statements on its parameters, accuracy and effectiveness.

3. Documentation 

Document all your work. Your final report must follow 5 sections detailed in the “format of final submission” section (refer to the next page). Your work must demonstrate appropriate understanding of academic writing and integrity.

Summarize the key points discussed in the blog post and reiterate the significance of the project in the context of big data analytics. Encourage readers to explore further and engage with the provided sample assignment.


Codersarts provides tailored assistance for your big data analytics project, following the outlined tasks and objectives in your blog. Our team specializes in guiding you through each stage of the project, from understanding the dataset to implementing advanced analytics techniques.

We offer hands-on support in utilizing Apache Hive and PySpark for data transformation, querying, and analysis. Our experts ensure efficient preprocessing and feature engineering to enhance the accuracy of your models. With a focus on coding best practices, we ensure the best and reliability of your analytical solutions.

Codersarts facilitates thorough project evaluation, conducting quantitative assessments and offering insightful interpretations of your findings. Additionally, we provide services such as documentation review and problem-solving sessions to enhance the overall quality and success of your big data analytics endeavor.

If you require any assistance with the project discussed in this blog, or if you find yourself in need of similar support for other projects, please don't hesitate to reach out to us. Our team can be contacted at any time via email at contact@codersarts.com.