"User Identification In Tor Network" In Machine Learning | Machine Learning Assignment Help

Codersarts
May 18, 2020
7 min read

Updated: May 23, 2020

INTRODUCTION

USER IDENTIFICATION IN TOR NETWORKS

Unlike conventional World Wide Web technologies, the Tor Darknet onion routing technologies give users a real chance to remain anonymous. Many users have jumped at this chance – some did so to protect themselves or out of curiosity, while others developed a false sense of impunity, and saw an opportunity to do clandestine business anonymously: selling banned goods, distributing illegal content, etc. However, further developments, such as the detention of the maker of the Silk Road site, have conclusively demonstrated that these businesses were less anonymous than most assumed. Intelligence services have not disclosed any technical details of how they detained cybercriminals who created Tor sites to distribute illegal goods; in particular, they are not giving any clues how they identify cybercriminals who act anonymously. This may mean that the implementation of the Tor Darknet contains some vulnerabilities and/or configuration defects that make it possible to unmask any Tor user.

TOR NETWORK

Tor is software that allows users to browse the Web anonymously. Developed by the Tor Project, a nonprofit organization that advocates for anonymity on the internet, Tor was originally called The Onion Router because it uses a technique called onion routing to conceal information about user activity. Tor Browser offers the best anonymous web browsing available today, and researchers are hard at work improving Tor's anonymity properties. Tor is an Internet networking protocol designed to anonymize the data relayed across it. Using Tors software will make it difficult, if not impossible, for any snoops to see your webmail, search history, social media posts or other online activity.

How does Tor work?

The Tor network runs through the computer servers of thousands of volunteers spread throughout the world. The data is bundled into an encrypted packet when it enters the Tor network. Then, unlike the case with normal Internet connections, Tor strips away part of the packets header, which is a part of the addressing information that could be used to learn things about the sender such as the operating system from which the message was sent. Finally, Tor encrypts the rest of the addressing information, called the packet wrapper. Regular Internet connections dont do this. The modified and encrypted data packet is then routed through many of these servers, called relays, on the way to its final destination.The roundabout way packets travel through the Tor network is akin to a person taking a roundabout path through a city to shake a pursuer.

DEEP NEURAL NETWORK-BASED USER IDENTIFICATION SYSTEM

Deep Neural Networks (DNNs), also called convolutional networks, are composed of multiple levels of nonlinear operations, such as neural nets with many hidden layers. Deep learning methods aim at learning feature hierarchies, where features at higher levels of the hierarchy are formed using the features at lower. It is a neural network with a certain level of complexity, a neural network with more than two layers. Deep neural networks use sophisticated mathematical modeling to process data in complex ways.

SOME ADVANTAGES OF USING THIS APPROACH OVER OTHERS

Has best-in-class performance on problems that significantly outperforms other

solutions in multiple domains.

Reduces the need for feature engineering, one of the most time-consuming parts of machine learning practice.

It is an architecture that can be adapted to new problems relatively easily.

SOME DISADVANTAGES OF THIS SYSTEM

Requires a large amount of data

It is extremely computationally expensive to train. The most complex models take weeks to train using hundreds of machines equipped with expensive GPUs.

Do not have much in the way of a strong theoretical foundation. This leads to the

next disadvantage.

Determining the topology/flavor/training method/hyperparameters for deep

learning is a black art with no theory to guide you.

What is learned is not easy to comprehend. Other classifiers (e.g. decision trees, logistic regression, etc) make it much easier to understand what’s going on.

Requirement specification

3.1.1 KERAS

Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research. KERAS is an Open Source Neural Network library. It is designed to be modular, fast, and easy to use. It was developed by François Chollet, a Google engineer. Keras doesn't handle low-level computation. Instead, it uses another library to do it, called the Backend. So Keras is a high-level API wrapper for the low-level API, capable of running on top of TensorFlow, CNTK, or Theano. It wraps the efficient numerical computation libraries Theano and TensorFlow and allows you to define and train neural network models in just a few lines of code.

3.1.2 PANDAS

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with structured (tabular, multidimensional, potentially heterogeneous) and time-series data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python.DataFrames in Python are very similar, they come with the Pandas library, and they are defined as two-dimensional labeled data structures with columns of potentially different types. Pandas DataFrame consists of three main components: the data, the index, and the columns.

Firstly, the DataFrame can contain data
Besides data, you can also specify the index and column names for your DataFrame.

Steps To Do This

Data Collection

This step could be executed in multiple ways, one of these ways is by using Wireshark to track the live data. A brief description of Wireshark is given below:

Data Preprocessing

The next stage in our analysis is the preprocessing phase. Data pre-processing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data pre-processing is a proven method of resolving such issues. Data pre-processing prepare raw data for further processing. Data pre-processing is used database-driven applications such as customer relationship management and rule-based applications (like neural networks). Data goes through a series of steps during pre-processing:

1. Data Cleaning: Data is cleansed through processes such as filling in missing values, smoothing the

noisy data, or resolving the inconsistencies in the data.

2. Data Integration: Data with different representations are put together and conflicts within the

data are resolved.

3. Data Transformation: Data is normalized, aggregated, and generalized.

4. Data Reduction: This step aims to present a reduced representation of the data in a data warehouse.

5. Data Discretization: Involves the reduction of a number of values of a continuous attribute by dividing the range of attribute intervals.

Once we transformed the tor data in accordance with these standards, there were still some inconsistencies which could be defined as follows:

Inaccurate data (missing data) —There are many reasons for missing data such as data is not continuously collected, a mistake in data entry, technical problems with bio-metrics, and much more.

The presence of noisy data (erroneous data and outliers) — the reasons for the existence of noisy data could be a technological problem of gadget that gathers data, a human mistake during data entry and much more.

Inconsistent data — the presence of inconsistencies are due to the reasons such that existence of duplication within data, human data entry, containing mistakes in codes or names, i.e., violation of data constraints and much more.

Therefore, to handle raw data, Data Pre-processing is performed.

Feature Selection

Feature Selection is one of the core concepts in machine learning which hugely impacts the performance of your model. The data features that you use to train your machine learning models have a huge influence on the performance you can achieve.

Feature Selection Algorithms

There are three general classes of feature selection algorithms: filter methods, wrapper methods, and embedded methods.

1. Filter Methods: Filter feature selection methods apply a statistical measure to assign a scoring to each feature. The features are ranked by the score and either selected to be kept or removed from the dataset. The methods are often univariate and consider the feature independently, or with regard to the dependent variable.

2. Wrapper Methods: Wrapper methods consider the selection of a set of features as a search problem, where different combinations are prepared, evaluated and compared to other combinations. A predictive model is used to evaluate a combination of features and assign a score based on model accuracy.

3. Embedded Methods: Embedded methods learn which features best contribute to the accuracy of the model while the model is being created. The most common type of embedded feature selection method is regularization methods.

How to select features and what are the Benefits of performing feature selection before modeling your data?

Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.

Improves Accuracy: Less misleading data means modeling accuracy improves.

Reduces Training Time: fewer data points reduce algorithm complexity and algorithms train faster.

As shot description of the feature set we have used is given in the snippet below:

Model Selection

Model selection is a process that can be applied both across different types of models (e.g. logistic regression, SVM, KNN, etc.) and across models of the same type configured with different model hyperparameters (e.g. different kernels in an SVM). There are many approaches for model selection such as:

Model Selection Using (SRM):

In any ML problem, we specify a hypothesis class H, which we believe includes a good predictor for the learning task at hand. In the SRM paradigm, we specify a weight function which, assigns a weight to each hypothesis class such that a higher weight reflects a stronger preference for the hypothesis class.

So the bottom line is that model selection is the process of selecting one final machine learning model from among a collection of candidate machine learning models for a training dataset. For this project, we built a deep learning model from neural networks.

The main architecture of this model is shown in the snippet below:

Model Evaluation

The model evaluation shows how well the model has performed based on the accuracy and loss achieved by the model. The following shows the loss and accuracy of mode for just four epochs.