- Apr 28

Data Engineering and Analysis with Hadoop, Twitter, and DynamoDB

Introduction:

Welcome to our latest blog post! Today, we're excited to delve into another exciting project requirement in the realm of data engineering and analysis. Our focus is on a compelling project titled "Data Engineering and Analysis with Hadoop, Twitter, and DynamoDB." This project represents the convergence of cutting-edge technologies, as we explore the dynamic landscapes of Hadoop clusters, Twitter data streams, and DynamoDB databases. Our journey includes configuring Hadoop clusters on Ubuntu VMs, streaming and preprocessing real-time tweets from Twitter, and executing database operations using AWS DynamoDB and HiveQL queries. Within our solution approach, we'll outline the methodologies and practical implementations. Finally, in the output section, we'll showcase some of the project's key findings and outputs.

Project Requirement :

Problem Statement:

The project aims to configure and demonstrate a Hadoop cluster on Ubuntu VMs, stream and preprocess public tweets from Twitter, and perform database operations using AWS DynamoDB and HiveQL queries.

Experiment 1 :

You are required to attempt this task on the hadoop cluster created using Ubuntu VMs on local machines during the lectures.

Modify the existing master node for the following characteristics and hardware specifications:

8 Gb of RAM;
1 processors/cores/CPUs;
should run on a Linux version which is consistent across the cluster;
rename the VM as 'master-cw';
should have two user accounts, out of which one of the user accounts should be named ‘hadoop-cw'; and
any hadoop communication and functionality on master-cw node should be carried out only via hadoop-cw user account.

Your cluster should have 2 nodes including the modified master node and the existing worker node. You should be able to demonstrate that your newly configured hadoop cluster is ready to execute a map reduce job.

Please provide the following in your pdf file.

(a) A sequential script that can be executed to replicate your experiment. The script should consist of all the terminal commands from your hadoop nodes (i.e. linux VMs), in the sequence of execution. Include any debugging/troubleshooting commands that you may require to execute as part of the same sequential script and in order of execution.

(b) A copy of configuration tags for each of the hadoop configuration n files that you may have edited. Paste the configuration tag after the relevant text editing Linux-terminal command in your script;

for example if you have opened the start-dfs.sh file using vi editor and made some edits; you should copy Paste those edits after the terminal command which you used to open this shell script.

(c) The modifications made to bash file/s. (please don't copy and paste the predefined scripts and variables from the bash file). Use the same strategy to present your work as suggested

(d) A copy of any system files that you may have edited on Linux VMs. Use the same strategy to present your work as suggested in (b).

(e) Screenshots of inputs, outputs on terminal and any text editors that you may use.

Experiment II

Stream public tweets from twitter which are authored in the United Kingdom, and United States over a period of two days. The tweets must in be: English language, and contain either of the following hashtags or word: #CoronaOutbreak #CoronavirusCoverup #DeltaVariant #UnitedKingdom and #UnitedStates. The twitter stream should span across exactly two days, i.e. 48 hours. You can either use and modify the tweets streaming script provided in lecture or create your own.

Note: Please mention the dates with start and end time in comments when you download the tweets.

Save only the relevant fields from the streaming API tweet response in a csv file so that you can perform the next task and Experiment III.

You are also required to pre-process any data at this stage preferably by using python programming language. The pre-processing would retain the appropriate attributes from response fields as features/CSV-columns and must include creation of new features as follows:

a feature called tweet_text where all English language stop words, any punctuation marks and hashtags/words mentioned in 1 have been removed from the tweet's text; and
the pre-processing should also include creation of 5 other features each representing one of the hashtags/words from Task 1. The value of these feature would be the frequency of times the hashtag/word has appeared in the tweet.
You should save the final set of features in a csv file.

Create a database in DynamoDB using the CSV file that you have streamed in Task 1. Your database should be modelled in a way that comparison between COVID-19's Delta variant outbreak in Unitec States and United Kingdom can be drawn by using results from simple NoSQL style queries. Note that you are not required to run any queries to do analysis at this stage. You must demonstrate the following in your solution:

(a) importing data from the CSV file to DynamoDB,

(b) a data model to reflect your database design, and

Experiment III

You are required to run HiveQL queries on the tables that you have created in Experiment II to find the following:

the country which has authored the most tweets,
the most frequent hashtag/word mentioned in Experiment II found in tweets from each country,
the most frequent hashtag/word mentioned in Experiment II found in all tweets,
total number of user mentions in tweets from each country respectively, and
total number of user mentions in all tweets.

Solution Approach :

Twitter API Integration
Utilized Tweepy library to interact with the Twitter API.
Authenticated API credentials to access Twitter data.
Data Extraction
Conducted Twitter searches based on specific hashtags and date ranges.
Filtered tweets based on language (English) and geographical locations.
Data Preprocessing
Cleaned and processed tweet text by removing stop words, punctuation, and URLs.
Extracted features such as mention count and hashtag frequencies.
Feature Creation
Created new features based on hashtags mentioned in tweets (e.g., CoronaOutbreak, CoronaVirusCoverup, etc.).
Transformed raw data into structured features for analysis.
Location Cleaning
Standardized and cleaned location data to categorize tweets into UK or US categories.
Data Storage
Saved processed data as CSV and JSON files for future analysis and visualization.

Output :

In our project on "Data Engineering and Analysis with Hadoop, Twitter, and DynamoDB," we delve into the realms of big data processing, real-time data streaming, and NoSQL database management. This project embodies the fusion of advanced technologies, as we navigate through the intricacies of configuring Hadoop clusters, streaming and preprocessing tweets from Twitter, and performing database operations using DynamoDB. Through meticulous data processing and analysis, we aim to extract meaningful insights and drive informed decision-making.

Our exploration begins with the configuration of Hadoop clusters on Ubuntu VMs, ensuring seamless integration and scalability. We then leverage the Twitter API to stream real-time tweets, filtering and preprocessing the data to extract relevant information. With DynamoDB, we embark on database operations, modeling our data to enable efficient querying and analysis. Throughout our journey, we'll outline our solution approach, methodologies, and practical implementations, culminating in the presentation of key findings and outputs derived from our data analysis efforts.

If you require any assistance with the project discussed in this blog, or if you find yourself in need of similar support for other projects, please don't hesitate to reach out to us. Our team can be contacted at any time via email at contact@codersarts.com.