Machine Learning

Public·3 members

Back

Codersarts

January 6, 2020

Social Media Data Analysis with Spark In Python

What is Data Analysis

Let see Data Analytics pipeline

Step 1: Data Acquisitions

Step 2: Data Preparation

Step 3: Data Representation

Step 4: Data Representation

Step 5: Data Analytics

Step 6: Results Interpretations

in this post we are going to Data analysis on twitter data

What kind of Analytics can be done in Twitter?

Counting. Real-time counting analytics such as how many requests per day, how many sign-ups, how many times a certain word appears, etc.
Correlation. Near-real-time analytics such as desktop vs. mobile users, which devices fail at the same time, etc.
Monitoring. Monitoring the customer opinions about a brand.
Research. More in-depth analytics that run in batch mode on the historical data such as what features get re-tweeted, detecting sentiments, etc.
Network Analysis: ego network analysis, monitoring followers growth, community analysis
Sentiment Analysis: tracking events and hot topics, trending. monitoring customer opinions about the product
Other: tweet engagement (top popular tweets)

What is Spark?


Apache Spark is a framework for developing distributed computing applications.

As research project at the University of California, Berkeley.

Features of Apache Spark

Speed (in-memory computations)
Supports multiply languages (Java, Scala, Python, R)
Advanced Analytics (SQL queries, Streaming data, Machine learning and Graph algorithms )

System requirements

Java 8+
Scala 2.11.x
Pyhotn 2.7+
8GM RAM and 8-16 cores CPU
read more

Who uses Spark and Why?

Data Scientist:

Analyze and model the data to obtain insights of the data;
Transforming the data into a useable format
Statistics, machine learning, SQL
Advanced analytics

Engineers:

Develop a data processing system or applications
Monitor, inspect and tune the applications

What is Spark SQL

Allows:

load the data in .csv, .json and .parquet file format
relational queries expressed in SQL, Scala
use of SchemaRDD (abstract table)

DataFrame, Dataset

A Dataset is a strongly typed collection of domain-specific objects.

Each Dataset has an untyped view called a DataFrame (a Dataset of

Row)

Two types of operations on Datasets:

transformations (e.g., map(),filter(),select(),aggregate(), etc.)
actions (e.g., count(),show(), etc. )

Get Spark Homework Help, SPARK Assignment Help, Apache Spark Assignment Help, Apache Spark Experts

Getting the data/Load the data to Spark (SQLContext)

Understanding the data:

calculating basic statistics
making histograms

Cleaning the data:

filtering data
dealing with missing, incomplete data

Feature extraction:

Dealing with categorical data

Clustering (using Mllib)

Saving the data to:

files
MongoDB database

Visualizing Data (Zeeplin or d3.js)

240 Views

See All Members (3)