top of page

Machine Learning

Public·3 members

Social Media Data Analysis with Spark In Python


What is Data Analysis


Let see Data Analytics pipeline


Step 1: Data Acquisitions

Step 2: Data Preparation

Step 3: Data Representation

Step 4: Data Representation

Step 5: Data Analytics

Step 6: Results Interpretations


in this post we are going to Data analysis on twitter data


What kind of Analytics can be done in Twitter?


  • Counting. Real-time counting analytics such as how many requests per day, how many sign-ups, how many times a certain word appears, etc.

  • Correlation. Near-real-time analytics such as desktop vs. mobile users, which devices fail at the same time, etc.

  • Monitoring. Monitoring the customer opinions about a brand.

  • Research. More in-depth analytics that run in batch mode on the historical data such as what features get re-tweeted, detecting sentiments, etc.

  • Network Analysis: ego network analysis, monitoring followers growth, community analysis

  • Sentiment Analysis: tracking events and hot topics, trending. monitoring customer opinions about the product

  • Other: tweet engagement (top popular tweets)


What is Spark?


Apache Spark is a framework for developing distributed computing applications.

As research project at the University of California, Berkeley.


Features of Apache Spark


  • Speed (in-memory computations)

  • Supports multiply languages (Java, Scala, Python, R)

  • Advanced Analytics (SQL queries, Streaming data, Machine learning and Graph algorithms )


System requirements

  • Java 8+

  • Scala 2.11.x

  • Pyhotn 2.7+

  • 8GM RAM and 8-16 cores CPU

  • read more


Who uses Spark and Why?

Data Scientist:

  • Analyze and model the data to obtain insights of the data;

  • Transforming the data into a useable format

  • Statistics, machine learning, SQL

  • Advanced analytics

Engineers:

  • Develop a data processing system or applications

  • Monitor, inspect and tune the applications


What is Spark SQL

Allows:

  • load the data in .csv, .json and .parquet file format

  • relational queries expressed in SQL, Scala

  • use of SchemaRDD (abstract table)


DataFrame, Dataset

A Dataset is a strongly typed collection of domain-specific objects.

Each Dataset has an untyped view called a DataFrame (a Dataset of

Row)


Two types of operations on Datasets:

  • transformations (e.g., map(),filter(),select(),aggregate(), etc.)

  • actions (e.g., count(),show(), etc. )


Get Spark Homework Help, SPARK Assignment Help, Apache Spark Assignment Help, Apache Spark Experts

Getting the data/Load the data to Spark (SQLContext)

Understanding the data:

  • calculating basic statistics

  • making histograms

Cleaning the data:

  • filtering data

  • dealing with missing, incomplete data

Feature extraction:

  • Dealing with categorical data

Clustering (using Mllib)

Saving the data to:

  • files

  • MongoDB database

Visualizing Data (Zeeplin or d3.js)


240 Views
bottom of page