top of page

L3 MapReduce, Spark Project Help

Updated: Mar 25, 2021



Table of content:

Word count

  • output the number of occurrences for each word in the dataset.


Naïve solution:


word_count (D):
   H = new dict
   For each w in D:
     H[w]+= 1
   For each w in H:
     print(w,  H[w])

How to speed up?

Make use of multiple workers


There are some problems…

Data reliability

  • Equal split of data

  • Delay of worker

  • Failure of worker

  • Aggregation the result

We need to handle them all! In the traditional way of parallel and distributed processing.


MapReduce

MapReduce is a programming framework that:

  • allows us to perform distributed and parallel processing on large data sets in a distributed environment

  • no need to bother about the issues like reliability, fault tolerance etc

  • offers the flexibility to write code logic without caring about the design issues of the system

  • MapReduce consists of Map and Reduce


  • Map

    • Reads a block of data

    • Produces key-value pairs as intermediate outputs

  • Reduce

    • Receive key-value pairs from multiple map jobs

    • aggregates the intermediate data tuples to the final output


A Simple MapReduce Example


Pseudo Code of Word Count

Map(D):
    for each w in D:
        emit(w, 1)


Reduce(t, counts): # e.g., bear, [1, 1]
  sum = 0
  for c in counts:
     sum = sum + c
  emit (t, sum)

Advantages of MapReduce

Parallel processing

  • Jobs are divided to multiple nodes

  • Nodes work simultaneously

  • Processing time reduced

Data locality

  • Moving processing to the data

    • Opposite from the traditional way


Motivation of Spark

  • MapReduce greatly simplified big data analysis on large, unreliable clusters. It is great at one pass computation.


But as soon as it got popular, users wanted more:

  • more complex, multi-pass analytics (e.g. ML, graph)

  • more interactive ad hoc queries

  • more real-time stream processing


Limitations of MapReduce

As a general programming model:

  • more suitable for one-pass computation on a large dataset

  • hard to compose and nest multiple operations

  • no means of expressing iterative operations

As implemented in Hadoop:

  • all datasets are read from disk, then stored back on to disk

  • all data is (usually) triple replicated for reliability


Data Sharing in Hadoop MapReduce


  • Slow due to replication, serialization, and disk IO

  • Complex apps, streaming, and interactive queries all need one thing that MapReduce lacks:

    • Efficient primitives for data sharing


What is Spark?


  • Apache Spark is an open-source cluster computing framework for real-time processing.

  • Spark provides an interface for programming entire clusters with

    • implicit data parallelism

    • fault tolerance

  • Built on top of Hadoop MapReduce

    • extends the MapReduce model to efficiently use more types of computations


Spark Features

  • Polyglot

  • Speed

  • Multiple formats

  • Lazy evaluation

  • Real-time computation

  • Hadoop integration

  • Machine learning


Spark Eco-System:


Spark Architecture

  • Master Node

    • takes care of the job execution within the cluster

  • Cluster Manager

    • allocates resources across applications

  • Worker Node

    • executes the tasks


Spark Resilient Distributed Dataset(RDD)

  • RDD is where the data stays

  • RDD is the fundamental data structure of Apache Spark

  • is a collection of elements

    • Dataset

  • can be operated on in parallel

    • Distributed

  • fault-tolerant

    • Resilient


Features of Spark RDD

In-memory computation

  • Partitioning

  • Fault tolerance

  • Immutability

  • Persistence

  • Coarse-grained operations

  • Location stickiness


Create RDDs

Parallelizing an existing collection in your driver program

  • Normally, Spark tries to set the number of partitions automatically based on your cluster

Referencing a dataset in an external storage system

  • HDFS, HBase, or any data source offering a Hadoop InputFormat

  • By default, Spark creates one partition for each block of the file


RDD Operations


Transformations

  • functions that take an RDD as the input and produce one or many RDDs as the output

  • Narrow Transformation

  • Wide Transformation

Actions

  • RDD operations that produce non RDD values.

  • returns the final result of RDD computations

Narrow and Wide Transformations

Narrow transformation involves no data shuffling

  • map

  • flatMap

  • filter

  • sample

Wide transformation involves data shuffling

  • sortByKey

  • reduceByKey

  • groupByKey

  • join














Action

Actions are the operations which are applied on an RDD to instruct Apache Spark to apply computation and pass the result back to the driver

  • collect

  • take

  • reduce

  • forEach

  • sample

  • count

  • save


Lineage

RDD lineage is the graph of all the ancestor RDDs of an RDD

  • Also called RDD operator graph or RDD dependency graph

Nodes: RDDs

Edges: dependencies between RDDs


Fault tolerance of RDD

All the RDDs generated from fault-tolerant data are fault-tolerant.

If a worker falls, and any partition of an RDD is lost

  • the partition can be recomputed from the original fault-tolerant dataset using the lineage

  • the task will be assigned to another worker


DAG in Spark

  • DAG is a direct graph with no cycle

    • Node: RDDs, results

    • Edge: Operations to be applied on RDD

  • On the calling of Action, the created DAG submits to DAG Scheduler which further splits the graph into the stages of the task

  • DAG operations can do better global optimization than other systems like MapReduce

DAG, Stages, and Tasks

  • DAG Scheduler splits the graph into multiple stages

  • Stages are created based on transformations

    • The narrow transformations will be grouped together into a single-stage

    • Wide transformation defines the boundary of 2 stages

  • DAG scheduler will then submit the stages into the task scheduler

    • The number of tasks depends on the number of partitions

    • The stages that are not interdependent may be submitted to the cluster for execution in parallel

Lineage vs. DAG in Spark

  • They are both DAG (data structure)

  • Different end nodes

  • Different roles in Spark



Contact us to get help by our Big data experts at:

contact@codersarts.com


bottom of page