What is Hadoop?

naveen kumar
Sep 7, 2020
3 min read

Apache Hadoop is an open-source software framework that

Stores big data in a distributed manner
Processes big data parallelly
Builds on large clusters of commodity hardware

Based on Google’s papers on Google File System (2003) and MapReduce (2004)

Hadoop is:

Scalable to Petabytes or more easily(Volume)
Offering parallel data processing(Velocity)
Storing all kinds of data(Variety)

Hadoop offers:

Redundant, Fault-tolerant data storage (HDFS)
Parallel computation framework (MapReduce)
Job coordination/scheduling (YARN)

Programmers no longer need to worry about:

Where the file is located?
How to handle failures & data lost?
How to divide computation?
How to program for scaling?

Hadoop Ecosystem

Core of Hadoop:

Hadoop distributed file system (
MapReduce
YARN (Yet Another Resource Negotiator) (from Hadoop v2.0)

Additional software packages:

Pig
Hive
Spark
HBase
........

The Master-Slave Architecture of Hadoop

Hadoop Distributed File Systems(HDFS)

HDFS is a file system that

follows master slave architecture
allows us to store data over multiple nodes(machines)
allows multiple users to access data.
just like file systems in your PC

HDFS supports

distributed storage
distributed computation
horizontal scalability

Vertical Scaling vs. Horizontal Scaling

HDFS Architecture

NameNode

NameNode maintains and manages the blocks in the DataNodes (slave nodes).

Master node

Functions:

records the metadata of all the files
- FsImage: file system namespace since our name node is started

It records all changes

EditLogs: all the recent modifications e.g for the past 1 hour
records each change to the metadata
regularly checks the status of data nodes
keeps a record of all the blocks in HDFS
if the DataNode fails, handle data recovery

DataNode

A commodity hardware stores the data

Slave node

Functions

stores actual data
performs the read and write requests
reports the health to NameNode (heartbeat)

NameNode vs. DataNode

If NameNode failed…

All the files on HDFS will be lost

there’s no way to reconstruct the files from the blocks in DataNodes without the metadata in NameNode

In order to make NameNode resilient to failure

back up metadata in NameNode (with a remote NFS mount)
Secondary NameNode

Secondary NameNode

Take checkpoints of the file system metadata present on NameNode

It is not a backup NameNode!

Functions:

Stores a copy of FsImage file and Editlogs
Periodically applies Editlogs to FsImage and refreshes the Editlogs
If NameNode is failed, File System metadata can be recovered from the last saved FsImage on the Secondary NameNode

NameNode vs. Secondary NameNode

Blocks

Block is a sequence of bytes that stores data

Data stores as a set of blocks in HDFS
Default block size is 128M eta B ytes (Hadoop 2.x and 3.x) x), the default size in hadoob is 64 metab bytes
A file is spitted into multiple blocks

Why Large Block Size?

HDFS stores huge datasets
If block size is small (e.g., 4KB in Linux), then the number of blocks is large:
- too much metadata for NameNode
- too many seeks affect the read speed
1. read speed=seek time+transfer time
2. tranfer time=total size of file/transportation speed
harm the performance of MapReduce too
We don’t recommend using HDFS for small files due to similar reasons.

If DataNode Failed…

Commodity hardware fails
- If NameNode hasn’t heard from a DataNode for 10mins, The DataNode is considered dead…
HDFS guarantees data reliability by generating multiple replications of data
- each block has 3 replications by default
- replications will be stored on different DataNodes
- if blocks were lost due to the failure of a DataNode, they can be recovered from other replications
- the total consumed space is 3 times the data size
It also helps to maintain data integrity (whether The data stored is correct or not

File, Block and Replica

A file contains one or more blocks
- Blocks are different
- Depends on the file size and block size
A block has multiple replicas
- Replicas are the same
- Depends on the preset replication factor

Replication Management

Each block is replicated 3 times and stored on different DataNodes

Why default replication factor 3?

If 1 replicate
- DataNode fails, block lost
Assume
- # of nodes N = 4000
- # of blocks R = 1,000,000
- Node failure rate FPD = 1 per day u expect to see 1 machine fail per day
If one node fails, then R/N = 250 blocks are lost
- E(# of losing blocks in one day) 250
Let the number of losing blocks follows Poisson distribution, then
- Pr[# of losing blocks in one day >= 250] = 0.508

If you are looking more about Hadoop or looking for project assignment help related to Hadoop, big data, spark, etc. then you can send the details of your requirement at below contact id: