ML Wiki

Latest revision as of 14:08, 23 November 2015

Distributed File Systems

Typically MapReduce I/O operations are performed on distribute file systems.

a distributed FS is a FS that manages data across several machines, so it's network based
One such file system is HDFS - Hadoop Distribute File System

Hadoop DFS

Hadoop MapReduce (and other engines) run on an underlying storage for reading and writing

This storage is typically HDFS
Large files are typically distributed in chunks, and they are stored in data nodes.
Each chuck is replicated (typically stored on 3 servers)

File Storage

Blocks

On disks, the disk block size is the minimal amount of data that disks can read
same for DFS: files are broken into block-sized chunks, which are stored as independent units
by default a block is ~ 128 mb
result: a file can be larger than any of the disks on the cluster
it's also good for replication

Namenodes & Datanodes

HDFS uses the master-workers pattern
the master is called the Name node
the workers are called the Data nodes

Namenode

Name Node - the node that orchestrates the process of data distributing and knows where everything is stored
Namenode manages the filesystem namespace, maintains the file tree and metadata for all files and directories
given a file, it knows where its block are located - on which datanodes
the namenode knows how to reconstruct a file from the blocks
it's also a single point of failure - if it fails, this information is lost

Datanodes

datanodes store blocks, and retrieve them when asked by clients
periodially report the namenode the list of stored blocks

Maintaining Consistency

How to maintain consistency across all these replicas?

Reading

When a client needs to read data, it needs to know where this piece of data is:

a "read" command is issued with an offset - how many bytes the client wants to read
The name node knows where every chunk of data is kept, so the clients read the metadata from it.
After getting the metadata, the client reads the data from the data node (so there's no centralized bottleneck - all reads are in parallel)

In case the client fails to read a chunk of data, it asks the name node where the next replica is - and tries again

Writing

We need to make sure that all the replicas contain the same data (i.e. they are consistent)

One replica is considered "main", and the master knows which one.
Client sends the data to be written to all replicas
it's written to the main one and propagated to the rest

So it supports parallel reads and writes from a large number of processors
The reads are arbitrary and random access, but the writes are best when they are added to the end (i.e. appended)
Because the architecture relies on the main replica for deciding the order in which multiple append requests are processed, the data is always consistent

Failure Handling

Namenode is a single point of failure
to prevent losing data, we can have a secondary namenode
it's not really a "namenode", it only keeps a copy of the namespace image - with some logs
but logs might not be up-to-date, so you potentially may lose some data if the namenode fails

HDFS Federation

Can federate several namenodes:

if there are too many files in HDFS - it's hard for the namenode to manage all of them
can add another namenode, so each namenode will manage only a portion of the namespace

Pros and Cons

Cons

Not good for:

low-latency reads and writes (it's not a Database!)
lots of small files

HDFS is not a Database!

HDFS has:

no indexing
no random access to files
no SQL
if you need DB capabilities on top of HDFS use HBase

Sources

@@ Line 1: / Line 1: @@
 == Distributed File Systems ==
 Typically [[MapReduce]] I/O operations are performed on distribute file systems.
+* a distributed FS is a FS that manages data across several machines, so it's network based
+* One such file system is HDFS - [[Hadoop]] Distribute File System
-One such file system is HDFS - [[Hadoop]] Distribute File System
-* ''Name Node'' - the node that orchestrates the process of data distributing and knows where everything is stored
-Large files are typically distributed in chunks 64 mb each, and they are stored in data nodes. Each chuck is replicated (typically stored on 3 servers)
+=== Hadoop DFS ===
+[[Hadoop MapReduce]] (and other engines) run on an underlying storage for reading and writing
+* This storage is typically HDFS
+* Large files are typically distributed in chunks, and they are stored in data nodes.
+* Each chuck is replicated (typically stored on 3 servers)
-=== Hadoop DFS ===
+== File Storage ==
-* block-structured file system managed by a single master node
+=== Blocks ===
-* MR runs on some underlying storage for reading and writing
+* On [[Secondary Storage|disks]], the disk block size is the minimal amount of data that disks can read
-* such storage may be distributes
+* same for DFS: files are broken into block-sized chunks, which are stored as independent units
-* chunk-based distributed file system
+* by default a block is ~ 128 mb
-* gives fault tolerance by data partitioning and replication
+* result: a file can be larger than any of the disks on the cluster
+* it's also good for replication
-==== not a DBS! ====
+=== Namenodes & Datanodes ===
-* no indexing
+* HDFS uses the master-workers pattern
-* no random access to files
+* the master is called the Name node
-* no SQL
+* the workers are called the Data nodes
-* if you need DB capabilities on top of HDFS use HBase
+Namenode
+* ''Name Node'' - the node that orchestrates the process of data distributing and knows where everything is stored
+* Namenode manages the filesystem namespace, maintains the file tree and metadata for all files and directories
+* given a file, it knows where its block are located - on which datanodes
+* the namenode knows how to reconstruct a file from the blocks
+* it's also a single point of failure - if it fails, this information is lost
+Datanodes
+* datanodes store blocks, and retrieve them when asked by clients
+* periodially report the namenode the list of stored blocks
 == Maintaining Consistency ==
 How to maintain [[Consistency (databases)|consistency]] across all these replicas?
 === Reading ===
 When a client needs to read data, it needs to know where this piece of data is:
-; a "read" command is issued with an offset - how many bytes the client wants to read
+* a "read" command is issued with an offset - how many bytes the client wants to read
-# The '''name node''' knows where every chunk of data is kept, so the clients read the metadata from it.
+* The '''name node''' knows where every chunk of data is kept, so the clients read the metadata from it.
-# After getting the metadata, the client reads the data from the '''data node''' (so there's no centralized bottleneck - all reads are in parallel)
+* After getting the metadata, the client reads the data from the '''data node''' (so there's no centralized bottleneck - all reads are in parallel)
 In case the client fails to read a chunk of data, it asks the '''name node''' where the next replica is - and tries again
@@ Line 37: / Line 55: @@
 === Writing ===
 We need to make sure that all the replicas contain the same data (i.e. they are consistent)
-# One replica is considered "main", and the master knows which one.
+* One replica is considered "main", and the master knows which one.
-# Client sends the data to be written to all replicas
+* Client sends the data to be written to all replicas
-: it's written to the main one and propagated to the rest
+* it's written to the main one and propagated to the rest
 * So it supports parallel reads and writes from a large number of processors
@@ Line 47: / Line 65: @@
 https://raw.github.com/alexeygrigorev/ulb-adb-project-couchbd/master/report/images/DFS.png
+=== Failure Handling ===
+* Namenode is a single point of failure
+* to prevent losing data, we can have a secondary namenode
+* it's not really a "namenode", it only keeps a copy of the namespace image - with some logs
+* but logs might not be up-to-date, so you potentially may lose some data if the namenode fails
+== HDFS Federation ==
+Can federate several namenodes:
+* if there are too many files in HDFS - it's hard for the namenode to manage all of them
+* can add another namenode, so each namenode will manage only a portion of the namespace
+== Pros and Cons ==
+=== Cons ===
+Not good for:
+* low-latency reads and writes (it's not a Database!)
+* lots of small files
+=== HDFS is not a [[Database]]! ===
+HDFS has:
+* no indexing
+* no random access to files
+* no SQL
+* if you need DB capabilities on top of HDFS use [[HBase]]
 == Sources ==
+* [[Hadoop: The Definitive Guide (book)]]
 * [[Web Intelligence and Big Data (coursera)]]
 [[Category:Distributed Systems]]
 [[Category:Hadoop]]