Line 1: Line 1:
 
== Distributed File Systems ==
 
== Distributed File Systems ==
 
Typically [[MapReduce]] I/O operations are performed on distribute file systems.   
 
Typically [[MapReduce]] I/O operations are performed on distribute file systems.   
 +
* a distributed FS is a FS that manages data across several machines, so it's network based
 +
* One such file system is HDFS - [[Hadoop]] Distribute File System
  
One such file system is HDFS - [[Hadoop]] Distribute File System
 
* ''Name Node'' - the node that orchestrates the process of data distributing and knows where everything is stored
 
  
Large files are typically distributed in chunks 64 mb each, and they are stored in data nodes. Each chuck is replicated (typically stored on 3 servers)
+
=== Hadoop DFS ===
 +
[[Hadoop MapReduce]] (and other engines) run on an underlying storage for reading and writing
 +
* This storage is typically HDFS
 +
* Large files are typically distributed in chunks, and they are stored in data nodes.  
 +
* Each chuck is replicated (typically stored on 3 servers)
  
  
=== Hadoop DFS ===
+
== File Storage ==
* block-structured file system managed by a single master node
+
=== Blocks ===
* MR runs on some underlying storage for reading and writing
+
* On [[Secondary Storage|disks]], the disk block size is the minimal amount of data that disks can read
* such storage may be distributes
+
* same for DFS: files are broken into block-sized chunks, which are stored as independent units
* chunk-based distributed file system
+
* by default a block is ~ 128 mb
* gives fault tolerance by data partitioning and replication
+
* result: a file can be larger than any of the disks on the cluster
 +
* it's also good for replication
  
  
==== not a DBS! ====
+
=== Namenodes & Datanodes ===
* no indexing
+
* HDFS uses the master-workers pattern
* no random access to files
+
* the master is called the Name node
* no SQL
+
* the workers are called the Data nodes
* if you need DB capabilities on top of HDFS use HBase
+
 
 +
 
 +
Namenode
 +
* ''Name Node'' - the node that orchestrates the process of data distributing and knows where everything is stored
 +
* Namenode manages the filesystem namespace, maintains the file tree and metadata for all files and directories
 +
* given a file, it knows where its block are located - on which datanodes
 +
* the namenode knows how to reconstruct a file from the blocks
 +
* it's also a single point of failure - if it fails, this information is lost
 +
 
 +
 
 +
Datanodes
 +
* datanodes store blocks, and retrieve them when asked by clients
 +
* periodially report the namenode the list of stored blocks
  
  
 
== Maintaining Consistency ==
 
== Maintaining Consistency ==
 
How to maintain [[Consistency (databases)|consistency]] across all these replicas?  
 
How to maintain [[Consistency (databases)|consistency]] across all these replicas?  
 +
  
 
=== Reading ===
 
=== Reading ===
 
When a client needs to read data, it needs to know where this piece of data is:
 
When a client needs to read data, it needs to know where this piece of data is:
; a "read" command is issued with an offset - how many bytes the client wants to read  
+
* a "read" command is issued with an offset - how many bytes the client wants to read  
# The '''name node''' knows where every chunk of data is kept, so the clients read the metadata from it.  
+
* The '''name node''' knows where every chunk of data is kept, so the clients read the metadata from it.  
# After getting the metadata, the client reads the data from the '''data node''' (so there's no centralized bottleneck - all reads are in parallel)  
+
* After getting the metadata, the client reads the data from the '''data node''' (so there's no centralized bottleneck - all reads are in parallel)  
  
 
In case the client fails to read a chunk of data, it asks the '''name node''' where the next replica is - and tries again
 
In case the client fails to read a chunk of data, it asks the '''name node''' where the next replica is - and tries again
Line 37: Line 55:
 
=== Writing ===
 
=== Writing ===
 
We need to make sure that all the replicas contain the same data (i.e. they are consistent)  
 
We need to make sure that all the replicas contain the same data (i.e. they are consistent)  
# One replica is considered "main", and the master knows which one.  
+
* One replica is considered "main", and the master knows which one.  
# Client sends the data to be written to all replicas  
+
* Client sends the data to be written to all replicas  
: it's written to the main one and propagated to the rest  
+
* it's written to the main one and propagated to the rest  
  
 
* So it supports parallel reads and writes from a large number of processors  
 
* So it supports parallel reads and writes from a large number of processors  
Line 47: Line 65:
  
 
https://raw.github.com/alexeygrigorev/ulb-adb-project-couchbd/master/report/images/DFS.png
 
https://raw.github.com/alexeygrigorev/ulb-adb-project-couchbd/master/report/images/DFS.png
 +
 +
 +
=== Failure Handling ===
 +
* Namenode is a single point of failure
 +
* to prevent losing data, we can have a secondary namenode
 +
* it's not really a "namenode", it only keeps a copy of the namespace image - with some logs
 +
* but logs might not be up-to-date, so you potentially may lose some data if the namenode fails
 +
 +
 +
== HDFS Federation ==
 +
Can federate several namenodes:
 +
* if there are too many files in HDFS - it's hard for the namenode to manage all of them
 +
* can add another namenode, so each namenode will manage only a portion of the namespace
 +
 +
 +
== Pros and Cons ==
 +
=== Cons ===
 +
Not good for:
 +
* low-latency reads and writes (it's not a Database!)
 +
* lots of small files
 +
 +
 +
=== HDFS is not a [[Database]]! ===
 +
HDFS has:
 +
* no indexing
 +
* no random access to files
 +
* no SQL
 +
* if you need DB capabilities on top of HDFS use [[HBase]]
  
  
 
== Sources ==
 
== Sources ==
 +
* [[Hadoop: The Definitive Guide (book)]]
 
* [[Web Intelligence and Big Data (coursera)]]
 
* [[Web Intelligence and Big Data (coursera)]]
 
  
 
[[Category:Distributed Systems]]
 
[[Category:Distributed Systems]]
 
[[Category:Hadoop]]
 
[[Category:Hadoop]]

Latest revision as of 14:08, 23 November 2015

Distributed File Systems

Typically MapReduce I/O operations are performed on distribute file systems.

  • a distributed FS is a FS that manages data across several machines, so it's network based
  • One such file system is HDFS - Hadoop Distribute File System


Hadoop DFS

Hadoop MapReduce (and other engines) run on an underlying storage for reading and writing

  • This storage is typically HDFS
  • Large files are typically distributed in chunks, and they are stored in data nodes.
  • Each chuck is replicated (typically stored on 3 servers)


File Storage

Blocks

  • On disks, the disk block size is the minimal amount of data that disks can read
  • same for DFS: files are broken into block-sized chunks, which are stored as independent units
  • by default a block is ~ 128 mb
  • result: a file can be larger than any of the disks on the cluster
  • it's also good for replication


Namenodes & Datanodes

  • HDFS uses the master-workers pattern
  • the master is called the Name node
  • the workers are called the Data nodes


Namenode

  • Name Node - the node that orchestrates the process of data distributing and knows where everything is stored
  • Namenode manages the filesystem namespace, maintains the file tree and metadata for all files and directories
  • given a file, it knows where its block are located - on which datanodes
  • the namenode knows how to reconstruct a file from the blocks
  • it's also a single point of failure - if it fails, this information is lost


Datanodes

  • datanodes store blocks, and retrieve them when asked by clients
  • periodially report the namenode the list of stored blocks


Maintaining Consistency

How to maintain consistency across all these replicas?


Reading

When a client needs to read data, it needs to know where this piece of data is:

  • a "read" command is issued with an offset - how many bytes the client wants to read
  • The name node knows where every chunk of data is kept, so the clients read the metadata from it.
  • After getting the metadata, the client reads the data from the data node (so there's no centralized bottleneck - all reads are in parallel)

In case the client fails to read a chunk of data, it asks the name node where the next replica is - and tries again


Writing

We need to make sure that all the replicas contain the same data (i.e. they are consistent)

  • One replica is considered "main", and the master knows which one.
  • Client sends the data to be written to all replicas
  • it's written to the main one and propagated to the rest
  • So it supports parallel reads and writes from a large number of processors
  • The reads are arbitrary and random access, but the writes are best when they are added to the end (i.e. appended)
  • Because the architecture relies on the main replica for deciding the order in which multiple append requests are processed, the data is always consistent


DFS.png


Failure Handling

  • Namenode is a single point of failure
  • to prevent losing data, we can have a secondary namenode
  • it's not really a "namenode", it only keeps a copy of the namespace image - with some logs
  • but logs might not be up-to-date, so you potentially may lose some data if the namenode fails


HDFS Federation

Can federate several namenodes:

  • if there are too many files in HDFS - it's hard for the namenode to manage all of them
  • can add another namenode, so each namenode will manage only a portion of the namespace


Pros and Cons

Cons

Not good for:

  • low-latency reads and writes (it's not a Database!)
  • lots of small files


HDFS is not a Database!

HDFS has:

  • no indexing
  • no random access to files
  • no SQL
  • if you need DB capabilities on top of HDFS use HBase


Sources