Distributed File Systems
Typically MapReduce I/O operations are performed on distribute file systems.
- a distributed FS is a FS that manages data across several machines, so it's network based
- One such file system is HDFS - Hadoop Distribute File System
Hadoop DFS
Hadoop MapReduce (and other engines) run on an underlying storage for reading and writing
- This storage is typically HDFS
- Large files are typically distributed in chunks, and they are stored in data nodes.
- Each chuck is replicated (typically stored on 3 servers)
File Storage
Blocks
- On disks, the disk block size is the minimal amount of data that disks can read
- same for DFS: files are broken into block-sized chunks, which are stored as independent units
- by default a block is ~ 128 mb
- result: a file can be larger than any of the disks on the cluster
- it's also good for replication
Namenodes & Datanodes
- HDFS uses the master-workers pattern
- the master is called the Name node
- the workers are called the Data nodes
Namenode
- Name Node - the node that orchestrates the process of data distributing and knows where everything is stored
- Namenode manages the filesystem namespace, maintains the file tree and metadata for all files and directories
- given a file, it knows where its block are located - on which datanodes
- the namenode knows how to reconstruct a file from the blocks
- it's also a single point of failure - if it fails, this information is lost
Datanodes
- datanodes store blocks, and retrieve them when asked by clients
- periodially report the namenode the list of stored blocks
Maintaining Consistency
How to maintain consistency across all these replicas?
Reading
When a client needs to read data, it needs to know where this piece of data is:
- a "read" command is issued with an offset - how many bytes the client wants to read
- The name node knows where every chunk of data is kept, so the clients read the metadata from it.
- After getting the metadata, the client reads the data from the data node (so there's no centralized bottleneck - all reads are in parallel)
In case the client fails to read a chunk of data, it asks the name node where the next replica is - and tries again
Writing
We need to make sure that all the replicas contain the same data (i.e. they are consistent)
- One replica is considered "main", and the master knows which one.
- Client sends the data to be written to all replicas
- it's written to the main one and propagated to the rest
- So it supports parallel reads and writes from a large number of processors
- The reads are arbitrary and random access, but the writes are best when they are added to the end (i.e. appended)
- Because the architecture relies on the main replica for deciding the order in which multiple append requests are processed, the data is always consistent
Failure Handling
- Namenode is a single point of failure
- to prevent losing data, we can have a secondary namenode
- it's not really a "namenode", it only keeps a copy of the namespace image - with some logs
- but logs might not be up-to-date, so you potentially may lose some data if the namenode fails
HDFS Federation
Can federate several namenodes:
- if there are too many files in HDFS - it's hard for the namenode to manage all of them
- can add another namenode, so each namenode will manage only a portion of the namespace
Pros and Cons
Cons
Not good for:
- low-latency reads and writes (it's not a Database!)
- lots of small files
HDFS has:
- no indexing
- no random access to files
- no SQL
- if you need DB capabilities on top of HDFS use HBase
Sources