Distributed File Systems

Typically MapReduce I/O operations are performed on distribute file systems.

  • a distributed FS is a FS that manages data across several machines, so it's network based
  • One such file system is HDFS - Hadoop Distribute File System


Hadoop DFS

Hadoop MapReduce (and other engines) run on an underlying storage for reading and writing

  • This storage is typically HDFS
  • Large files are typically distributed in chunks, and they are stored in data nodes.
  • Each chuck is replicated (typically stored on 3 servers)


File Storage

Blocks

  • On disks, the disk block size is the minimal amount of data that disks can read
  • same for DFS: files are broken into block-sized chunks, which are stored as independent units
  • by default a block is ~ 128 mb
  • result: a file can be larger than any of the disks on the cluster
  • it's also good for replication


Namenodes & Datanodes

  • HDFS uses the master-workers pattern
  • the master is called the Name node
  • the workers are called the Data nodes


Namenode

  • Name Node - the node that orchestrates the process of data distributing and knows where everything is stored
  • Namenode manages the filesystem namespace, maintains the file tree and metadata for all files and directories
  • given a file, it knows where its block are located - on which datanodes
  • the namenode knows how to reconstruct a file from the blocks
  • it's also a single point of failure - if it fails, this information is lost


Datanodes

  • datanodes store blocks, and retrieve them when asked by clients
  • periodially report the namenode the list of stored blocks


Maintaining Consistency

How to maintain consistency across all these replicas?


Reading

When a client needs to read data, it needs to know where this piece of data is:

  • a "read" command is issued with an offset - how many bytes the client wants to read
  • The name node knows where every chunk of data is kept, so the clients read the metadata from it.
  • After getting the metadata, the client reads the data from the data node (so there's no centralized bottleneck - all reads are in parallel)

In case the client fails to read a chunk of data, it asks the name node where the next replica is - and tries again


Writing

We need to make sure that all the replicas contain the same data (i.e. they are consistent)

  • One replica is considered "main", and the master knows which one.
  • Client sends the data to be written to all replicas
  • it's written to the main one and propagated to the rest
  • So it supports parallel reads and writes from a large number of processors
  • The reads are arbitrary and random access, but the writes are best when they are added to the end (i.e. appended)
  • Because the architecture relies on the main replica for deciding the order in which multiple append requests are processed, the data is always consistent


DFS.png


Failure Handling

  • Namenode is a single point of failure
  • to prevent losing data, we can have a secondary namenode
  • it's not really a "namenode", it only keeps a copy of the namespace image - with some logs
  • but logs might not be up-to-date, so you potentially may lose some data if the namenode fails


HDFS Federation

Can federate several namenodes:

  • if there are too many files in HDFS - it's hard for the namenode to manage all of them
  • can add another namenode, so each namenode will manage only a portion of the namespace


Pros and Cons

Cons

Not good for:

  • low-latency reads and writes (it's not a Database!)
  • lots of small files


HDFS is not a Database!

HDFS has:

  • no indexing
  • no random access to files
  • no SQL
  • if you need DB capabilities on top of HDFS use HBase


Sources

Machine Learning Bookcamp: Learn machine learning by doing projects. Get 40% off with code "grigorevpc".

Share your opinion