Line 1: | Line 1: | ||
== Distributed File Systems == | == Distributed File Systems == | ||
Typically [[MapReduce]] I/O operations are performed on distribute file systems. | Typically [[MapReduce]] I/O operations are performed on distribute file systems. | ||
+ | * a distributed FS is a FS that manages data across several machines, so it's network based | ||
+ | * One such file system is HDFS - [[Hadoop]] Distribute File System | ||
− | |||
− | |||
− | Large files are typically distributed in chunks | + | === Hadoop DFS === |
+ | [[Hadoop MapReduce]] (and other engines) run on an underlying storage for reading and writing | ||
+ | * This storage is typically HDFS | ||
+ | * Large files are typically distributed in chunks, and they are stored in data nodes. | ||
+ | * Each chuck is replicated (typically stored on 3 servers) | ||
− | === | + | == File Storage == |
− | * block | + | === Blocks === |
− | * | + | * On [[Secondary Storage|disks]], the disk block size is the minimal amount of data that disks can read |
− | * | + | * same for DFS: files are broken into block-sized chunks, which are stored as independent units |
− | * | + | * by default a block is ~ 128 mb |
− | * | + | * result: a file can be larger than any of the disks on the cluster |
+ | * it's also good for replication | ||
− | === | + | === Namenodes & Datanodes === |
− | * | + | * HDFS uses the master-workers pattern |
− | * | + | * the master is called the Name node |
− | * | + | * the workers are called the Data nodes |
− | * if | + | |
+ | |||
+ | Namenode | ||
+ | * ''Name Node'' - the node that orchestrates the process of data distributing and knows where everything is stored | ||
+ | * Namenode manages the filesystem namespace, maintains the file tree and metadata for all files and directories | ||
+ | * given a file, it knows where its block are located - on which datanodes | ||
+ | * the namenode knows how to reconstruct a file from the blocks | ||
+ | * it's also a single point of failure - if it fails, this information is lost | ||
+ | |||
+ | |||
+ | Datanodes | ||
+ | * datanodes store blocks, and retrieve them when asked by clients | ||
+ | * periodially report the namenode the list of stored blocks | ||
== Maintaining Consistency == | == Maintaining Consistency == | ||
How to maintain [[Consistency (databases)|consistency]] across all these replicas? | How to maintain [[Consistency (databases)|consistency]] across all these replicas? | ||
+ | |||
=== Reading === | === Reading === | ||
When a client needs to read data, it needs to know where this piece of data is: | When a client needs to read data, it needs to know where this piece of data is: | ||
− | + | * a "read" command is issued with an offset - how many bytes the client wants to read | |
− | + | * The '''name node''' knows where every chunk of data is kept, so the clients read the metadata from it. | |
− | + | * After getting the metadata, the client reads the data from the '''data node''' (so there's no centralized bottleneck - all reads are in parallel) | |
In case the client fails to read a chunk of data, it asks the '''name node''' where the next replica is - and tries again | In case the client fails to read a chunk of data, it asks the '''name node''' where the next replica is - and tries again | ||
Line 37: | Line 55: | ||
=== Writing === | === Writing === | ||
We need to make sure that all the replicas contain the same data (i.e. they are consistent) | We need to make sure that all the replicas contain the same data (i.e. they are consistent) | ||
− | + | * One replica is considered "main", and the master knows which one. | |
− | + | * Client sends the data to be written to all replicas | |
− | + | * it's written to the main one and propagated to the rest | |
* So it supports parallel reads and writes from a large number of processors | * So it supports parallel reads and writes from a large number of processors | ||
Line 47: | Line 65: | ||
https://raw.github.com/alexeygrigorev/ulb-adb-project-couchbd/master/report/images/DFS.png | https://raw.github.com/alexeygrigorev/ulb-adb-project-couchbd/master/report/images/DFS.png | ||
+ | |||
+ | |||
+ | === Failure Handling === | ||
+ | * Namenode is a single point of failure | ||
+ | * to prevent losing data, we can have a secondary namenode | ||
+ | * it's not really a "namenode", it only keeps a copy of the namespace image - with some logs | ||
+ | * but logs might not be up-to-date, so you potentially may lose some data if the namenode fails | ||
+ | |||
+ | |||
+ | == HDFS Federation == | ||
+ | Can federate several namenodes: | ||
+ | * if there are too many files in HDFS - it's hard for the namenode to manage all of them | ||
+ | * can add another namenode, so each namenode will manage only a portion of the namespace | ||
+ | |||
+ | |||
+ | == Pros and Cons == | ||
+ | === Cons === | ||
+ | Not good for: | ||
+ | * low-latency reads and writes (it's not a Database!) | ||
+ | * lots of small files | ||
+ | |||
+ | |||
+ | === HDFS is not a [[Database]]! === | ||
+ | HDFS has: | ||
+ | * no indexing | ||
+ | * no random access to files | ||
+ | * no SQL | ||
+ | * if you need DB capabilities on top of HDFS use [[HBase]] | ||
== Sources == | == Sources == | ||
+ | * [[Hadoop: The Definitive Guide (book)]] | ||
* [[Web Intelligence and Big Data (coursera)]] | * [[Web Intelligence and Big Data (coursera)]] | ||
− | |||
[[Category:Distributed Systems]] | [[Category:Distributed Systems]] | ||
[[Category:Hadoop]] | [[Category:Hadoop]] |
Typically MapReduce I/O operations are performed on distribute file systems.
Hadoop MapReduce (and other engines) run on an underlying storage for reading and writing
Namenode
Datanodes
How to maintain consistency across all these replicas?
When a client needs to read data, it needs to know where this piece of data is:
In case the client fails to read a chunk of data, it asks the name node where the next replica is - and tries again
We need to make sure that all the replicas contain the same data (i.e. they are consistent)
Can federate several namenodes:
Not good for:
HDFS has: