Search Articles

Networks #

Wifi: Each host has a wireless transmitter and receiver over which it can send/receive signals
Connections between cities/ISPs are usually 100Gb/s or faster, it’s the connection between ISP and user that is the bottleneck
The routing table for routers is populated automatically with BGP (Border Gateway Protocol)

Distribute File Systems #

2 main approaches:
- Client/server (e.g. NFS)
- Cluster-based (e.g. HDFS, GFS (Google File System)) -> for many applications running in parallel with high scalability

Requires more data, I/O. NFS sends data back and forth a lot, cluster based systems send less
- Traditional client/server models couldn’t scale to the level of data cluster-based systems support
Clients connect to a master metadata server which connects to various data servers that hold chunks of files
- The metadata server holds information about which servers hold which portions of which files
- File chunks are replicated n number of times (e.g. 3)
- Thus, can split the processing up on the different servers by the file chunks
  - This is a common sense architecture
Clients first connect to the metadata server, which returns information to the client about the data servers that house the file of interest
- Clients can connect to the closer server if the data is replicated
Files can be read from or written to in parallel if they are stored on different machines
The metadata server is responsible for rebalancing load/file chunks amongst the different servers
MapReduce uses the cluster based distributed file system to execute jobs in parallel
HDFS only allows append only writes (this is what most writes are in general) and a single file writer, while GFS allows random writes and concurrent writers