This course is an introduction to distributed systems.
The lectures will cover fundamental concepts in distributed
systems showing how they are applied when building reliable
distributed systems and services. Topics include:
How and why computers systems fail. How to overcome failures in a distributed system.
Failures models. The distributed commit problem.
Clock synchronization and synchronous systems.
Dynamic membership. Replicating data with malicious failures. Impossibility of asynchronous consensus.
Group communication systems, properties and dynamic group membership.
Causal and total order.
Virtually synchronous algorithms and tools: replicated data, state transfer, load-balancing, primary-backup and coordinator-cohort fault tolerance.
Transactional model and implementation of a transactional storage systems.
Distributed transactions and multiphase commit.
Distributed hash tables.
Application of distributed systems concepts to real systems:
files systems (GFS, HDFS), databases (BigTable, HBase, Spanner),
lock services (Chubby, Zookeeper, Zab), computational services
Applications of distributed systems to blockchains, digital currencies, credit, systems, smart contracts, and distributed ledgers.
The grade will be based on several written homework assignments (HW),
programming projects (PP), a final project (FP), and class
participation (in class, piazza, office hours, etc) (CP)
Grade = 24%*HW + 40%*PP + 26%*FP + 10%CP.
Programming language required is C, platform is Linux.
For the final project, any programming language can be used.
Reading list and resources
- Reading will be assigned during each lecture, see also the list at the end of the page.
- Lectures for the undergraduate introduction to C course I taught at Purdue: [www]
- Socket programming: [www]
- Unix programming links: [www]
Academic Honesty and Ethical behavior are required in this course,
as it is in all courses at Northeastern University. There is zero
tolerance to cheating.
You are encouraged to talk with the professor about any questions
you have about what is permitted on any particular assignment.
Lecture slides will be posted below. Homework and projects will be handed
in class and/or posted on piazza. All class communication will take place on piazza.
Preliminary plan and topics below.
||Topic 1 - Introduction. Class
|| Topic 2 -
Time in distributed systems (Lamport clocks, vector clocks, NTP).
Global states and distributed snapshots. Failure detectors.
| Week 3
Topic 3 - Consensus: synchronous
systems, asynchronous systems, byzantine failures (including
||Project 1 assigned.
| Week 4
|| Topic 4 Distributed commit
(2PC and 3PC)
| Week 5
|| Topic 4 cont. No class on Wednesday.
|| Hw2 assigned
||Project 1 due
| Week 6
|| Topic 5 - Process Groups: Leader election, membership, reliable multicast, virtual synchrony. |
| Hw2 due
Project 2 assigned.
|| No class on Monday.
Topic 6 - Quorums. Paxos. Viewstamped replication. BFT.
| Week 8
|| Topic 6. cont.
Topic 7 - Peer-to-peer overlays. Gossip protocols. Distributed Hash Tables
||Project 2 due
Topic 8 - Blockchains, digital currencies, credit systems, smart contracts, distributed ledgers.
|| Final project assigned/selected.
Topic 9 - GFS, HDFS. |
Topic 10 - BigTable, HBase, Spanner. Dynamo
|| Hw3 assigned.
Topic 11 - MapReduce. Hadoop. Spark. Mesos. Yarn.
|| Hw3 due.
|| Topic 12 - Infrastructure for ML. TensorFlow, GraphLab.
Topic 13 - Edge computing.
Class summary: Ten things to remember.
Final Project presentations.
Final project presentations will take place in class.
- Why Do Computers Stop and What can be done about it? J. Gray. 1985.
- End to end arguments in System Design. Saltzer, Reed, Clark. TOCS 1990.
- Why do Internet services fail, and what can be done about it? 2003. D. Oppenheimer, A.Ganapathi and D. A. Patterson.
- Time, Clocks, and the Ordering of Events in a Distributed System, L. Lamport 1978, SIGOPS Hall of Fame.
- Virtual Time and Global States of Distributed Systems", Mattern, F. 1988.
- Distributed Snapshots: Determining Global States of Distributed Systems. K. M. Chandy and L. Lamport,, 1985, SIGOPS Hall of Fame.
- Unreliable Failure Detectors for Reliable Distributed Systems, T. Chandra and S. Toueg. , 1996.
- Knowledge and Common Knowledge in a Distributed Environment, J. Halpern and Y. Moses , E.W. Dijkstra Prize 2009.
- Impossibility of Distributed Consensus with One Faulty Process. M.J.Fischer, N.A.Lynch and M.S. Paterson. , 1983.
E.W. Dijkstra Prize, 2001.
- The Byzantine Generals Problem, L. Lamport, R. Shostak, and M. Pease, 1982.
- Another advantage of free choice (Extended Abstract): Completely asynchronous agreement protocol. M. Ben-Or. 1983.
- Exploiting virtual synchrony in distributed systems. K. P. Birman and T. A. Joseph, 1987.
- Extended Virtual Synchrony, L. E. Moser, Y. Amir, P. M. Melliar-Smith, D. A. Agarwal,1994.
- Distributed Recovery, Bernstein, Goodman and Hadzilakos.
- Non-blocking Commit Protocols, D. Skeen.
- Determining the Last Process to Fail, D. Skeen.
- The State Machine Approach. F.B. Schneider. , SIGOPS Hall of Fame.
- Hypervisor-based Fault-Tolerance, T. Bressoud and F.B. Schneider
- A Survey of Rollback Recovery Protocols in Message Passing Systems, E. Elnozahy, L. Alvisi, Y.M.Wang, and D.B. Johnson.
- Paxos Made Simple, L. Lamport.
- The Part-Time Parliament L. Lamport , SIGOS Hall of Fame
- Paxos for System Builders, J. Kirsch and Y. Amir (the technical report) .
- Viewstamped Replication Revisited, B. Liskov and J. Cowling
- From Viewstamped replication to Byzantine replication. B Liskov.
- Bimodal Multicast, K.P. Birman, M. Hayden, O. Ozkasap, Z. Xiao, M. Budiu, and Y. Minsky
- Byzantine Quorum Systems, D. Malkhi and M. Reiter
- Practical Byzantine Fault-Tolerance, M. Castro and B. Liskov
- Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications, Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, H. Balakrishnan, , 2001.
- Google File System. S, Ghemawat, H. Gobioff and S.-T. Leung. SOSP 2003.
- The Chubby Lock Service for Loosely-Coupled Distributed Systems. Mike Burrows, OSDI 2006
- Bigtable: A Distributed Storage System for Structured Data. 2008. ACM Trans. Comput. Syst. 26, 2 (Jun. 2008), 1-26
- Spanner, Google?s globally distributed database. OSDI 2012.
- MapReduce: Simplified Data Processing on Large Clusters OSDI 2004
- Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, NSDI 2011
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI 2012, best paper
- Apache Hadoop YARN: Yet Another Resource Negotiator SOCC 2013 (best paper)
- Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud, VLDB 2012
- Pregel: A System for Large-Scale Graph Processing, SIGMOD 2010
- TensorFlow: A System for Large-Scale Machine Learning OSDI 2016
- Bitcoin: A Peer-to-Peer Electronic Cash System, Satoshi Nakamoto
- Majority is not Enough: Bitcoin Mining is Vulnerable Ittay Eyal, and Emin GŁn Sirer
- Hyperledger fabric: a distributed operating system for permissioned blockchains, EuroSys 2018.
Copyright© 2014 Cristina Nita-Rotaru. Send your comments and questions to Cristina Nita-Rotaru