6.208 Intro to Distribution-oblivious Programming

Concepts, techniques, and models for developing programs that execute on any number of computers

The seminar’s focus is distribution-oblivious programming, Quick Links: Homework & Projects, Reading & References, Schedule & Topics, Other Details.

an emerging area at the intersection of distributed systems and programming languages, and its interface with security and artificial intelligence. Distribution-oblivious programs perform on any number of computers rather than just one, have many benefits (e.g., can speed up computations, mitigate resource-exhaustion attacks, improve fault-tolerance, etc.), and are widely used at places such as Google and Facebook—but also increasingly by data-processing occurring at the edge. For example, techniques developed in the seminar will allow the following code—counting file lines—to run on any number of computers, agnostic to the file location:

…or automatically transform the following streaming pipeline—identifying the 10 highest temperatures during the last year across the U.S.—to run on as many computers as are available:

The seminar presents a hands-on introduction to the subject, guides participants through developing a series of (small, but actual!) systems, and prepares them for scientific contributions in the area. It follows a distribution-as-a-library approach, in which we incrementally augment a conventional programming language with distribution capabilities, and advocates a programming-language outlook—e.g., program analysis and transformation, declarative DSLs, and program synthesis—to simplify development and allow proving certain guarantees.

Reading & References

The seminar combines a theoretical and practical component, which involve different reading resources focusing on the respective aspects of distributed systems.

Theoretical Component

The following books contain useful course material for the former aspect, and much of the lecture content is derived from them (and other sources):

As in many fields in computer science, state-of-the-art research has not found its way into a book yet; rather, it is still in the form of conference papers, which we will regularly use as our primary resource.

Practical Component

The practical aspect of the course alternates between Lua, JavaScript, and Typescript. For this incarnation of the course we will be using JavaScript; helpful resources include:

Students are expected to have programmed before (in any language or form—i.e., can use an editor), but we will learn git, andromeda, and other tools together.

Schedule & Topics

Below is tentative schedule; at multiple points the schedule will depend on the interests of the students—so it’s by no means final!

Topic Details
0. Transformations (slides, code, hw) Language-based distribution (+administrivia)
Principles: Syntax, semantics, pragmatics
Practice: Crypto-currencies and lib-transform
1. Analysis Static and dynamic program analysis
Principles: Execution invariants, {state, server}-less
Practice: Functional purity and performance pathologies
2. Statefulness Distributed state management
Principles: Leader election, consistent hashing
Practice: Data partitioning, distributed indexing, and scalable file-system
3. Look-ups Accelerating queries and acceleration trade-offs
Principles: Hyperspace, unispace, and query DSLs
Practice: Distributed unispace indexing and queries
4. Failures Fault-tolernace and replication
Principles: Naive replication, impossibility results, chaining
Practice: Replicated file-system
5. Consistency Consistency trade-offs revisited
Principles: Consistency spectrum, mix-ins, and relaxations
Practice: Mixed-consistency file-system and DSLs
6. Execution Parallelism and distributability classes
Principles: Execution properties (monotone, det/stic, commutative, …)
Practice: Characterization and map-reduce
7. Composition Modularity and abstraction
Principles: Data-flow analysis, cost models, higher-order modules
Practice: Unix pipeline distributability
8. Synthesis Learning and synthesizing programs
Principles: Rich type hierarchies, decomposition
Practice: Automated decomposition into map-reduce primitives
9. Beyond Other concerns and open research problems
Principles: Multi-layer optimization, security, etc.
Practice: Combining analyses, transformations, and next steps

Homeworks & Projects

There are several homework projects: during the seminar, we will build several small-but-real components of a large distributed environment.

Homework 1 is an individual assignment, to ensure everyone has a working version of the toolchain. This assignment includes general tool-chain instructions, for getting access to the tools (and environment) you need to complete course assignments. The final objective is to run a few relatively simple tasks on the runtime environment used by the course.

Homeworks 2–4 are pair-programming mini-projects—students are strongly encouraged to work in groups of two or three students (in rare cases, they may work alone on them, provided they ask permission). Students will build three distributed services, which we will deploy on real distributed hardware!

Homework 5 is a project that students propose–e.g., an implementation of MapReduce, a distributed Unix file-system, etc. Students are provided a complete version of the distributed environment developed in HW2-4. This homework may include original research, which can lead to a successful paper publication. The only rule for HW5 is that no two HW5 projects can be the same; however, different HW5 projects may complement each other, so that the sum of them leads to something even more interesting.

List of homeworks (URLs become available as we progress in the seminar):

Policies

Contrary to most classes, this one encourages collaboration: the subject of the class as well as some of the tools are complex—as a result, it may be necessary to collaborate as a class on figuring out how to best internalize the content and use the tools. It is also helpful to get different perspectives on the same subject, and even catch or correct misconceptions early on. Finally, there’s not better way to learn something than to explain it:-) Thus, you are encouraged to talk with each other as much as possible (and write English or pseudocode).

This is a high-loadNot difficulty, just time. Significant instruction effort was invested to simplify the development of distributed systems.

but high-reward course, with a significant hands-on component. Rewards include the satisfaction of having built cutting-edge systems that solve actual problems, the likelihood of interning at tech companies such as Google and Facebook, and the ability to conduct cutting-edge research in several areas, including distributed systems, programming languages, and computer security.

Other Details

A mailing list and GitHub organization will be set up for the course. Before the first class, participants should send me an email (so that I include them in the mailing list) that includes their GitHub ID (so that I add them to the org).

Prerequisites

Familiarity with a programming language, basic mathematical maturity, and excitement are important; the course will cover the rest. If you are unsure about taking this course, you should attend the first lecture:-)

Contributions / Fixes

The entire course (including the page you’re reading!) are hosted on GitHub—feel free to send a pull requests! The version displayed here is 7754b3f. You can also follow the course on GitHub, to receive notifications about announcements.