Reliable Distributed Systems
Developing distributed systems can be difficult, and many of the patterns that are successful in developing conventional applications (such as constructing complex operations by composing simpler operations) lead to applications that work… some of the time. Although researchers have known it for years, a new generation of practitioners are learning the hard way that there’s an intractable contradiction between scalability, reliability and data integrity.
Ken Birman’s textbook Reliable Distributed Systems, is an excellent introduction to this brave new world, focused on the construction of systems that are reliable — that keep working when something goes wrong. This is critical for rich internet applications (that work over an unreliable public internet) and for applications that run on large clusters (where there’s a lot of hardware to fail.) If you find his text is pricey, you’ll appreciate the slides from his Cornell course available on his home page.
Paul Houle on April 15th 2008 in Distributed