[ Content | View menu ]

Further MapReduce Observations and Ramblings

Mark Mzyk | February 12, 2008

I’ve posted several times about MapReduce. I’ll admit, I’m fascinated by it. Here are some observations I noted while reading the white paper, along with several direct quotations from it as well.

MapReduce was “inspired by the map and reduce primitives present in Lisp and many other functional languages.” If that isn’t a strong argument for learning a different language and a different style of programming, I don’t know what is. You never know where you can pick up new ways of doing things, or where inspiration might come from.

MapReduce makes use of hash maps. It seems to me these are the forgotten data structure. Certainly a lot of people do use them, but in my experience the tendency is to try and fit an array to a problem and only after all permutations of the basic array have failed, then look at other data structures. The world might be a better place if the hash map was kept closer at hand (and all the other data structures as well).

Parallel processing is going to become ever more important. This isn’t just because the processor companies have to figure out ways to keep up with Moore’s law. As corporations continue to amass huge amounts of data, parallel processing is going to become the only cost and time effective way to analyze it.

Parallel processing will also likely lead to new languages and new programming techniques. Erlang and other languages that make parallel processing easier than the current crop of popular languages, such as Java, seem poised to jump into the mainstream. If not this year, then certainly sometime during the next several.

Programmers are going to have to work with ever larger data sets. Gone will be the days of thinking in bits and bytes, although that knowledge will still be important. Google’s implementation of MapReduce works with 16 to 64MB chunks of data. How soon until a gigabyte chunk of data is considered commonplace? Languages that have inherent limits, like PHP’s 2GB filesize limit, will quickly become liabilities, or else libraries will be developed that allow developers to forget the limit exists, although bad workarounds will continue to persist, even as we all wish they would be banished to the seventh level of Hell.

I’ve questioned before how Google is going to maintain their code quality while growing so quickly. Well, the sorting portion of Google’s MapReduce consists of less than 50 lines of code. With implementations that small, maintainability becomes much easier.

Google now uses MapReduce for all kinds of problems beyond the original application it was written for:

  • large-scale machine learning problems
  • clustering problems for the Google News and Froogle products
  • extracting data to produce reports of popular queries
  • extracting properties of web pages for new experiments and products
  • processing of satellite imagery data
  • language model processing for statistical machine translation
  • large-scale graph computations

That is the definition of code reuse. While the same code isn’t being used in each instance, the same algorithm and structure is. Again, maintainability is going to be much, much easier. It also shows that with a bit of imagination, something that might seem limited can have a large array of possibilities.

A direct quote from the white paper that every programmer should read (added emphasis is my own):

MapReduce has been so successful because it makes it possible to write a simple program and run it efficiently on a thousand machines in a half hour, greatly speeding up the development and prototyping cycle. Furthermore, it allows programmers who have no experience with distributed and/or parallel systems to exploit large amounts of resources easily.

What Google is dealing with now will likely be the norm in the future. Learn what they are doing so you can exploit it when it is your turn.

What else has MapReduce done for Google? The paper states that because of it parts of Google’s code are smaller and simpler to understand. As an example, one part of Google’s code dropped from approximately 3800 lines of C++ code to approximately 700 lines of code. That’s incredible.

Read the white paper for more amazing bits of informaiton. If you haven’t asked yourself how your company might be able to use MapReduce, why not? While it won’t be applicable to everything, it could be incredibly beneficial in some areas. There is even an open source implementation of MapReduce.

The story of MapReduce points out the importance of diversity. Perhaps you should go out on a limb and hire a different type of developer for that open position. If you’re a Java shop, why not consider hiring a Ruby developer? Or if you’re a PHP shop, maybe you should hire a Haskell developer. While it might take the new developer some time to get up to speed initially, you can’t quantify the benefit their outside experience might bring.

If you’re a developer and only know one language, maybe this has shown you why you should learn another. At the very least, learning another language will expand your own personal knowledge, and that is always a venture worth pursuing.