Data Intensive Text Processing with MapReduce

Last Updated: Sep 25, 2012

There’s a big learning curve when you jump from studying statistics in school to programming statistical tools for Amazon scale data. Big Data’s swiss knife is Map Reduce, so it’s often the case that you have to describe any data manipulation algorithm in the Map Reduce Paradigm. Since I’m new to Map Reduce, I’ve been reading up on best practices. One book that I saw recommended on the web was Data-Intensive Text Processing with MapReduce, by Jimmy Lin and Chris Dyer. You can get it free from here. This book is written by two professors from the Natural Language Parsing world. While I don’t directly deal with NPL, this interested me because of its machine learning applications.

Review

The book describes what Map Reduce is, mostly from a Hadoop bent. It then talks about a few common patterns when implementing a Map Reduce algorithm, and goes on to describe some detailed implementations of Web Crawling, Breadth First Search and Expectation Maximization. The text book stays above any nitty gritty Map Reduce issues such as syntax and installation, which is good because the algorithms are transferable to other languages like Pig. However, they don’t shy away from using algorithm design optimizations which are specific to the Hadoop implementation of Map Reduce. Which is also good because that’s the implementation everybody working outside of Google uses.

I found the book very helpful as an introducation, and a good length at about 150 pages. The sections on Web Crawling, Graph algorithsm and EM spend a lot of time describing background information about the problem space. Of course this background isn’t really full of Map Reduce goodness, so I found some of it tedious. For example if you haven’t studied about storing sparse/dense graphs as lists or matrices, a book on Map Reduce algorithms is probably above you. The whole web crawling chapter also dragged on a bit, it’s a very common example for map reduce I feel like its value could have been distilled to a couple of pages. The second last chapter gives a big intro to Hidden Markov Models and Expectation Maximization. With my statistics background, I’ve come across these concepts before, but still found their description about EM confusing. While the background info took up about 40 pages, the actual Map Reduce implementations took only 5. I’ll do my best to clarify this chapter below.

Overall I’d highly recommend this book to anybody serious about writing map reduce algorithms. For those who are new to the whole system, like me, I’d recommend reading the whole book. People with a bit more Map Reduce experience under their belt might just skip ahead to the specific Map Reduce implementations in each chapter to see if there’s any tricks they can learn.

Stripes & Pairs

Stripes and Pairs are the two general patterns introduced at the start of the book which the rest of the Map Reduce algorithms make use of. These represent two different ways to operate on data keys which are tuples of the input data. The example problem used in the book is counting pairs of words which occur in the same sentences over a collection of documents. In this case the data keys are pairs of words, and the output value associate with each data key is its count.

Both Stripes and Pairs will calculate the same output, but they differ on their intermediate values. The pairs mapper breaks up the input into each individual tuple it finds. The Stripes algorithm does some preprocessing within the mapper before emitting its values to the reducer. The Stripes algorithm is faster, while the pairs mapper scales to problems that have a large number of unique data key tuples. Each implementation of the word co-occurence solution is given below.

Pairs

Map(doc)
  foreach sentence in doc
    foreach (word1) in sentence
      foreach (word2) in sentence
 EMIT(tuple(word1, word2), 1)

Reduce(tuple(word1, word2), counts[])
  sum = 0
  foreach c in counts
    sum += count
  EMIT(tuple(word1, word2), count)

Stripes

Map(doc)
  A = Map(word, Map(word, count))
  foreach sentence in doc
    foreach (word1) in sentence
      if word1 not in A
        A[word1] = new Map(word, count)
      foreach (word2) in sentence
        if word2 not in A[word1]
          A[word1][word2] = 1
        else
          A[word1][word2] += 1

  foreach word in A
    EMIT(word, A[word])

Reducer(word1, Map(word2, count) Stripes[])
  S = new Map(word2, count)
  foreach s in Stripes
    S += s //sum each element in map

  foreach word2 in S
    EMIT(tuple(word1, word2), S[word2])

The pairs mapper emits each pair of co-occuring words it comes across, ie. tuple(word1, word2). The stripes algorithms emits a “stripes” of data for each word in the document, ie. tuple(word1, *). In this case, a “stripe” is represented by a Map. The stripes implementation is faster because it performs a lot of summing within the mapper, reducing the amount of information that needs to be transferred between the mapper and reducer. However, the stripes algorithm runs into a problem if the number of co-occurring words grows too large for the stripe map to fit in memory. If this is the case you’ll have to revert to the Pairs approach.

The Future of Machine Learning on Map Reduce

The book was published in 2010 and it’s starting to show its age. It’s a real trip down Map Reduce memory lane. At the time of the book’s writing, Google’s paper on Pregel system had just been released, and Apache Pig was a big unknown. Of course Pig is a scripting language which describes data flows that get translated into Map Reduce jobs. Flashforward to the present and Pig is my go to language to write new Map Reduce jobs.

It seems there’s been a push back against Map Reduce from the machine learning world since the time of the book’s writing. There’s a lot of specialized distributed machine learning frameworks under development that act as an alternative to Map Reduce. In the past month alone I’ve seen two presentations on new frameworks which claim to run machine learning algorithms orders of magnitude faster than Map Reduce. The arguments against Map Reduce were

The implementation unnecessarily sacrifices too much speed for safety (eg. writing every intermediate value to the file system). Most machine learning algorithms can’t be turned into efficient Map Reduce algorithms because of complex dependencies between the data.

Carlos Guestrin’s presentation on GraphLab seemed very promising and speedy. There’s also of course Vowpal Wabbit, although it only implements a specific machine learning algorithm. We’ll be able to read their ideas for how to perform large scale machine learning in Scaling up Machine Learning.