Data-Intensive Text Processing with MapReduce


There’s a big learning curve when you jump from studying statistics in school to programming statistical tools for Amazon scale data. Big Data’s swiss knife is Map Reduce, so it’s often the case that you have to describe any data manipulation algorithm in the Map Reduce Paradigm. Since I’m new to Map Reduce, I’ve been reading up on best practices. One book that I saw recommended on the web was Data-Intensive Text Processing with MapReduce, by Jimmy Lin and Chris Dyer. You can get it free from here. This book is written by two professors from the Natural Language Parsing world. While I don’t directly deal with NPL, this interested me because of its machine learning applications.




The book describes what Map Reduce is, mostly from a Hadoop bent. It then talks about a few common patterns when implementing a Map Reduce algorithm, and goes on to describe some detailed implementations of Web Crawling, Breadth First Search and Expectation Maximization. The text book stays above any nitty gritty Map Reduce issues such as syntax and installation, which is good because the algorithms are transferable to other languages like Pig. However, they don’t shy away from using algorithm design optimizations which are specific to the Hadoop implementation of Map Reduce. Which is also good because that’s the implementation everybody working outside of Google uses.


I found the book very helpful as an introducation, and a good length at about 150 pages. The sections on Web Crawling, Graph algorithsm and EM spend a lot of time describing background information about the problem space. Of course this background isn’t really full of Map Reduce goodness, so I found some of it tedious. For example if you haven’t studied about storing sparse/dense graphs as lists or matrices, a book on Map Reduce algorithms is probably above you. The whole web crawling chapter also dragged on a bit, it’s a very common example for map reduce I feel like its value could have been distilled to a couple of pages. The second last chapter gives a big intro to Hidden Markov Models and Expectation Maximization. With my statistics background, I’ve come across these concepts before, but still found their description about EM confusing. While the background info took up about 40 pages, the actual Map Reduce implementations took only 5. I’ll do my best to clarify this chapter below.


Overall I’d highly recommend this book to anybody serious about writing map reduce algorithms. For those who are new to the whole system, like me, I’d recommend reading the whole book. People with a bit more Map Reduce experience under their belt might just skip ahead to the specific Map Reduce implementations in each chapter to see if there’s any tricks they can learn.


Stripes & Pairs

Stripes and Pairs are the two general patterns introduced at the start of the book which the rest of the Map Reduce algorithms make use of. These represent two different ways to operate on data keys which are tuples of the input data. The example problem used in the book is counting pairs of words which occur in the same sentences over a collection of documents. In this case the data keys are pairs of words, and the output value associate with each data key is its count.

Both Stripes and Pairs will calculate the same output, but they differ on their intermediate values. The pairs mapper breaks up the input into each individual tuple it finds. The Stripes algorithm does some preprocessing within the mapper before emitting its values to the reducer. The Stripes algorithm is faster, while the pairs mapper scales to problems that have a large number of unique data key tuples. Each implementation of the word co-occurence solution is given below.


  foreach sentence in doc
    foreach (word1) in sentence
      foreach (word2) in sentence
 EMIT(tuple(word1, word2), 1)

Reduce(tuple(word1, word2), counts[])
  sum = 0
  foreach c in counts
    sum += count
  EMIT(tuple(word1, word2), count)


  A = Map(word, Map(word, count))
  foreach sentence in doc
    foreach (word1) in sentence
      if word1 not in A
        A[word1] = new Map(word, count)
      foreach (word2) in sentence
        if word2 not in A[word1]
          A[word1][word2] = 1
          A[word1][word2] += 1

  foreach word in A
    EMIT(word, A[word])

Reducer(word1, Map(word2, count) Stripes[])
  S = new Map(word2, count)
  foreach s in Stripes
    S += s //sum each element in map

  foreach word2 in S
    EMIT(tuple(word1, word2), S[word2])

The pairs mapper emits each pair of co-occuring words it comes across, ie. tuple(word1, word2). The stripes algorithms emits a “stripes” of data for each word in the document, ie. tuple(word1, *). In this case, a “stripe” is represented by a Map. The stripes implementation is faster because it performs a lot of summing within the mapper, reducing the amount of information that needs to be transferred between the mapper and reducer. However, the stripes algorithm runs into a problem if the number of co-occurring words grows too large for the stripe map to fit in memory. If this is the case you’ll have to revert to the Pairs approach.


The future of Machine Learning on Map Reduce


The book was published in 2010 and it’s starting to show its age. It’s a real trip down Map Reduce memory lane. At the time of the book’s writing, Google’s paper on Pregel system had just been released, and Apache Pig was a big unknown. Of course Pig is a scripting language which describes data flows that get translated into Map Reduce jobs. Flashforward to the present and Pig is my go to language to write new Map Reduce jobs.


It seems there’s been a push back against Map Reduce from the machine learning world since the time of the book’s writing. There’s a lot of specialized distributed machine learning frameworks under development that act as an alternative to Map Reduce. In the past month alone I’ve seen two presentations on new frameworks which claim to run machine learning algorithms orders of magnitude faster than Map Reduce. The arguments against Map Reduce were

  1. The implementation unnecessarily sacrifices too much speed for safety (eg. writing every intermediate value to the file system).
  2. Most machine learning algorithms can’t be turned into efficient Map Reduce algorithms because of complex dependencies between the data.


Carlos Guestrin’s presentation on GraphLab seemed very promising and speedy. There’s also of course Vowpal Wabbit, although it only implements a specific machine learning algorithm. We’ll be able to read their ideas for how to perform large scale machine learning in Scaling up Machine Learning.

The Thank You Economy – Gary Vaynerchuk

Thesis: In The Thank You Economy, Gary Vaynerchuk says that it in order for companies of any size to grow in the future, they will have to engage their customers at a personal level. His logic for this is that in the past, commerce was dominated by small mom and pop shops and interactions with customers at a personal level was the norm. Then in the 1950′s, people moved out to the suburbs which led to increased isolation. People had weaker social networks and so their bond with commerce was weakened and there was also less opportunity to speak to other members of their community and bring up recent customer experiences. This led to faceless large companies replacing small shops. Now, with the advent of social networking, people are more connected to their community (thus can spread recommendations and negative stories about brands), it also makes it easier for people to communicate with their brands again.

Contents: The book contains a lot of anecdotes of how big and small businesses have used social networking to advertise their brand. Gary highlights a lot of good practices and also offers improvements. He also goes on to describe how to combine social networking campaigns with traditional media such as billboards and tv commercials.

Review: I recommend reading this book because there’s a lot of great ideas about how to carry out marketing campaigns using social media. Gary really does have some good and original ideas for both large companies and small companies. The book was fairly short, which is a good thing because there wasn’t much fluff in it.

I did find issue with his main thesis; I feel like it ignores simple economic logic. I think that a certain segment of the population will switch to a brand if it starts replying to their tweets, while a much larger segment of the population will shop at the cheapest, most convenient place (Walmart.) Customer service has a cost, which ultimately gets added to the price of the product sold. He didn’t do a good job of explaining how the entire population will decide to pay more in order for companies to tweet with them.

This flaw is highlighted at the start of the book when he laments that full service gas stations are much less popular than self serve, citing this as an example of poor customer service. The fact is that both types of gas stations exist, and the full service stations cost a few cents more. If people were willing to pay a few more cents for better customer service, then all gas stations would be full service.

Overall, his stories about how small businesses advanced their brand online were a lot more convincing then how the big companies can be successful. Small businesses are really building a relationship between their sole owner and the customer. People can picture who their speaking to. When Dairy Queen posts “What’s your favourite Blizzard flavour” on Facebook, who are you actually having a conversation with? In the book, Gary speaks about the wildly successful Old Spice campaign, which used both tv commercials and social networking. Gary claims at the start of the campaign he bought some Old Spice deodorant, but after the campaign ended and he hadn’t received any more tweets from them, he no longer bought Old Spice. This is an odd narrative to me. Some great advertising campaign got him to try a new deodorant, and either he liked it or he didn’t. If Social Networking relations dictated his deodorant choice, then what deodorant’s tweets caused him to switch?

This book did get me thinking about social relationships between people and large companies, but that will be the topic of another post.

The Quest for The Cure Review

The Quest for The Cure is a very interesting book that gives an overview of past and present pharmaceutical breakthroughs. I don’t have much formal education in chemistry or biology, but I am drawn to medical and biotechnology breakthroughs. Medical issues touch everyone and in my view this field has the greatest impact on improving the human condition. That being said, this is a one of kind book that gives just the right amount of detail on the past, present and future of drug development. The book essentially chronicles several major breakthroughs in drug developments, beginning with the birth of biochemistry.

I think that anybody who’s inclined to solving problems (eg. engineers like me) will find the writing style very engaging. The typical chapter will start by describing some unsolved biological or chemical issue in drug treatments. The author then explains an innovation that addresses the issue. You learn about the background and personality of the researcher investigating the problem, the hurdles they went through and some of the scientific mechanisms for how the break through works. One of the most interesting stories is how American scientists studying the effects of mustard gas during World War 2 lead to the first chemotherapy drugs. The author seems to give just the right amount of scientific detail to explain the crux of the breakthrough. With my very modest background I was able to understand most of it, although you’ll probably get more out of the book if you understand a few very basics, like what the difference between a gene, RNA and proteins are.

Overall I’d highly recommend anybody interested in medical breakthroughs to check out this book. It’s inspiring to read about researchers who overcame great obstacles and percevered for years (or decades) in order to create life saving drugs. Had I read this in high school, I might just be a chemistry phd right now.

What is Stochastic Calculus?

One of the cool things I’ve learned about in grad school is stochastic calculus. I’m far from an expert on the subject but I’m going to share the basic idea of a Stochastic Integral.

Quick Review

Reimann Integral Equation

The integral everybody learns about in high school and undergrad is a Reimann Integral, where you’re finding the area under a function by partitioning the space into a sequence of rectangles. As you make the widths of the rectangles smaller, the approximation gets better, and at the limit where the rectangle widths go to zero, you get the area of the space under the curve.

Stochastic Integral

With a stochastic integral, we still partition the space into rectangles and sum up the area of the rectangles as the widths get smaller. So what makes a stochastic integral different?

Each rectangle width is a random variable!

Just think about how crazy that is, you’re now summing a series of random variables (ie. a stochastic process.) In fact, you’ll notice that the integral equation uses dXs instead of just dX like the Reimann integrals,  dXs is a stochastic process that lasts for time s.

What is the output of a stochastic integral?

So a stochastic integral is a sum of widths, which are each random variables. That means the output of a stochastic integral is also a random variable. The characteristics of the random variable depend on the characteristics of the stochastic process you integrated over. The most common process to use is “Brownian Motion”/”Weiner Processes“(teehee). This describes the motion of a stock price, essentially saying it’s just as likely to go up or down at any given time.  In that case the stochastic integral is a random variable that has a normal distribution, with an expected value of 0.

Why in god’s name would you do this?

The first thing you might be wondering is whether this is all some elaborate homework question from a theoretical math professor. Well, the big fields which use this math are physics and finance. It’s hard to think of how to directly represent some relationship or model with a stochastic integral. However, it’s a lot easier to picture modelling some relationship with a stochastic differential equation. This would just be any value whose rate of change is dependent on the rate of change of some random process. Once you have the differential equation, you use a stochastic integral to solve for the value.

I was introduced to the concept through a finance class, so let’s look at an example from the stock market.

The above equation describes the price of a stock.

St is the price of the stock.

dSt is the change in the stock price.

dt is the change in time.

u is the average expected return of the stock over time.

sigma is the volatility of the stock.

dWt is the change in Brownian Motion (random ups and downs of the stock).

So the equation says that the stock price changes in relation to two terms. The first term is u * St * dt, the long term expected return of the stock times the current price of the stock times the length of time. For example, if the expected return of the stock were 3% per year (u = 0.03), and the current price of the stock was 100 (St = 100), and we wanted to the change of the stock price in one year (dt = 1) then this term would equal 3. ie. the stock price has gone up by 3 in one year. Of course stocks don’t just go up in such a predictable way, they go up and down along the way.

The second term models the random ups and downs of the stock. To be technical, the ups and downs are represented by Brownian Motion. dWt to represent the change in Brownian Motion over the time period.

Now the rate of change of the stock on its own isn’t that useful, we want to know what the actual price of the stock is at some given time (ie. solve for St.) If the dWt term wasn’t in there, you could just use basic Ordinary Differential Equation solving techniques for this. In order to deal with the change in Brownian Motion inside this equation, we’ll need to bring in the big guns: stochastic calculus, and in this case Ito’s Lemma.

Why can’t we solve this equation to predict the stock market and get rich? Remember what I said earlier, the output of a stochastic integral is a random variable. So solving this equation won’t actually give you the definite price of the stock 1 year from now, but a random variable, along with its mean and variance (assuming real world stocks actually follow Brownian Motion!)

Where can I learn more?

I learned about this through financial applications, so those are the only resources I know about. Essentially, if you carry on with more complex examples, all roads lead to the infamous Black Scholes Equation. It’s described in a fairly accessible way in the Hull book. The teacher for my financial stochastic calculus course, prof. Jaimungal at U of T also has all of his lectures and notes online. The videos are very instructive, probably the best resource for an introduction to this field. Lecture 7 and 8 basically cover an intro to stochastic calculus independently of finance.

Test Driven Development – Kent Beck

A good saying in software development is that “if it’s not tested it doesn’t work”. Every programmer agrees it’s a good idea to test your code, the question is what’s the best methodology to do the testing. One very useful tool is unit tests, which is code that performs a series of specific tests on a module to test its functionality. By using a common unit testing framework across a project, it’s easy to create new tests, run the entire suite of tests for a project, and check if any tests are failing. Unit tests are great for verifying new code works, and also that a new code change hasn’t broken any previously working code. Kent Beck was an early advocate of unit testing, and helped create the jUnit testing framework.

Test Driven Development flips the idea of unit testing on its head. It advocates writing unit tests before writing the code that gets tested. This way, the tests are actually used to guide the design process. The book goes through several examples of how to apply this methodology, including using Test Driven Development to create a unit testing frame work, which results in a very interesting development cycle. Overall I found the writing and example code quite clear. Kent Beck did well in emphasizing the benefits of TDD:

  • By writing unit tests before the actual code, you ensure that your code will have unit tests!
  • Unit tests help make the final code more modular and usable, since the tests give the developer a taste of how each class will be used before he even writes them.
  • You split the programming work up into very tiny increments, and this increases the momentum.

I read Test Driven Development in 2010, well after it was first published and this development methodology had been introduced to the world. It seems like since this was first release, the mainstream users of TDD have backed off a bit from literally starting any software development by writing unit tests. For large project, I think it’s more practical to do some up front design work. However, I have found that using TDD is great for working on smaller scale projects and modules. It’s also really helpful when you’re dealing with a new language or library, since you get to learn really quickly what works and what doesn’t. Shortly after reading this book I began work on a new feature at my job which used a lot of graph algorithms. I decided to represent the graphs using the notoriously difficult (but efficient) Boost Graph library, and TDD made the learning process go a whole lot smoother.

Clean Code – Robert “Uncle Bob” Martin

I found this book to be a great refresher on writing clean code. By clean code, I don’t mean code that has good “style”, such as indentations and comments. The essence of being a good programmer is splitting a problem into clean abstractions. This book advocates writing extremely short functions and classes, each with one specific purpose. While some may argue that increasing the number of functions or classes also increases the complexity of the program, Robert Martin simply views this plethora of functions as exposing the true complexity of the problem. I tend to agree with him.

Code Complete 2

Code Complete 2 is a really useful book on good programming, software engineering and management principles. I read this several years ago and it’s really shaped the way I program. It’s a thick book, but it covers creating software at a lot of different levels. It starts at a really low level with coding style issues, like variable naming, spacing, and other low level details. It progresses to higher level software design issues; abstracting code into classes and functions, design patterns. It then moves on to good practices for managing large software projects, such as scheduling and the programmer/management relationship. Overall, I would say this is the best book that a fresh graduate can read to learn about programming professionally.