I wanted to do a brief discussion of Dave Thomas's 19th Code Kata. The gist of the problem is that given a wordlist of words can you determine a path from a source word to an end word, changing only one letter at a time, such that every intermediate step is also a valid word in the dictionary. For instance, the supplied example from 'cat' to 'dog' is: cat, cot, cog, dog. The general idea behind these code katas is that they are designed to be well-formed coding exercises to give programmers a chance to stretch and develop new skills. I've written more graph algorithms that run in a single-process space than I'd care to discuss (here's a simple graph implementation I wrote to learn Google's Go for instance Go-lang datastructures ), so I wanted to solve this one using a different paradigm of problem solving. For this one I wanted to write it as a MapReduce job. I'm familiar with CouchDB and it's built-in MapReduce implementation, but I wanted to go with something that worked in a fully-distributed mode as well as serve as practice for Hadoop. I love all my NoSQL options. The code for this discussion, as well as instructions to setup and run, is available at : codekata19-mapreduce. For the sake of discussion, let's decompose the problem into two subproblems: graph construction and graph traversal.
Problem 1 - Graph construction
Given a list of words, we need to construct a graph such that every vertex is a valid word and every edge represents a valid single-letter transform to get to the next vertex. To solve this using MapReduce we make the input set of data the wordlist. The only "global" information that we need to pass around with the actual processing job is the dictionary so each JVM/Mapper in the system knows what constitutes a valid word (this is done using Hadoop's DistributedCache mechanism). The Map function will emit a key for every valid, single-step transform we can determine as valid. So the output is essentially the edge-set of our graph. The reduce function's only job in this case is to make sure we don't have duplicates and to format the data in a way such that we can use this MapReduce job's output as the input to the next phase. The final results of this will look something like:
cat cog,|-1|WHITE|
with an intentional trailing pipe. The code for this MapReduce job is available here: CodeKata19.java.
Problem 2 - Graph traversal
Graph traversal in this case is a brute-force breadth-first search of the entire graph. Cailin has done an excellent writeup of parallel, distributed breadth-first search at (Breadth-first graph search using iterative map reduce algorithm). The only modification I made to the proposed setup was including a complete path after the last pipe. The algorithm will run until no more GRAY nodes exist in the network, meaning all reachable nodes from a selected start point have been reached. It's worth noting that instead of being the "fastest" implementation, which could easily fit inside a single-process space, this is more of an exercise in distributed algorithms. One of the key advantages is that once this completes running we have the single-source, all-destination pair shortest path result. The final result contains all shortest paths to all feasible solutions. The code for this MapReduce job is available here: CodeKata19Search.java.