Nick
Nack
Format your hadoop streaming output like a true villain!
Why
Ever wanted to split your Hadoop streaming output into multiple directories? How about filtering by multiple keys in a single pass?
Nick Nack makes your Hadoop streaming jobs even better by hooking into the rich support for writing to multiple outputs that Hadoop offers. This library (and further documentation) is tailored for working with mrjob, but can be used with any Hadoop streaming job.
How
Start slicing your mapper or reducer output with these simple steps.
- Download the latest Nick Nack jar.
- Augment your MRJob class to use your desired formatter:
mrjob 0.5.3+
- As long as the nicknack jar is in the same directory as your script, it will be discovered and uploaded by mrjob.
See the docs for details.
Pre 0.5.3:
- Modify mrjob to stick the jar on the requested machine:
Or using a mrjob.conf config file:
That’s it! Your mrjob streaming output will now be processed by a Nick Nack formatter.
Formatters
To use a Nick Nack formatter you must set both the OUTPUT_PROTOCOL and the HADOOP_OUTPUT_FORMAT fields in your class definition. The package name for all the Nick Nack formatters is nicknack, so prefix the formatter name with this. For example, to use the MultipleValueOutputFormat formatter, your class should have:
MultipleValueOutputFormat
Use Case: You want to split your output into directories specified by a key, and not include the key in the output.
mrjob Output:
Hadoop Input:
Output in [outputdir]/filename1/part-00000:
Output in [outputdir]/otherfile/part-00000
MultipleTextOutputFormatByKey
Use Case: You want to split your output into directories specified by a key, but also keep the keys in the output.
- This is nearly the same as MultipleValueOutputFormat, except both the key and value are included in the output.
mrjob Output:
Hadoop Input:
Output in [outputdir]/filename1/part-00000:
Output in [outputdir]/otherfile/part-00000
MultipleSimpleOutputFormat
Use Case: You want to split out output into specific directories, where the directory name isn’t a part of the output. Your key should be a tuple of (filename,key) within a space delimited string (we break on the first space to determine the filename, the remainder is the key.
mrjob Output:
Hadoop Input:
Output in [outputdir]/dirname1/part-00000:
Output in [outputdir]/dirname2/part-00000
MultipleJSONOutputFormat
Use Case: You want to split out output into specific directories, where the directory name isn’t a part of the output. Your key should be a JSON array containing exactly two elements, 1) the directory name and 2) the key to write in the output.
- This formatter adds a tiny bit of overhead as we need to decode the json key for each line processed.
mrjob Output:
Hadoop Input:
Output in [outputdir]/dirname1/part-00000:
Output in [outputdir]/dirname2/part-00000
MultipleCSVOutputFormat
Use Case: You want to split out output into specific directories, where the directory name isn’t a part of the output. Your key should be a CSV row containing exactly two elements, 1) the directory name and 2) the key to write in the output.
- This formatter adds a tiny bit of overhead as we need to decode the csv key for each line processed.
mrjob Output:
Hadoop Input:
Output in [outputdir]/dirname1/part-00000:
Output in [outputdir]/dirname2/part-00000
MultipleLeafValueOutputFormat
Use Case: You want complete control of the output files, including the leaf file (usually part-00XXX). This means the job must output unique keys per process or else you will lose data. For example, you may use this format if you guarantee each key is unique per reducer.
mrjob Output:
Hadoop Input:
Output in [outputdir]/full/filename:
Output in [outputdir]/otherpath/filename
Credits
The majority of this project is shameless copied over from oddjob, the original author deserves any and all credit. I couldn’t even come up with an original name.