Piglet is a DSL for writing Pig Latin scripts in Ruby:
a = load 'input' b = a.group :c store b, 'output'
The code above will be translated to the following Pig Latin:
relation_2 = LOAD 'input'; relation_1 = GROUP relation_2 BY c; STORE relation_1 INTO 'output';
Piglet aims to look like Pig Latin while allowing for things like loops and control of flow that are missing from Pig. I started working on Piglet out of frustration that my Pig scripts started to be very repetitive. Pig lacks control of flow and mechanisms to apply the same set of operations on multiple relations. Piglet is my way of adding those features.
If you have Gemcutter.org as a source, just do
gem install piglet
there are no dependencies.
It can be used either as a command line tool for translating a file of Piglet code into Pig Latin, or you can use it inline in a Ruby script:
If piggy.rb
contains
store(load('input')), 'output')
then running
piglet piggy.rb
will output
relation_1 = LOAD 'input'; STORE relation_1 INTO 'output';
require 'piglet' @piglet = Piglet::Interpreter.new @piglet.interpret do store(load('input'), 'output') end puts @piglet.to_pig_latin
or
puts Piglet::Interpreter.new { store(load('input'), 'output') }.to_pig_latin
or
@interpreter = Piglet::Interpreter.new puts @interpreter.to_pig_latin { store(load('input'), 'output') }
will print
relation_1 = LOAD 'input'; STORE relation_1 INTO 'output';
to standard out.
a = load 'input', :schema => [:a, :b, :c] b = a.group :c c = b.foreach { [self[0], self[1].a.max, self[1].b.max] } store c, 'output'
will result in the following Pig Latin:
relation_3 = LOAD 'input' AS (a, b, c); relation_2 = GROUP relation_3 BY c; relation_1 = FOREACH relation_2 GENERATE $0, MAX($1.a), MAX($1.b); STORE relation_1 INTO 'output';
There are two kinds of operators in Piglet: load & store operators, and relation operators. Load & store are called as functions with no receiver, like this:
load('input') store(a, 'out') describe(b) illustrate(c) dump(d) explain(e)
and those are also all the load & store operators. They mirror the Pig Latin operators LOAD
, STORE
, DESCRIBE
, ILLUSTRATE
, DUMP
and EXPLAIN
.
Relation operators are called as methods on relations. Relations are created by the load
operator, and can be stored in regular variables:
a = load('input', :schema => [:x, :y, :z]) b = a.group(:x) store(b, 'out')
Unlinke Pig Latin, operators can be chained:
a = load('input', :schema => [:x, :y, :z]) b = a.sample(3).group(:x) store(b, 'out')
In fact, a whole script can be written without using variables at all:
store(load('input', :schema => [:x, :y, :z]).sample(3).group(:x))
The relation operators are meant to be close to the Pig Latin syntax, but there are obvious limitations and tradeoffs, see the documentation for the Piglet::Relation::Relation mixin for syntax examples.
When loading a relation you can specify the schema by passing the :schema
option to load
. The syntax of the schema specification is not perfect at this time: if you don’t care about types you can pass an array of symbols or strings, like this:
load('input', :schema => %w(a b c d)) load('input', :schema => [:a, :b, :c, :d])
But if you want types, then you need to pass an array of arrays, where the inner arrays contain the field name and the field type:
load('input', :schema => [[:a, :chararray], [:b, :long]])
This is a bit inconvenient. I would like to use a hash, like this: {:a => :chararray, :b => :long}
, but since the order of the keys isn’t guaranteed in Ruby 1.8, it’s not possible. I’m working on something better.
If you need to specify tuples or bags in a schema you can use the special syntax [:field_name, :tuple, [[:a, :int], [:b, :float]]]
, i.e. the field name, the field type (:tuple
or :bag
) and the schema of the tuple or bag. See “Types & schemas” below for more info.
You can also specify a load function by passing the :using
option:
load('input', :using => :pig_storage) load('input', :using => 'MyOwnFunction')
Piglet knows to translate :pig_storage
to PigStorage
, as well as the other pre-defined load and store functions: :binary_serializer
, :binary_deserializer
, :bin_storage
, :pig_dump
and :text_loader
.
store
works similarily to load
, but it takes a relation as its first argument, and the path to the output as second. It too takes the option :using
, with the same values as load
.
dump
, describe
, illustrate
and explain
all take a relation as sole argument. explain
can be called without argument (see the Pig Latin manual for what EXPLAIN
without argument does).
These operators are the most straightforward in Piglet. To do the equivalent of
b = DISTINCT a;
you write
b = a.distinct
in Piglet. More examples:
a.cross(b) # => CROSS a, b a.limit(4) # => LIMIT a 4 a.sample(0.1) # => SAMPLE a 0.1 a.union(b, c) # => UNION a, b, c
you get the pattern.
order
works more or less like the operators above, with some extra features: to specify ascending or descending order you can pass an array with two elements instead of a field name – the first element is the field name, the second :asc
or :desc
:
a.order(:x, [:y, :desc]) # => ORDER a BY x, y DESC
In light of the above group
works exactly as you would expect: a.group(:b)
becomes GROUP a BY b
. You can specify which fields to group by either by passing them as separate arguments, or by passing an array as the first parameter. These statements are equivalent:
a.group(:x, :y) a.group([:x, :y]) a.group(%w(x y))
filter
works a little bit different from the operators discussed above. It takes a block in which you specify the arguments to the operator. The block is interpreted in the context of the relation it’s performed on.
The thing that sets filter
apart from the operators above is it needs to support field expressions. For example the x == 3
in FILTER a BY x == 3
. Piglet supports simple field operators like ==
or %
quite transparently, but more complex expressions can be less elegant, see ”Limitations” below. For example a.filter { x == 3 }
works fine, but a.filter { x != 3 }
doesn’t (it has to do with how Ruby parses expressions, unfortunately). To do not equals you can either do x.ne(3)
or (x == 3).not
. See “Limitations” below for more info on field expressions.
The way field expressions are done in Piglet is that you simply use fields as if they were existing local variables, and then call methods on those to build up an expression. Some Ruby operators can be used, but other operations are only available as methods, again, see “Limitations” below for a complete reference.
a.filter { x == 3 } # => FILTER a BY x == 3 a.filter { (x > 4).or(y < 2) } # => FILTER a BY x > 4 OR r < 2
Be careful about the names of the fields. Ruby’s scoping rules apply, which means that if there’s already a variable defined outside of the block with the name x
Ruby will assume you meant that variable. If you get strange results, try prefixing with self
, e.g. self.x
.
FOREACH … GENERATE
is probably the most complex operator in Pig Latin. Piglet tries its best to support most of it, but there are things that are still missing – see “Limitations”. Most things should work without problems though. The operator in Piglet is called simply foreach
, and just as filter
it takes a block, which is interpreted in the context of the relation foreach
was called on.
In contrast to filter
, foreach
should return an array of field references and expressions. This array describes the schema of the new relation. The expressions used in foreach
are usually not the same as those used in filter
, although all are of course available in both situations. In foreach
common operators to use are the aggregate functions (called “eval functions” in the Pig Latin manual) like MAX
, MIN
, COUNT
, SUM
, etc. In Piglet these are method calls on field objects. Let’s look at an example (I like to use lots of whitespace and newlines for foreach
operations, because otherwise it gets very messy):
a.foreach do [ x.max, y.min, z.count, w + q ] end
this would be translated into:
FOREACH a GENERATE MAX(x), MIN(y), COUNT(z), w + q;
pretty straight forward. What if you want to give the fields of the new relation proper names? In Pig Latin you would write MAX(x) AS (x_max)
, and in Piglet you can write x.max.as(:x_max)
. This is such a common thing to do that I’m thinking of adding some kind of feature that automatically adds AS
clauses where appropriate, but it’s not there yet.
If you want to access fields with $0, $1, etc. you can use self[0]
, self[1]
:
a.foreach { [self[0].as(:x)] } # => FOREACH a GENERATE $0
foreach
is a very complex beast, and this is just an overview, so I’ll just give you a few more examples that are not obvious:
Literal values can be specified using literal
:
a.foreach { [literal('hello').as(:hello)] } # => FOREACH a GENERATE 'hello' AS hello
Binary conditionals, a.k.a. the ternary operator are supported through test
(unfortunately the Ruby ternary operator can’t be overridden):
a.foreach { [test(x == 3, y, z)] } # => FOREACH a GENERATE (x == 3 ? y : z)
The first argument to test
is the test expression, the second is the if-true expression and the third is the if-false expression.
In Pig Latin you can use a different syntax if you have a relation with an inner bag, e.g:
x = FOREACH b { S = FILTER a BY c == 'xyz'; GENERATE COUNT(s.z); }
In Piglet you would write this as
x.nested_foreach { s = a.filter { c == 'xyz' } [s.z.count] }
The syntax of split
shouldn’t be surprising if you’ve read this far, but there’s perhaps some details that aren’t obvious. To split a relation into a number of parts you call split
on the relation and pass a block in which you specify the expressions describing each shard. Just as with filter
and foreach
the block operates in the context of the relation split
is called on. split
returns an array containing the relation shards and you can use parallel assignment to make it look really nice:
b, c = a.split { [x > 2, y == 3] } # => SPLIT a INTO b IF x > 2, c IF y == 3
Thes two operators are the different ways to join relations in Pig Latin. They take the relations to join, and the keys to join them. In Piglet you specify the join expression using a hash: the keys are the relations, and the values are the fields on which to join:
a.join(b => :y, a => :x) # => JOIN b BY x, a BY y a.cogroup(b => :y, a => :x) # => COGROUP b BY x, a BY y
Notice that you have to specify the a
relation twice: you call the method on it, but you also have to pass it as a key to the join description. I’m working on an alternative syntax.
If you’re joining on more than one field, simply pass an array of field names:
a.join(b => [:y, :z], a => [:x, :w]) # => JOIN b BY (y, z), a BY (x, w)
I’m not absolutely sure that it is legal to join or cogroup on more than one field, the Pig Latin manual isn’t entirely clear on this, but Piglet supports it for the time being.
COGROUP
lets you specify INNER
and OUTER
for join fields, and in Piglet you can do this by passing :inner
or :outer
as the last element in the array that is the value in the join description:
a.cogroup(b => [:y, :inner], a => [:z, :outer]) # => COGROUP b BY y INNER, a BY z OUTER
The STREAM
operator is supported through stream
. You can either stream through a command, or through a command reference.
To stream through a command, use this syntax:
a.stream(:command => 'cut -f 3') # => STREAM a THROUGH `cut -f 3`
to define a command and then stream a relation through that command, use this syntax:
define(:reverse, :command => 'reverse.rb') # => DEFINE reverse `reverse.rb` a.stream(:reverse) # => STREAM a THROUGH reverse
You can also use define
to define function references:
define(:hello, :function => 'com.example.Hello') # => DEFINE hello com.example.Hello
When you define a UDF it becomes available as a method in the interpreter scope. This means that you can refer to it by name in, for example, a FOREACH … GENERATE
statement:
define :awesome, :function => 'my.awesome.Function' # => DEFINE awesome my.awesome.Function … b = a.foreach { [awesome(self[0]).as(:something_special)] } # => b = FOREACH a GENERATE awesome($0) AS something_special
If you need to register a JAR you can use register
:
register('path/to/lib.jar') # => REGISTER path/to/lib.jar
Streaming multiple relations is supported, just pass more relations as an array:
a.stream([b, c, d], :reverse) # => STREAM a, b, c THROUGH reverse
And finally, this is how you specify the schema of the resulting relation:
a.stream(:reverse, :schema => [:x, :y]) # => STREAM a THROUGH reverse AS (x, y)
the schema syntax is the same as for load
, and you can read more about it under “Types & schemas” below.
For some operators in Pig Latin you can specify the PARALLEL
keyword to tell Pig how many reducers
For the cogroup
, cross
, distinct
, group
, join
and order
you can pass :parallel => n
as the last parameter to specify the amount of parallelism, e.g. a.group(:x, :y, :z, :parallel => 5)
.
The %declare
and %default
preprocessor macros are available as declare
and default
. Each take two parameters, a name and a value:
declare(:foo, 'bar') # => %declare foo 'bar' default('hello', :world) # => %default hello 'world'
If you want to quote the value with backticks, pass :backticks => true
as the third parameter:
default 'CMD', 'uniq', :backticks => true
Let’s look at a more complex example:
students = load('students.txt', :schema => [%w(student chararray), %w(age int), %w(grade int)]) top_acheivers = students.filter { grade == 5 } name_and_age = top_acheivers.foreach { [student.as(:name), age] } name_by_age = name_and_age.group(:age) count_by_age = name_by_age.foreach { [self[0].as(:age), r[1].name.count.as(:count)]} store(count_by_age, 'student_counts_by_age.txt', :using => :pig_storage)
We load the file students.txt
as a relation with three fields: student
, a string, age
an integer and grade
another integer. Next we filter out the top acheivers with filter
. filter
takes a block and that block gets a referece to the relation (the one filter
was called on), the result of the block will be the filter expression, in this case it’s grade == 5
.
When we have the top acheivers we want to make a projection to remove the grades field, since we will not use it in the next set of calculations. In Pig Latin this is done with FOREACH … GENERATE
, which is just foreach
in Piglet. Like filter
, foreach
takes a block that gets a reference to the relation. The result of the block should be an array of expressions, and in this case it’s [r.student.as(:name), r.age]
, which means the student field from the relation, renamed to “name” and the age field. The resulting relation will have two fields: “name” and “age”.
On the next line we group the relation by the age field, and following that we do another projection, this time on the grouped relation. Remember that when doing a grouping in Pig you get a relation that in this case looks like this: (group:int, name_by_age:{name:chararray, age:int})
. In the block passed to foreach
we use r[0]
and r[1]
to reference the first and second fields of name_by_age
, equivalent to $0
and $1
in Pig Latin. In Pig Latin you could also have used the names group
and name_by_age
, but for a number of reasons you can’t do that in Piglet (r.group
unfortunately refers to the group
method, and the relation isn’t actually called name_by_age
after Piglet has translated the code into Pig Latin).
The expression r[1].name.count.as(:count)
means take the “name” field from the relation in the second field of the relation ($1.name
), run the COUNT
operator on it, and rename it count
, i.e. COUNT($1.name) AS count
.
Finally we store the result in a file called student_counts_by_age.txt
, using PigStorage (which isn’t strictly necessary to specify since it’s the default. If you have a custom method you can pass its name as a string instead of :pig_storage
).
Piglet will translate this into the following Pig Latin:
relation_5 = LOAD 'students.txt' AS (student:chararray, age:int, grade:int); relation_4 = FILTER relation_5 BY grade == 5; relation_3 = FOREACH relation_4 GENERATE student AS name, age; relation_2 = GROUP relation_3 BY age; relation_1 = FOREACH relation_2 GENERATE $0 AS age, COUNT($1.name) AS count; STORE relation_1 INTO 'student_counts_by_age.txt' USING PigStorage;
My goal with Piglet was to add control of flow and reuse mechanisms to Pig, so I’d better show some of the things you can do:
input = load('input', :schema => %w(country browser site visit_duration)) %w(country browser site).each do |dimension| grouped = input.group(dimension).foreach do [self[0], self[1].visit_duration.sum] end store(grouped, "output-#{dimension}") end
We load a file that contains an ID field, three dimensions (country, browser and site) and a metric: the duration of a visit. This could be data from a the logs of a set of websites, or an ad server. What we want to do is to sum the the visit_duration
field for each of the three dimensions. This would be a big tedious in Pig Latin:
input = LOAD 'input' AS (country browser site visit_duration); by_country = GROUP input BY country; by_browser = GROUP input BY browser; by_site = GROUP input BY site; sum_by_country = FOREACH by_country GENERATE $0, SUM($1.visit_duration); sum_by_browser = FOREACH by_browser GENERATE $0, SUM($1.visit_duration); sum_by_site = FOREACH by_site GENERATE $0, SUM($1.visit_duration); STORE sum_by_country INTO 'output-country; STORE sum_by_browser INTO 'output-browser; STORE sum_by_site INTO 'output-site;
But in Piglet it’s as simple as looping over the names of the dimensions. You could even define a method that encapsulates the grouping, summing and storing (although in this case it would be a bit overkill):
def sum_dimension(relation, dimension) grouped = relation.group(dimension).foreach do [self[0], self[1].visit_duration.sum] end store(grouped, "output-#{dimension}") end input = load('input', :schema => %w(country browser site visit_duration)) %w(country browser site).each do |dimension| sum_dimension(input, dimension) end
You can even define your own relation operations if you want, just add them to Piglet::Relation::Relation:
module Piglet::Relation::Relation # Returns a list of sampled relations for each given sample size def samples(*sizes) sizes.map { |s| sample(s) } end end
and then use them just as any other operator:
small, medium, large = input.samples(0.01, 0.1, 0.5)
or what about an operator that returns the top n items by some field:
module Piglet::Relation::Relation # Returns the top n tuples from a relation, ordered by field def top(n, field) order([field, :desc]).limit(n) end end which can be used as
input.top(10, :score)
nifty, huh?
Piglet knows the schema of relations, so you can do something else that Pig lacks: introspection. This lets you do things like like this code, which counts the unique values of all fields in a relation:
relation = load('in', :schema => [:a, :b, :c]) relation.schema.field_names.each do |field| grouped = relation.group(field) counted = grouped.foreach { [self[1].count] } store(counted, "out-#{field}") end
This feature obviously only works if you have specified a schema in the call to #load.
There are currently many limitations to this feature, so use it with caution. Since the schema support isn’t completely reliable Piglet does not enforce schemas, and it does not warn you if you try to access a field that doesn’t exist. If it had, it would probably be more annoying and limiting than it would be worth.
The aim is to support most of Pig Latin, but currently there are some limitations.
The following Pig operators are supported:
-
COGROUP
-
CROSS
-
DEFINE
-
DESCRIBE
-
DISTINCT
-
DUMP
-
EXPLAIN
-
FILTER
-
FOREACH … GENERATE
(includingFOREACH { … GENERATE }
) -
GROUP
-
ILLUSTRATE
-
JOIN
-
LIMIT
-
LOAD
-
ORDER
-
REGISTER
-
SAMPLE
-
SPLIT
-
STORE
-
STREAM
-
UNION
The file commands (cd
, cat
, etc.) will probably not be supported for the forseeable future.
All the aggregate functions except one are supported:
-
AVG
-
CONCAT
-
COUNT
-
IsEmpty
-
MAX
-
MIN
-
SIZE
-
SUM
-
TOKENIZE
-
FLATTEN
DIFF
is not supported yet.
Piglet only supports most arithmetic and logic operators (see below) on fields – but check the output and make sure that it’s doing what you expect because some it’s tricky to see where Piglet hijacks the operators and when it’s Ruby that is running the show. I’m doing the best I can, but there are many things that can’t be done, at least not in Ruby 1.8.
Piglet does support these field operators:
-
==
(equality) -
>
(greater than) -
<
(less than) -
>=
(greater or equal to) -
<=
(less than or equal to) -
%
(modulo) -
+
(addition) -
-
(subtraction) -
*
(multiplication) -
/
(division)
It also has these operators, see below for explanations:
-
#not
(logical negation) -
#neg
(numerical negation) -
#ne
(not equals) -
#test
(binary conditionals)
Piglet does not support:
-
!=
(not equals, you have to use==
and aNOT
, e.g.(a == b).not
, which will be translated asNOT (a == b)
or you can use#ne
, which will translate to !=, e.g.a.ne(b)
will becomea != b
. May be supported in the future, but only in Ruby 1.9) -
? :
(the ternary operator) -
-
(negation, but you can use#neg
on a field expression to get the same result, e.g.a.neg
will be translated as-a
. May be supported in the future, but only in Ruby 1.9) -
key#value
(map dereferencing, may be supported in the future)
When you run piglet
on a Piglet script the aliases in the output will be relation_1
, relation_2
, relation_3
, and so on, instead of the names of the variables of the Piglet script – like in the example at the top of this document.
The names a
and b
are lost in translation, this is unfortunate but hard to avoid. Firstly there is no way to discover the names of variables, and secondly there is no correspondence between a statement in a Piglet script and a statement in Pig Latin, a.union(b, c).sample(3).group(:x)
is at least three statements in Pig Latin. It simply wouldn’t be worth the extra complexity of trying to infer some variable names and reuse them as aliases in the Pig Latin output.
In the future I may add a way of manually suggesting relation aliases, so that the Pig Latin output is more readable.
You may also wonder why the relation aliases aren’t in consecutive order. The reason is that they get their names in the order they are evaluated, and the interpreter walks the relation ancestry upwards from a store
(and it only evaluates a relation once).
I’m working on it.
If you try this Piglet code:
a = load 'input' b = a.group :c
You might be surprised that Piglet will not output anything. In fact, Piglet only creates Pig Latin operations on relations that will somehow be outputed. Unless there is a store
, dump
, describe
, illustrate
or explain
that outputs a relation, the operations applied to that relation and its ancestors will not be included.
When you call group
, filter
or any of the other methods that can be applied to a relation a datastructure that encodes these operations is created. When a relation is passed to store
or one of the other output operators the Piglet interpreter traverses the datastructure backwards, building the Pig Latin operations needed to arrive at the relation that should be passed to the output operator. This is similar to how Pig itself interprets a Pig Latin script.
As a side effect of using store
and the other output operators as the trigger for creating the needed relational operations any relations that are not ancestors of relations that are outputed will not be included in the Pig Latin output. On the other hand, they would be no-ops when run by Pig anyway.
Please contact me and give me the Piglet code and what you think the output should be. I’ll try to either fix your Piglet code, or fix Piglet to do what you expect it to do.
-
Theo Hultberg
-
Ning Liang
© 2009-2010 Theo Hultberg / Iconara and contributors. See LICENSE for details.