Skip to content
forked from iconara/piglet

Piglet is a DSL for writing Pig scripts in Ruby

License

Notifications You must be signed in to change notification settings

scouredimage/piglet

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Piglet

Piglet is a DSL for writing Pig Latin scripts in Ruby:

a = load 'input'
b = a.group :c
store b, 'output'

The code above will be translated to the following Pig Latin:

relation_2 = LOAD 'input';
relation_1 = GROUP relation_2 BY c;
STORE relation_1 INTO 'output';

Piglet aims to look like Pig Latin while allowing for things like loops and control of flow that are missing from Pig. I started working on Piglet out of frustration that my Pig scripts started to be very repetitive. Pig lacks control of flow and mechanisms to apply the same set of operations on multiple relations. Piglet is my way of adding those features.

Installation

If you have Gemcutter.org as a source, just do

gem install piglet

there are no dependencies.

Usage

It can be used either as a command line tool for translating a file of Piglet code into Pig Latin, or you can use it inline in a Ruby script:

Command line usage

If piggy.rb contains

store(load('input')), 'output')

then running

piglet piggy.rb

will output

relation_1 = LOAD 'input';
STORE relation_1 INTO 'output';

Programmatic interface

require 'piglet'

@piglet = Piglet::Interpreter.new
@piglet.interpret do
  store(load('input'), 'output')
end
puts @piglet.to_pig_latin

or

puts Piglet::Interpreter.new { store(load('input'), 'output') }.to_pig_latin

or

@interpreter = Piglet::Interpreter.new
puts @interpreter.to_pig_latin { store(load('input'), 'output') }

will print

relation_1 = LOAD 'input';
STORE relation_1 INTO 'output';

to standard out.

Examples of what it can do

a = load 'input', :schema => [:a, :b, :c]
b = a.group :c
c = b.foreach { [self[0], self[1].a.max, self[1].b.max] }
store c, 'output'

will result in the following Pig Latin:

relation_3 = LOAD 'input' AS (a, b, c);
relation_2 = GROUP relation_3 BY c;
relation_1 = FOREACH relation_2 GENERATE $0, MAX($1.a), MAX($1.b);
STORE relation_1 INTO 'output';

Syntax

There are two kinds of operators in Piglet: load & store operators, and relation operators. Load & store are called as functions with no receiver, like this:

load('input')
store(a, 'out')
describe(b)
illustrate(c)
dump(d)
explain(e)

and those are also all the load & store operators. They mirror the Pig Latin operators LOAD, STORE, DESCRIBE, ILLUSTRATE, DUMP and EXPLAIN.

Relation operators are called as methods on relations. Relations are created by the load operator, and can be stored in regular variables:

a = load('input', :schema => [:x, :y, :z])
b = a.group(:x)
store(b, 'out')

Unlinke Pig Latin, operators can be chained:

a = load('input', :schema => [:x, :y, :z])
b = a.sample(3).group(:x)
store(b, 'out')

In fact, a whole script can be written without using variables at all:

store(load('input', :schema => [:x, :y, :z]).sample(3).group(:x))

The relation operators are meant to be close to the Pig Latin syntax, but there are obvious limitations and tradeoffs, see the documentation for the Piglet::Relation::Relation mixin for syntax examples.

load

When loading a relation you can specify the schema by passing the :schema option to load. The syntax of the schema specification is not perfect at this time: if you don’t care about types you can pass an array of symbols or strings, like this:

load('input', :schema => %w(a b c d))
load('input', :schema => [:a, :b, :c, :d])

But if you want types, then you need to pass an array of arrays, where the inner arrays contain the field name and the field type:

load('input', :schema => [[:a, :chararray], [:b, :long]])

This is a bit inconvenient. I would like to use a hash, like this: {:a => :chararray, :b => :long}, but since the order of the keys isn’t guaranteed in Ruby 1.8, it’s not possible. I’m working on something better.

If you need to specify tuples or bags in a schema you can use the special syntax [:field_name, :tuple, [[:a, :int], [:b, :float]]], i.e. the field name, the field type (:tuple or :bag) and the schema of the tuple or bag. See “Types & schemas” below for more info.

You can also specify a load function by passing the :using option:

load('input', :using => :pig_storage)
load('input', :using => 'MyOwnFunction')

Piglet knows to translate :pig_storage to PigStorage, as well as the other pre-defined load and store functions: :binary_serializer, :binary_deserializer, :bin_storage, :pig_dump and :text_loader.

store, dump, describe, etc.

store works similarily to load, but it takes a relation as its first argument, and the path to the output as second. It too takes the option :using, with the same values as load.

dump, describe, illustrate and explain all take a relation as sole argument. explain can be called without argument (see the Pig Latin manual for what EXPLAIN without argument does).

cross, distinct, limit, sample, union

These operators are the most straightforward in Piglet. To do the equivalent of

b = DISTINCT a;

you write

b = a.distinct

in Piglet. More examples:

a.cross(b) # => CROSS a, b
a.limit(4) # => LIMIT a 4
a.sample(0.1) # => SAMPLE a 0.1
a.union(b, c) # => UNION a, b, c

you get the pattern.

order

order works more or less like the operators above, with some extra features: to specify ascending or descending order you can pass an array with two elements instead of a field name – the first element is the field name, the second :asc or :desc:

a.order(:x, [:y, :desc]) # => ORDER a BY x, y DESC

group

In light of the above group works exactly as you would expect: a.group(:b) becomes GROUP a BY b. You can specify which fields to group by either by passing them as separate arguments, or by passing an array as the first parameter. These statements are equivalent:

a.group(:x, :y)
a.group([:x, :y])
a.group(%w(x y))

filter

filter works a little bit different from the operators discussed above. It takes a block in which you specify the arguments to the operator. The block is interpreted in the context of the relation it’s performed on.

The thing that sets filter apart from the operators above is it needs to support field expressions. For example the x == 3 in FILTER a BY x == 3. Piglet supports simple field operators like == or % quite transparently, but more complex expressions can be less elegant, see ”Limitations” below. For example a.filter { x == 3 } works fine, but a.filter { x != 3 } doesn’t (it has to do with how Ruby parses expressions, unfortunately). To do not equals you can either do x.ne(3) or (x == 3).not. See “Limitations” below for more info on field expressions.

The way field expressions are done in Piglet is that you simply use fields as if they were existing local variables, and then call methods on those to build up an expression. Some Ruby operators can be used, but other operations are only available as methods, again, see “Limitations” below for a complete reference.

a.filter { x == 3 }            # => FILTER a BY x == 3
a.filter { (x > 4).or(y < 2) } # => FILTER a BY x > 4 OR r < 2

Be careful about the names of the fields. Ruby’s scoping rules apply, which means that if there’s already a variable defined outside of the block with the name x Ruby will assume you meant that variable. If you get strange results, try prefixing with self, e.g. self.x.

foreach

FOREACH … GENERATE is probably the most complex operator in Pig Latin. Piglet tries its best to support most of it, but there are things that are still missing – see “Limitations”. Most things should work without problems though. The operator in Piglet is called simply foreach, and just as filter it takes a block, which is interpreted in the context of the relation foreach was called on.

In contrast to filter, foreach should return an array of field references and expressions. This array describes the schema of the new relation. The expressions used in foreach are usually not the same as those used in filter, although all are of course available in both situations. In foreach common operators to use are the aggregate functions (called “eval functions” in the Pig Latin manual) like MAX, MIN, COUNT, SUM, etc. In Piglet these are method calls on field objects. Let’s look at an example (I like to use lots of whitespace and newlines for foreach operations, because otherwise it gets very messy):

a.foreach do
  [
    x.max,
    y.min,
    z.count,
    w + q
  ]
end

this would be translated into:

FOREACH a GENERATE
  MAX(x),
  MIN(y),
  COUNT(z),
  w + q;

pretty straight forward. What if you want to give the fields of the new relation proper names? In Pig Latin you would write MAX(x) AS (x_max), and in Piglet you can write x.max.as(:x_max). This is such a common thing to do that I’m thinking of adding some kind of feature that automatically adds AS clauses where appropriate, but it’s not there yet.

If you want to access fields with $0, $1, etc. you can use self[0], self[1]:

a.foreach { [self[0].as(:x)] } # => FOREACH a GENERATE $0

foreach is a very complex beast, and this is just an overview, so I’ll just give you a few more examples that are not obvious:

Literal values can be specified using literal:

a.foreach { [literal('hello').as(:hello)] } # => FOREACH a GENERATE 'hello' AS hello

Binary conditionals, a.k.a. the ternary operator are supported through test (unfortunately the Ruby ternary operator can’t be overridden):

a.foreach { [test(x == 3, y, z)] } # => FOREACH a GENERATE (x == 3 ? y : z)

The first argument to test is the test expression, the second is the if-true expression and the third is the if-false expression.

nested_foreach

In Pig Latin you can use a different syntax if you have a relation with an inner bag, e.g:

x = FOREACH b {
  S = FILTER a BY c == 'xyz';
  GENERATE COUNT(s.z);
}

In Piglet you would write this as

x.nested_foreach {
  s = a.filter { c == 'xyz' }
  [s.z.count]
}

split

The syntax of split shouldn’t be surprising if you’ve read this far, but there’s perhaps some details that aren’t obvious. To split a relation into a number of parts you call split on the relation and pass a block in which you specify the expressions describing each shard. Just as with filter and foreach the block operates in the context of the relation split is called on. split returns an array containing the relation shards and you can use parallel assignment to make it look really nice:

b, c = a.split { [x > 2, y == 3] } # => SPLIT a INTO b IF x > 2, c IF y == 3

cogroup & join

Thes two operators are the different ways to join relations in Pig Latin. They take the relations to join, and the keys to join them. In Piglet you specify the join expression using a hash: the keys are the relations, and the values are the fields on which to join:

a.join(b => :y, a => :x)    # => JOIN b BY x, a BY y
a.cogroup(b => :y, a => :x) # => COGROUP b BY x, a BY y

Notice that you have to specify the a relation twice: you call the method on it, but you also have to pass it as a key to the join description. I’m working on an alternative syntax.

If you’re joining on more than one field, simply pass an array of field names:

a.join(b => [:y, :z], a => [:x, :w]) # => JOIN b BY (y, z), a BY (x, w)

I’m not absolutely sure that it is legal to join or cogroup on more than one field, the Pig Latin manual isn’t entirely clear on this, but Piglet supports it for the time being.

COGROUP lets you specify INNER and OUTER for join fields, and in Piglet you can do this by passing :inner or :outer as the last element in the array that is the value in the join description:

a.cogroup(b => [:y, :inner], a => [:z, :outer]) # => COGROUP b BY y INNER, a BY z OUTER

stream, define & register

The STREAM operator is supported through stream. You can either stream through a command, or through a command reference.

To stream through a command, use this syntax:

a.stream(:command => 'cut -f 3') # => STREAM a THROUGH `cut -f 3`

to define a command and then stream a relation through that command, use this syntax:

define(:reverse, :command => 'reverse.rb') # => DEFINE reverse `reverse.rb`
a.stream(:reverse)                         # => STREAM a THROUGH reverse

You can also use define to define function references:

define(:hello, :function => 'com.example.Hello') # => DEFINE hello com.example.Hello

When you define a UDF it becomes available as a method in the interpreter scope. This means that you can refer to it by name in, for example, a FOREACH … GENERATE statement:

define :awesome, :function => 'my.awesome.Function' # => DEFINE awesome my.awesome.Function

b = a.foreach { [awesome(self[0]).as(:something_special)] } # => b = FOREACH a GENERATE awesome($0) AS something_special

If you need to register a JAR you can use register:

register('path/to/lib.jar') # => REGISTER path/to/lib.jar

Streaming multiple relations is supported, just pass more relations as an array:

a.stream([b, c, d], :reverse) # => STREAM a, b, c THROUGH reverse

And finally, this is how you specify the schema of the resulting relation:

a.stream(:reverse, :schema => [:x, :y]) # => STREAM a THROUGH reverse AS (x, y)

the schema syntax is the same as for load, and you can read more about it under “Types & schemas” below.

:parallel

For some operators in Pig Latin you can specify the PARALLEL keyword to tell Pig how many reducers

For the cogroup, cross, distinct, group, join and order you can pass :parallel => n as the last parameter to specify the amount of parallelism, e.g. a.group(:x, :y, :z, :parallel => 5).

%declare & %default

The %declare and %default preprocessor macros are available as declare and default. Each take two parameters, a name and a value:

declare(:foo, 'bar')     # => %declare foo 'bar'
default('hello', :world) # => %default hello 'world'

If you want to quote the value with backticks, pass :backticks => true as the third parameter:

default 'CMD', 'uniq', :backticks => true

Putting it all together

Let’s look at a more complex example:

students = load('students.txt', :schema => [%w(student chararray), %w(age int), %w(grade int)])
top_acheivers = students.filter { grade == 5 }
name_and_age = top_acheivers.foreach { [student.as(:name), age] }
name_by_age = name_and_age.group(:age)
count_by_age = name_by_age.foreach { [self[0].as(:age), r[1].name.count.as(:count)]}
store(count_by_age, 'student_counts_by_age.txt', :using => :pig_storage)

We load the file students.txt as a relation with three fields: student, a string, age an integer and grade another integer. Next we filter out the top acheivers with filter. filter takes a block and that block gets a referece to the relation (the one filter was called on), the result of the block will be the filter expression, in this case it’s grade == 5.

When we have the top acheivers we want to make a projection to remove the grades field, since we will not use it in the next set of calculations. In Pig Latin this is done with FOREACH … GENERATE, which is just foreach in Piglet. Like filter, foreach takes a block that gets a reference to the relation. The result of the block should be an array of expressions, and in this case it’s [r.student.as(:name), r.age], which means the student field from the relation, renamed to “name” and the age field. The resulting relation will have two fields: “name” and “age”.

On the next line we group the relation by the age field, and following that we do another projection, this time on the grouped relation. Remember that when doing a grouping in Pig you get a relation that in this case looks like this: (group:int, name_by_age:{name:chararray, age:int}). In the block passed to foreach we use r[0] and r[1] to reference the first and second fields of name_by_age, equivalent to $0 and $1 in Pig Latin. In Pig Latin you could also have used the names group and name_by_age, but for a number of reasons you can’t do that in Piglet (r.group unfortunately refers to the group method, and the relation isn’t actually called name_by_age after Piglet has translated the code into Pig Latin).

The expression r[1].name.count.as(:count) means take the “name” field from the relation in the second field of the relation ($1.name), run the COUNT operator on it, and rename it count, i.e. COUNT($1.name) AS count.

Finally we store the result in a file called student_counts_by_age.txt, using PigStorage (which isn’t strictly necessary to specify since it’s the default. If you have a custom method you can pass its name as a string instead of :pig_storage).

Piglet will translate this into the following Pig Latin:

relation_5 = LOAD 'students.txt' AS (student:chararray, age:int, grade:int);
relation_4 = FILTER relation_5 BY grade == 5;
relation_3 = FOREACH relation_4 GENERATE student AS name, age;
relation_2 = GROUP relation_3 BY age;
relation_1 = FOREACH relation_2 GENERATE $0 AS age, COUNT($1.name) AS count;
STORE relation_1 INTO 'student_counts_by_age.txt' USING PigStorage;

Going beyond Pig Latin

My goal with Piglet was to add control of flow and reuse mechanisms to Pig, so I’d better show some of the things you can do:

input = load('input', :schema => %w(country browser site visit_duration))
%w(country browser site).each do |dimension|
  grouped = input.group(dimension).foreach do
    [self[0], self[1].visit_duration.sum]
  end
  store(grouped, "output-#{dimension}")
end

We load a file that contains an ID field, three dimensions (country, browser and site) and a metric: the duration of a visit. This could be data from a the logs of a set of websites, or an ad server. What we want to do is to sum the the visit_duration field for each of the three dimensions. This would be a big tedious in Pig Latin:

input = LOAD 'input' AS (country browser site visit_duration);
by_country = GROUP input BY country;
by_browser = GROUP input BY browser;
by_site = GROUP input BY site;
sum_by_country = FOREACH by_country GENERATE $0, SUM($1.visit_duration);
sum_by_browser = FOREACH by_browser GENERATE $0, SUM($1.visit_duration);
sum_by_site = FOREACH by_site GENERATE $0, SUM($1.visit_duration);
STORE sum_by_country INTO 'output-country;
STORE sum_by_browser INTO 'output-browser;
STORE sum_by_site INTO 'output-site;

But in Piglet it’s as simple as looping over the names of the dimensions. You could even define a method that encapsulates the grouping, summing and storing (although in this case it would be a bit overkill):

def sum_dimension(relation, dimension)
  grouped = relation.group(dimension).foreach do
    [self[0], self[1].visit_duration.sum]
  end
  store(grouped, "output-#{dimension}")
end

input = load('input', :schema => %w(country browser site visit_duration))
%w(country browser site).each do |dimension|
  sum_dimension(input, dimension)
end

You can even define your own relation operations if you want, just add them to Piglet::Relation::Relation:

module Piglet::Relation::Relation
  # Returns a list of sampled relations for each given sample size
  def samples(*sizes)
    sizes.map { |s| sample(s) }
  end
end

and then use them just as any other operator:

small, medium, large = input.samples(0.01, 0.1, 0.5)

or what about an operator that returns the top n items by some field:

module Piglet::Relation::Relation # Returns the top n tuples from a relation, ordered by field def top(n, field) order([field, :desc]).limit(n) end end which can be used as

input.top(10, :score)

nifty, huh?

Types & schemas

Piglet knows the schema of relations, so you can do something else that Pig lacks: introspection. This lets you do things like like this code, which counts the unique values of all fields in a relation:

relation = load('in', :schema => [:a, :b, :c])
relation.schema.field_names.each do |field|
  grouped = relation.group(field)
  counted = grouped.foreach { [self[1].count] }
  store(counted, "out-#{field}")
end

This feature obviously only works if you have specified a schema in the call to #load.

There are currently many limitations to this feature, so use it with caution. Since the schema support isn’t completely reliable Piglet does not enforce schemas, and it does not warn you if you try to access a field that doesn’t exist. If it had, it would probably be more annoying and limiting than it would be worth.

Limitations

The aim is to support most of Pig Latin, but currently there are some limitations.

The following Pig operators are supported:

  • COGROUP

  • CROSS

  • DEFINE

  • DESCRIBE

  • DISTINCT

  • DUMP

  • EXPLAIN

  • FILTER

  • FOREACH … GENERATE (including FOREACH { … GENERATE })

  • GROUP

  • ILLUSTRATE

  • JOIN

  • LIMIT

  • LOAD

  • ORDER

  • REGISTER

  • SAMPLE

  • SPLIT

  • STORE

  • STREAM

  • UNION

The file commands (cd, cat, etc.) will probably not be supported for the forseeable future.

All the aggregate functions except one are supported:

  • AVG

  • CONCAT

  • COUNT

  • IsEmpty

  • MAX

  • MIN

  • SIZE

  • SUM

  • TOKENIZE

  • FLATTEN

DIFF is not supported yet.

Piglet only supports most arithmetic and logic operators (see below) on fields – but check the output and make sure that it’s doing what you expect because some it’s tricky to see where Piglet hijacks the operators and when it’s Ruby that is running the show. I’m doing the best I can, but there are many things that can’t be done, at least not in Ruby 1.8.

Piglet does support these field operators:

  • == (equality)

  • &gt; (greater than)

  • &lt; (less than)

  • &gt= (greater or equal to)

  • &lt;= (less than or equal to)

  • % (modulo)

  • + (addition)

  • - (subtraction)

  • * (multiplication)

  • / (division)

It also has these operators, see below for explanations:

  • #not (logical negation)

  • #neg (numerical negation)

  • #ne (not equals)

  • #test (binary conditionals)

Piglet does not support:

  • != (not equals, you have to use == and a NOT, e.g. (a == b).not, which will be translated as NOT (a == b) or you can use #ne, which will translate to !=, e.g. a.ne(b) will become a != b. May be supported in the future, but only in Ruby 1.9)

  • ? : (the ternary operator)

  • - (negation, but you can use #neg on a field expression to get the same result, e.g. a.neg will be translated as -a. May be supported in the future, but only in Ruby 1.9)

  • key#value (map dereferencing, may be supported in the future)

Why aren’t the aliases in the Pig Latin the same as the variable names in the Piglet script?

When you run piglet on a Piglet script the aliases in the output will be relation_1, relation_2, relation_3, and so on, instead of the names of the variables of the Piglet script – like in the example at the top of this document.

The names a and b are lost in translation, this is unfortunate but hard to avoid. Firstly there is no way to discover the names of variables, and secondly there is no correspondence between a statement in a Piglet script and a statement in Pig Latin, a.union(b, c).sample(3).group(:x) is at least three statements in Pig Latin. It simply wouldn’t be worth the extra complexity of trying to infer some variable names and reuse them as aliases in the Pig Latin output.

In the future I may add a way of manually suggesting relation aliases, so that the Pig Latin output is more readable.

You may also wonder why the relation aliases aren’t in consecutive order. The reason is that they get their names in the order they are evaluated, and the interpreter walks the relation ancestry upwards from a store (and it only evaluates a relation once).

Why the verbosity in the code generated from a nested FOREACH?

I’m working on it.

Why aren’t all operations included in the output?

If you try this Piglet code:

a = load 'input'
b = a.group :c

You might be surprised that Piglet will not output anything. In fact, Piglet only creates Pig Latin operations on relations that will somehow be outputed. Unless there is a store, dump, describe, illustrate or explain that outputs a relation, the operations applied to that relation and its ancestors will not be included.

When you call group, filter or any of the other methods that can be applied to a relation a datastructure that encodes these operations is created. When a relation is passed to store or one of the other output operators the Piglet interpreter traverses the datastructure backwards, building the Pig Latin operations needed to arrive at the relation that should be passed to the output operator. This is similar to how Pig itself interprets a Pig Latin script.

As a side effect of using store and the other output operators as the trigger for creating the needed relational operations any relations that are not ancestors of relations that are outputed will not be included in the Pig Latin output. On the other hand, they would be no-ops when run by Pig anyway.

The output is not what I expected!

Please contact me and give me the Piglet code and what you think the output should be. I’ll try to either fix your Piglet code, or fix Piglet to do what you expect it to do.

Contributors

  • Theo Hultberg

  • Ning Liang

© 2009-2010 Theo Hultberg / Iconara and contributors. See LICENSE for details.

About

Piglet is a DSL for writing Pig scripts in Ruby

Resources

License

Stars

Watchers

Forks

Packages

No packages published