Skip to content

Commit

Permalink
Update doc with Scala First-N and Partition/Rebalance
Browse files Browse the repository at this point in the history
  • Loading branch information
aljoscha committed Sep 29, 2014
1 parent 66f236d commit 2b4d779
Show file tree
Hide file tree
Showing 2 changed files with 83 additions and 17 deletions.
71 changes: 55 additions & 16 deletions docs/dataset_transformations.md
Original file line number Diff line number Diff line change
Expand Up @@ -284,7 +284,7 @@ When using Case Classes you can also specify the grouping key using the names of

~~~scala
case class MyClass(val a: String, b: Int, c: Double)
val tuples = DataSet[MyClass]] = // [...]
val tuples = DataSet[MyClass] = // [...]
// group on the first and second field
val reducedTuples = tuples.groupBy("a", "b").reduce { ... }
~~~
Expand Down Expand Up @@ -1103,15 +1103,11 @@ val unioned = vals1.union(vals2).union(vals3)
</div>
</div>

### Rebalance (Java API Only)

### Rebalance
Evenly rebalances the parallel partitions of a DataSet to eliminate data skew.
Only Map-like transformations may follow a rebalance transformation, i.e.,

- Map
- FlatMap
- Filter
- MapPartition
<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">

~~~java
DataSet<String> in = // [...]
Expand All @@ -1120,16 +1116,26 @@ DataSet<Tuple2<String, String>> out = in.rebalance()
.map(new Mapper());
~~~

### Hash-Partition (Java API Only)
</div>
<div data-lang="scala" markdown="1">

~~~scala
val in: DataSet[String] = // [...]
// rebalance DataSet and apply a Map transformation.
val out = in.rebalance().map { ... }
~~~

</div>
</div>


### Hash-Partition

Hash-partitions a DataSet on a given key.
Keys can be specified as key-selector functions or field position keys (see [Reduce examples](#reduce-on-grouped-dataset) for how to specify keys).
Only Map-like transformations may follow a hash-partition transformation, i.e.,

- Map
- FlatMap
- Filter
- MapPartition
<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">

~~~java
DataSet<Tuple2<String, Integer>> in = // [...]
Expand All @@ -1138,10 +1144,25 @@ DataSet<Tuple2<String, String>> out = in.partitionByHash(0)
.mapPartition(new PartitionMapper());
~~~

### First-n (Java API Only)
</div>
<div data-lang="scala" markdown="1">

~~~scala
val in: DataSet[(String, Int)] = // [...]
// hash-partition DataSet by String value and apply a MapPartition transformation.
val out = in.partitionByHash(0).mapPartition { ... }
~~~

</div>
</div>

### First-n

Returns the first n (arbitrary) elements of a DataSet. First-n can be applied on a regular DataSet, a grouped DataSet, or a grouped-sorted DataSet. Grouping keys can be specified as key-selector functions or field position keys (see [Reduce examples](#reduce-on-grouped-dataset) for how to specify keys).

<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">

~~~java
DataSet<Tuple2<String, Integer>> in = // [...]
// Return the first five (arbitrary) elements of the DataSet
Expand All @@ -1155,4 +1176,22 @@ DataSet<Tuple2<String, Integer>> out2 = in.groupBy(0)
DataSet<Tuple2<String, Integer>> out3 = in.groupBy(0)
.sortGroup(1, Order.ASCENDING)
.first(3);
~~~
~~~

</div>
<div data-lang="scala" markdown="1">

~~~scala
val in: DataSet[(String, Int)] = // [...]
// Return the first five (arbitrary) elements of the DataSet
val out1 = in.first(5)

// Return the first two (arbitrary) elements of each String group
val out2 = in.groupBy(0).first(2)

// Return the first three elements of each String group ordered by the Integer field
val out3 = in.groupBy(0).sortGroup(1, Order.ASCENDING).first(3)
~~~

</div>
</div>
29 changes: 28 additions & 1 deletion docs/programming_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -608,7 +608,7 @@ DataSet<String> result = in.rebalance()
<tr>
<td><strong>Hash-Partition</strong></td>
<td>
<p>Hash-partitions a data set on a given key. Keys can be specified as key-selector functions or field position keys. Only Map-like transformations may follow a hash-partition transformation. (Java API Only)</p>
<p>Hash-partitions a data set on a given key. Keys can be specified as key-selector functions or field position keys.</p>
{% highlight java %}
DataSet<Tuple2<String,Integer>> in = // [...]
DataSet<Integer> result = in.partitionByHash(0)
Expand Down Expand Up @@ -804,6 +804,33 @@ val result: DataSet[(Int, String)] = data1.cross(data2)
<p>Produces the union of two data sets.</p>
{% highlight scala %}
data.union(data2)
{% endhighlight %}
</td>
</tr>
<tr>
<td><strong>Hash-Partition</strong></td>
<td>
<p>Hash-partitions a data set on a given key. Keys can be specified as key-selector functions, tuple positions
or case class fields.</p>
{% highlight scala %}
val in: DataSet[(Int, String)] = // [...]
val result = in.partitionByHash(0).mapPartition { ... }
{% endhighlight %}
</td>
</tr>
<tr>
<td><strong>First-n</strong></td>
<td>
<p>Returns the first n (arbitrary) elements of a data set. First-n can be applied on a regular data set, a grouped data set, or a grouped-sorted data set. Grouping keys can be specified as key-selector functions,
tuple positions or case class fields.</p>
{% highlight scala %}
val in: DataSet[(Int, String)] = // [...]
// regular data set
val result1 = in.first(3)
// grouped data set
val result2 = in.groupBy(0).first(3)
// grouped-sorted data set
val result3 = in.groupBy(0).sortGroup(1, Order.ASCENDING).first(3)
{% endhighlight %}
</td>
</tr>
Expand Down

0 comments on commit 2b4d779

Please sign in to comment.