Skip to content

Commit

Permalink
[FLINK-17285][python][docs] Translate "Python Table API" page into Ch…
Browse files Browse the repository at this point in the history
…inese (data type, metrics, etc)

This closes apache#13116.
  • Loading branch information
billyrrr authored and dianfu committed Aug 17, 2020
1 parent 29bd21a commit b757816
Show file tree
Hide file tree
Showing 5 changed files with 83 additions and 90 deletions.
39 changes: 18 additions & 21 deletions docs/dev/table/python/conversion_of_pandas.zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,60 +22,57 @@ specific language governing permissions and limitations
under the License.
-->

It supports to convert between PyFlink Table and Pandas DataFrame.
PyFlink支持PyFlink表和Pandas DataFrame之间进行转换。

* This will be replaced by the TOC
{:toc}

## Convert Pandas DataFrame to PyFlink Table
## 将Pandas DataFrame转换为PyFlink表

It supports creating a PyFlink Table from a Pandas DataFrame. Internally, it will serialize the Pandas DataFrame
using Arrow columnar format at client side and the serialized data will be processed and deserialized in Arrow source
during execution. The Arrow source could also be used in streaming jobs and it will properly handle the checkpoint
and provides the exactly once guarantees.
PyFlink支持将Pandas DataFrame转换成PyFlink表。在内部实现上,会在客户端将Pandas DataFrame序列化成Arrow列存格式,序列化后的数据
在作业执行期间,在Arrow源中会被反序列化,并进行处理。Arrow源除了可以用在批作业中外,还可以用于流作业,它将正确处理检查点并提供恰好一次的保证。

The following example shows how to create a PyFlink Table from a Pandas DataFrame:
以下示例显示如何从Pandas DataFrame创建PyFlink表:

{% highlight python %}
import pandas as pd
import numpy as np

# Create a Pandas DataFrame
# 创建一个Pandas DataFrame
pdf = pd.DataFrame(np.random.rand(1000, 2))

# Create a PyFlink Table from a Pandas DataFrame
# 由Pandas DataFrame创建PyFlink表
table = t_env.from_pandas(pdf)

# Create a PyFlink Table from a Pandas DataFrame with the specified column names
# 由Pandas DataFrame创建指定列名的PyFlink表
table = t_env.from_pandas(pdf, ['f0', 'f1'])

# Create a PyFlink Table from a Pandas DataFrame with the specified column types
# 由Pandas DataFrame创建指定列类型的PyFlink表
table = t_env.from_pandas(pdf, [DataTypes.DOUBLE(), DataTypes.DOUBLE()])

# Create a PyFlink Table from a Pandas DataFrame with the specified row type
# 由Pandas DataFrame创建列名和列类型的PyFlink表
table = t_env.from_pandas(pdf,
DataTypes.ROW([DataTypes.FIELD("f0", DataTypes.DOUBLE()),
DataTypes.FIELD("f1", DataTypes.DOUBLE())])
{% endhighlight %}

## Convert PyFlink Table to Pandas DataFrame
## 将PyFlink表转换为Pandas DataFrame

It also supports converting a PyFlink Table to a Pandas DataFrame. Internally, it will materialize the results of the
table and serialize them into multiple Arrow batches of Arrow columnar format at client side. The maximum Arrow batch size
is determined by the config option [python.fn-execution.arrow.batch.size]({{ site.baseurl }}/zh/dev/table/python/python_config.html#python-fn-execution-arrow-batch-size).
The serialized data will then be converted to Pandas DataFrame. It will collect the content of the table to
the client side and so please make sure that the content of the table could fit in memory before calling this method.
除此之外,还支持将PyFlink表转换为Pandas DataFrame。在内部实现上,它将执行表的计算逻辑,得到物化之后的表的执行结果,并
在客户端将其序列化为Arrow列存格式,最大Arrow批处理大小
由配置选项[python.fn-execution.arrow.batch.size]({{ site.baseurl }}/zh/dev/table/python/python_config.html#python-fn-execution-arrow-batch-size) 确定。
序列化后的数据将被转换为Pandas DataFrame。这意味着需要把表的内容收集到客户端,因此在调用此函数之前,请确保表的内容可以容纳在内存中。

The following example shows how to convert a PyFlink Table to a Pandas DataFrame:
以下示例显示了如何将PyFlink表转换为Pandas DataFrame

{% highlight python %}
import pandas as pd
import numpy as np

# Create a PyFlink Table
# 创建PyFlink Table
pdf = pd.DataFrame(np.random.rand(1000, 2))
table = t_env.from_pandas(pdf, ["a", "b"]).filter("a > 0.5")

# Convert the PyFlink Table to a Pandas DataFrame
# 转换PyFlink Table为Pandas DataFrame
pdf = table.to_pandas()
{% endhighlight %}
2 changes: 1 addition & 1 deletion docs/dev/table/python/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,7 @@ You can refer to the Java metric document for more details on [Scope definition]

### User Scope

You can define a user scope by calling `MetricGroup.add_group(key: str, value: str = None)`. If extra is not None, creates a new key-value MetricGroup pair. The key group is added to this group's sub-groups, while the value group is added to the key group's sub-groups. In this case, the value group will be returned and a user variable will be defined.
You can define a user scope by calling `MetricGroup.add_group(key: str, value: str = None)`. If value is not None, creates a new key-value MetricGroup pair. The key group is added to this group's sub-groups, while the value group is added to the key group's sub-groups. In this case, the value group will be returned and a user variable will be defined.

<div class="codetabs" markdown="1">
<div data-lang="python" markdown="1">
Expand Down
70 changes: 36 additions & 34 deletions docs/dev/table/python/metrics.zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,24 +22,24 @@ specific language governing permissions and limitations
under the License.
-->

PyFlink exposes a metric system that allows gathering and exposing metrics to external systems.
PyFlink支持指标系统,该指标系统允许收集指标并将其暴露给外部系统。

* This will be replaced by the TOC
{:toc}

## Registering metrics
## 注册指标

You can access the metric system from a [User-defined Function]({{ site.baseurl }}/zh/dev/table/python/python_udfs.html) by calling `function_context.get_metric_group()` in the `open` method.
The `get_metric_group()` method returns a `MetricGroup` object on which you can create and register new metrics.
您可以通过在[用户自定义函数]({{ site.baseurl }}/zh/dev/table/python/python_udfs.html)`open`方法中调用`function_context.get_metric_group()`来访问指标系统。
`get_metric_group()`方法返回一个`MetricGroup`对象,您可以在该对象上创建和注册新指标。

### Metric types
### 指标类型

PyFlink supports `Counters`, `Gauges`, `Distribution` and `Meters`.
PyFlink支持计数器`Counters`,量表`Gauges`,分布`Distribution`和仪表`Meters`

#### Counter
#### 计数器 Counter

A `Counter` is used to count something. The current value can be in- or decremented using `inc()/inc(n: int)` or `dec()/dec(n: int)`.
You can create and register a `Counter` by calling `counter(name: str)` on a `MetricGroup`.
`Counter`用于计算某个东西的出现次数。可以通过`inc()/inc(n: int)``dec()/dec(n: int)`增加或减少当前值。
您可以通过在`MetricGroup`上调用`counter(name: str)`来创建和注册`Counter`

<div class="codetabs" markdown="1">
<div data-lang="python" markdown="1">
Expand All @@ -63,9 +63,9 @@ class MyUDF(ScalarFunction):

</div>

#### Gauge
#### 量表

A `Gauge` provides a value on demand. You can register a gauge by calling `gauge(name: str, obj: Callable[[], int])` on a MetricGroup. The Callable object will be used to report the values. Gauge metrics are restricted to integer-only values.
`Gauge`可按需返回数值。您可以通过在MetricGroup上调用`gauge(name: str, obj: Callable[[], int])`来注册一个量表。Callable对象将用于汇报数值。量表指标(Gauge metrics)只能用于汇报整数值。

<div class="codetabs" markdown="1">
<div data-lang="python" markdown="1">
Expand All @@ -88,9 +88,9 @@ class MyUDF(ScalarFunction):

</div>

#### Distribution
#### 分布(Distribution

A metric that reports information(sum, count, min, max and mean) about the distribution of reported values. The value can be updated using `update(n: int)`. You can register a distribution by calling `distribution(name: str)` on a MetricGroup. Distribution metrics are restricted to integer-only distributions.
`Distribution`用于报告关于所报告值分布的信息(总和,计数,最小,最大和平均值)的指标。可以通过`update(n: int)`来更新当前值。您可以通过在MetricGroup上调用`distribution(name: str)`来注册该指标。分布指标(Distribution metrics)只能用于汇报整数指标。

<div class="codetabs" markdown="1">
<div data-lang="python" markdown="1">
Expand All @@ -113,9 +113,10 @@ class MyUDF(ScalarFunction):

</div>

#### Meter
#### 仪表

A Meter measures an average throughput. An occurrence of an event can be registered with the `mark_event()` method. The occurrence of multiple events at the same time can be registered with mark_event(n: int) method. You can register a meter by calling `meter(self, name: str, time_span_in_seconds: int = 60)` on a MetricGroup. The default value of time_span_in_seconds is 60.
仪表用于汇报平均吞吐量。可以使用`mark_event()`函数来注册事件的发生,使用mark_event(n: int)函数来注册同时发生的多个事件。
您可以通过在MetricGroup上调用`meter(self, name: str, time_span_in_seconds: int = 60)`来注册仪表。time_span_in_seconds的默认值为60。

<div class="codetabs" markdown="1">
<div data-lang="python" markdown="1">
Expand All @@ -129,7 +130,7 @@ class MyUDF(ScalarFunction):

def open(self, function_context):
super().open(function_context)
# an average rate of events per second over 120s, default is 60s.
# 120秒内统计的平均每秒事件数,默认是60秒
self.meter = function_context.get_metric_group().meter("my_meter", time_span_in_seconds=120)

def eval(self, i):
Expand All @@ -140,13 +141,14 @@ class MyUDF(ScalarFunction):

</div>

## Scope
## 范围(Scope

You can refer to the Java metric document for more details on [Scope definition]({{ site.baseurl }}/zh/monitoring/metrics.html#Scope).
您可以参考Java指标文档以获取有关[范围定义]({{ site.baseurl }}/zh/monitoring/metrics.html#Scope)的更多详细信息。

### User Scope
### 用户范围(User Scope

You can define a user scope by calling `MetricGroup.add_group(key: str, value: str = None)`. If extra is not None, creates a new key-value MetricGroup pair. The key group is added to this group's sub-groups, while the value group is added to the key group's sub-groups. In this case, the value group will be returned and a user variable will be defined.
您可以通过调用`MetricGroup.add_group(key: str, value: str = None)`来定义用户范围。如果``value``不为``None``,则创建一个新的键值``MetricGroup``对。
其中,键组被添加到该组的子组中,而值组又被添加到键组的子组中。在这种情况下,值组将作为结果返回,与此同时,创建一个用户变量。

<div class="codetabs" markdown="1">
<div data-lang="python" markdown="1">
Expand All @@ -167,19 +169,19 @@ function_context

</div>

### System Scope
### 系统范围(System Scope

You can refer to the Java metric document for more details on [System Scope]({{ site.baseurl }}/zh/monitoring/metrics.html#system-scope).
您可以参考Java指标文档以获取有关[系统范围]({{ site.baseurl }}/zh/monitoring/metrics.html#system-scope)的更多详细信息。

### List of all Variables
### 所有变量列表

You can refer to the Java metric document for more details on [List of all Variables]({{ site.baseurl }}/zh/monitoring/metrics.html#list-of-all-variables).
您可以参考Java指标文档以获取有关[“所有变量列表”的]({{ site.baseurl }}/zh/monitoring/metrics.html#list-of-all-variables)更多详细信息。

### User Variables
### 用户变量(User Variables

You can define a user variable by calling `MetricGroup.addGroup(key: str, value: str = None)` and specifying the value parameter.
您可以通过调用`MetricGroup.addGroup(key: str, value: str = None)`并指定value参数来定义用户变量。

**Important:** User variables cannot be used in scope formats.
**重要提示:**用户变量不能在以`scope format`中使用。

<div class="codetabs" markdown="1">
<div data-lang="python" markdown="1">
Expand All @@ -193,15 +195,15 @@ function_context

</div>

## Common part between PyFlink and Flink
## PyFlink和Flink的共通部分

You can refer to the Java metric document for more details on the following sections:
您可以参考Java的指标文档,以获取关于以下部分的更多详细信息:

- [Reporter]({{ site.baseurl }}/zh/monitoring/metrics.html#reporter).
- [System metrics]({{ site.baseurl }}/zh/monitoring/metrics.html#system-metrics).
- [Latency tracking]({{ site.baseurl }}/zh/monitoring/metrics.html#latency-tracking).
- [REST API integration]({{ site.baseurl }}/zh/monitoring/metrics.html#rest-api-integration).
- [Dashboard integration]({{ site.baseurl }}/zh/monitoring/metrics.html#dashboard-integration).
* [Reporter]({{ site.baseurl }}/zh/monitoring/metrics.html#reporter)
* [系统指标]({{ site.baseurl }}/zh/monitoring/metrics.html#system-metrics)
* [延迟跟踪]({{ site.baseurl }}/zh/monitoring/metrics.html#latency-tracking)
* [REST API集成]({{ site.baseurl }}/zh/monitoring/metrics.html#rest-api-integration)
* [仪表板集成]({{ site.baseurl }}/zh/monitoring/metrics.html#dashboard-integration)


{% top %}
23 changes: 9 additions & 14 deletions docs/dev/table/python/python_types.zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,33 +22,28 @@ specific language governing permissions and limitations
under the License.
-->

This page describes the data types supported in PyFlink Table API.
本节描述PyFlink Table API中所支持的数据类型.

* This will be replaced by the TOC
{:toc}

Data Type
---------

A *data type* describes the logical type of a value in the table ecosystem. It can be used to declare input and/or
output types of Python user-defined functions. Users of the Python Table API work with instances of
`pyflink.table.types.DataType` within the Python Table API or when defining user-defined functions.
在Table生态系统中,数据类型用于描述值的逻辑类型。它可以用来声明Python用户自定义函数的输入/输出类型。
Python Table API的用户可以在Python Table API中,或者定义Python用户自定义函数时,使用`pyflink.table.types.DataType`实例。

A `DataType` instance declares the **logical type** which does not imply a concrete physical representation for transmission
or storage. All pre-defined data types are available in `pyflink.table.types` and can be instantiated with the utility methods
defined in `pyflink.table.types.DataTypes`.
`DataType`实例声明了数据的**逻辑类型**,这并不能用于推断数据在进行传输或存储时的具体物理表示形式。
所有预定义的数据类型都位于`pyflink.table.types`中,并且可以通过类`pyflink.table.types.DataTypes`中所定义的方法创建。

A list of all pre-defined data types can be found [below]({{ site.baseurl }}/zh/dev/table/types.html#list-of-data-types).
可以在[下面]({{ site.baseurl }}/zh/dev/table/types.html#list-of-data-types)找到所有预定义数据类型的列表。

Data Type and Python Type Mapping
数据类型(Data Type)和Python类型的映射关系
------------------

A *data type* can be used to declare input and/or output types of Python user-defined functions. The inputs
will be converted to Python objects corresponding to the data type and the type of the user-defined functions
result must also match the defined data type.
*数据类型*可用于声明Python用户自定义函数的输入/输出类型。输入数据将被转换为与所定义的数据类型相对应的Python对象,用户自定义函数的执行结果的类型也必须与所定义的数据类型匹配。

For vectorized Python UDF, the input types and output type are `pandas.Series`. The element type
of the `pandas.Series` corresponds to the specified data type.
对于向量化Python UDF,输入类型和输出类型都为`pandas.Series``pandas.Series`中的元素类型对应于指定的数据类型。

| Data Type | Python Type | Pandas Type |
|:-----------------|:-----------------------|
Expand Down
Loading

0 comments on commit b757816

Please sign in to comment.