Skip to content

Latest commit

 

History

History

pandas_transform

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

PandasTransform

TL;DR

PandasTransform is a TFX component which can be used instead of the standard Transform component, and allows you to work with Pandas dataframes for your feature engineering. Processing is distributed using Beam for scalability. Operations which require a full pass over the dataset are not currently supported. Statistics such as the standard deviation, which are required for operations such as z-score normalization, are supplied using the statistics which are captured by StatisticsGen.

Project Category

Component

Project Use-Case(s)

The primary use cases are:

  • Developers who are not modeling in TensorFlow
  • Developers who are prototyping and are more comfortable working with dataframes, at least initially, and may not deploy their model for inference
  • Developers whose feature engineering can work with the basic statistics of the dataset (min, max, etc) and do not need to make full passes over the data
  • Developers who will perform their feature engineering outside of their model during inference, and therefore do not need a Transform graph to prepend to their model.

Implementation

This is implemented as a Python-function component, using Beam for processing. Like the Transform component you will supply a module file with your user code in a preprocessing_fn. Your code will be supplied with your dataset as a Pandas dataframe, and you will return your results as a Pandas dataframe. Your code will also be supplied with the basic statistics for your dataset, generated by StatisticsGen, and formatted as a Python dictionary. Your code will also be supplied with the schema of your dataset, generated by SchemaGen, and formatted as a Python dictionary.

Note that PandasTransform is designed to be backward compatible, and for TFX releases >= 1.8.0 PandasTransform will take advantage of the use_beam property in the @component decorator to subclass BaseBeamComponent, which honors the TFX-pipeline-wise shared configuration, including beam_pipeline_args. When used with earlier releases of TFX it will subclass BaseComponent.

def PandasTransform(
      transformed_examples: tfx.dsl.components.OutputArtifact[Examples],
      examples: tfx.dsl.components.InputArtifact[Examples] = None,
      schema: tfx.dsl.components.InputArtifact[Schema] = None,
      statistics: tfx.dsl.components.InputArtifact[ExampleStatistics] = None,
      module_file: tfx.dsl.components.Parameter[str] = None,
      beam_pipeline_args: tfx.dsl.components.Parameter[str] = None) -> None
"""
Args:
    examples: A TFX input channel containing a dataset artifact
    schema: A TFX input channel containing a schema artifact
    statistics: A TFX input channel containing a statistics artifact
    transformed_examples: A TFX output channel which will be used to output the resulting
    dataset artifact
    module_file: A component parameter containing a file path to a Python file which
    contains the user code, in a function named 'preprocessing_fn'.
    beam_pipeline_args: A string with the argv options for creating a Beam pipeline.
    Note that this is a string, not a list.  It will be split on spaces to create
    a list. If running TFX >= 1.8.0, if beam_pipeline_args are specified they will
    override the pipeline beam args.


  Returns:
    The resulting dataset artifact after processing by the user code.

  Raises:
    ImportError - When the module file is not found.
"""

Notes & Caveats

  • It's important to note that each invocation of your preprocessing_fn will only be supplied with part of your dataset, to enable distributed processing. That means that full passes over your dataset by your user code will not be possible, so operations which require a full pass will not be supported in the first release. A future release may or may not enable full pass operations, TBD.
  • Unlike the standard Transform component, this PandasTransform component does not output the modified schema and statistics for the altered dataset. To generate a schema and statistics which reflect any changes that you've made to your dataset, you should follow the PandasTransform component with StatisticsGen and SchemaGen components in your pipeline.
  • Unlike the standard Transform component, PandasTransform does not create a Transform graph, so the operations performed in PandasTransform cannot be prepended to a TensorFlow model.

Project Dependencies

  • Apache Beam
  • PyArrow
  • Pandas
  • TensorFlow
  • TensorFlow Data Validation
  • TFX

Project Team

Robert Crowe (rcrowe-google) robertcrowe--at--google--dot--com

Note

Please be aware of the processes and requirements which are outlined here: