Vectorized Python UDTFs¶
This topic introduces vectorized Python UDTFs.
Overview¶
Vectorized Python UDTFs (user-defined table functions) provide a way to operate over rows in batches.
Snowflake supports two kinds of vectorized UDTFs:
UDTFs with a vectorized
end_partitionmethodUDTFs with a vectorized
processmethod
You must choose one kind because a UDTF can’t have both a vectorized process method and a vectorized end_partition method.
UDTFs with a vectorized end_partition method¶
UDTFs with a vectorized end_partition method enable seamless partition-by-partition processing by operating on
partitions as pandas DataFrames
and returning results as
pandas DataFrames
or lists of pandas arrays
or pandas Series.
This facilitates integration with libraries that operate on pandas DataFrames or pandas arrays.
Use a vectorized end_partition method for the following tasks:
Process your data on a partition-by-partition basis instead of on a row-by-row basis.
Return multiple rows or columns for each partition.
Use libraries that operate on pandas DataFrames for data analysis.
UDTFs with a vectorized process method¶
UDTFs with a vectorized process method provide a way to operate over rows in batches, when the operation performs a 1-to-1 mapping.
In other words, the method returns one output row for each input row. The number of columns is not restricted.
Use a vectorized process method for the following tasks:
Apply a 1-to-1 transformation with a multi-columnar result in batches.
Use a library that requires
pandas.DataFrame.Process rows in batches, without explicit partitioning.
Leverage the to_pandas() API to transform the query result directly to a pandas DataFrame.
Prerequisites¶
The Snowpark Library for Python version 1.14.0 or later is required.
Create a UDTF with a vectorized end_partition method¶
Optional: Define a handler class with an
__init__method, which will be invoked before each partition is processed.Note: Do not define a
processmethod.Define an
end_partitionmethod that takes in a DataFrame argument and returns or yields apandas.DataFrameor a tuple ofpandas.Seriesorpandas.arrayswhere each array is a column.The column types of the result must match the column types in the UDTF definition.
To mark the
end_partitionmethod as vectorized, use the@vectorizeddecorator or the_sf_vectorized_inputfunction attribute.For more information, see Vectorized Python UDFs. The
@vectorizeddecorator can only be used when the Python UDTF is executed within Snowflake; for example, when using a SQL worksheet. When you are executing using the client or a Python worksheet, you must use the function attribute.
Note
The default column names for the input DataFrame to a UDTF with a vectorized end_partition method match the signature of the SQL function.
The column names follow the SQL identifier requirements.
That is, if an identifier is unquoted it will be capitalized, and if it is double quoted it will remain unchanged.
The following code block is an example of creating a UDTF with a vectorized end_partition method, using the @vectorized decorator:
The following code block is an example of creating a UDTF with a vectorized end_partition method, using the function attribute:
Note
A UDTF with a vectorized end_partition method must be called with a PARTITION BY clause to build the partitions.
To call the UDTF with all the data in the same partition:
To call the UDTF with the data partitioned by column x:
Example: Row collection using a regular UDTF versus using a UDTF with a vectorized end_partition method¶
Row collection using a regular UDTF:
Row collection using a UDTF with a vectorized end_partition method:
Example: Calculate the summary statistic for each column in the partition¶
Here is an example of how to calculate the summary statistic for each column in the partition using
the pandas describe() method.
Create a table and generate three partitions of five rows each:
Look at the data:
Create the function:
Do one of the following steps:
Call the function and partition by
id:Call the function and treat the whole table as one partition:
Create a UDTF with a vectorized process method¶
Define a handler class, similar to regular UDTFs, with optional
__init__andend_partitionmethods.Define a
processmethod that takes in a DataFrame argument and returns either apandas.DataFrameor a tuple ofpandas.Seriesorpandas.arrayswhere each array is a column.The column types of the result must match the column types in the UDTF definition. The returned result must be exactly one DataFrame or tuple. This is different from a vectorized
end_partitionmethod where you can yield or return a list.To mark the
processmethod as vectorized, use the@vectorizeddecorator or the_sf_vectorized_inputfunction attribute.For more information, see Vectorized Python UDFs. The
@vectorizeddecorator can only be used when the Python UDTF is executed within Snowflake; for example, when using a SQL worksheet. When you are executing using the client or a Python worksheet, you must use the function attribute.Optional: If your Python handler function is exceeding the execution time limit, set a target batch size.
Note
The default column names for the input DataFrame to a UDTF with a vectorized process method match the signature of the SQL function.
The column names follow the SQL identifier requirements.
Namely, if an identifier is unquoted it will be capitalized, and if it is double quoted it will remain unchanged.
The handler for a UDTF with a vectorized process method can be implemented to process batches in a partition-aware manner or to process them simply batch by batch.
For more information, see Stateful and Stateless Processing.
Example: Use a UDTF with a vectorized process method to apply one hot encoding¶
Use a UDTF with a vectorized process method to apply one hot encoding on a table with ten categories:
Sample result:
Prepare to print the table:
Sample result:
You can obtain the same result with a vectorized UDF, although is less convenient. You need to package the results into one column, and then unpack the column to restore the results to a usable pandas DataFrame.
Example of using a vectorized UDF:
Type support¶
Vectorized UDTFs support the same SQL types as
vectorized UDFs. However, for vectorized UDTFs,
SQL NUMBER arguments with a scale of 0 that all fit in a 64-bit
or smaller integer type will always be mapped to Int16, Int32, or Int64.
Unlike scalar UDFs, if the argument of a UDTF is not nullable, it will not be converted to int16, int32, or int64.
To view a table showing how SQL types are mapped to pandas dtypes, see the type support table in the vectorized Python UDFs topic.
Best practices¶
If a scalar must be returned with each row, build a list of repeated values instead of unpackaging the
numpyarray to create tuples. For example, for a two-column result, instead of:Use this:
To improve performance, unpackage semi-structured data into columns.
For example, if you have a variant column,
obj, with elements,x(int),y(float), andz(string), instead of defining a UDTF with a signature like this, and calling it usingvec_udtf(obj):Define the UDTF with a signature like this, and call it using
vec_udtf(obj:x, obj:y, obj:z):By default, Snowflake encodes the inputs into pandas dtypes that support NULL values (for example, Int64). If you are using a library that requires a primitive type (such as
numpy) and your input has no NULL values, you should cast the column to a primitive type before using the library. For example:For more information, see Type Support.
When using UDTFs with a vectorized
end_partitionmethod, to improve performance and prevent timeouts, avoid usingpandas.concatto accumulate partial results. Instead, yield the partial result whenever one is ready.For example, instead of:
Do this: