Skip to content

Latest commit

 

History

History
240 lines (190 loc) · 12.1 KB

File metadata and controls

240 lines (190 loc) · 12.1 KB

Grouper Objects

Author: Deepak Cherian deepak@cherian.net Created: Nov 21, 2023

Abstract

I propose the addition of Grouper objects to Xarray's public API so that

Dataset.groupby(x=BinGrouper(bins=np.arange(10, 2))))

is identical to today's syntax:

Dataset.groupby_bins("x", bins=np.arange(10, 2))

Motivation and scope

Xarray's GroupBy API implements the split-apply-combine pattern (Wickham, 2011)1, which applies to a very large number of problems: histogramming, compositing, climatological averaging, resampling to a different time frequency, etc. The pattern abstracts the following pseudocode:

results = []
for element in unique_labels:
    subset = ds.sel(x=(ds.x == element))  # split
    # subset = ds.where(ds.x == element, drop=True)  # alternative
    result = subset.mean() # apply
    results.append(result)

xr.concat(results)  # combine

to

ds.groupby('x').mean()  # splits, applies, and combines

Efficient vectorized implementations of this pattern are implemented in numpy's ufunc.at, ufunc.reduceat, numbagg.grouped, numpy_groupies, and probably more. These vectorized implementations all require, as input, an array of integer codes or labels that identify unique elements in the array being grouped over ('x' in the example above).

import numpy as np

# array to reduce
a = np.array([1, 1, 1, 1, 2])

# initial value for result
out = np.zeros((3,), dtype=int)

# integer codes
labels = np.array([0, 0, 1, 2, 1])

# groupby-reduction
np.add.at(out, labels, a)
out  # array([2, 3, 1])

One can 'factorize' or construct such an array of integer codes using pandas.factorize or numpy.unique(..., return_inverse=True) for categorical arrays; pandas.cut, pandas.qcut, or np.digitize for discretizing continuous variables. In practice, since GroupBy objects exist, much of complexity in applying the groupby paradigm stems from appropriately factorizing or generating labels for the operation. Consider these two examples:

  1. Bins that vary in a dimension
  2. Overlapping groups
  3. Rolling resampling

Anecdotally, less experienced users commonly resort to the for-loopy implementation illustrated by the pseudocode above when the analysis at hand is not easily expressed using the API presented by Xarray's GroupBy object. Xarray's GroupBy API today abstracts away the split, apply, and combine stages but not the "factorize" stage. Grouper objects will close the gap.

Usage and impact

Grouper objects

  1. Will abstract useful factorization algorithms, and
  2. Present a natural way to extend GroupBy to grouping by multiple variables: ds.groupby(x=BinGrouper(...), t=Resampler(freq="M", ...)).mean().

In addition, Grouper objects provide a nice interface to add often-requested grouping functionality

  1. A new SpaceResampler would allow specifying resampling spatial dimensions. (issue)
  2. RollingTimeResampler would allow rolling-like functionality that understands timestamps (issue)
  3. A QuantileBinGrouper to abstract away pd.cut (issue)
  4. A SeasonGrouper and SeasonResampler would abstract away common annoyances with such calculations today
    1. Support seasons that span a year-end.
    2. Only include seasons with complete data coverage.
    3. Allow grouping over seasons of unequal length
    4. See this xcdat discussion for a SeasonGrouper like functionality:
    5. Return results with seasons in a sensible order
  5. Weighted grouping (issue)
    1. Once IntervalIndex like objects are supported, Resampler groupers can account for interval lengths when resampling.

Backward Compatibility

Xarray's existing grouping functionality will be exposed using two new Groupers:

  1. UniqueGrouper which uses pandas.factorize
  2. BinGrouper which uses pandas.cut
  3. TimeResampler which mimics pandas' .resample

Grouping by single variables will be unaffected so that ds.groupby('x') will be identical to ds.groupby(x=UniqueGrouper()). Similarly, ds.groupby_bins('x', bins=np.arange(10, 2)) will be unchanged and identical to ds.groupby(x=BinGrouper(bins=np.arange(10, 2))).

Detailed description

All Grouper objects will subclass from a Grouper object

import abc

class Grouper(abc.ABC):
    @abc.abstractmethod
    def factorize(self, by: DataArray):
        raise NotImplementedError

class CustomGrouper(Grouper):
    def factorize(self, by: DataArray):
        ...
        return codes, group_indices, unique_coord, full_index

    def weights(self, by: DataArray) -> DataArray:
        ...
        return weights

The factorize method

Today, the factorize method takes as input the group variable and returns 4 variables (I propose to clean this up below):

  1. codes: An array of same shape as the group with int dtype. NaNs in group are coded by -1 and ignored later.
  2. group_indices is a list of index location of group elements that belong to a single group.
  3. unique_coord is (usually) a pandas.Index object of all unique group members present in group.
  4. full_index is a pandas.Index of all group members. This is different from unique_coord for binning and resampling, where not all groups in the output may be represented in the input group. For grouping by a categorical variable e.g. ['a', 'b', 'a', 'c'], full_index and unique_coord are identical. There is some redundancy here since unique_coord is always equal to or a subset of full_index. We can clean this up (see Implementation below).

The weights method (?)

The proposed weights method is optional and unimplemented today. Groupers with weights will allow composing weighted and groupby (issue). The weights method should return an appropriate array of weights such that the following property is satisfied

gb_sum = ds.groupby(by).sum()

weights = CustomGrouper.weights(by)
weighted_sum = xr.dot(ds, weights)

assert_identical(gb_sum, weighted_sum)

For example, the boolean weights for group=np.array(['a', 'b', 'c', 'a', 'a']) should be

[[1, 0, 0, 1, 1],
 [0, 1, 0, 0, 0],
 [0, 0, 1, 0, 0]]

This is the boolean "summarization matrix" referred to in the classic Iverson (1980, Section 4.3)2 and "nub sieve" in various APLs.

Note

We can always construct weights automatically using group_indices from factorize, so this is not a required method.

For a rolling resampling, windowed weights are possible

[[0.5, 1,    0.5, 0, 0],
 [0,   0.25, 1,   1, 0],
 [0,   0,    0,   1, 1]]

The preferred_chunks method (?)

Rechunking support is another optional extension point. In flox I experimented some with automatically rechunking to make a groupby more parallel-friendly (example 1, example 2). A great example is for resampling-style groupby reductions, for which codes might look like

0001|11122|3333

where | represents chunk boundaries. A simple rechunking to

000|111122|3333

would make this resampling reduction an embarrassingly parallel blockwise problem.

Similarly consider monthly-mean climatologies for which the month numbers might be

1 2 3 4 5 | 6 7 8 9 10 | 11 12 1 2 3 | 4 5 6 7 8 | 9 10 11 12 |

A slight rechunking to

1 2 3 4 | 5 6 7 8 | 9 10 11 12 | 1 2 3 4 | 5 6 7 8 | 9 10 11 12 |

allows us to reduce 1, 2, 3, 4 separately from 5,6,7,8 and 9, 10, 11, 12 while still being parallel friendly (see the flox documentation for more).

We could attempt to detect these patterns, or we could just have the Grouper take as input chunks and return a tuple of "nice" chunk sizes to rechunk to.

def preferred_chunks(self, chunks: ChunksTuple) -> ChunksTuple:
    pass

For monthly means, since the period of repetition of labels is 12, the Grouper might choose possible chunk sizes of ((2,),(3,),(4,),(6,)). For resampling, the Grouper could choose to resample to a multiple or an even fraction of the resampling frequency.

Related work

Pandas has Grouper objects that represent the GroupBy instruction. However, these objects do not appear to be extension points, unlike the Grouper objects proposed here. Instead, Pandas' ExtensionArray has a factorize method.

Composing rolling with time resampling is a common workload:

  1. Polars has group_by_dynamic which appears to be like the proposed RollingResampler.
  2. scikit-downscale provides PaddedDOYGrouper

Implementation Proposal

  1. Get rid of squeeze issue: PR
  2. Merge existing two class implementation to a single Grouper class
    1. This design was implemented in this PR to account for some annoying data dependencies.
    2. See PR
  3. Clean up what's returned by factorize methods.
    1. A solution here might be to have group_indices: Mapping[int, Sequence[int]] be a mapping from group index in full_index to a sequence of integers.
    2. Return a namedtuple or dataclass from existing Grouper factorize methods to facilitate API changes in the future.
  4. Figure out what to pass to factorize
    1. Xarray eagerly reshapes nD variables to 1D. This is an implementation detail we need not expose.
    2. When grouping by an unindexed variable Xarray passes a _DummyGroup object. This seems like something we don't want in the public interface. We could special case "internal" Groupers to preserve the optimizations in UniqueGrouper.
  5. Grouper objects will exposed under the xr.groupers Namespace. At first these will include UniqueGrouper, BinGrouper, and TimeResampler.

Alternatives

One major design choice made here was to adopt the syntax ds.groupby(x=BinGrouper(...)) instead of ds.groupby(BinGrouper('x', ...)). This allows reuse of Grouper objects, example

grouper = BinGrouper(...)
ds.groupby(x=grouper, y=grouper)

but requires that all variables being grouped by (x and y above) are present in Dataset ds. This does not seem like a bad requirement. Importantly Grouper instances will be copied internally so that they can safely cache state that might be shared between factorize and weights.

Today, it is possible to ds.groupby(DataArray, ...). This syntax will still be supported.

Discussion

This proposal builds on these discussions:

  1. xarray-contrib/flox#191 (comment)
  2. pydata#6610

Copyright

This document has been placed in the public domain.

References and footnotes

Footnotes

  1. Wickham, H. (2011). The split-apply-combine strategy for data analysis. https://vita.had.co.nz/papers/plyr.html

  2. Iverson, K.E. (1980). Notation as a tool of thought. Commun. ACM 23, 8 (Aug. 1980), 444–465. https://doi.org/10.1145/358896.358899