Author: Deepak Cherian deepak@cherian.net Created: Nov 21, 2023
I propose the addition of Grouper objects to Xarray's public API so that
Dataset.groupby(x=BinGrouper(bins=np.arange(10, 2))))is identical to today's syntax:
Dataset.groupby_bins("x", bins=np.arange(10, 2))Xarray's GroupBy API implements the split-apply-combine pattern (Wickham, 2011)1, which applies to a very large number of problems: histogramming, compositing, climatological averaging, resampling to a different time frequency, etc. The pattern abstracts the following pseudocode:
results = []
for element in unique_labels:
subset = ds.sel(x=(ds.x == element)) # split
# subset = ds.where(ds.x == element, drop=True) # alternative
result = subset.mean() # apply
results.append(result)
xr.concat(results) # combineto
ds.groupby('x').mean() # splits, applies, and combinesEfficient vectorized implementations of this pattern are implemented in numpy's ufunc.at, ufunc.reduceat, numbagg.grouped, numpy_groupies, and probably more.
These vectorized implementations all require, as input, an array of integer codes or labels that identify unique elements in the array being grouped over ('x' in the example above).
import numpy as np
# array to reduce
a = np.array([1, 1, 1, 1, 2])
# initial value for result
out = np.zeros((3,), dtype=int)
# integer codes
labels = np.array([0, 0, 1, 2, 1])
# groupby-reduction
np.add.at(out, labels, a)
out # array([2, 3, 1])One can 'factorize' or construct such an array of integer codes using pandas.factorize or numpy.unique(..., return_inverse=True) for categorical arrays; pandas.cut, pandas.qcut, or np.digitize for discretizing continuous variables.
In practice, since GroupBy objects exist, much of complexity in applying the groupby paradigm stems from appropriately factorizing or generating labels for the operation.
Consider these two examples:
Anecdotally, less experienced users commonly resort to the for-loopy implementation illustrated by the pseudocode above when the analysis at hand is not easily expressed using the API presented by Xarray's GroupBy object. Xarray's GroupBy API today abstracts away the split, apply, and combine stages but not the "factorize" stage. Grouper objects will close the gap.
Grouper objects
- Will abstract useful factorization algorithms, and
- Present a natural way to extend GroupBy to grouping by multiple variables:
ds.groupby(x=BinGrouper(...), t=Resampler(freq="M", ...)).mean().
In addition, Grouper objects provide a nice interface to add often-requested grouping functionality
- A new
SpaceResamplerwould allow specifying resampling spatial dimensions. (issue) RollingTimeResamplerwould allow rolling-like functionality that understands timestamps (issue)- A
QuantileBinGrouperto abstract awaypd.cut(issue) - A
SeasonGrouperandSeasonResamplerwould abstract away common annoyances with such calculations today- Support seasons that span a year-end.
- Only include seasons with complete data coverage.
- Allow grouping over seasons of unequal length
- See this xcdat discussion for a
SeasonGrouperlike functionality: - Return results with seasons in a sensible order
- Weighted grouping (issue)
- Once
IntervalIndexlike objects are supported,Resamplergroupers can account for interval lengths when resampling.
- Once
Xarray's existing grouping functionality will be exposed using two new Groupers:
UniqueGrouperwhich usespandas.factorizeBinGrouperwhich usespandas.cutTimeResamplerwhich mimics pandas'.resample
Grouping by single variables will be unaffected so that ds.groupby('x') will be identical to ds.groupby(x=UniqueGrouper()).
Similarly, ds.groupby_bins('x', bins=np.arange(10, 2)) will be unchanged and identical to ds.groupby(x=BinGrouper(bins=np.arange(10, 2))).
All Grouper objects will subclass from a Grouper object
import abc
class Grouper(abc.ABC):
@abc.abstractmethod
def factorize(self, by: DataArray):
raise NotImplementedError
class CustomGrouper(Grouper):
def factorize(self, by: DataArray):
...
return codes, group_indices, unique_coord, full_index
def weights(self, by: DataArray) -> DataArray:
...
return weightsToday, the factorize method takes as input the group variable and returns 4 variables (I propose to clean this up below):
codes: An array of same shape as thegroupwith int dtype. NaNs ingroupare coded by-1and ignored later.group_indicesis a list of index location ofgroupelements that belong to a single group.unique_coordis (usually) apandas.Indexobject of all uniquegroupmembers present ingroup.full_indexis apandas.Indexof allgroupmembers. This is different fromunique_coordfor binning and resampling, where not all groups in the output may be represented in the inputgroup. For grouping by a categorical variable e.g.['a', 'b', 'a', 'c'],full_indexandunique_coordare identical. There is some redundancy here sinceunique_coordis always equal to or a subset offull_index. We can clean this up (see Implementation below).
The proposed weights method is optional and unimplemented today.
Groupers with weights will allow composing weighted and groupby (issue).
The weights method should return an appropriate array of weights such that the following property is satisfied
gb_sum = ds.groupby(by).sum()
weights = CustomGrouper.weights(by)
weighted_sum = xr.dot(ds, weights)
assert_identical(gb_sum, weighted_sum)For example, the boolean weights for group=np.array(['a', 'b', 'c', 'a', 'a']) should be
[[1, 0, 0, 1, 1],
[0, 1, 0, 0, 0],
[0, 0, 1, 0, 0]]
This is the boolean "summarization matrix" referred to in the classic Iverson (1980, Section 4.3)2 and "nub sieve" in various APLs.
Note
We can always construct weights automatically using group_indices from factorize, so this is not a required method.
For a rolling resampling, windowed weights are possible
[[0.5, 1, 0.5, 0, 0],
[0, 0.25, 1, 1, 0],
[0, 0, 0, 1, 1]]
Rechunking support is another optional extension point.
In flox I experimented some with automatically rechunking to make a groupby more parallel-friendly (example 1, example 2).
A great example is for resampling-style groupby reductions, for which codes might look like
0001|11122|3333
where | represents chunk boundaries. A simple rechunking to
000|111122|3333
would make this resampling reduction an embarrassingly parallel blockwise problem.
Similarly consider monthly-mean climatologies for which the month numbers might be
1 2 3 4 5 | 6 7 8 9 10 | 11 12 1 2 3 | 4 5 6 7 8 | 9 10 11 12 |
A slight rechunking to
1 2 3 4 | 5 6 7 8 | 9 10 11 12 | 1 2 3 4 | 5 6 7 8 | 9 10 11 12 |
allows us to reduce 1, 2, 3, 4 separately from 5,6,7,8 and 9, 10, 11, 12 while still being parallel friendly (see the flox documentation for more).
We could attempt to detect these patterns, or we could just have the Grouper take as input chunks and return a tuple of "nice" chunk sizes to rechunk to.
def preferred_chunks(self, chunks: ChunksTuple) -> ChunksTuple:
passFor monthly means, since the period of repetition of labels is 12, the Grouper might choose possible chunk sizes of ((2,),(3,),(4,),(6,)).
For resampling, the Grouper could choose to resample to a multiple or an even fraction of the resampling frequency.
Pandas has Grouper objects that represent the GroupBy instruction.
However, these objects do not appear to be extension points, unlike the Grouper objects proposed here.
Instead, Pandas' ExtensionArray has a factorize method.
Composing rolling with time resampling is a common workload:
- Polars has
group_by_dynamicwhich appears to be like the proposedRollingResampler. - scikit-downscale provides
PaddedDOYGrouper
- Get rid of
squeezeissue: PR - Merge existing two class implementation to a single Grouper class
- Clean up what's returned by
factorizemethods.- A solution here might be to have
group_indices: Mapping[int, Sequence[int]]be a mapping from group index infull_indexto a sequence of integers. - Return a
namedtupleordataclassfrom existing Grouper factorize methods to facilitate API changes in the future.
- A solution here might be to have
- Figure out what to pass to
factorize- Xarray eagerly reshapes nD variables to 1D. This is an implementation detail we need not expose.
- When grouping by an unindexed variable Xarray passes a
_DummyGroupobject. This seems like something we don't want in the public interface. We could special case "internal" Groupers to preserve the optimizations inUniqueGrouper.
- Grouper objects will exposed under the
xr.groupersNamespace. At first these will includeUniqueGrouper,BinGrouper, andTimeResampler.
One major design choice made here was to adopt the syntax ds.groupby(x=BinGrouper(...)) instead of ds.groupby(BinGrouper('x', ...)).
This allows reuse of Grouper objects, example
grouper = BinGrouper(...)
ds.groupby(x=grouper, y=grouper)but requires that all variables being grouped by (x and y above) are present in Dataset ds. This does not seem like a bad requirement.
Importantly Grouper instances will be copied internally so that they can safely cache state that might be shared between factorize and weights.
Today, it is possible to ds.groupby(DataArray, ...). This syntax will still be supported.
This proposal builds on these discussions:
This document has been placed in the public domain.
Footnotes
-
Wickham, H. (2011). The split-apply-combine strategy for data analysis. https://vita.had.co.nz/papers/plyr.html ↩
-
Iverson, K.E. (1980). Notation as a tool of thought. Commun. ACM 23, 8 (Aug. 1980), 444–465. https://doi.org/10.1145/358896.358899 ↩