How to reduce dimensions (sum/min/any/all)#
After elementwise functions, dimension-reducer functions are the most commonly used. These functions replace a list of numbers with a single, scalar number by adding, multiplying, minimizing, maximizing, or performing logical-or (“any”) or logical-and (“all”).
These are also called aggregation functions; in relational databases, SQL, and data-frames, aggregations are applied after a “group by” operation. Awkward Array doesn’t have “group by” operations; lists are already grouped.
import awkward as ak
import numpy as np
First reducer: ak.sum
#
To illustrate all of these functions, let’s consider addition. Given an array:
array = ak.Array([[1, 2, 3], [4, 5], [], [6]])
ak.sum()
with no arguments adds all of the values in the nested lists, just like np.sum
.
ak.sum(array)
21
With Awkward Arrays, it’s usually more useful to supply an axis
argument to reduce one dimension, rather than all dimensions.
For reasons that will be explained below, axis=-1
is the most frequently useful.
ak.sum(array, axis=-1)
[6, 9, 0, 6] --------------- type: 4 * int64
The axis
argument#
Before getting deeper into the axis
argument, let’s consider a NumPy array with more dimensions.
array3d = np.array([
[
[ 1, 2, 3, 4, 5],
[ 10, 20, 30, 40, 50],
[ 100, 200, 300, 400, 500],
],
[
[0.1 , 0.2 , 0.3 , 0.4 , 0.5 ],
[0.01 , 0.02 , 0.03 , 0.04 , 0.05 ],
[0.001, 0.002, 0.003, 0.004, 0.005],
],
])
with np.printoptions(suppress=True):
print(array3d)
[[[ 1. 2. 3. 4. 5. ]
[ 10. 20. 30. 40. 50. ]
[100. 200. 300. 400. 500. ]]
[[ 0.1 0.2 0.3 0.4 0.5 ]
[ 0.01 0.02 0.03 0.04 0.05 ]
[ 0.001 0.002 0.003 0.004 0.005]]]
This array has 3 dimensions, so in addition to axis=None
(reduce everything to a scalar), there are 3 possible axis values.
The first case, axis=0
, adds the first 3×5 block to the second 3×5 block, i.e. summing over the first (length-2) dimension. Thus, the 1
is added to 0.1
, the 2
is added to 0.2
, and so on until the 500
is added to 0.005
.
with np.printoptions(suppress=True):
print(np.sum(array3d, axis=0))
[[ 1.1 2.2 3.3 4.4 5.5 ]
[ 10.01 20.02 30.03 40.04 50.05 ]
[100.001 200.002 300.003 400.004 500.005]]
The second case, axis=1
, adds vertically within each 3×5 block, i.e. summing over the second (length-3) dimension. What’s left are two lists of length 5.
with np.printoptions(suppress=True):
print(np.sum(array3d, axis=1))
[[111. 222. 333. 444. 555. ]
[ 0.111 0.222 0.333 0.444 0.555]]
The third case, axis=2
, adds horizontally within each 3×5 block, i.e. summing over the third (length-5) dimension. What’s left are two lists of length 3.
with np.printoptions(suppress=True):
print(np.sum(array3d, axis=2))
[[ 15. 150. 1500. ]
[ 1.5 0.15 0.015]]
Since negative axis
counts from the other end of the scale,
axis=0
is equivalent toaxis=-3
axis=1
is equivalent toaxis=-2
axis=2
is equivalent toaxis=-1
.
The axis
argument with ragged lists#
Awkward Arrays allow the lengths of lists in an array to differ, so we can have
array_ragged = ak.Array([
[ 1, 2, 3 ],
[ 10, 20 ],
[100, 200, 300, 400],
])
array_ragged
[[1, 2, 3], [10, 20], [100, 200, 300, 400]] ---------------------- type: 3 * var * int64
As before, axis=-1
sums over the innermost lists, replacing each of the 3 horizontal rows with a sum.
ak.sum(array_ragged, axis=-1)
[6, 30, 1000] --------------- type: 3 * int64
And axis=-2
sums vertically, replacing each of the 4 vertical columns with a sum. Since the list lengths differ, some of the places we might expect to see a value is an empty gap—it contributes nothing to the result.
ak.sum(array_ragged, axis=0)
[111, 222, 303, 400] --------------- type: 4 * int64
We also have to choose a convention: should the values be left-aligned or right-aligned within their lists? Awkward Array choses left-aligned.
In ragged data from real datasets, summing over whole lists usually has more meaning than summing over parts of different lists, so axis=-1
is usually the most meaningful choice of axis
.
The axis
argument with missing data#
Just as empty gaps contribute nothing to the sum, missing values (None
) don’t contribute anything, either.
array_ragged = ak.Array([
[None, None, 3, 4],
[ 10, None, 30 ],
[ 100, 200, 300, 400],
])
array_ragged
[[None, None, 3, 4], [10, None, 30], [100, 200, 300, 400]] ---------------------- type: 3 * var * ?int64
axis=-1
sums over each inner list, horizontally, replacing it with a scalar.
ak.sum(array_ragged, axis=-1)
[7, 40, 1000] --------------- type: 3 * int64
And axis=-2
sums over the outer dimension, vertically.
ak.sum(array_ragged, axis=-2)
[110, 200, 333, 404] --------------- type: 4 * int64
For ak.sum()
, each None
has the same effect as a 0
value, for ak.prod()
(multiplication), each None
has the same effect as a 1
value, etc.
The keepdims
argument#
Sometimes, you want to replace lists with a length-1 list, rather than a scalar. keepdims=True
does that.
ak.sum(array_ragged, axis=-1, keepdims=True)
[[7], [40], [1000]] ------------------- type: 3 * 1 * int64
ak.sum(array_ragged, axis=-2, keepdims=True)
[[110, 200, 333, 404]] ---------------------- type: 1 * var * int64
The keepdims
argument is particularly useful for ak.argmin()
and ak.argmax()
, which return positions in a list where the value is minimized or maximized. Those positions can only be used as slice indexes if they’re at the right nesting level, which keepdims=True
maintains.
Other reducers#
The
ak.prod()
reducer multiplies, rather than adding.ak.min()
andak.max()
minimize and maximize, returningNone
for empty lists.ak.argmin()
andak.argmax()
return the index positions of the minimum or maximum value, withNone
for empty lists.ak.nansum()
,ak.nanprod()
,ak.nanmin()
,ak.nanmax()
,ak.nanargmin()
, andak.nanargmax()
ignore floating-pointnan
values before operating, the way that all reducers ignoreNone
values before operating.ak.count_nonzero()
counts non-zero values.ak.count()
simply counts values. In NumPy, there’s no need for such a function because it would return constants (drawn from the NumPy array’sshape
), but for ragged arrays, it counts the number of values that enter into a reduction.ak.num()
also returns lengths of lists, but in a way that’s more useful for slicing;ak.count()
is useful as the denominator of expressions in which another reducer (with the sameaxis
andkeepdims
choices) is in the numerator.ak.any()
andak.all()
reduce like logical-or and logical-and, which makes them particularly useful in slices (below).
Reducing over “any” and “all”#
ak.any()
and ak.all()
reduce boolean arrays, asking if a predicate is satisfied by “any” item or “all” items, respectively.
array_bool = ak.Array([
[False, False, True, True],
[False, True, False, True],
[False, True, True, True],
])
array_bool
[[False, False, True, True], [False, True, False, True], [False, True, True, True]] ---------------------------- type: 3 * var * bool
ak.any(array_bool, axis=-1)
[True, True, True] -------------- type: 3 * bool
ak.any(array_bool, axis=-2)
[False, True, True, True] -------------- type: 4 * bool
ak.all(array_bool, axis=-1)
[False, False, False] -------------- type: 3 * bool
ak.all(array_bool, axis=-2)
[False, False, False, True] -------------- type: 4 * bool
Since logical-or is like addition of booleans and logical-and is like multiplication, these reducers could have been replaced with ak.sum()
and ak.prod()
, but they’re very useful to have because they make some boolean-array slices easier to read.
array = ak.Array([[0, 1, 2], [], [-3, 4], [-5], [-6, -7, -8, -9]])
array
[[0, 1, 2], [], [-3, 4], [-5], [-6, -7, -8, -9]] --------------------- type: 5 * var * int64
Select whole lists if any of their values are negative:
array[ak.any(array < 0, axis=-1)]
[[-3, 4], [-5], [-6, -7, -8, -9]] --------------------- type: 3 * var * int64
Select whole lists if all of their values are negative:
array[ak.all(array < 0, axis=-1)]
[[], [-5], [-6, -7, -8, -9]] --------------------- type: 3 * var * int64
(If a list is empty, all of its elements satisfy a constraint.)
In both cases above, the selection can be read like an English sentence, “select lists if any…” or “select lists if all…”.
Heterogeneous data and records cannot be reduced#
These two kinds of data types are not reducible. Heterogeneous data allows an array to have multiple numbers of dimensions, so the problem is ill-posed:
ak.sum(ak.Array([[1.1, 2.2, 3.3], [], 4.4, 5.5]))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[25], line 1
----> 1 ak.sum(ak.Array([[1.1, 2.2, 3.3], [], 4.4, 5.5]))
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_dispatch.py:64, in named_high_level_function.<locals>.dispatch(*args, **kwargs)
62 # Failed to find a custom overload, so resume the original function
63 try:
---> 64 next(gen_or_result)
65 except StopIteration as err:
66 return err.value
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/operations/ak_sum.py:210, in sum(array, axis, keepdims, mask_identity, highlevel, behavior, attrs)
207 yield (array,)
209 # Implementation
--> 210 return _impl(array, axis, keepdims, mask_identity, highlevel, behavior, attrs)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/operations/ak_sum.py:277, in _impl(array, axis, keepdims, mask_identity, highlevel, behavior, attrs)
274 layout = ctx.unwrap(array, allow_record=False, primitive_policy="error")
275 reducer = ak._reducers.Sum()
--> 277 out = ak._do.reduce(
278 layout,
279 reducer,
280 axis=axis,
281 mask=mask_identity,
282 keepdims=keepdims,
283 behavior=ctx.behavior,
284 )
285 return ctx.wrap(out, highlevel=highlevel, allow_other=True)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_do.py:244, in reduce(layout, reducer, axis, mask, keepdims, behavior)
232 parts = remove_structure(
233 layout,
234 flatten_records=False,
(...)
238 list_to_regular=True,
239 )
241 if len(parts) > 1:
242 # We know that `flatten_records` must fail, so the only other type
243 # that can return multiple parts here is the union array
--> 244 raise ValueError(
245 "cannot use axis=None on an array containing irreducible unions"
246 )
247 elif len(parts) == 0:
248 layout = ak.contents.EmptyArray()
ValueError: cannot use axis=None on an array containing irreducible unions
This error occurred while calling
ak.sum(
<Array [[1.1, 2.2, 3.3], [], 4.4, 5.5] type='4 * union[var * float6...'>
)
And records are sometimes used to represent data with coordinates; applying ak.sum()
to non-Cartesian coordinates would be a subtle error.
ak.sum(ak.Array([{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}]), axis=-1)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[26], line 1
----> 1 ak.sum(ak.Array([{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}]), axis=-1)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_dispatch.py:64, in named_high_level_function.<locals>.dispatch(*args, **kwargs)
62 # Failed to find a custom overload, so resume the original function
63 try:
---> 64 next(gen_or_result)
65 except StopIteration as err:
66 return err.value
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/operations/ak_sum.py:210, in sum(array, axis, keepdims, mask_identity, highlevel, behavior, attrs)
207 yield (array,)
209 # Implementation
--> 210 return _impl(array, axis, keepdims, mask_identity, highlevel, behavior, attrs)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/operations/ak_sum.py:277, in _impl(array, axis, keepdims, mask_identity, highlevel, behavior, attrs)
274 layout = ctx.unwrap(array, allow_record=False, primitive_policy="error")
275 reducer = ak._reducers.Sum()
--> 277 out = ak._do.reduce(
278 layout,
279 reducer,
280 axis=axis,
281 mask=mask_identity,
282 keepdims=keepdims,
283 behavior=ctx.behavior,
284 )
285 return ctx.wrap(out, highlevel=highlevel, allow_other=True)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_do.py:296, in reduce(layout, reducer, axis, mask, keepdims, behavior)
294 parents = ak.index.Index64.zeros(layout.length, layout.backend.index_nplike)
295 shifts = None
--> 296 next = layout._reduce_next(
297 reducer,
298 negaxis,
299 starts,
300 shifts,
301 parents,
302 1,
303 mask,
304 keepdims,
305 behavior,
306 )
308 return next[0]
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/recordarray.py:888, in RecordArray._reduce_next(self, reducer, negaxis, starts, shifts, parents, outlength, mask, keepdims, behavior)
886 reducer_recordclass = find_record_reducer(reducer, self, behavior)
887 if reducer_recordclass is None:
--> 888 raise TypeError(
889 "no ak.{} overloads for custom types: {}".format(
890 reducer.name, ", ".join(self.fields)
891 )
892 )
893 else:
894 # Positional reducers ultimately need to do more work when rebuilding the result
895 # so asking for a mask doesn't help us!
896 reducer_should_mask = mask and not reducer.needs_position
TypeError: no ak.sum overloads for custom types: x, y
This error occurred while calling
ak.sum(
<Array [{x: 1.1, y: [1]}, {...}] type='2 * {x: float64, y: var * in...'>
axis = -1
)