How to filter arrays: cutting vs. masking#
import awkward as ak
import numpy as np
The problem with slicing#
When you write a mathematical formula using binary operators like +
and *
, or NumPy universal functions (ufuncs) like np.sqrt
, the shapes of nested lists must align. If the arrays in an expression were derived from a single array, this is often automatic. For instance,
original_array = ak.Array([
[
{"title": "zero", "x": 0, "y": 0},
{"title": "one", "x": 1, "y": 1.1},
{"title": "two", "x": 2, "y": 2.2},
],
[],
[
{"title": "three", "x": 3, "y": 3.3},
{"title": "four", "x": 4, "y": 4.4},
],
[
{"title": "five", "x": 5, "y": 5.5},
],
[
{"title": "six", "x": 6, "y": 6.6},
{"title": "seven", "x": 7, "y": 7.7},
{"title": "eight", "x": 8, "y": 8.8},
{"title": "nine", "x": 9, "y": 9.9},
],
])
array_x = original_array.x
array_y = original_array.y
The array_x
and array_y
have the same number of lists and the same numbers of items in each list because they were both slices of the original_array
.
array_x
[[0, 1, 2], [], [3, 4], [5], [6, 7, 8, 9]] --------------------- backend: cpu nbytes: 128 B type: 5 * var * int64
array_y
[[0, 1.1, 2.2], [], [3.3, 4.4], [5.5], [6.6, 7.7, 8.8, 9.9]] ----------------------- backend: cpu nbytes: 128 B type: 5 * var * float64
Thus, they can be used together in a mathematical formula.
array_x**2 + array_y**2
[[0, 2.21, 8.84], [], [19.9, 35.4], [55.2], [79.6, 108, 141, 179]] ----------------------- backend: cpu nbytes: 128 B type: 5 * var * float64
However, if one array is sliced, or if the two arrays are sliced by different criteria, they would no longer line up:
sliced_x = array_x[array_x > 3]
sliced_y = array_y[array_y > 3]
sliced_x
[[], [], [4], [5], [6, 7, 8, 9]] --------------------- backend: cpu nbytes: 96 B type: 5 * var * int64
sliced_y
[[], [], [3.3, 4.4], [5.5], [6.6, 7.7, 8.8, 9.9]] ----------------------- backend: cpu nbytes: 104 B type: 5 * var * float64
Notice that the first was sliced with array_x > 3
and the second was sliced with array_y > 3
, and as a result, the third list differs in length between the two arrays:
sliced_x[2], sliced_y[2]
(<Array [4] type='1 * int64'>, <Array [3.3, 4.4] type='2 * float64'>)
If we try to use these together, we get a ValueError:
sliced_x**2 + sliced_y**2
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[11], line 1
----> 1 sliced_x**2 + sliced_y**2
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_operators.py:54, in _binary_method.<locals>.func(self, other)
51 if _disables_array_ufunc(other):
52 return NotImplemented
---> 54 return ufunc(self, other)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/highlevel.py:1616, in Array.__array_ufunc__(self, ufunc, method, *inputs, **kwargs)
1551 """
1552 Intercepts attempts to pass this Array to a NumPy
1553 [universal functions](https://docs.scipy.org/doc/numpy/reference/ufuncs.html)
(...)
1613 See also #__array_function__.
1614 """
1615 name = f"{type(ufunc).__module__}.{ufunc.__name__}.{method!s}"
-> 1616 with ak._errors.OperationErrorContext(name, inputs, kwargs):
1617 return ak._connect.numpy.array_ufunc(ufunc, method, inputs, kwargs)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_errors.py:80, in ErrorContext.__exit__(self, exception_type, exception_value, traceback)
78 self._slate.__dict__.clear()
79 # Handle caught exception
---> 80 raise self.decorate_exception(exception_type, exception_value)
81 else:
82 # Step out of the way so that another ErrorContext can become primary.
83 if self.primary() is self:
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/highlevel.py:1617, in Array.__array_ufunc__(self, ufunc, method, *inputs, **kwargs)
1615 name = f"{type(ufunc).__module__}.{ufunc.__name__}.{method!s}"
1616 with ak._errors.OperationErrorContext(name, inputs, kwargs):
-> 1617 return ak._connect.numpy.array_ufunc(ufunc, method, inputs, kwargs)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_connect/numpy.py:469, in array_ufunc(ufunc, method, inputs, kwargs)
461 raise TypeError(
462 "no {}.{} overloads for custom types: {}".format(
463 type(ufunc).__module__, ufunc.__name__, ", ".join(error_message)
464 )
465 )
467 return None
--> 469 out = ak._broadcasting.broadcast_and_apply(
470 inputs,
471 action,
472 depth_context=depth_context,
473 lateral_context=lateral_context,
474 allow_records=False,
475 function_name=ufunc.__name__,
476 )
478 out_named_axis = functools.reduce(
479 _unify_named_axis, lateral_context[NAMED_AXIS_KEY].named_axis
480 )
481 if len(out) == 1:
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:1200, in broadcast_and_apply(inputs, action, depth_context, lateral_context, allow_records, left_broadcast, right_broadcast, numpy_to_regular, regular_to_jagged, function_name, broadcast_parameters_rule)
1198 backend = backend_of(*inputs, coerce_to_common=False)
1199 isscalar = []
-> 1200 out = apply_step(
1201 backend,
1202 broadcast_pack(inputs, isscalar),
1203 action,
1204 0,
1205 depth_context,
1206 lateral_context,
1207 {
1208 "allow_records": allow_records,
1209 "left_broadcast": left_broadcast,
1210 "right_broadcast": right_broadcast,
1211 "numpy_to_regular": numpy_to_regular,
1212 "regular_to_jagged": regular_to_jagged,
1213 "function_name": function_name,
1214 "broadcast_parameters_rule": broadcast_parameters_rule,
1215 },
1216 )
1217 assert isinstance(out, tuple)
1218 return tuple(broadcast_unpack(x, isscalar) for x in out)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:1178, in apply_step(backend, inputs, action, depth, depth_context, lateral_context, options)
1176 return result
1177 elif result is None:
-> 1178 return continuation()
1179 else:
1180 raise AssertionError(result)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:1147, in apply_step.<locals>.continuation()
1145 # Any non-string list-types?
1146 elif any(x.is_list and not is_string_like(x) for x in contents):
-> 1147 return broadcast_any_list()
1149 # Any RecordArrays?
1150 elif any(x.is_record for x in contents):
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:671, in apply_step.<locals>.broadcast_any_list()
668 nextinputs.append(x)
669 nextparameters.append(NO_PARAMETERS)
--> 671 outcontent = apply_step(
672 backend,
673 nextinputs,
674 action,
675 depth + 1,
676 copy.copy(depth_context),
677 lateral_context,
678 options,
679 )
680 assert isinstance(outcontent, tuple)
681 parameters = parameters_factory(nextparameters, len(outcontent))
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:1178, in apply_step(backend, inputs, action, depth, depth_context, lateral_context, options)
1176 return result
1177 elif result is None:
-> 1178 return continuation()
1179 else:
1180 raise AssertionError(result)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:1147, in apply_step.<locals>.continuation()
1145 # Any non-string list-types?
1146 elif any(x.is_list and not is_string_like(x) for x in contents):
-> 1147 return broadcast_any_list()
1149 # Any RecordArrays?
1150 elif any(x.is_record for x in contents):
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:722, in apply_step.<locals>.broadcast_any_list()
718 for i, ((named_axis, ndim), x, x_is_string) in enumerate(
719 zip(named_axes_with_ndims, inputs, input_is_string)
720 ):
721 if isinstance(x, listtypes) and not x_is_string:
--> 722 next_content = broadcast_to_offsets_avoiding_carry(x, offsets)
723 nextinputs.append(next_content)
724 nextparameters.append(x._parameters)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:385, in broadcast_to_offsets_avoiding_carry(list_content, offsets)
383 return list_content.content[:next_length]
384 else:
--> 385 return list_content._broadcast_tooffsets64(offsets).content
386 elif isinstance(list_content, ListArray):
387 # Is this list contiguous?
388 if index_nplike.array_equal(
389 list_content.starts.data[1:], list_content.stops.data[:-1]
390 ):
391 # Does this list match the offsets?
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/listoffsetarray.py:420, in ListOffsetArray._broadcast_tooffsets64(self, offsets)
415 next_content = self._content[this_start:]
417 if index_nplike.known_data and not index_nplike.array_equal(
418 this_zero_offsets, offsets.data
419 ):
--> 420 raise ValueError("cannot broadcast nested list")
422 return ListOffsetArray(
423 offsets, next_content[: offsets[-1]], parameters=self._parameters
424 )
ValueError: cannot broadcast nested list
This error occurred while calling
numpy.add.__call__(
<Array [[], [], ..., [25], [36, 49, 64, 81]] type='5 * var * int64'>
<Array [[], [], ..., [43.6, 59.3, 77.4, 98]] type='5 * var * float64'>
)
Sometimes, these misalignments are overt, but sometimes they’re subtle and embedded deep within a very large array. You can start investigating a problem like this with ak.num()
:
ak.num(sliced_x) != ak.num(sliced_y)
[False, False, True, False, False] -------------- backend: cpu nbytes: 5 B type: 5 * bool
np.nonzero(ak.to_numpy(ak.num(sliced_x) != ak.num(sliced_y)))
(array([2]),)
But it’s also possible to avoid them in the first place.
Masking with missing values#
The problem was that the two arrays’ shapes changed differently; instead, we’ll slice them in such a way that their shapes don’t change at all.
The ak.mask()
function uses a boolean array like a slice, but takes values that line up with False
and returns None
instead of removing them.
ak.mask(array_x, array_x > 3)
[[None, None, None], [], [None, 4], [5], [6, 7, 8, 9]] ---------------------- backend: cpu nbytes: 138 B type: 5 * var * ?int64
It can also be accessed as an array property, with square brackets, so that it resembles a slice:
masked_x = array_x.mask[array_x > 3]
masked_y = array_y.mask[array_y > 3]
masked_x
[[None, None, None], [], [None, 4], [5], [6, 7, 8, 9]] ---------------------- backend: cpu nbytes: 138 B type: 5 * var * ?int64
masked_y
[[None, None, None], [], [3.3, 4.4], [5.5], [6.6, 7.7, 8.8, 9.9]] ------------------------ backend: cpu nbytes: 138 B type: 5 * var * ?float64
The results of these two masks can be used in a mathematical expression because they line up:
result = masked_x**2 + masked_y**2
result
[[None, None, None], [], [None, 35.4], [55.2], [79.6, 108, 141, 179]] ------------------------ backend: cpu nbytes: 176 B type: 5 * var * ?float64
Now only one problem remains: the None
(missing) values might be undesirable in the output. There are several ways to get rid of them:
ak.drop_none()
eliminatesNone
, like a slice, but it can be done once at the end of a calculation,ak.fill_none()
replacesNone
with a chosen value,ak.flatten()
removes list structure, and if theNone
values are at the level of a list (the ones inresult
aren’t), they’ll be removed too,ak.singletons()
replacesNone
with[]
and any other valuex
with[x]
. The resulting lists all have length 0 or length 1.
ak.drop_none(result, axis=1)
[[], [], [35.4], [55.2], [79.6, 108, 141, 179]] ----------------------- backend: cpu nbytes: 96 B type: 5 * var * float64
ak.fill_none(result, -1, axis=1)
[[-1, -1, -1], [], [-1, 35.4], [55.2], [79.6, 108, 141, 179]] ----------------------- backend: cpu nbytes: 128 B type: 5 * var * float64
ak.singletons(result, axis=1)
[[[], [], []], [], [[], [35.4]], [[55.2]], [[79.6], [108], [141], [179]]] ------------------------------- backend: cpu nbytes: 184 B type: 5 * var * var * float64
As a final note, the difference between using ak.drop_none()
and slicing with the result of ak.is_none()
is that ak.drop_none()
also removes “missingness” from the data type; a slice does not.
result[~ak.is_none(result, axis=1)]
[[], [], [35.4], [55.2], [79.6, 108, 141, 179]] ------------------------ backend: cpu nbytes: 144 B type: 5 * var * ?float64
(Note the ?
for “option-type” before float64
. This could have consequences, good or bad, at a later stage in processing.)