How to filter with arrays containing missing values#
import awkward as ak
import numpy as np
Indexing with missing values#
In Building an awkward index, we looked building arrays of integers to perform awkward indexing using ak.argmin()
and ak.argmax()
. In particular, the keepdims
argument of ak.argmin()
and ak.argmax()
is very useful for creating arrays that can be used to index into the original array. However, reducers such as ak.argmax()
behave differently when they are asked to operate upon empty lists.
Let’s first create an array that contains empty sublists:
array = ak.Array(
[
[],
[10, 3, 2, 9],
[4, 5, 5, 12, 6],
[],
[8, 9, -1],
]
)
array
[[], [10, 3, 2, 9], [4, 5, 5, 12, 6], [], [8, 9, -1]] ------------------ backend: cpu nbytes: 144 B type: 5 * var * int64
Awkward reducers accept a mask_identity
argument, which changes the ak.Array.type
and the values of the result:
ak.argmax(array, keepdims=True, axis=-1, mask_identity=False)
[[-1], [0], [3], [-1], [1]] ------ backend: cpu nbytes: 40 B type: 5 * 1 * int64
ak.argmax(array, keepdims=True, axis=-1, mask_identity=True)
[[None], [0], [3], [None], [1]] -------- backend: cpu nbytes: 45 B type: 5 * 1 * ?int64
Setting mask_identity=True
yields the identity value for the reducer instead of None
when reducing empty lists. From the above examples of ak.argmax()
, we can see that the identity for the ak.argmax()
is -1
: What happens if we try and use the array produced with mask_identity=False
to index into array
?
As discussed in Indexing with argmin and argmax, we first need to convert at least one dimension to a ragged dimension
index = ak.from_regular(
ak.argmax(array, keepdims=True, axis=-1, mask_identity=False)
)
Now, if we try and index into array
with index
, it will raise an exception
array[index]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Cell In[6], line 1
----> 1 array[index]
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/highlevel.py:1104, in Array.__getitem__(self, where)
675 def __getitem__(self, where):
676 """
677 Args:
678 where (many types supported; see below): Index of positions to
(...) 1102 have the same dimension as the array being indexed.
1103 """
-> 1104 with ak._errors.SlicingErrorContext(self, where):
1105 # Handle named axis
1106 (_, ndim) = self._layout.minmax_depth
1107 named_axis = _get_named_axis(self)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_errors.py:80, in ErrorContext.__exit__(self, exception_type, exception_value, traceback)
78 self._slate.__dict__.clear()
79 # Handle caught exception
---> 80 raise self.decorate_exception(exception_type, exception_value)
81 else:
82 # Step out of the way so that another ErrorContext can become primary.
83 if self.primary() is self:
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/highlevel.py:1112, in Array.__getitem__(self, where)
1108 where = _normalize_named_slice(named_axis, where, ndim)
1110 NamedAxis.mapping = named_axis
-> 1112 indexed_layout = prepare_layout(self._layout._getitem(where, NamedAxis))
1114 if NamedAxis.mapping:
1115 return ak.operations.ak_with_named_axis._impl(
1116 indexed_layout,
1117 named_axis=NamedAxis.mapping,
(...) 1120 attrs=self._attrs,
1121 )
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/content.py:659, in Content._getitem(self, where, named_axis)
656 return out._getitem_at(0)
658 elif isinstance(where, ak.highlevel.Array):
--> 659 return self._getitem(where.layout, named_axis)
661 # Convert between nplikes of different backends
662 elif (
663 isinstance(where, ak.contents.Content)
664 and where.backend is not self._backend
665 ):
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/content.py:736, in Content._getitem(self, where, named_axis)
733 return where.to_NumpyArray(np.int64)
735 elif isinstance(where, Content):
--> 736 return self._getitem((where,), named_axis)
738 elif is_sized_iterable(where):
739 # Do we have an array
740 nplike = nplike_of_obj(where, default=None)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/content.py:651, in Content._getitem(self, where, named_axis)
642 named_axis.mapping = _named_axis
644 next = ak.contents.RegularArray(
645 this,
646 this.length,
647 1,
648 parameters=None,
649 )
--> 651 out = next._getitem_next(nextwhere[0], nextwhere[1:], None)
653 if out.length is not unknown_length and out.length == 0:
654 return out._getitem_nothing()
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/regulararray.py:722, in RegularArray._getitem_next(self, head, tail, advanced)
706 assert head.offsets.nplike is nplike
707 self._maybe_index_error(
708 self._backend[
709 "awkward_RegularArray_getitem_jagged_expand",
(...) 720 slicer=head,
721 )
--> 722 down = self._content._getitem_next_jagged(
723 multistarts, multistops, head._content, tail
724 )
726 return RegularArray(
727 down, headlength, self.length, parameters=self._parameters
728 )
730 elif isinstance(head, ak.contents.IndexedOptionArray):
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/listoffsetarray.py:446, in ListOffsetArray._getitem_next_jagged(self, slicestarts, slicestops, slicecontent, tail)
440 def _getitem_next_jagged(
441 self, slicestarts: Index, slicestops: Index, slicecontent: Content, tail
442 ) -> Content:
443 out = ak.contents.ListArray(
444 self.starts, self.stops, self._content, parameters=self._parameters
445 )
--> 446 return out._getitem_next_jagged(slicestarts, slicestops, slicecontent, tail)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/listarray.py:583, in ListArray._getitem_next_jagged(self, slicestarts, slicestops, slicecontent, tail)
572 nextcarry = ak.index.Index64.empty(carrylen, self._backend.nplike)
574 assert (
575 outoffsets.nplike is self._backend.nplike
576 and nextcarry.nplike is self._backend.nplike
(...) 581 and self._stops.nplike is self._backend.nplike
582 )
--> 583 self._maybe_index_error(
584 self._backend[
585 "awkward_ListArray_getitem_jagged_apply",
586 outoffsets.dtype.type,
587 nextcarry.dtype.type,
588 slicestarts.dtype.type,
589 slicestops.dtype.type,
590 sliceindex.dtype.type,
591 self._starts.dtype.type,
592 self._stops.dtype.type,
593 ](
594 outoffsets.data,
595 nextcarry.data,
596 slicestarts.data,
597 slicestops.data,
598 slicestarts.length,
599 sliceindex.data,
600 sliceindex.length,
601 self._starts.data,
602 self._stops.data,
603 self._content.length,
604 ),
605 slicer=ak.contents.ListArray(slicestarts, slicestops, slicecontent),
606 )
607 nextcontent = self._content._carry(nextcarry, True)
608 nexthead, nexttail = ak._slicing.head_tail(tail)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/content.py:297, in Content._maybe_index_error(self, error, slicer)
295 else:
296 message = self._backend.format_kernel_error(error)
--> 297 raise ak._errors.index_error(self, slicer, message)
IndexError: cannot slice ListArray (of length 5) with [[-1], [0], [3], [-1], [1]]: index out of range while attempting to get index -1 (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-49/awkward-cpp/src/cpu-kernels/awkward_ListArray_getitem_jagged_apply.cpp#L43)
This error occurred while attempting to slice
<Array [[], [10, 3, 2, 9], ..., [], [8, 9, -1]] type='5 * var * int64'>
with
<Array [[-1], [0], [3], [-1], [1]] type='5 * var * int64'>
From the error message, it is clear that for some sublist(s) the index -1
is out of range. This makes sense; some of our sublists are empty, meaning that there is no valid integer to index into them.
Now let’s look at the result of indexing with mask_identity=True
.
index = ak.argmax(array, keepdims=True, axis=-1, mask_identity=True)
Because it contains an option type, index
already satisfies rule (2) in Building an awkward index, and we do not need to convert it to a ragged array. We can see that this index succeeds:
array[index]
[[None], [10], [12], [None], [9]] -------- backend: cpu nbytes: 112 B type: 5 * var * ?int64
Here, the missing values in the index array correspond to missing values in the output array.
Indexing with missing sublists#
Ragged indexing also supports using None
in place of empty sublists within an index. For example, given the following array
array = ak.Array(
[
[10, 3, 2, 9],
[4, 5, 5, 12, 6],
[],
[8, 9, -1],
]
)
array
[[10, 3, 2, 9], [4, 5, 5, 12, 6], [], [8, 9, -1]] ------------------ backend: cpu nbytes: 136 B type: 4 * var * int64
let’s use build a ragged index to pull out some particular values. Rather than using empty lists, we can use None
to mask out sublists that we don’t care about:
array[
[
[0, 1],
None,
[],
[2],
],
]
[[10, 3], None, [], [-1]] --------- backend: cpu nbytes: 96 B type: 4 * option[var * int64]
If we compare this with simply providing an empty sublist,
array[
[
[0, 1],
[],
[],
[2],
],
]
[[10, 3], [], [], [-1]] --------- backend: cpu nbytes: 64 B type: 4 * var * int64
we can see that the None
value introduces an option-type into the final result. None
values can be used at any level in the index array to introduce an option-type at that depth in the result.