How to create arrays of missing data#

Data at any level of an Awkward Array can be “missing,” represented by None in Python.

This functionality is somewhat like NumPy’s masked arrays, but masked arrays can only declare numerical values to be missing (not, for instance, a row of a 2-dimensional array) and they represent missing data with an np.ma.masked object instead of None.

Pandas also handles missing data, but in several different ways. For floating point columns, NaN (not a number) is used to mean “missing,” and as of version 1.0, Pandas has a pd.NA object for missing data in other data types.

In Awkward Array, floating point NaN and a missing value are clearly distinct. Missing data, like all data in Awkward Arrays, are also not represented by any Python object; they are converted to and from None by ak.to_list() and ak.from_iter().

import awkward as ak
import numpy as np

From Python None#

The ak.Array constructor and ak.from_iter() interpret None as a missing value, and ak.to_list() converts them back into None.

ak.Array([1, 2, 3, None, 4, 5])
[1,
 2,
 3,
 None,
 4,
 5]
----------------
type: 6 * ?int64

The missing values can be deeply nested (missing integers):

ak.Array([[[[], [1, 2, None]]], [[[3]]], []])
[[[[], [1, 2, None]]],
 [[[3]]],
 []]
----------------------------------
type: 3 * var * var * var * ?int64

They can be shallow (missing lists):

ak.Array([[[[], [1, 2]]], None, [[[3]]], []])
[[[[], [1, 2]]],
 None,
 [[[3]]],
 []]
-----------------------------------------
type: 4 * option[var * var * var * int64]

Or both:

ak.Array([[[[], [3]]], None, [[[None]]], []])
[[[[], [3]]],
 None,
 [[[None]]],
 []]
------------------------------------------
type: 4 * option[var * var * var * ?int64]

Records can also be missing:

ak.Array([{"x": 1, "y": 1}, None, {"x": 2, "y": 2}])
[{x: 1, y: 1},
 None,
 {x: 2, y: 2}]
--------------
type: 3 * ?{
    x: int64,
    y: int64
}

Potentially missing values are represented in the type string as “?” or “option[...]” (if the nested type is a list, which needs to be bracketed for clarity).

From NumPy arrays#

Normal NumPy arrays can’t represent missing data, but masked arrays can. Here is how one is constructed in NumPy:

numpy_array = np.ma.MaskedArray([1, 2, 3, 4, 5], [False, False, True, True, False])
numpy_array
masked_array(data=[1, 2, --, --, 5],
             mask=[False, False,  True,  True, False],
       fill_value=999999)

It returns np.ma.masked objects if you try to access missing values:

numpy_array[0], numpy_array[1], numpy_array[2], numpy_array[3], numpy_array[4]
(1, 2, masked, masked, 5)

But it uses None for missing values in tolist:

numpy_array.tolist()
[1, 2, None, None, 5]

The ak.from_numpy() function converts masked arrays into Awkward Arrays with missing values, as does the ak.Array constructor.

awkward_array = ak.Array(numpy_array)
awkward_array
[1,
 2,
 None,
 None,
 5]
----------------
type: 5 * ?int64

The reverse, ak.to_numpy(), returns masked arrays if the Awkward Array has missing data.

ak.to_numpy(awkward_array)
masked_array(data=[1, 2, --, --, 5],
             mask=[False, False,  True,  True, False],
       fill_value=999999)

But np.asarray, the usual way of casting data as NumPy arrays, does not. (np.asarray is supposed to return a plain np.ndarray, which np.ma.masked_array is not.)

np.asarray(awkward_array)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[12], line 1
----> 1 np.asarray(awkward_array)

File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/highlevel.py:1303, in Array.__array__(self, *args, **kwargs)
   1301 arguments.update(kwargs)
   1302 with ak._errors.OperationErrorContext("numpy.asarray", arguments):
-> 1303     return ak._connect.numpy.convert_to_array(self._layout, args, kwargs)

File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/_connect/numpy.py:28, in convert_to_array(layout, args, kwargs)
     27 def convert_to_array(layout, args, kwargs):
---> 28     out = ak.operations.to_numpy(layout, allow_missing=False)
     29     if args == () and kwargs == {}:
     30         return out

File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/operations/ak_to_numpy.py:41, in to_numpy(array, allow_missing)
      7 """
      8 Args:
      9     array: Array-like data (anything #ak.to_layout recognizes).
   (...)
     35 See also #ak.from_numpy and #ak.to_cupy.
     36 """
     37 with ak._errors.OperationErrorContext(
     38     "ak.to_numpy",
     39     {"array": array, "allow_missing": allow_missing},
     40 ):
---> 41     return _impl(array, allow_missing)

File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/operations/ak_to_numpy.py:53, in _impl(array, allow_missing)
     50 backend = ak._backends.NumpyBackend.instance()
     51 numpy_layout = layout.to_backend(backend)
---> 53 return numpy_layout.to_backend_array(allow_missing=allow_missing)

File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/contents/content.py:1076, in Content.to_backend_array(self, allow_missing, backend)
   1074 else:
   1075     backend = ak._backends.regularize_backend(backend)
-> 1076 return self._to_backend_array(allow_missing, backend)

File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/contents/bytemaskedarray.py:978, in ByteMaskedArray._to_backend_array(self, allow_missing, backend)
    977 def _to_backend_array(self, allow_missing, backend):
--> 978     return self.to_IndexedOptionArray64()._to_backend_array(allow_missing, backend)

File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/contents/indexedoptionarray.py:1535, in IndexedOptionArray._to_backend_array(self, allow_missing, backend)
   1533         return nplike.ma.MaskedArray(data, mask)
   1534     else:
-> 1535         raise ak._errors.wrap_error(
   1536             ValueError(
   1537                 "Content.to_nplike cannot convert 'None' values to "
   1538                 "np.ma.MaskedArray unless the "
   1539                 "'allow_missing' parameter is set to True"
   1540             )
   1541         )
   1542 else:
   1543     if allow_missing:

ValueError: while calling

    numpy.asarray(
        <Array [1, 2, None, None, 5] type='5 * ?int64'>
    )

Error details: Content.to_nplike cannot convert 'None' values to np.ma.MaskedArray unless the 'allow_missing' parameter is set to True

Missing rows vs missing numbers#

In Awkward Array, a missing list is a different thing from a list whose values are missing. However, ak.to_numpy() converts it for you.

missing_row = ak.Array([[1, 2, 3], None, [4, 5, 6]])
missing_row
[[1, 2, 3],
 None,
 [4, 5, 6]]
-----------------------------
type: 3 * option[var * int64]
ak.to_numpy(missing_row)
masked_array(
  data=[[1, 2, 3],
        [--, --, --],
        [4, 5, 6]],
  mask=[[False, False, False],
        [ True,  True,  True],
        [False, False, False]],
  fill_value=999999)

NaN is not missing#

Floating point NaN values are simply unrelated to missing values, in both Awkward Array and NumPy.

missing_with_nan = ak.Array([1.1, 2.2, np.nan, None, 3.3])
missing_with_nan
[1.1,
 2.2,
 nan,
 None,
 3.3]
------------------
type: 5 * ?float64
ak.to_numpy(missing_with_nan)
masked_array(data=[1.1, 2.2, nan, --, 3.3],
             mask=[False, False, False,  True, False],
       fill_value=1e+20)

Missing values as empty lists#

Sometimes, it’s useful to think about a potentially missing value as a length-1 list if it is not missing and a length-0 list if it is. (Some languages define the option type as a kind of list.)

The Awkward functions ak.singletons() and ak.firsts() convert from “None form” to and from “lists form.”

none_form = ak.Array([1, 2, 3, None, None, 5])
none_form
[1,
 2,
 3,
 None,
 None,
 5]
----------------
type: 6 * ?int64
lists_form = ak.singletons(none_form)
lists_form
[[1],
 [2],
 [3],
 [],
 [],
 [5]]
---------------------
type: 6 * var * int64
ak.firsts(lists_form)
[1,
 2,
 3,
 None,
 None,
 5]
----------------
type: 6 * ?int64

Masking instead of slicing#

The most common way of filtering data is to slice it with an array of booleans (usually the result of a calculation).

array = ak.Array([1, 2, 3, 4, 5])
array
[1,
 2,
 3,
 4,
 5]
---------------
type: 5 * int64
booleans = ak.Array([True, True, False, False, True])
booleans
[True,
 True,
 False,
 False,
 True]
--------------
type: 5 * bool
array[booleans]
[1,
 2,
 5]
---------------
type: 3 * int64

The data can also be effectively filtered by replacing values with None. The following syntax does that:

array.mask[booleans]
[1,
 2,
 None,
 None,
 5]
----------------
type: 5 * ?int64

(Or use the ak.mask() function.)

An advantage of masking is that the length and nesting structure of the masked array is the same as the original array, so anything that broadcasts with one broadcasts with the other (so that unfiltered data can be used interchangeably with filtered data).

array + array.mask[booleans]
[2,
 4,
 None,
 None,
 10]
----------------
type: 5 * ?int64

whereas

array + array[booleans]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[25], line 1
----> 1 array + array[booleans]

File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/numpy/lib/mixins.py:21, in _binary_method.<locals>.func(self, other)
     19 if _disables_array_ufunc(other):
     20     return NotImplemented
---> 21 return ufunc(self, other)

File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/highlevel.py:1376, in Array.__array_ufunc__(self, ufunc, method, *inputs, **kwargs)
   1374 arguments.update(kwargs)
   1375 with ak._errors.OperationErrorContext(name, arguments):
-> 1376     return ak._connect.numpy.array_ufunc(ufunc, method, inputs, kwargs)

File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/_connect/numpy.py:291, in array_ufunc(ufunc, method, inputs, kwargs)
    282     out = ak._do.recursively_apply(
    283         inputs[where],
    284         unary_action,
   (...)
    287         allow_records=False,
    288     )
    290 else:
--> 291     out = ak._broadcasting.broadcast_and_apply(
    292         inputs, action, behavior, allow_records=False, function_name=ufunc.__name__
    293     )
    294     assert isinstance(out, tuple) and len(out) == 1
    295     out = out[0]

File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/_broadcasting.py:1060, in broadcast_and_apply(inputs, action, behavior, depth_context, lateral_context, allow_records, left_broadcast, right_broadcast, numpy_to_regular, regular_to_jagged, function_name, broadcast_parameters_rule)
   1058 backend = ak._backends.backend_of(*inputs)
   1059 isscalar = []
-> 1060 out = apply_step(
   1061     backend,
   1062     broadcast_pack(inputs, isscalar),
   1063     action,
   1064     0,
   1065     depth_context,
   1066     lateral_context,
   1067     behavior,
   1068     {
   1069         "allow_records": allow_records,
   1070         "left_broadcast": left_broadcast,
   1071         "right_broadcast": right_broadcast,
   1072         "numpy_to_regular": numpy_to_regular,
   1073         "regular_to_jagged": regular_to_jagged,
   1074         "function_name": function_name,
   1075         "broadcast_parameters_rule": broadcast_parameters_rule,
   1076     },
   1077 )
   1078 assert isinstance(out, tuple)
   1079 return tuple(broadcast_unpack(x, isscalar, backend) for x in out)

File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/_broadcasting.py:1039, in apply_step(backend, inputs, action, depth, depth_context, lateral_context, behavior, options)
   1037     return result
   1038 elif result is None:
-> 1039     return continuation()
   1040 else:
   1041     raise ak._errors.wrap_error(AssertionError(result))

File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/_broadcasting.py:736, in apply_step.<locals>.continuation()
    734         nextinputs.append(x.content[:0])
    735     else:
--> 736         raise ak._errors.wrap_error(
    737             ValueError(
    738                 "cannot broadcast RegularArray of size "
    739                 "{} with RegularArray of size {} {}".format(
    740                     x.size, dimsize, in_function(options)
    741                 )
    742             )
    743         )
    744 else:
    745     nextinputs.append(x)

ValueError: while calling

    numpy.add.__call__(
        <Array [1, 2, 3, 4, 5] type='5 * int64'>
        <Array [1, 2, 5] type='3 * int64'>
    )

Error details: cannot broadcast RegularArray of size 3 with RegularArray of size 5 in add

With ArrayBuilder#

ak.ArrayBuilder is described in more detail in this tutorial, but you can add missing values to an array using the null method or appending None.

(This is what ak.from_iter() uses internally to accumulate data.)

builder = ak.ArrayBuilder()

builder.append(1)
builder.append(2)
builder.null()
builder.append(None)
builder.append(3)

array = builder.snapshot()
array
[1,
 2,
 None,
 None,
 3]
----------------
type: 5 * ?int64

In Numba#

Functions that Numba Just-In-Time (JIT) compiles can use ak.ArrayBuilder or construct a boolean array for ak.mask().

(ak.ArrayBuilder can’t be constructed or converted to an array using snapshot inside a JIT-compiled function, but can be outside the compiled context.)

import numba as nb
@nb.jit
def example(builder):
    builder.append(1)
    builder.append(2)
    builder.null()
    builder.append(None)
    builder.append(3)
    return builder


builder = example(ak.ArrayBuilder())

array = builder.snapshot()
array
[1,
 2,
 None,
 None,
 3]
----------------
type: 5 * ?int64
@nb.jit
def faster_example():
    data = np.empty(5, np.int64)
    mask = np.empty(5, np.bool_)
    data[0] = 1
    mask[0] = True
    data[1] = 2
    mask[1] = True
    mask[2] = False
    mask[3] = False
    data[4] = 5
    mask[4] = True
    return data, mask


data, mask = faster_example()

array = ak.Array(data).mask[mask]
array
[1,
 2,
 None,
 None,
 5]
----------------
type: 5 * ?int64