How to create arrays of missing data

Data at any level of an Awkward Array can be “missing,” represented by None in Python.

This functionality is somewhat like NumPy’s masked arrays, but masked arrays can only declare numerical values to be missing (not, for instance, a row of a 2-dimensional array) and they represent missing data with an np.ma.masked object instead of None.

Pandas also handles missing data, but in several different ways. For floating point columns, NaN (not a number) is used to mean “missing,” and as of version 1.0, Pandas has a pd.NA object for missing data in other data types.

In Awkward Array, floating point NaN and a missing value are clearly distinct. Missing data, like all data in Awkward Arrays, are also not represented by any Python object; they are converted to and from None by ak.to_list and ak.from_iter.

import awkward as ak
import numpy as np

From Python None

The ak.Array constructor and ak.from_iter interpret None as a missing value, and ak.to_list converts them back into None.

ak.Array([1, 2, 3, None, 4, 5])
<Array [1, 2, 3, None, 4, 5] type='6 * ?int64'>

The missing values can be deeply nested (missing integers):

ak.Array([[[[], [1, 2, None]]], [[[3]]], []])
<Array [[[[], [1, 2, None]]], [[[3]]], []] type='3 * var * var * var * ?int64'>

They can be shallow (missing lists):

ak.Array([[[[], [1, 2]]], None, [[[3]]], []])
<Array [[[[], [1, 2]]], None, [[[3]]], []] type='4 * option[var * var * var * in...'>

Or both:

ak.Array([[[[], [3]]], None, [[[None]]], []])
<Array [[[[], [3]]], None, [[[None]]], []] type='4 * option[var * var * var * ?i...'>

Records can also be missing:

ak.Array([{"x": 1, "y": 1}, None, {"x": 2, "y": 2}])
<Array [{x: 1, y: 1}, None, {x: 2, y: 2}] type='3 * ?{"x": int64, "y": int64}'>

Potentially missing values are represented in the type string as “?” or “option[...]” (if the nested type is a list, which needs to be bracketed for clarity).

From NumPy arrays

Normal NumPy arrays can’t represent missing data, but masked arrays can. Here is how one is constructed in NumPy:

numpy_array = np.ma.MaskedArray([1, 2, 3, 4, 5], [False, False, True, True, False])
numpy_array
masked_array(data=[1, 2, --, --, 5],
             mask=[False, False,  True,  True, False],
       fill_value=999999)

It returns np.ma.masked objects if you try to access missing values:

numpy_array[0], numpy_array[1], numpy_array[2], numpy_array[3], numpy_array[4]
(1, 2, masked, masked, 5)

But it uses None for missing values in tolist:

numpy_array.tolist()
[1, 2, None, None, 5]

The ak.from_numpy function converts masked arrays into Awkward Arrays with missing values, as does the ak.Array constructor.

awkward_array = ak.Array(numpy_array)
awkward_array
<Array [1, 2, None, None, 5] type='5 * ?int64'>

The reverse, ak.to_numpy, returns masked arrays if the Awkward Array has missing data.

ak.to_numpy(awkward_array)
masked_array(data=[1, 2, --, --, 5],
             mask=[False, False,  True,  True, False],
       fill_value=999999)

But np.asarray, the usual way of casting data as NumPy arrays, does not. (np.asarray is supposed to return a plain np.ndarray, which np.ma.masked_array is not.)

np.asarray(awkward_array)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [12], in <cell line: 1>()
----> 1 np.asarray(awkward_array)

File ~/python3.8/lib/python3.8/site-packages/awkward/highlevel.py:1351, in Array.__array__(self, *args, **kwargs)
   1326 def __array__(self, *args, **kwargs):
   1327     """
   1328     Intercepts attempts to convert this Array into a NumPy array and
   1329     either performs a zero-copy conversion or raises an error.
   (...)
   1349     cannot be sliced as dimensions.
   1350     """
-> 1351     return ak._connect._numpy.convert_to_array(self.layout, args, kwargs)

File ~/python3.8/lib/python3.8/site-packages/awkward/_connect/_numpy.py:13, in convert_to_array(layout, args, kwargs)
     12 def convert_to_array(layout, args, kwargs):
---> 13     out = ak.operations.convert.to_numpy(layout, allow_missing=False)
     14     if args == () and kwargs == {}:
     15         return out

File ~/python3.8/lib/python3.8/site-packages/awkward/operations/convert.py:302, in to_numpy(array, allow_missing)
    300         return numpy.ma.MaskedArray(data, mask)
    301     else:
--> 302         raise ValueError(
    303             "ak.to_numpy cannot convert 'None' values to "
    304             "np.ma.MaskedArray unless the "
    305             "'allow_missing' parameter is set to True"
    306             + ak._util.exception_suffix(__file__)
    307         )
    308 else:
    309     if allow_missing:

ValueError: ak.to_numpy cannot convert 'None' values to np.ma.MaskedArray unless the 'allow_missing' parameter is set to True

(https://github.com/scikit-hep/awkward-1.0/blob/1.9.0rc4/src/awkward/operations/convert.py#L306)

Missing rows vs missing numbers

In Awkward Array, a missing list is a different thing from a list whose values are missing. However, ak.to_numpy converts it for you.

missing_row = ak.Array([[1, 2, 3], None, [4, 5, 6]])
missing_row
<Array [[1, 2, 3], None, [4, 5, 6]] type='3 * option[var * int64]'>
ak.to_numpy(missing_row)
masked_array(
  data=[[1, 2, 3],
        [--, --, --],
        [4, 5, 6]],
  mask=[[False, False, False],
        [ True,  True,  True],
        [False, False, False]],
  fill_value=999999)

NaN is not missing

Floating point NaN values are simply unrelated to missing values, in both Awkward Array and NumPy.

missing_with_nan = ak.Array([1.1, 2.2, np.nan, None, 3.3])
missing_with_nan
<Array [1.1, 2.2, nan, None, 3.3] type='5 * ?float64'>
ak.to_numpy(missing_with_nan)
masked_array(data=[1.1, 2.2, nan, --, 3.3],
             mask=[False, False, False,  True, False],
       fill_value=1e+20)

Missing values as empty lists

Sometimes, it’s useful to think about a potentially missing value as a length-1 list if it is not missing and a length-0 list if it is. (Some languages define the option type as a kind of list.)

The Awkward functions ak.singletons and ak.firsts convert from “None form” to and from “lists form.”

none_form = ak.Array([1, 2, 3, None, None, 5])
none_form
<Array [1, 2, 3, None, None, 5] type='6 * ?int64'>
lists_form = ak.singletons(none_form)
lists_form
<Array [[1], [2], [3], [], [], [5]] type='6 * var * int64'>
ak.firsts(lists_form)
<Array [1, 2, 3, None, None, 5] type='6 * ?int64'>

Masking instead of slicing

The most common way of filtering data is to slice it with an array of booleans (usually the result of a calculation).

array = ak.Array([1, 2, 3, 4, 5])
array
<Array [1, 2, 3, 4, 5] type='5 * int64'>
booleans = ak.Array([True, True, False, False, True])
booleans
<Array [True, True, False, False, True] type='5 * bool'>
array[booleans]
<Array [1, 2, 5] type='3 * int64'>

The data can also be effectively filtered by replacing values with None. The following syntax does that:

array.mask[booleans]
<Array [1, 2, None, None, 5] type='5 * ?int64'>

(Or use the ak.mask function.)

An advantage of masking is that the length and nesting structure of the masked array is the same as the original array, so anything that broadcasts with one broadcasts with the other (so that unfiltered data can be used interchangeably with filtered data).

array + array.mask[booleans]
<Array [2, 4, None, None, 10] type='5 * ?int64'>

whereas

array + array[booleans]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [25], in <cell line: 1>()
----> 1 array + array[booleans]

File ~/python3.8/lib/python3.8/site-packages/numpy/lib/mixins.py:21, in _binary_method.<locals>.func(self, other)
     19 if _disables_array_ufunc(other):
     20     return NotImplemented
---> 21 return ufunc(self, other)

File ~/python3.8/lib/python3.8/site-packages/awkward/highlevel.py:1411, in Array.__array_ufunc__(self, ufunc, method, *inputs, **kwargs)
   1354 """
   1355 Intercepts attempts to pass this Array to a NumPy
   1356 [universal functions](https://docs.scipy.org/doc/numpy/reference/ufuncs.html)
   (...)
   1408 See also #__array_function__.
   1409 """
   1410 if not hasattr(self, "_tracers"):
-> 1411     return ak._connect._numpy.array_ufunc(ufunc, method, inputs, kwargs)
   1412 else:
   1413     return ak._connect._jax.jax_utils.array_ufunc(
   1414         self, ufunc, method, inputs, kwargs
   1415     )

File ~/python3.8/lib/python3.8/site-packages/awkward/_connect/_numpy.py:250, in array_ufunc(ufunc, method, inputs, kwargs)
    240         raise ValueError(
    241             "no overloads for custom types: {}({})".format(
    242                 ufunc.__name__,
   (...)
    245             + ak._util.exception_suffix(__file__)
    246         )
    248     return None
--> 250 out = ak._util.broadcast_and_apply(
    251     inputs, getfunction, behavior, allow_records=False, pass_depth=False
    252 )
    253 assert isinstance(out, tuple) and len(out) == 1
    254 return ak._util.wrap(out[0], behavior)

File ~/python3.8/lib/python3.8/site-packages/awkward/_util.py:1166, in broadcast_and_apply(inputs, getfunction, behavior, allow_records, pass_depth, pass_user, user, left_broadcast, right_broadcast, numpy_to_regular, regular_to_jagged)
   1164 else:
   1165     isscalar = []
-> 1166     out = apply(broadcast_pack(inputs, isscalar), 0, user)
   1167     assert isinstance(out, tuple)
   1168     return tuple(broadcast_unpack(x, isscalar) for x in out)

File ~/python3.8/lib/python3.8/site-packages/awkward/_util.py:733, in broadcast_and_apply.<locals>.apply(inputs, depth, user)
    730 if pass_user:
    731     args = args + (user,)
--> 733 custom = getfunction(inputs, *args)
    734 if callable(custom):
    735     return custom()

File ~/python3.8/lib/python3.8/site-packages/awkward/_connect/_numpy.py:195, in array_ufunc.<locals>.getfunction(inputs)
    186 if all(
    187     (
    188         isinstance(x, ak.layout.NumpyArray)
   (...)
    192     for x in inputs
    193 ):
    194     nplike = ak.nplike.of(*inputs)
--> 195     result = getattr(ufunc, method)(
    196         *[nplike.asarray(x) for x in inputs], **kwargs
    197     )
    198     return lambda: (ak.operations.convert.from_numpy(result, highlevel=False),)
    199 elif all(
    200     isinstance(x, ak.layout.NumpyArray) and (x.format.upper().startswith("M"))
    201     for x in inputs
    202 ):

ValueError: operands could not be broadcast together with shapes (1,5) (1,3) 

With ArrayBuilder

ak.ArrayBuilder is described in more detail in this tutorial, but you can add missing values to an array using the null method or appending None.

(This is what ak.from_iter uses internally to accumulate data.)

builder = ak.ArrayBuilder()

builder.append(1)
builder.append(2)
builder.null()
builder.append(None)
builder.append(3)

array = builder.snapshot()
array
<Array [1, 2, None, None, 3] type='5 * ?int64'>

In Numba

Functions that Numba Just-In-Time (JIT) compiles can use ak.ArrayBuilder or construct a boolean array for ak.mask.

(ak.ArrayBuilder can’t be constructed or converted to an array using snapshot inside a JIT-compiled function, but can be outside the compiled context.)

import numba as nb
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Input In [27], in <cell line: 1>()
----> 1 import numba as nb

File ~/python3.8/lib/python3.8/site-packages/numba/__init__.py:200, in <module>
    197     return False
    199 _ensure_llvm()
--> 200 _ensure_critical_deps()
    202 # we know llvmlite is working as the above tests passed, import it now as SVML
    203 # needs to mutate runtime options (sets the `-vector-library`).
    204 import llvmlite

File ~/python3.8/lib/python3.8/site-packages/numba/__init__.py:140, in _ensure_critical_deps()
    138     raise ImportError("Numba needs NumPy 1.18 or greater")
    139 elif numpy_version > (1, 21):
--> 140     raise ImportError("Numba needs NumPy 1.21 or less")
    142 try:
    143     import scipy

ImportError: Numba needs NumPy 1.21 or less
@nb.jit
def example(builder):
    builder.append(1)
    builder.append(2)
    builder.null()
    builder.append(None)
    builder.append(3)
    return builder

builder = example(ak.ArrayBuilder())

array = builder.snapshot()
array
@nb.jit
def faster_example():
    data = np.empty(5, np.int64)
    mask = np.empty(5, np.bool_)
    data[0] = 1
    mask[0] = True
    data[1] = 2
    mask[1] = True
    mask[2] = False
    mask[3] = False
    data[4] = 5
    mask[4] = True
    return data, mask

data, mask = faster_example()

array = ak.Array(data).mask[mask]
array