How to pad and clip arrays, particularly for machine learning

Most applications of arrays expect them to be rectilinear—a rectangular table of numbers in N dimensions. Machine learning frameworks refer to these blocks of numbers as “tensors,” but they are equivalent to N-dimensional NumPy arrays. Awkward Array handles more general data than these, but if your intention is to pass the data to a framework that wants a tensor, you have to reduce your data to a tensor.

This tutorial presents several ways of doing that. Like flattening for plots, the method you choose will depend on what your data mean and what you want to get out of the machine learning process. For instance, if you remove all list structures with ak.flatten, you can’t expect the machine learning algorithm to learn about lists (whatever they mean in your application), and if you truncate them at a particular length, it won’t learn about the values that have been removed.

(In principle, graph neural networks should be able to learn about variable-length data without any losses, but I’m not familiar enough with how to set up these processes to explain it. If you’re an expert, we’d like to hear tips and tricks in GitHub Discussions!)

import awkward as ak
import numpy as np

Flattening an array

Suppose you have an array like this:

array = ak.Array([[0.0, 1.1, 2.2], [], [3.3, 4.4], [5.5], [6.6, 7.7, 8.8, 9.9]])
array
<Array [[0, 1.1, 2.2], ... 6.6, 7.7, 8.8, 9.9]] type='5 * var * float64'>

If the list boundaries are irrelevant to your machine learning application, you can simply ak.flatten them.

ak.flatten(array)
<Array [0, 1.1, 2.2, 3.3, ... 7.7, 8.8, 9.9] type='10 * float64'>

The default axis for ak.flatten is 1, which is to say, the first level of nested list (axis=1) gets squashed to the outer level of array nesting (axis=0). If you have many levels to flatten at once, you can use axis=None:

ak.flatten(ak.Array([[[[[[1.1, 2.2, 3.3]]], [[[4.4, 5.5]]]]]]), axis=None)
<Array [1.1, 2.2, 3.3, 4.4, 5.5] type='5 * float64'>

However, be aware that ak.flatten with axis=None will also merge all fields of a record, which is usually undesirable, and the order might not be what you expect.

ak.flatten(ak.Array([{"x": 1.1, "y": 10}, {"x": 2.2, "y": 20}, {"x": 3.3, "y": 30}]), axis=None)
<Array [1.1, 2.2, 3.3, 10, 20, 30] type='6 * float64'>

Also be aware that flattening (for any axis) removes missing values (at that axis). That is, at the level where lists are concatenated, missing lists are treated the same way as empty lists.

ak.flatten(ak.Array([[1.1, 2.2, 3.3], None, [4.4, 5.5]]))
<Array [1.1, 2.2, 3.3, 4.4, 5.5] type='5 * float64'>

But only at that level.

ak.flatten(ak.Array([[1.1, 2.2, None, 3.3], [], [4.4, 5.5]]))
<Array [1.1, 2.2, None, 3.3, 4.4, 5.5] type='6 * ?float64'>

Markers at the end of each list

Flattening is likely to be useful for training recurrent neural networks, which learn a sequence in order. If, for instance, the values in your nested lists represent words and the lists are sentences, the machine would learn what typical sentences look like. However, it would not learn that the ends of sentences are special.

Suppose we have a nested array of integers representing words in a corpus like this:

array = ak.Array([[5512, 1364], [657], [4853, 6421, 3461, 7745], [5245, 654, 4216]])
array
<Array [[5512, 1364], ... [5245, 654, 4216]] type='4 * var * int64'>

Flattening it would turn them into a big run-on sentence, which may be a bad thing to learn.

ak.flatten(array)
<Array [5512, 1364, 657, ... 5245, 654, 4216] type='10 * int64'>

Suppose that we want to fix this by adding 0 as a marker meaning “end of sentence/stop.” One way to do it is to make an array of [0] lists with the same length as array and ak.concatenate them at axis=1.

periods = np.zeros((len(array), 1), np.int64)
periods
array([[0],
       [0],
       [0],
       [0]])
ak.concatenate([array, periods], axis=1)
<Array [[5512, 1364, 0], ... 654, 4216, 0]] type='4 * var * int64'>
ak.concatenate([array, periods], axis=1).tolist()
[[5512, 1364, 0], [657, 0], [4853, 6421, 3461, 7745, 0], [5245, 654, 4216, 0]]
ak.flatten(ak.concatenate([array, periods], axis=1))
<Array [5512, 1364, 0, 657, ... 654, 4216, 0] type='14 * int64'>

Padding lists to a common length

A general function for turning an array of lists into lists of the same length is ak.pad_none. With the default clip=False, it ensures that a set of lists have at least a given target length.

array = ak.Array([[0, 1, 2], [], [3, 4], [5], [6, 7, 8, 9]])
array
<Array [[0, 1, 2], [], ... [5], [6, 7, 8, 9]] type='5 * var * int64'>
ak.pad_none(array, 2).tolist()
[[0, 1, 2], [None, None], [3, 4], [5, None], [6, 7, 8, 9]]

“At least length 2” means that the list is still a variable-length type, which we can see with the “var *” in its type string.

ak.pad_none(array, 2).type
5 * var * ?int64

To produce lists of an exact length, set clip=True.

ak.pad_none(array, 2, clip=True).tolist()
[[0, 1], [None, None], [3, 4], [5, None], [6, 7]]
ak.pad_none(array, 2, clip=True).type
5 * 2 * ?int64

Now the type string says that the nested lists all have exactly two elements each. This can be directly converted into NumPy (allowing for missing data with ak.to_numpy; casting as np.asarray doesn’t allow the arrays to be NumPy masked arrays).

ak.to_numpy(ak.pad_none(array, 2, clip=True))
masked_array(
  data=[[0, 1],
        [--, --],
        [3, 4],
        [5, --],
        [6, 7]],
  mask=[[False, False],
        [ True,  True],
        [False, False],
        [False,  True],
        [False, False]],
  fill_value=999999)

Perhaps your machine learning library knows how to deal with NumPy masked arrays. If it does not, you can replace all of the missing values with ak.fill_none.

ak.pad_none and ak.fill_none are frequently used together.

ak.fill_none(ak.pad_none(array, 2, clip=True), 999)
/tmp/ipykernel_2472/1624178140.py:1: DeprecationWarning: In version 1.7.0 (target date: 2021-10-01), this will be changed.

To raise these warnings as errors (and get stack traces to find out where they're called), run

    import warnings
    warnings.filterwarnings("error", module="awkward.*")

after the first `import awkward` or use `@pytest.mark.filterwarnings("error:::awkward.*")` in pytest.

Issue: ak.fill_none needs an explicit `axis` because the default will change to `axis=-1`.
  ak.fill_none(ak.pad_none(array, 2, clip=True), 999)
<Array [[0, 1], [999, 999, ... 5, 999], [6, 7]] type='5 * 2 * int64'>
np.asarray(ak.fill_none(ak.pad_none(array, 2, clip=True), 999))
/tmp/ipykernel_2472/2450664306.py:1: DeprecationWarning: In version 1.7.0 (target date: 2021-10-01), this will be changed.

To raise these warnings as errors (and get stack traces to find out where they're called), run

    import warnings
    warnings.filterwarnings("error", module="awkward.*")

after the first `import awkward` or use `@pytest.mark.filterwarnings("error:::awkward.*")` in pytest.

Issue: ak.fill_none needs an explicit `axis` because the default will change to `axis=-1`.
  np.asarray(ak.fill_none(ak.pad_none(array, 2, clip=True), 999))
array([[  0,   1],
       [999, 999],
       [  3,   4],
       [  5, 999],
       [  6,   7]])

Record fields into lists

Sometimes, the data you need to put into one big array (tensor) for machine learning is scattered among several record fields. In Awkward Array, record fields are discontiguous (are stored in separate arrays) and nested lists are contiguous (same array). This will require a copy using ak.concatenate.

array = ak.Array([
    {"a": 11, "b": 12, "c": 13, "d": 14, "e": 15, "f": 16, "g": 17, "h": 18},
    {"a": 21, "b": 22, "c": 23, "d": 24, "e": 25, "f": 26, "g": 27, "h": 28},
    {"a": 31, "b": 32, "c": 33, "d": 34, "e": 35, "f": 36, "g": 37, "h": 38},
    {"a": 41, "b": 42, "c": 43, "d": 44, "e": 45, "f": 46, "g": 47, "h": 48},
    {"a": 51, "b": 52, "c": 53, "d": 54, "e": 55, "f": 56, "g": 57, "h": 58},
])
array
<Array [{a: 11, b: 12, c: 13, ... h: 58}] type='5 * {"a": int64, "b": int64, "c"...'>

To concatenate, say, array.a and array.b as though [a, b] were a list, we have to put them into lists (of length 1). NumPy’s np.newaxis slice will do that.

array.a[:, np.newaxis], array.b[:, np.newaxis]
(<Array [[11], [21], [31], [41], [51]] type='5 * 1 * int64'>,
 <Array [[12], [22], [32], [42], [52]] type='5 * 1 * int64'>)
ak.concatenate([array.a[:, np.newaxis], array.b[:, np.newaxis]], axis=1)
<Array [[11, 12], [21, 22, ... 42], [51, 52]] type='5 * var * int64'>

If there are a lot of fields, doing this manually for each one would be a chore, so we use ak.fields and ak.unzip.

ak.fields(array)
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
ak.unzip(array)
(<Array [11, 21, 31, 41, 51] type='5 * int64'>,
 <Array [12, 22, 32, 42, 52] type='5 * int64'>,
 <Array [13, 23, 33, 43, 53] type='5 * int64'>,
 <Array [14, 24, 34, 44, 54] type='5 * int64'>,
 <Array [15, 25, 35, 45, 55] type='5 * int64'>,
 <Array [16, 26, 36, 46, 56] type='5 * int64'>,
 <Array [17, 27, 37, 47, 57] type='5 * int64'>,
 <Array [18, 28, 38, 48, 58] type='5 * int64'>)

Now it’s a one-liner.

ak.concatenate(ak.unzip(array[:, np.newaxis]), axis=1)
<Array [[11, 12, 13, 14, ... 55, 56, 57, 58]] type='5 * var * int64'>
np.asarray(ak.concatenate(ak.unzip(array[:, np.newaxis]), axis=1))
array([[11, 12, 13, 14, 15, 16, 17, 18],
       [21, 22, 23, 24, 25, 26, 27, 28],
       [31, 32, 33, 34, 35, 36, 37, 38],
       [41, 42, 43, 44, 45, 46, 47, 48],
       [51, 52, 53, 54, 55, 56, 57, 58]])