How to create arrays of lists

import awkward as ak
import numpy as np

From Python lists

If you have a collection of Python lists, the easiest way to turn them into an Awkward Array is to pass them to the ak.Array constructor, which recognizes a non-dict, non-NumPy iterable and calls ak.from_iter.

python_lists = [[1, 2, 3], [], [4, 5], [6], [7, 8, 9, 10]]
python_lists
[[1, 2, 3], [], [4, 5], [6], [7, 8, 9, 10]]
awkward_array = ak.Array(python_lists)
awkward_array
<Array [[1, 2, 3], [], ... [6], [7, 8, 9, 10]] type='5 * var * int64'>

The lists of lists can be arbitrarily deep.

python_lists = [[[[], [1, 2, 3]]], [[[4, 5]]], []]
python_lists
[[[[], [1, 2, 3]]], [[[4, 5]]], []]
awkward_array = ak.Array(python_lists)
awkward_array
<Array [[[[], [1, 2, 3]]], [[[4, 5]]], []] type='3 * var * var * var * int64'>

The “var *” in the type string indicates nested levels of variable-length lists. This is an array of lists of lists of lists of integers.

The advantage of the Awkward Array is that the numerical data are now all in one array buffer and calculations are vectorized across the array, such as NumPy universal functions.

np.sqrt(awkward_array)
<Array [[[[], [1, 1.41, ... 2, 2.24]]], []] type='3 * var * var * var * float64'>

Unlike Python lists, arrays consist of a homogeneous type. A Python list wouldn’t notice if numerical data were given at two different levels of nesting, but that’s a big difference to an Awkward Array.

union_array = ak.Array([[[[], [1, 2, 3]]], [[4, 5]], []])
union_array
<Array [[[[], [1, 2, 3]]], [[4, 5]], []] type='3 * var * var * union[var * int64...'>

In this example, the data type is a “union” of two levels deep and three levels deep.

union_array.type
3 * var * var * union[var * int64, int64]

Some operations are possible with union arrays, but not all. (Iteration in Numba is one such example.)

From NumPy arrays

The ak.Array constructor loads NumPy arrays differently from Python lists. The inner dimensions of a NumPy array are guaranteed to have the same lengths, so they are interpreted as a fixed-length list type.

numpy_array = np.arange(2*3*5).reshape(2, 3, 5)
numpy_array
array([[[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9],
        [10, 11, 12, 13, 14]],

       [[15, 16, 17, 18, 19],
        [20, 21, 22, 23, 24],
        [25, 26, 27, 28, 29]]])
regular_array = ak.Array(numpy_array)
regular_array
<Array [[[0, 1, 2, 3, 4, ... 26, 27, 28, 29]]] type='2 * 3 * 5 * int64'>

The type in this case has no “var *” in it, only “2 *”, “3 *”, and “5 *”. It’s a length-2 array of length-3 lists containing length-5 lists of integers.

Furthermore, if NumPy arrays are nested within Python lists (or other iterables), they’ll be treated as variable-length (“var *”) because there’s no guarantee at the start of a sequence that all NumPy arrays in the sequence will have the same shape.

numpy_arrays = [np.arange(3*5).reshape(3, 5), np.arange(3*5, 2*3*5).reshape(3, 5)]
numpy_arrays
[array([[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9],
        [10, 11, 12, 13, 14]]),
 array([[15, 16, 17, 18, 19],
        [20, 21, 22, 23, 24],
        [25, 26, 27, 28, 29]])]
irregular_array = ak.Array(numpy_arrays)
irregular_array
<Array [[[0, 1, 2, 3, 4, ... 26, 27, 28, 29]]] type='2 * var * var * int64'>

Both regular_array and irregular_array have the same data values:

regular_array.tolist() == irregular_array.tolist()
True

but they have different types:

regular_array.type, irregular_array.type
(2 * 3 * 5 * int64, 2 * var * var * int64)

This can make a difference in some operations, such as broadcasting.

If you want more control over this, use the explicit ak.from_iter and ak.from_numpy functions instead of the general-purpose ak.Array constructor.

Unflattening

Another difference between ak.from_iter and ak.from_numpy is that iteration over Python lists is slow and necessarily copies the data, whereas ingesting a NumPy array is zero-copy. (You can see that it’s zero copy by changing the data in-place.)

In some cases, list-making can be vectorized. If you have a flat NumPy array of data and an array of “counts” that add up to the length of the data, then you can ak.unflatten it.

data = np.array([1, 2, 3, 4, 5, 6, 7, 8])
counts = np.array([3, 0, 1, 4])

unflattened = ak.unflatten(data, counts)
unflattened
<Array [[1, 2, 3], [], [4], [5, 6, 7, 8]] type='4 * var * int64'>

The first list has length 3, the second has length 0, the third has length 1, and the last has length 4. This is close to Awkward Array’s internal representation of variable-length lists, so it can be performed quickly.

This function is named ak.unflatten because it has the opposite effect as ak.flatten and ak.num:

ak.flatten(unflattened)
<Array [1, 2, 3, 4, 5, 6, 7, 8] type='8 * int64'>
ak.num(unflattened)
<Array [3, 0, 1, 4] type='4 * int64'>

With ArrayBuilder

ak.ArrayBuilder is described in more detail in this tutorial, but you can also construct arrays of lists using the begin_list/end_list methods or the list context manager.

(This is what ak.from_iter uses internally to accumulate lists.)

builder = ak.ArrayBuilder()

builder.begin_list()
builder.append(1)
builder.append(2)
builder.append(3)
builder.end_list()

builder.begin_list()
builder.end_list()

builder.begin_list()
builder.append(4)
builder.append(5)
builder.end_list()

array = builder.snapshot()
array
<Array [[1, 2, 3], [], [4, 5]] type='3 * var * int64'>
builder = ak.ArrayBuilder()

with builder.list():
    builder.append(1)
    builder.append(2)
    builder.append(3)

with builder.list():
    pass

with builder.list():
    builder.append(4)
    builder.append(5)

array = builder.snapshot()
array
<Array [[1, 2, 3], [], [4, 5]] type='3 * var * int64'>

In Numba

Functions that Numba Just-In-Time (JIT) compiles can use ak.ArrayBuilder or construct flat data and “counts” arrays for ak.unflatten.

(At this time, Numba can’t use context managers, the with statement, in fully compiled code. ak.ArrayBuilder can’t be constructed or converted to an array using snapshot inside a JIT-compiled function, but can be outside the compiled context. Similarly, ak.* functions like ak.unflatten can’t be called inside a JIT-compiled function, but can be outside.)

import numba as nb
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
/tmp/ipykernel_1877/977659880.py in <module>
----> 1 import numba as nb

~/python3.8/lib/python3.8/site-packages/numba/__init__.py in <module>
    196 
    197 _ensure_llvm()
--> 198 _ensure_critical_deps()
    199 
    200 # we know llvmlite is working as the above tests passed, import it now as SVML

~/python3.8/lib/python3.8/site-packages/numba/__init__.py in _ensure_critical_deps()
    136         raise ImportError("Numba needs NumPy 1.17 or greater")
    137     elif numpy_version > (1, 20):
--> 138         raise ImportError("Numba needs NumPy 1.20 or less")
    139 
    140     try:

ImportError: Numba needs NumPy 1.20 or less
@nb.jit
def append_list(builder, start, stop):
    builder.begin_list()
    for x in range(start, stop):
        builder.append(x)
    builder.end_list()

@nb.jit
def example(builder):
    append_list(builder, 1, 4)
    append_list(builder, 999, 999)
    append_list(builder, 4, 6)
    return builder

builder = example(ak.ArrayBuilder())

array = builder.snapshot()
array
@nb.jit
def append_list(i, data, j, counts, start, stop):
    for x in range(start, stop):
        data[i] = x
        i += 1
    counts[j] = stop - start
    j += 1
    return i, j

@nb.jit
def example():
    data = np.empty(5, np.int64)
    counts = np.empty(3, np.int64)
    i, j = 0, 0
    i, j = append_list(i, data, j, counts, 1, 4)
    i, j = append_list(i, data, j, counts, 999, 999)
    i, j = append_list(i, data, j, counts, 4, 6)
    return data, counts

data, counts = example()

array = ak.unflatten(data, counts)
array