How to create arrays of lists#
import awkward as ak
import numpy as np
From Python lists#
If you have a collection of Python lists, the easiest way to turn them into an Awkward Array is to pass them to the ak.Array
constructor, which recognizes a non-dict, non-NumPy iterable and calls ak.from_iter()
.
python_lists = [[1, 2, 3], [], [4, 5], [6], [7, 8, 9, 10]]
python_lists
[[1, 2, 3], [], [4, 5], [6], [7, 8, 9, 10]]
awkward_array = ak.Array(python_lists)
awkward_array
[[1, 2, 3], [], [4, 5], [6], [7, 8, 9, 10]] --------------------- type: 5 * var * int64
The lists of lists can be arbitrarily deep.
python_lists = [[[[], [1, 2, 3]]], [[[4, 5]]], []]
python_lists
[[[[], [1, 2, 3]]], [[[4, 5]]], []]
awkward_array = ak.Array(python_lists)
awkward_array
[[[[], [1, 2, 3]]], [[[4, 5]]], []] --------------------------------- type: 3 * var * var * var * int64
The “var *
” in the type string indicates nested levels of variable-length lists. This is an array of lists of lists of lists of integers.
The advantage of the Awkward Array is that the numerical data are now all in one array buffer and calculations are vectorized across the array, such as NumPy universal functions.
np.sqrt(awkward_array)
[[[[], [1, 1.41, 1.73]]], [[[2, 2.24]]], []] ----------------------------------- type: 3 * var * var * var * float64
Unlike Python lists, arrays consist of a homogeneous type. A Python list wouldn’t notice if numerical data were given at two different levels of nesting, but that’s a big difference to an Awkward Array.
union_array = ak.Array([[[[], [1, 2, 3]]], [[4, 5]], []])
union_array
[[[[], [1, 2, 3]]], [[4, 5]], []] ---------------------------- type: 3 * var * var * union[ var * int64, int64 ]
In this example, the data type is a “union” of two levels deep and three levels deep.
union_array.type
ArrayType(ListType(ListType(UnionType([ListType(NumpyType('int64')), NumpyType('int64')]))), 3, None)
Some operations are possible with union arrays, but not all. (Iteration in Numba is one such example.)
From NumPy arrays#
The ak.Array
constructor loads NumPy arrays differently from Python lists. The inner dimensions of a NumPy array are guaranteed to have the same lengths, so they are interpreted as a fixed-length list type.
numpy_array = np.arange(2 * 3 * 5).reshape(2, 3, 5)
numpy_array
array([[[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]],
[[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24],
[25, 26, 27, 28, 29]]])
regular_array = ak.Array(numpy_array)
regular_array
[[[0, 1, 2, 3, 4], [5, 6, 7, 8, 9], [10, 11, 12, 13, 14]], [[15, 16, 17, 18, 19], [20, 21, 22, 23, 24], [25, 26, 27, 28, 29]]] -------------------------------------------------------------------- type: 2 * 3 * 5 * int64
The type in this case has no “var *
” in it, only “2 *
”, “3 *
”, and “5 *
”. It’s a length-2 array of length-3 lists containing length-5 lists of integers.
Furthermore, if NumPy arrays are nested within Python lists (or other iterables), they’ll be treated as variable-length (”var *
”) because there’s no guarantee at the start of a sequence that all NumPy arrays in the sequence will have the same shape.
numpy_arrays = [
np.arange(3 * 5).reshape(3, 5),
np.arange(3 * 5, 2 * 3 * 5).reshape(3, 5),
]
numpy_arrays
[array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]]),
array([[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24],
[25, 26, 27, 28, 29]])]
irregular_array = ak.Array(numpy_arrays)
irregular_array
[[[0, 1, 2, 3, 4], [5, 6, 7, 8, 9], [10, 11, 12, 13, 14]], [[15, 16, 17, 18, 19], [20, 21, 22, 23, 24], [25, 26, 27, 28, 29]]] -------------------------------------------------------------------- type: 2 * var * var * int64
Both regular_array
and irregular_array
have the same data values:
regular_array.to_list() == irregular_array.to_list()
True
but they have different types:
regular_array.type, irregular_array.type
(ArrayType(RegularType(RegularType(NumpyType('int64'), 5), 3), 2, None),
ArrayType(ListType(ListType(NumpyType('int64'))), 2, None))
This can make a difference in some operations, such as broadcasting.
If you want more control over this, use the explicit ak.from_iter()
and ak.from_numpy()
functions instead of the general-purpose ak.Array
constructor.
Unflattening#
Another difference between ak.from_iter()
and ak.from_numpy()
is that iteration over Python lists is slow and necessarily copies the data, whereas ingesting a NumPy array is zero-copy. (You can see that it’s zero copy by changing the data in-place.)
In some cases, list-making can be vectorized. If you have a flat NumPy array of data and an array of “counts” that add up to the length of the data, then you can ak.unflatten()
it.
data = np.array([1, 2, 3, 4, 5, 6, 7, 8])
counts = np.array([3, 0, 1, 4])
unflattened = ak.unflatten(data, counts)
unflattened
[[1, 2, 3], [], [4], [5, 6, 7, 8]] --------------------- type: 4 * var * int64
The first list has length 3
, the second has length 0
, the third has length 1
, and the last has length 4
. This is close to Awkward Array’s internal representation of variable-length lists, so it can be performed quickly.
This function is named ak.unflatten()
because it has the opposite effect as ak.flatten()
and ak.num()
:
ak.flatten(unflattened)
[1, 2, 3, 4, 5, 6, 7, 8] --------------- type: 8 * int64
ak.num(unflattened)
[3, 0, 1, 4] --------------- type: 4 * int64
With ArrayBuilder#
ak.ArrayBuilder
is described in more detail in this tutorial, but you can also construct arrays of lists using the begin_list
/end_list
methods or the list
context manager.
(This is what ak.from_iter()
uses internally to accumulate lists.)
builder = ak.ArrayBuilder()
builder.begin_list()
builder.append(1)
builder.append(2)
builder.append(3)
builder.end_list()
builder.begin_list()
builder.end_list()
builder.begin_list()
builder.append(4)
builder.append(5)
builder.end_list()
array = builder.snapshot()
array
[[1, 2, 3], [], [4, 5]] --------------------- type: 3 * var * int64
builder = ak.ArrayBuilder()
with builder.list():
builder.append(1)
builder.append(2)
builder.append(3)
with builder.list():
pass
with builder.list():
builder.append(4)
builder.append(5)
array = builder.snapshot()
array
[[1, 2, 3], [], [4, 5]] --------------------- type: 3 * var * int64
In Numba#
Functions that Numba Just-In-Time (JIT) compiles can use ak.ArrayBuilder
or construct flat data and “counts” arrays for ak.unflatten()
.
(At this time, Numba can’t use context managers, the with
statement, in fully compiled code. ak.ArrayBuilder
can’t be constructed or converted to an array using snapshot
inside a JIT-compiled function, but can be outside the compiled context. Similarly, ak.*
functions like ak.unflatten()
can’t be called inside a JIT-compiled function, but can be outside.)
import numba as nb
@nb.jit
def append_list(builder, start, stop):
builder.begin_list()
for x in range(start, stop):
builder.append(x)
builder.end_list()
@nb.jit
def example(builder):
append_list(builder, 1, 4)
append_list(builder, 999, 999)
append_list(builder, 4, 6)
return builder
builder = example(ak.ArrayBuilder())
array = builder.snapshot()
array
/tmp/ipykernel_6513/1067188312.py:2: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
def append_list(builder, start, stop):
/tmp/ipykernel_6513/1067188312.py:10: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
def example(builder):
[[1, 2, 3], [], [4, 5]] --------------------- type: 3 * var * int64
@nb.jit
def append_list(i, data, j, counts, start, stop):
for x in range(start, stop):
data[i] = x
i += 1
counts[j] = stop - start
j += 1
return i, j
@nb.jit
def example():
data = np.empty(5, np.int64)
counts = np.empty(3, np.int64)
i, j = 0, 0
i, j = append_list(i, data, j, counts, 1, 4)
i, j = append_list(i, data, j, counts, 999, 999)
i, j = append_list(i, data, j, counts, 4, 6)
return data, counts
data, counts = example()
array = ak.unflatten(data, counts)
array
/tmp/ipykernel_6513/1645053648.py:2: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
def append_list(i, data, j, counts, start, stop):
/tmp/ipykernel_6513/1645053648.py:12: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
def example():
[[1, 2, 3], [], [4, 5]] --------------------- type: 3 * var * int64