How to create arrays with ArrayBuilder (easy and general)

If you’re not getting data from a file or conversion from another format, you may need to create it from scratch. ak.ArrayBuilder is a general, high-level way to do that, though it has performance limitations.

The biggest difference between an ak.ArrayBuilder and an ak.Array is that you can append data to a builder, but an array is immutable (see qualifications on mutability). It’s a bit like a Python list, which has an append method, but ak.ArrayBuilder has many methods for appending different types of structures.

Appending

import awkward as ak

When a builder is first created, it has zero length and unknown type.

builder = ak.ArrayBuilder()
builder
<ArrayBuilder [] type='0 * unknown'>

Calling its append method adds data and also determines its type.

builder.append(1)
builder
<ArrayBuilder [1] type='1 * int64'>
builder.append(2.2)
builder
<ArrayBuilder [1, 2.2] type='2 * float64'>
builder.append(3+1j)
builder
<ArrayBuilder [(1+0j), (2.2+0j), (3+1j)] type='3 * complex128'>

Note that this can include missing data by promoting to an option-type,

builder.append(None)
builder
<ArrayBuilder [(1+0j), (2.2+0j), (3+1j), None] type='4 * ?complex128'>

and mix types by promoting to a union-type:

builder.append("five")
builder
<ArrayBuilder [(1+0j), (2.2+0j), ... None, 'five'] type='5 * ?union[complex128, ...'>
builder.type
5 * ?union[complex128, string]

We’ve been using “append” because it is generic (it recognizes the types of its arguments and builds that), but there are also methods for building structure explicitly.

builder = ak.ArrayBuilder()
builder.boolean(False)
builder.integer(1)
builder.real(2.2)
builder.complex(3+1j)
builder.null()
builder.string("five")
builder
<ArrayBuilder [False, (1+0j), ... None, 'five'] type='6 * ?union[bool, complex12...'>
builder.type
6 * ?union[bool, complex128, string]

Snapshot

To turn an ak.ArrayBuilder into an ak.Array, call snapshot. This is an inexpensive operation (may be done multiple times; the builder is unaffacted).

array = builder.snapshot()
array
<Array [False, (1+0j), ... None, 'five'] type='6 * ?union[bool, complex128, string]'>

Builders don’t have all the high-level methods that arrays do, so if you want to use the array for normal analysis, remember to take a snapshot.

Nested lists

The most useful of these create nested data structures:

  • begin_list/end_list

  • begin_record/end_record

  • begin_tuple/end_tuple

which switch into a mode that starts filling inside of a list, record, or tuple. For records and tuples, you additionally have to specify the field or index of the record or tuple (respectively).

builder = ak.ArrayBuilder()

builder.begin_list()
builder.append(1.1)
builder.append(2.2)
builder.append(3.3)
builder.end_list()

builder.begin_list()
builder.end_list()

builder.begin_list()
builder.append(4.4)
builder.append(5.5)
builder.end_list()

builder
<ArrayBuilder [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>

Appending after the begin_list puts data inside the list, rather than outside:

builder.append(9.9)
builder
<ArrayBuilder [[1.1, 2.2, 3.3], ... [4.4, 5.5], 9.9] type='4 * union[var * float...'>

This 9.9 is outside of the lists, and hence the type is now “lists of numbers or numbers.”

builder.type
4 * union[var * float64, float64]

Since begin_list and end_list are imperative, the nesting structure of an array can be determined by program flow:

def arbitrary_nesting(builder, depth):
    if depth == 0:
        builder.append(1)
        builder.append(2)
        builder.append(3)
    else:
        builder.begin_list()
        arbitrary_nesting(builder, depth - 1)
        builder.end_list()

builder = ak.ArrayBuilder()
arbitrary_nesting(builder, 5)
builder
<ArrayBuilder [[[[[[1, 2, 3]]]]]] type='1 * var * var * var * var * var * int64'>

Often, you’ll know the exact depth of nesting you want. The Python with statement can be used to restrict the generality (nd free you from having to remember to end what you begin).

builder = ak.ArrayBuilder()

with builder.list():
    with builder.list():
        builder.append(1)
        builder.append(2)
        builder.append(3)

builder
<ArrayBuilder [[[1, 2, 3]]] type='1 * var * var * int64'>

(Note that the Python with statement, a.k.a. “context manager,” is not available in Numba jit-compiled functions, in case you’re using ak.ArrayBuilder in Numba.)

Nested records

When using begin_record/end_record (or the equivalent record in the with statement), you have to specify which field each “append” is associated with.

  • field("fieldname"): switches to fill a field with a given name (and returns the builder, for convenience).

builder = ak.ArrayBuilder()

with builder.record():
    builder.field("x").append(1)
    builder.field("y").append(2.2)
    builder.field("z").append("three")

builder
<ArrayBuilder [{x: 1, y: 2.2, z: 'three'}] type='1 * {"x": int64, "y": float64, ...'>

The record type can also be given a name.

builder = ak.ArrayBuilder()

with builder.record("Point"):
    builder.field("x").real(1.1)
    builder.field("y").real(2.2)
    builder.field("z").real(3.3)

builder
<ArrayBuilder [{x: 1.1, y: 2.2, z: 3.3}] type='1 * Point["x": float64, "y": floa...'>

This gives the resulting records a type named “Point”, which might have specialized behaviors.

array = builder.snapshot()
array
<Array [{x: 1.1, y: 2.2, z: 3.3}] type='1 * Point["x": float64, "y": float64, "z...'>
array.type
1 * Point["x": float64, "y": float64, "z": float64]

Nested tuples

The same is true for tuples, but the next field to fill is selected by “index” (integer), rather than “field” (string), and the tuple size has to be given up-front.

builder = ak.ArrayBuilder()

with builder.tuple(3):
    builder.index(0).append(1)
    builder.index(1).append(2.2)
    builder.index(2).append("three")

builder
<ArrayBuilder [(1, 2.2, 'three')] type='1 * (int64, float64, string)'>

Records and unions

If the set of fields changes while collecting records, the builder algorithm could handle it one of two possible ways:

  1. Assume that the new field or fields have simply been missing up to this point, and that any now-unspecified fields are also missing.

  2. Assume that a different set of fields means a different type and make a union.

By default, ak.ArrayBuilder follows policy (1), but it can be made to follow policy (2) if the names of the records are different.

policy1 = ak.ArrayBuilder()

with policy1.record():
    policy1.field("x").append(1)
    policy1.field("y").append(1.1)

with policy1.record():
    policy1.field("y").append(2.2)
    policy1.field("z").append("three")

print(policy1)
policy1.type
[{x: 1, y: 1.1, z: None}, {x: None, y: 2.2, z: 'three'}]
2 * {"x": ?int64, "y": float64, "z": option[string]}
policy2 = ak.ArrayBuilder()

with policy2.record("First"):
    policy2.field("x").append(1)
    policy2.field("y").append(1.1)

with policy2.record("Second"):
    policy2.field("y").append(2.2)
    policy2.field("z").append("three")

print(policy2)
policy2.type
[{x: 1, y: 1.1}, {y: 2.2, z: 'three'}]
2 * union[First["x": int64, "y": float64], Second["y": float64, "z": string]]

Comments on union-type

Although it’s easy to make union-type data with ak.ArrayBuilder, the applications of union-type data are more limited. For instance, we can select a field that belongs to all types of the union, but not any fields that don’t share that field.

array2 = policy2.snapshot()
array2
<Array [{x: 1, y: 1.1}, ... z: 'three'}] type='2 * union[First["x": int64, "y": ...'>
array2.y
<Array [1.1, 2.2] type='2 * float64'>
# array2.x    would raise AttributeError

The above would be no problem for records collected using policy 1 (see previous section).

array1 = policy1.snapshot()
array1
<Array [{x: 1, y: 1.1, ... z: 'three'}] type='2 * {"x": ?int64, "y": float64, "z...'>
array1.y
<Array [1.1, 2.2] type='2 * float64'>
array1.x
<Array [1, None] type='2 * ?int64'>

At the time of writing, union-types are not supported in Numba (issue 174).

Use in Numba

ak.ArrayBuilder can be used in Numba-compiled functions, and that can often be the most convenient way to build up an array, relatively quickly (see below).

There are a few limitations, though:

  • At the time of writing, Numba doesn’t support Python’s with statement (context manager), so begin_list/end_list will have to be used instead.

  • Builders cannot be constructed inside of the compiled function; they have to be passed in.

  • The snapshot method cannot be called inside of the compiled function; it has to be applied to the output.

Therefore, a common pattern is:

import numba as nb

@nb.jit
def build(builder):
    builder.begin_list()
    builder.append(1.1)
    builder.append(2.2)
    builder.append(3.3)
    builder.end_list()
    builder.begin_list()
    builder.end_list()
    builder.begin_list()
    builder.append(4.4)
    builder.append(5.5)
    builder.end_list()
    return builder

array = build(ak.ArrayBuilder()).snapshot()
array
<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>

Appending parts of an existing array

If the argument of the append function is part of another Awkward Array, that array will be linked into the new array, rather than reconstructing the original by iterating over it. That can be a performance advantage (appending records with 1000 fields takes as much time as appending records with 1 field), but it can prevent large data structures from being garbage-collected, because a reference to them exists in the new array.

original = ak.Array([{"x": 1, "y": 1.1}, {"x": 2, "y": 2.2}, {"x": 3, "y": 3.3}])

builder = ak.ArrayBuilder()
builder.append(original[2])
builder.append(original[1])
builder.append(original[0])
builder.append(original[1])
builder.append(original[1])
builder.append(original[0])
builder.append(original[2])
builder.append(original[2])

new_array = builder.snapshot()
new_array
<Array [{x: 3, y: 3.3}, ... {x: 3, y: 3.3}] type='8 * {"x": int64, "y": float64}'>
new_array.layout
<IndexedArray64>
    <index><Index64 i="[2 1 0 1 1 0 2 2]" offset="0" length="8" at="0x00000390ef10"/></index>
    <content><RecordArray length="3">
        <field index="0" key="x">
            <NumpyArray format="l" shape="3" data="1 2 3" at="0x00000390aef0"/>
        </field>
        <field index="1" key="y">
            <NumpyArray format="d" shape="3" data="1.1 2.2 3.3" at="0x00000390cf00"/>
        </field>
    </RecordArray></content>
</IndexedArray64>

Above, we see that new_array is just making references (ak.layout.IndexedArray) of an ak.layout.RecordArray with x = [1, 2, 3] and y = [1.1, 2.2, 3.3].

Comments on performance

Although ak.ArrayBuilder is implemented in C++, it is dynamically typed by design. The advantage of compiled code over interpreted code often comes in the knowledge of data types at compile-time, enabling fewer runtime checks and more compiler optimizations.

If you’re using a builder in Python, there’s also the overhead of calling from Python.

If you’re using a builder in Numba, the builder calls are external function calls and LLVM can’t inline them for optimizations.

Whenever you have a choice between

  1. using the ak.ArrayBuilder,

  2. constructing an array manually from layouts (next chapter), or

  3. filling a NumPy array and using it as an index,

the alternatives are often faster. The point of ak.ArrayBuilder is that it is easy.