How to create arrays with ArrayBuilder (easy and general)#
If you’re not getting data from a file or conversion from another format, you may need to create it from scratch. ak.ArrayBuilder
is a general, high-level way to do that, though it has performance limitations.
The biggest difference between an ak.ArrayBuilder
and an ak.Array
is that you can append data to a builder, but an array is immutable (see qualifications on mutability). It’s a bit like a Python list, which has an append
method, but ak.ArrayBuilder
has many methods for appending different types of structures.
Appending#
import awkward as ak
When a builder is first created, it has zero length and unknown type.
builder = ak.ArrayBuilder()
builder
<ArrayBuilder type='0 * unknown'>
Calling its append
method adds data and also determines its type.
builder.append(1)
builder
<ArrayBuilder type='1 * int64'>
builder.append(2.2)
builder
<ArrayBuilder type='2 * float64'>
builder.append(3 + 1j)
builder
<ArrayBuilder type='3 * complex128'>
Note that this can include missing data by promoting to an option-type
,
builder.append(None)
builder
<ArrayBuilder type='4 * ?complex128'>
and mix types by promoting to a union-type
:
builder.append("five")
builder
<ArrayBuilder type='5 * union[?complex128, ?string]'>
builder.type.show()
5 * union[
?complex128,
?string
]
We’ve been using “append
” because it is generic (it recognizes the types of its arguments and builds that), but there are also methods for building structure explicitly.
builder = ak.ArrayBuilder()
builder.boolean(False)
builder.integer(1)
builder.real(2.2)
builder.complex(3 + 1j)
builder.null()
builder.string("five")
builder
<ArrayBuilder type='6 * union[?bool, ?complex128, ?string]'>
builder.type.show()
6 * union[
?bool,
?complex128,
?string
]
Snapshot#
To turn an ak.ArrayBuilder
into an ak.Array
, call snapshot
. This is an inexpensive operation (may be done multiple times; the builder is unaffected).
array = builder.snapshot()
array
[False, 1+0j, 2.2+0j, 3+1j, None, 'five'] ---------------------------------------------------------- backend: cpu nbytes: 139 B type: 6 * union[ ?bool, ?complex128, ?string ]
Builders don’t have all the high-level methods that arrays do, so if you want to use the array for normal analysis, remember to take a snapshot.
Nested lists#
The most useful of these create nested data structures:
begin_list
/end_list
begin_record
/end_record
begin_tuple
/end_tuple
which switch into a mode that starts filling inside of a list, record, or tuple. For records and tuples, you additionally have to specify the field
or index
of the record or tuple (respectively).
builder = ak.ArrayBuilder()
builder.begin_list()
builder.append(1.1)
builder.append(2.2)
builder.append(3.3)
builder.end_list()
builder.begin_list()
builder.end_list()
builder.begin_list()
builder.append(4.4)
builder.append(5.5)
builder.end_list()
builder
<ArrayBuilder type='3 * var * float64'>
Appending after the begin_list
puts data inside the list, rather than outside:
builder.append(9.9)
builder
<ArrayBuilder type='4 * union[var * float64, float64]'>
This 9.9
is outside of the lists, and hence the type is now “lists of numbers or numbers.”
builder.type.show()
4 * union[
var * float64,
float64
]
Since begin_list
and end_list
are imperative, the nesting structure of an array can be determined by program flow:
def arbitrary_nesting(builder, depth):
if depth == 0:
builder.append(1)
builder.append(2)
builder.append(3)
else:
builder.begin_list()
arbitrary_nesting(builder, depth - 1)
builder.end_list()
builder = ak.ArrayBuilder()
arbitrary_nesting(builder, 5)
builder
<ArrayBuilder type='1 * var * var * var * var * var * int64'>
Often, you’ll know the exact depth of nesting you want. The Python with
statement can be used to restrict the generality (nd free you from having to remember to end
what you begin
).
builder = ak.ArrayBuilder()
with builder.list():
with builder.list():
builder.append(1)
builder.append(2)
builder.append(3)
builder
<ArrayBuilder type='1 * var * var * int64'>
(Note that the Python with
statement, a.k.a. “context manager,” is not available in Numba jit-compiled functions, in case you’re using ak.ArrayBuilder
in Numba.)
Nested records#
When using begin_record
/end_record
(or the equivalent record
in the with
statement), you have to specify which field each “append
” is associated with.
field("fieldname")
: switches to fill a field with a given name (and returns the builder, for convenience).
builder = ak.ArrayBuilder()
with builder.record():
builder.field("x").append(1)
builder.field("y").append(2.2)
builder.field("z").append("three")
builder
<ArrayBuilder type='1 * {x: int64, y: float64, z: string}'>
The record type can also be given a name.
builder = ak.ArrayBuilder()
with builder.record("Point"):
builder.field("x").real(1.1)
builder.field("y").real(2.2)
builder.field("z").real(3.3)
builder
<ArrayBuilder type='1 * Point[x: float64, y: float64, z: float64]'>
This gives the resulting records a type named “Point
”, which might have specialized behaviors.
array = builder.snapshot()
array
[{x: 1.1, y: 2.2, z: 3.3}] ----------------------------------------------------------------- backend: cpu nbytes: 24 B type: 1 * Point[ x: float64, y: float64, z: float64 ]
Nested tuples#
The same is true for tuples, but the next field to fill is selected by “index
” (integer), rather than “field
” (string), and the tuple size has to be given up-front.
builder = ak.ArrayBuilder()
with builder.tuple(3):
builder.index(0).append(1)
builder.index(1).append(2.2)
builder.index(2).append("three")
builder
<ArrayBuilder type='1 * (int64, float64, string)'>
Records and unions#
If the set of fields changes while collecting records, the builder algorithm could handle it one of two possible ways:
Assume that the new field or fields have simply been missing up to this point, and that any now-unspecified fields are also missing.
Assume that a different set of fields means a different type and make a union.
By default, ak.ArrayBuilder
follows policy (1), but it can be made to follow policy (2) if the names of the records are different.
policy1 = ak.ArrayBuilder()
with policy1.record():
policy1.field("x").append(1)
policy1.field("y").append(1.1)
with policy1.record():
policy1.field("y").append(2.2)
policy1.field("z").append("three")
policy1.type.show()
2 * {
x: ?int64,
y: float64,
z: ?string
}
policy2 = ak.ArrayBuilder()
with policy2.record("First"):
policy2.field("x").append(1)
policy2.field("y").append(1.1)
with policy2.record("Second"):
policy2.field("y").append(2.2)
policy2.field("z").append("three")
policy2.type.show()
2 * union[
First[
x: int64,
y: float64
],
Second[
y: float64,
z: string
]
]
Use in Numba#
ak.ArrayBuilder
can be used in Numba-compiled functions, and that can often be the most convenient way to build up an array, relatively quickly (see below).
There are a few limitations, though:
At the time of writing, Numba doesn’t support Python’s
with
statement (context manager), sobegin_list
/end_list
will have to be used instead.Builders cannot be constructed inside of the compiled function; they have to be passed in.
The
snapshot
method cannot be called inside of the compiled function; it has to be applied to the output.
Therefore, a common pattern is:
import numba as nb
@nb.jit
def build(builder):
builder.begin_list()
builder.append(1.1)
builder.append(2.2)
builder.append(3.3)
builder.end_list()
builder.begin_list()
builder.end_list()
builder.begin_list()
builder.append(4.4)
builder.append(5.5)
builder.end_list()
return builder
array = build(ak.ArrayBuilder()).snapshot()
array
[[1.1, 2.2, 3.3], [], [4.4, 5.5]] ----------------------- backend: cpu nbytes: 72 B type: 3 * var * float64
Setting the type of empty lists#
In addition to supporting type-discovery at execution time, ak.ArrayBuilder
also makes it convenient to work with complex, ragged arrays when the type is known ahead of time. Although it is not the most performant means of constructing an array whose type is already known, it provides a readable abstraction in the event that building the array is not a limiting factor for performance. However, due to this “on-line” type-discovery, it is possible that for certain data the result of ak.ArrayBuilder.snapshot()
will have different types. Consider this function that builds an array from the contents of some iterable:
def process_data(builder, data):
for item in data:
if item < 0:
builder.null()
else:
builder.integer(item)
return builder.snapshot()
If we pass in only positive integers, the result is an array of integers:
process_data(
ak.ArrayBuilder(),
[1, 2, 3, 4],
)
[1, 2, 3, 4] --------------- backend: cpu nbytes: 32 B type: 4 * int64
If we pass in only negative integers, the result is an array of None
s with an unknown type:
process_data(
ak.ArrayBuilder(),
[-1, -2, -3, -4],
)
[None, None, None, None] ------------------ backend: cpu nbytes: 32 B type: 4 * ?unknown
It is only if we pass in a mix of these values that we see the “full” array type:
process_data(
ak.ArrayBuilder(),
[1, 2, 3, 4, -1, -2, -3, -4],
)
[1, 2, 3, 4, None, None, None, None] ---------------- backend: cpu nbytes: 96 B type: 8 * ?int64
A simple way to solve this problem is to explore all code branches explicitly, and remove the generated entry(ies) from the final array:
def process_data(builder, data):
for item in data:
if item < 0:
builder.null()
else:
builder.integer(item)
# Ensure we have the proper type
builder.integer(1)
builder.null()
return builder.snapshot()[:-2]
The previous examples now have the same type:
process_data(
ak.ArrayBuilder(),
[1, 2, 3, 4],
)
[1, 2, 3, 4] ---------------- backend: cpu nbytes: 72 B type: 4 * ?int64
process_data(
ak.ArrayBuilder(),
[-1, -2, -3, -4],
)
[None, None, None, None] ---------------- backend: cpu nbytes: 40 B type: 4 * ?int64
process_data(
ak.ArrayBuilder(),
[1, 2, 3, 4, -1, -2, -3, -4],
)
[1, 2, 3, 4, None, None, None, None] ---------------- backend: cpu nbytes: 104 B type: 8 * ?int64
Comments on performance#
Although ak.ArrayBuilder
is implemented in C++, it is dynamically typed by design. The advantage of compiled code over interpreted code often comes in the knowledge of data types at compile-time, enabling fewer runtime checks and more compiler optimizations.
If you’re using a builder in Python, there’s also the overhead of calling from Python.
If you’re using a builder in Numba, the builder calls are external function calls and LLVM can’t inline them for optimizations.
Whenever you have a choice between
using the
ak.ArrayBuilder
,constructing an array manually from layouts (next chapter), or
filling a NumPy array and using it as an index,
the alternatives are often faster. The point of ak.ArrayBuilder
is that it is easy.
Comments on union-type#
Although it’s easy to make
union-type
data withak.ArrayBuilder
, the applications of union-type data are more limited. For instance, we can select a field that belongs to all types of the union, but not any fields that don’t share that field.The above would be no problem for records collected using policy 1 (see previous section).
At the time of writing,
union-type
are not supported in Numba (issue 174).