How to restructure arrays with zip/unzip and project#

Unzipping an array of records#

As discussed in How to create arrays of records, in addition to primitive types like numpy.float64 and numpy.datetime64, Awkward Arrays can also contain records. These records are formed from a fixed number of optionally named fields.

import awkward as ak
import numpy as np

records = ak.Array(
    [
        {"x": 1, "y": 1.1, "z": "one"},
        {"x": 2, "y": 2.2, "z": "two"},
        {"x": 3, "y": 3.3, "z": "three"},
        {"x": 4, "y": 4.4, "z": "four"},
        {"x": 5, "y": 5.5, "z": "five"},
    ]
)

[{x: 1, y: 1.1, z: 'one'},
 {x: 2, y: 2.2, z: 'two'},
 {x: 3, y: 3.3, z: 'three'},
 {x: 4, y: 4.4, z: 'four'},
 {x: 5, y: 5.5, z: 'five'}]
---------------------------------------------------------
backend: cpu
nbytes: 147 B
type: 5 * {
    x: int64,
    y: float64,
    z: string
}

Although it is useful to be able to create arrays from a sequence of records (as arrays of structures), Awkward Array implements arrays as structures of arrays. It is therefore more natural to think about arrays in terms of their fields. In the above example, we have created an array of records from a list of dictionaries. We can see that the x field of records contains five numpy.int64 values:

records.x

[1,
 2,
 3,
 4,
 5]
---------------
backend: cpu
nbytes: 40 B
type: 5 * int64

If we wanted to look at each of the fields of records, we could pull them out individually from the array:

records.y

[1.1,
 2.2,
 3.3,
 4.4,
 5.5]
-----------------
backend: cpu
nbytes: 40 B
type: 5 * float64

records.z

['one',
 'two',
 'three',
 'four',
 'five']
----------------
backend: cpu
nbytes: 67 B
type: 5 * string

Clearly, for arrays with a large number of fields, retrieving each field in this manner would become tedious rather quickly. ak.unzip() can be used to directly build a tuple of the field arrays:

ak.unzip(records)

(<Array [1, 2, 3, 4, 5] type='5 * int64'>,
 <Array [1.1, 2.2, 3.3, 4.4, 5.5] type='5 * float64'>,
 <Array ['one', 'two', 'three', 'four', 'five'] type='5 * string'>)

Records are not required to have field names. A record without field names is known as a “tuple”, e.g.

tuples = ak.Array(
    [
        (1, 1.1, "one"),
        (2, 2.2, "two"),
        (3, 3.3, "three"),
        (4, 4.4, "four"),
        (5, 5.5, "five"),
    ]
)

[(1, 1.1, 'one'),
 (2, 2.2, 'two'),
 (3, 3.3, 'three'),
 (4, 4.4, 'four'),
 (5, 5.5, 'five')]
------------------------------------------------
backend: cpu
nbytes: 147 B
type: 5 * (
    int64,
    float64,
    string
)

If we unzip an array of tuples, we obtain the same result as for records:

ak.unzip(tuples)

(<Array [1, 2, 3, 4, 5] type='5 * int64'>,
 <Array [1.1, 2.2, 3.3, 4.4, 5.5] type='5 * float64'>,
 <Array ['one', 'two', 'three', 'four', 'five'] type='5 * string'>)

ak.unzip() can be combined with ak.fields() to build a mapping from field name to field array:

dict(zip(ak.fields(records), ak.unzip(records)))

{'x': <Array [1, 2, 3, 4, 5] type='5 * int64'>,
 'y': <Array [1.1, 2.2, 3.3, 4.4, 5.5] type='5 * float64'>,
 'z': <Array ['one', 'two', 'three', 'four', 'five'] type='5 * string'>}

For tuples, the field names will be strings corresponding to the field index:

dict(zip(ak.fields(tuples), ak.unzip(tuples)))

{'0': <Array [1, 2, 3, 4, 5] type='5 * int64'>,
 '1': <Array [1.1, 2.2, 3.3, 4.4, 5.5] type='5 * float64'>,
 '2': <Array ['one', 'two', 'three', 'four', 'five'] type='5 * string'>}

Zipping together arrays#

Because Awkward Arrays unzip into distinct arrays, it is reasonable to ask whether the reverse is possible, i.e. given the following arrays

age = ak.Array([18, 32, 87, 55])
name = ak.Array(["Dorit", "Caitlin", "Theodor", "Albano"]);

can we form an array of records? The ak.zip() function provides a way to join compatible arrays into a single array of records:

people = ak.zip({"age": age, "name": name})

[{age: 18, name: 'Dorit'},
 {age: 32, name: 'Caitlin'},
 {age: 87, name: 'Theodor'},
 {age: 55, name: 'Albano'}]
----------------------------------------------
backend: cpu
nbytes: 97 B
type: 4 * {
    age: int64,
    name: string
}

Similarly, we could also build an array of tuples by passing a sequence of arrays:

ak.zip([age, name])

[(18, 'Dorit'),
 (32, 'Caitlin'),
 (87, 'Theodor'),
 (55, 'Albano')]
-----------------------------------
backend: cpu
nbytes: 97 B
type: 4 * (
    int64,
    string
)

Zipping and unzipping arrays is a lightweight operation, and so you should not hesitate to zip together arrays if it makes sense for the problem at hand. One of the benefits of combining arrays into an array of records is that slicing and masking operations are applied to all fields, e.g.

people[age > 35]

[{age: 87, name: 'Theodor'},
 {age: 55, name: 'Albano'}]
----------------------------------------------
backend: cpu
nbytes: 113 B
type: 2 * {
    age: int64,
    name: string
}

Arrays with different dimensions#

So far, we’ve looked at simple arrays with the same dimension in each field. It is actually possible to build arrays with fields of different dimensions, e.g.

x = ak.Array(
    [
        103,
        450,
        33,
        4,
    ]
)

digits_of_x = ak.Array(
    [
        [1, 0, 3],
        [4, 5, 0],
        [3, 3],
        [4],
    ]
)
x_and_digits = ak.zip({"x": x, "digits": digits_of_x})

[[{x: 103, digits: 1}, {x: 103, digits: 0}, {x: 103, digits: 3}],
 [{x: 450, digits: 4}, {x: 450, digits: 5}, {x: 450, digits: 0}],
 [{x: 33, digits: 3}, {x: 33, digits: 3}],
 [{x: 4, digits: 4}]]
-----------------------------------------------------------------
backend: cpu
nbytes: 184 B
type: 4 * var * {
    x: int64,
    digits: int64
}

The type of this array is

x_and_digits.type

ArrayType(ListType(RecordType([NumpyType('int64'), NumpyType('int64')], ['x', 'digits'])), 4, None)

Note that the x field has changed type:

x.type

ArrayType(NumpyType('int64'), 4, None)

x_and_digits.x.type

ArrayType(ListType(NumpyType('int64')), 4, None)

In zipping the two arrays together, the x has been broadcast against digits_of_x. Sometimes you might want to limit the broadcasting to a particular depth (dimension). This can be done by passing the depth_limit parameter:

x_and_digits = ak.zip({"x": x, "digits": digits_of_x}, depth_limit=1)

[{x: 103, digits: [1, 0, 3]},
 {x: 450, digits: [4, 5, 0]},
 {x: 33, digits: [3, 3]},
 {x: 4, digits: [4]}]
---------------------------------------------------
backend: cpu
nbytes: 144 B
type: 4 * {
    x: int64,
    digits: var * int64
}

Now the x field has a single dimension

x_and_digits.x.type

ArrayType(NumpyType('int64'), 4, None)

Arrays with different dimension lengths#

What happens if we zip together arrays with the same dimensions, but different lengths in each dimensions?

x_and_y = ak.Array(
    [
        [103, 903],
        [450, 83],
        [33, 8],
        [4, 109],
    ]
)

digits_of_x_and_y = ak.Array(
    [
        [1, 0, 3, 9, 0, 3],
        [4, 5, 0, 8, 3],
        [3, 3, 8],
        [4, 1, 0, 9],
    ]
)

ak.zip({"x_and_y": x_and_y, "digits": digits_of_x_and_y})

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[21], line 19
x_and_y = ak.Array(
   [
       [103, 903],
   (...)
   ]
)
digits_of_x_and_y = ak.Array(
   [
       [1, 0, 3, 9, 0, 3],
   (...)
   ]
)
---> 19 ak.zip({"x_and_y": x_and_y, "digits": digits_of_x_and_y})

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_dispatch.py:41, in named_high_level_function.<locals>.dispatch(*args, **kwargs)
@wraps(func)
def dispatch(*args, **kwargs):
   # NOTE: this decorator assumes that the operation is exposed under `ak.`
---> 41     with OperationErrorContext(name, args, kwargs):
       gen_or_result = func(*args, **kwargs)
       if isgenerator(gen_or_result):

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_errors.py:80, in ErrorContext.__exit__(self, exception_type, exception_value, traceback)
   self._slate.__dict__.clear()
   # Handle caught exception
---> 80     raise self.decorate_exception(exception_type, exception_value)
else:
   # Step out of the way so that another ErrorContext can become primary.
   if self.primary() is self:

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_dispatch.py:67, in named_high_level_function.<locals>.dispatch(*args, **kwargs)
# Failed to find a custom overload, so resume the original function
try:
---> 67     next(gen_or_result)
except StopIteration as err:
   return err.value

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/operations/ak_zip.py:153, in zip(arrays, depth_limit, parameters, with_name, right_broadcast, optiontype_outside_record, highlevel, behavior, attrs)
   yield arrays
# Implementation
--> 153 return _impl(
   arrays,
   depth_limit,
   parameters,
   with_name,
   right_broadcast,
   optiontype_outside_record,
   highlevel,
   behavior,
   attrs,
)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/operations/ak_zip.py:247, in _impl(arrays, depth_limit, parameters, with_name, right_broadcast, optiontype_outside_record, highlevel, behavior, attrs)
       return None
depth_context, lateral_context = NamedAxesWithDims.prepare_contexts(
   list(arrays.values()) if isinstance(arrays, Mapping) else list(arrays)
)
--> 247 out = ak._broadcasting.broadcast_and_apply(
   layouts,
   action,
   depth_context=depth_context,
   lateral_context=lateral_context,
   right_broadcast=right_broadcast,
)
assert isinstance(out, tuple) and len(out) == 1
out = out[0]

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:1219, in broadcast_and_apply(inputs, action, depth_context, lateral_context, allow_records, left_broadcast, right_broadcast, numpy_to_regular, regular_to_jagged, function_name, broadcast_parameters_rule)
backend = backend_of(*inputs, coerce_to_common=False)
isscalar = []
-> 1219 out = apply_step(
   backend,
   broadcast_pack(inputs, isscalar),
   action,
   0,
   depth_context,
   lateral_context,
   {
       "allow_records": allow_records,
       "left_broadcast": left_broadcast,
       "right_broadcast": right_broadcast,
       "numpy_to_regular": numpy_to_regular,
       "regular_to_jagged": regular_to_jagged,
       "function_name": function_name,
       "broadcast_parameters_rule": broadcast_parameters_rule,
   },
)
assert isinstance(out, tuple)
return tuple(broadcast_unpack(x, isscalar) for x in out)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:1197, in apply_step(backend, inputs, action, depth, depth_context, lateral_context, options)
   return result
elif result is None:
-> 1197     return continuation()
else:
   raise AssertionError(result)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:1166, in apply_step.<locals>.continuation()
# Any non-string list-types?
elif any(x.is_list and not is_string_like(x) for x in contents):
-> 1166     return broadcast_any_list()
# Any RecordArrays?
elif any(x.is_record for x in contents):

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:670, in apply_step.<locals>.broadcast_any_list()
       nextinputs.append(x)
       nextparameters.append(NO_PARAMETERS)
--> 670 outcontent = apply_step(
   backend,
   nextinputs,
   action,
   depth + 1,
   copy.copy(depth_context),
   lateral_context,
   options,
)
assert isinstance(outcontent, tuple)
parameters = parameters_factory(nextparameters, len(outcontent))

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:1197, in apply_step(backend, inputs, action, depth, depth_context, lateral_context, options)
   return result
elif result is None:
-> 1197     return continuation()
else:
   raise AssertionError(result)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:1166, in apply_step.<locals>.continuation()
# Any non-string list-types?
elif any(x.is_list and not is_string_like(x) for x in contents):
-> 1166     return broadcast_any_list()
# Any RecordArrays?
elif any(x.is_record for x in contents):

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:721, in apply_step.<locals>.broadcast_any_list()
for i, ((named_axis, ndim), x, x_is_string) in enumerate(
   zip(named_axes_with_ndims, inputs, input_is_string)
):
   if isinstance(x, listtypes) and not x_is_string:
--> 721         next_content = broadcast_to_offsets_avoiding_carry(x, offsets)
       nextinputs.append(next_content)
       nextparameters.append(x._parameters)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:386, in broadcast_to_offsets_avoiding_carry(list_content, offsets)
       return list_content.content[:next_length]
   else:
--> 386         return list_content._broadcast_tooffsets64(offsets).content
elif isinstance(list_content, ListArray):
   # Is this list contiguous?
   if nplike.array_equal(
       list_content.starts.data[1:], list_content.stops.data[:-1]
   ):
       # Does this list match the offsets?

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/listoffsetarray.py:429, in ListOffsetArray._broadcast_tooffsets64(self, offsets)
   next_content = self._content[this_start:]
if nplike.known_data and not nplike.array_equal(
   this_zero_offsets, offsets.data
):
--> 429     raise ValueError("cannot broadcast nested list")
return ListOffsetArray(
   offsets, next_content[: offsets[-1]], parameters=self._parameters
)

ValueError: cannot broadcast nested list

This error occurred while calling

    ak.zip(
        {'x_and_y': <Array [[103, 903], [450, 83], [33, ...], [4, 109]] type=...
    )

Arrays which cannot be broadcast against each other will raise a ValueError. In this case, we want to stop broadcasting at the first dimension (depth_limit=1)

ak.zip({"x_and_y": x_and_y, "digits": digits_of_x_and_y}, depth_limit=1)

[{x_and_y: [103, 903], digits: [1, 0, 3, ..., 0, 3]},
 {x_and_y: [450, 83], digits: [4, 5, 0, 8, 3]},
 {x_and_y: [33, 8], digits: [3, 3, 8]},
 {x_and_y: [4, 109], digits: [4, 1, 0, 9]}]
---------------------------------------------------------------
backend: cpu
nbytes: 288 B
type: 4 * {
    x_and_y: var * int64,
    digits: var * int64
}

Projecting arrays#

Sometimes we are interested only in a subset of the fields of an array. For example, imagine that we have an array of coordinates on the $\hat{x} \hat{y}$ plane:

triangle = ak.Array(
    [
        {"x": 1, "y": 6, "z": 0},
        {"x": 2, "y": 7, "z": 0},
        {"x": 3, "y": 8, "z": 0},
    ]
)

[{x: 1, y: 6, z: 0},
 {x: 2, y: 7, z: 0},
 {x: 3, y: 8, z: 0}]
------------------------------------------------------
backend: cpu
nbytes: 72 B
type: 3 * {
    x: int64,
    y: int64,
    z: int64
}

If we know that these points should lie on a plane, then we might wish to discard the $\hat{z}$ coordinate. We can do this by slicing only the $\hat{x}$ and $\hat{y}$ fields:

triangle_2d = triangle[["x", "y"]]

[{x: 1, y: 6},
 {x: 2, y: 7},
 {x: 3, y: 8}]
----------------------------------------
backend: cpu
nbytes: 48 B
type: 3 * {
    x: int64,
    y: int64
}

Note that the key passed to the subscript operator is a list ["x", "y"], not a tuple. Awkward Array recognises the list to mean “take both the "x" and "y" fields”.

Projections can be combined with array slicing and masking, e.g.

triangle_2d_first_2 = triangle[:2, ["x", "y"]]

[{x: 1, y: 6},
 {x: 2, y: 7}]
----------------------------------------
backend: cpu
nbytes: 64 B
type: 2 * {
    x: int64,
    y: int64
}

Let’s now consider an array of triangles, i.e. a polygon:

triangles = ak.Array(
    [
        [
            {"x": 1, "y": 6, "z": 0},
            {"x": 2, "y": 7, "z": 0},
            {"x": 3, "y": 8, "z": 0},
        ],
        [
            {"x": 4, "y": 9, "z": 0},
            {"x": 5, "y": 10, "z": 0},
            {"x": 6, "y": 11, "z": 0},
        ],
    ]
)

[[{x: 1, y: 6, z: 0}, {x: 2, y: 7, z: 0}, {x: 3, y: 8, z: 0}],
 [{x: 4, y: 9, z: 0}, {x: 5, y: 10, z: 0}, {x: 6, y: 11, z: 0}]]
----------------------------------------------------------------
backend: cpu
nbytes: 168 B
type: 2 * var * {
    x: int64,
    y: int64,
    z: int64
}

We can combine an int index 0 with a str projection to view the "x" coordinates of the first triangle vertices

triangles[0, "x"]

[1,
 2,
 3]
---------------
backend: cpu
nbytes: 24 B
type: 3 * int64

We could even ignore the first vertex of each triangle

triangles[0, 1:, "x"]

[2,
 3]
---------------
backend: cpu
nbytes: 64 B
type: 2 * int64

Projections commute (to the left) with other indices to produce the same result as their “natural” position. This means that the above projection could also be written as

triangles[0, "x", 1:]

[2,
 3]
---------------
backend: cpu
nbytes: 16 B
type: 2 * int64

or even

triangles["x", 0, 1:]

[2,
 3]
---------------
backend: cpu
nbytes: 16 B
type: 2 * int64

For columnar Awkward Arrays, there is no performance difference between any of these approaches; projecting the records of an array just changes its metadata, rather than invoking any loops over the data.

Projecting records-of-records#

The records of an array can themselves contain records

polygon = ak.Array(
    [
        {
            "vertex": [
                {"x": 1, "y": 6, "z": 0},
                {"x": 2, "y": 7, "z": 0},
                {"x": 3, "y": 8, "z": 0},
            ],
            "normal": [
                {"x": 0.164, "y": 0.986, "z": 0.0},
                {"x": 0.275, "y": 0.962, "z": 0.0},
                {"x": 0.351, "y": 0.936, "z": 0.0},
            ],
            "n_vertex": 3,
        },
        {
            "vertex": [
                {"x": 4, "y": 9, "z": 0},
                {"x": 5, "y": 10, "z": 0},
                {"x": 6, "y": 11, "z": 0},
                {"x": 7, "y": 12, "z": 0},
            ],
            "normal": [
                {"x": 0.406, "y": 0.914, "z": 0.0},
                {"x": 0.447, "y": 0.894, "z": 0.0},
                {"x": 0.470, "y": 0.878, "z": 0.0},
                {"x": 0.504, "y": 0.864, "z": 0.0},
            ],
            "n_vertex": 4,
        },
    ]
)

[{vertex: [{x: 1, y: 6, z: 0}, ..., {...}], normal: [...], n_vertex: 3},
 {vertex: [{x: 4, y: 9, z: 0}, ..., {...}], normal: [...], n_vertex: 4}]
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
backend: cpu
nbytes: 400 B
type: 2 * {
    vertex: var * {
        x: int64,
        y: int64,
        z: int64
    },
    normal: var * {
        x: float64,
        y: float64,
        z: float64
    },
    n_vertex: int64
}

Naturally we can access the "vertex" field with the . operator:

polygon.vertex

[[{x: 1, y: 6, z: 0}, {x: 2, y: 7, z: 0}, {x: 3, y: 8, z: 0}],
 [{x: 4, y: 9, z: 0}, {x: 5, y: 10, z: 0}, {...}, {x: 7, y: 12, z: 0}]]
-----------------------------------------------------------------------
backend: cpu
nbytes: 192 B
type: 2 * var * {
    x: int64,
    y: int64,
    z: int64
}

We can view the "x" field of the vertex array with an additional lookup

polygon.vertex.x

[[1, 2, 3],
 [4, 5, 6, 7]]
---------------------
backend: cpu
nbytes: 80 B
type: 2 * var * int64

The . operator represents the simplest slice of a single string, i.e.

polygon["vertex"]

[[{x: 1, y: 6, z: 0}, {x: 2, y: 7, z: 0}, {x: 3, y: 8, z: 0}],
 [{x: 4, y: 9, z: 0}, {x: 5, y: 10, z: 0}, {...}, {x: 7, y: 12, z: 0}]]
-----------------------------------------------------------------------
backend: cpu
nbytes: 192 B
type: 2 * var * {
    x: int64,
    y: int64,
    z: int64
}

The slice corresponding to the nested lookup .vertex.x is given by a tuple of str:

polygon[("vertex", "x")]

[[1, 2, 3],
 [4, 5, 6, 7]]
---------------------
backend: cpu
nbytes: 80 B
type: 2 * var * int64

It is even possible to combine multiple and single projections. Let’s project the "x" field of the "vertex" and "normal" fields:

polygon[["vertex", "normal"], "x"]

[{vertex: [1, 2, 3], normal: [0.164, ..., 0.351]},
 {vertex: [4, 5, 6, 7], normal: [0.406, ..., 0.504]}]
----------------------------------------------------------------
backend: cpu
nbytes: 160 B
type: 2 * {
    vertex: var * int64,
    normal: var * float64
}