How to restructure arrays with zip/unzip and project#

Hide code cell content
%config InteractiveShell.ast_node_interactivity = "last_expr_or_assign"

Unzipping an array of records#

As discussed in How to create arrays of records, in addition to primitive types like numpy.float64 and numpy.datetime64, Awkward Arrays can also contain records. These records are formed from a fixed number of optionally named fields.

import awkward as ak
import numpy as np

records = ak.Array(
    [
        {"x": 1, "y": 1.1, "z": "one"},
        {"x": 2, "y": 2.2, "z": "two"},
        {"x": 3, "y": 3.3, "z": "three"},
        {"x": 4, "y": 4.4, "z": "four"},
        {"x": 5, "y": 5.5, "z": "five"},
    ]
)
[{x: 1, y: 1.1, z: 'one'},
 {x: 2, y: 2.2, z: 'two'},
 {x: 3, y: 3.3, z: 'three'},
 {x: 4, y: 4.4, z: 'four'},
 {x: 5, y: 5.5, z: 'five'}]
----------------------------
type: 5 * {
    x: int64,
    y: float64,
    z: string
}

Although it is useful to be able to create arrays from a sequence of records (as arrays of structures), Awkward Array implements arrays as structures of arrays. It is therefore more natural to think about arrays in terms of their fields. In the above example, we have created an array of records from a list of dictionaries. We can see that the x field of records contains five numpy.int64 values:

records.x
[1,
 2,
 3,
 4,
 5]
---------------
type: 5 * int64

If we wanted to look at each of the fields of records, we could pull them out individually from the array:

records.y
[1.1,
 2.2,
 3.3,
 4.4,
 5.5]
-----------------
type: 5 * float64
records.z
['one',
 'two',
 'three',
 'four',
 'five']
----------------
type: 5 * string

Clearly, for arrays with a large number of fields, retrieving each field in this manner would become tedious rather quickly. ak.unzip() can be used to directly build a tuple of the field arrays:

ak.unzip(records)
(<Array [1, 2, 3, 4, 5] type='5 * int64'>,
 <Array [1.1, 2.2, 3.3, 4.4, 5.5] type='5 * float64'>,
 <Array ['one', 'two', 'three', 'four', 'five'] type='5 * string'>)

Records are not required to have field names. A record without field names is known as a “tuple”, e.g.

tuples = ak.Array(
    [
        (1, 1.1, "one"),
        (2, 2.2, "two"),
        (3, 3.3, "three"),
        (4, 4.4, "four"),
        (5, 5.5, "five"),
    ]
)
[(1, 1.1, 'one'),
 (2, 2.2, 'two'),
 (3, 3.3, 'three'),
 (4, 4.4, 'four'),
 (5, 5.5, 'five')]
-------------------
type: 5 * (
    int64,
    float64,
    string
)

If we unzip an array of tuples, we obtain the same result as for records:

ak.unzip(tuples)
(<Array [1, 2, 3, 4, 5] type='5 * int64'>,
 <Array [1.1, 2.2, 3.3, 4.4, 5.5] type='5 * float64'>,
 <Array ['one', 'two', 'three', 'four', 'five'] type='5 * string'>)

ak.unzip() can be combined with ak.fields() to build a mapping from field name to field array:

dict(zip(ak.fields(records), ak.unzip(records)))
{'x': <Array [1, 2, 3, 4, 5] type='5 * int64'>,
 'y': <Array [1.1, 2.2, 3.3, 4.4, 5.5] type='5 * float64'>,
 'z': <Array ['one', 'two', 'three', 'four', 'five'] type='5 * string'>}

For tuples, the field names will be strings corresponding to the field index:

dict(zip(ak.fields(tuples), ak.unzip(tuples)))
{'0': <Array [1, 2, 3, 4, 5] type='5 * int64'>,
 '1': <Array [1.1, 2.2, 3.3, 4.4, 5.5] type='5 * float64'>,
 '2': <Array ['one', 'two', 'three', 'four', 'five'] type='5 * string'>}

Zipping together arrays#

Because Awkward Arrays unzip into distinct arrays, it is reasonable to ask whether the reverse is possible, i.e. given the following arrays

age = ak.Array([18, 32, 87, 55])
name = ak.Array(["Dorit", "Caitlin", "Theodor", "Albano"]);

can we form an array of records? The ak.zip() function provides a way to join compatible arrays into a single array of records:

people = ak.zip({"age": age, "name": name})
[{age: 18, name: 'Dorit'},
 {age: 32, name: 'Caitlin'},
 {age: 87, name: 'Theodor'},
 {age: 55, name: 'Albano'}]
----------------------------
type: 4 * {
    age: int64,
    name: string
}

Similarly, we could also build an array of tuples by passing a sequence of arrays:

ak.zip([age, name])
[(18, 'Dorit'),
 (32, 'Caitlin'),
 (87, 'Theodor'),
 (55, 'Albano')]
-----------------
type: 4 * (
    int64,
    string
)

Zipping and unzipping arrays is a lightweight operation, and so you should not hesitate to zip together arrays if it makes sense for the problem at hand. One of the benefits of combining arrays into an array of records is that slicing and masking operations are applied to all fields, e.g.

people[age > 35]
[{age: 87, name: 'Theodor'},
 {age: 55, name: 'Albano'}]
----------------------------
type: 2 * {
    age: int64,
    name: string
}

Arrays with different dimensions#

So far, we’ve looked at simple arrays with the same dimension in each field. It is actually possible to build arrays with fields of different dimensions, e.g.

x = ak.Array(
    [
        103,
        450,
        33,
        4,
    ]
)

digits_of_x = ak.Array(
    [
        [1, 0, 3],
        [4, 5, 0],
        [3, 3],
        [4],
    ]
)
x_and_digits = ak.zip({"x": x, "digits": digits_of_x})
[[{x: 103, digits: 1}, {x: 103, digits: 0}, {x: 103, digits: 3}],
 [{x: 450, digits: 4}, {x: 450, digits: 5}, {x: 450, digits: 0}],
 [{x: 33, digits: 3}, {x: 33, digits: 3}],
 [{x: 4, digits: 4}]]
-----------------------------------------------------------------
type: 4 * var * {
    x: int64,
    digits: int64
}

The type of this array is

x_and_digits.type
ArrayType(ListType(RecordType([NumpyType('int64'), NumpyType('int64')], ['x', 'digits'])), 4)

Note that the x field has changed type:

x.type
ArrayType(NumpyType('int64'), 4)
x_and_digits.x.type
ArrayType(ListType(NumpyType('int64')), 4)

In zipping the two arrays together, the x has been broadcast against digits_of_x. Sometimes you might want to limit the broadcasting to a particular depth (dimension). This can be done by passing the depth_limit parameter:

x_and_digits = ak.zip({"x": x, "digits": digits_of_x}, depth_limit=1)
[{x: 103, digits: [1, 0, 3]},
 {x: 450, digits: [4, 5, 0]},
 {x: 33, digits: [3, 3]},
 {x: 4, digits: [4]}]
-----------------------------
type: 4 * {
    x: int64,
    digits: var * int64
}

Now the x field has a single dimension

x_and_digits.x.type
ArrayType(NumpyType('int64'), 4)

Arrays with different dimension lengths#

What happens if we zip together arrays with the same dimensions, but different lengths in each dimensions?

x_and_y = ak.Array(
    [
        [103, 903],
        [450, 83],
        [33, 8],
        [4, 109],
    ]
)

digits_of_x_and_y = ak.Array(
    [
        [1, 0, 3, 9, 0, 3],
        [4, 5, 0, 8, 3],
        [3, 3, 8],
        [4, 1, 0, 9],
    ]
)

ak.zip({"x_and_y": x_and_y, "digits": digits_of_x_and_y})
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[21], line 19
      1 x_and_y = ak.Array(
      2     [
      3         [103, 903],
   (...)
      7     ]
      8 )
     10 digits_of_x_and_y = ak.Array(
     11     [
     12         [1, 0, 3, 9, 0, 3],
   (...)
     16     ]
     17 )
---> 19 ak.zip({"x_and_y": x_and_y, "digits": digits_of_x_and_y})

File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/operations/ak_zip.py:144, in zip(arrays, depth_limit, parameters, with_name, right_broadcast, optiontype_outside_record, highlevel, behavior)
     19 """
     20 Args:
     21     arrays (dict or iterable of arrays): Each value in this dict or iterable
   (...)
    129     <Array [None, (2, 5), None] type='3 * ?(int64, int64)'>
    130 """
    131 with ak._errors.OperationErrorContext(
    132     "ak.zip",
    133     dict(
   (...)
    142     ),
    143 ):
--> 144     return _impl(
    145         arrays,
    146         depth_limit,
    147         parameters,
    148         with_name,
    149         right_broadcast,
    150         optiontype_outside_record,
    151         highlevel,
    152         behavior,
    153     )

File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/operations/ak_zip.py:238, in _impl(arrays, depth_limit, parameters, with_name, right_broadcast, optiontype_outside_record, highlevel, behavior)
    235     else:
    236         return None
--> 238 out = ak._broadcasting.broadcast_and_apply(
    239     layouts, action, behavior, right_broadcast=right_broadcast
    240 )
    241 assert isinstance(out, tuple) and len(out) == 1
    242 out = out[0]

File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/_broadcasting.py:1032, in broadcast_and_apply(inputs, action, behavior, depth_context, lateral_context, allow_records, left_broadcast, right_broadcast, numpy_to_regular, regular_to_jagged, function_name, broadcast_parameters_rule)
   1030 backend = ak._backends.backend_of(*inputs)
   1031 isscalar = []
-> 1032 out = apply_step(
   1033     backend,
   1034     broadcast_pack(inputs, isscalar),
   1035     action,
   1036     0,
   1037     depth_context,
   1038     lateral_context,
   1039     behavior,
   1040     {
   1041         "allow_records": allow_records,
   1042         "left_broadcast": left_broadcast,
   1043         "right_broadcast": right_broadcast,
   1044         "numpy_to_regular": numpy_to_regular,
   1045         "regular_to_jagged": regular_to_jagged,
   1046         "function_name": function_name,
   1047         "broadcast_parameters_rule": broadcast_parameters_rule,
   1048     },
   1049 )
   1050 assert isinstance(out, tuple)
   1051 return tuple(broadcast_unpack(x, isscalar, backend) for x in out)

File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/_broadcasting.py:1011, in apply_step(backend, inputs, action, depth, depth_context, lateral_context, behavior, options)
   1009     return result
   1010 elif result is None:
-> 1011     return continuation()
   1012 else:
   1013     raise ak._errors.wrap_error(AssertionError(result))

File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/_broadcasting.py:730, in apply_step.<locals>.continuation()
    727             assert length == x.length
    728 assert length is not None
--> 730 outcontent = apply_step(
    731     backend,
    732     nextinputs,
    733     action,
    734     depth + 1,
    735     copy.copy(depth_context),
    736     lateral_context,
    737     behavior,
    738     options,
    739 )
    740 assert isinstance(outcontent, tuple)
    741 parameters = parameters_factory(len(outcontent))

File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/_broadcasting.py:1011, in apply_step(backend, inputs, action, depth, depth_context, lateral_context, behavior, options)
   1009     return result
   1010 elif result is None:
-> 1011     return continuation()
   1012 else:
   1013     raise ak._errors.wrap_error(AssertionError(result))

File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/_broadcasting.py:893, in apply_step.<locals>.continuation()
    891     nextinputs.append(fcn(x, offsets))
    892 elif isinstance(x, listtypes):
--> 893     nextinputs.append(x._broadcast_tooffsets64(offsets).content)
    895 # Handle implicit left-broadcasting (non-NumPy-like broadcasting).
    896 elif options["left_broadcast"] and isinstance(x, Content):

File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/contents/listoffsetarray.py:321, in ListOffsetArray._broadcast_tooffsets64(self, offsets)
    314 nextcarry = ak.index.Index64.empty(offsets[-1], self._backend.index_nplike)
    315 assert (
    316     nextcarry.nplike is self._backend.index_nplike
    317     and offsets.nplike is self._backend.index_nplike
    318     and starts.nplike is self._backend.index_nplike
    319     and stops.nplike is self._backend.index_nplike
    320 )
--> 321 self._handle_error(
    322     self._backend[
    323         "awkward_ListArray_broadcast_tooffsets",
    324         nextcarry.dtype.type,
    325         offsets.dtype.type,
    326         starts.dtype.type,
    327         stops.dtype.type,
    328     ](
    329         nextcarry.data,
    330         offsets.data,
    331         offsets.length,
    332         starts.data,
    333         stops.data,
    334         self._content.length,
    335     )
    336 )
    338 nextcontent = self._content._carry(nextcarry, True)
    340 return ListOffsetArray(offsets, nextcontent, parameters=self._parameters)

File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/contents/content.py:231, in Content._handle_error(self, error, slicer)
    228 message += filename
    230 if slicer is None:
--> 231     raise ak._errors.wrap_error(ValueError(message))
    232 else:
    233     raise ak._errors.index_error(self, slicer, message)

ValueError: while calling

    ak.zip(
        arrays = {'x_and_y': <Array [[103, 903], [450, 83], [33, ...], [4, 10...
        depth_limit = None
        parameters = None
        with_name = None
        right_broadcast = False
        optiontype_outside_record = False
        highlevel = True
        behavior = None
    )

Error details: cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward-1.0/blob//src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)

Arrays which cannot be broadcast against each other will raise a ValueError. In this case, we want to stop broadcasting at the first dimension (depth_limit=1)

ak.zip({"x_and_y": x_and_y, "digits": digits_of_x_and_y}, depth_limit=1)
[{x_and_y: [103, 903], digits: [1, 0, 3, ..., 0, 3]},
 {x_and_y: [450, 83], digits: [4, 5, 0, 8, 3]},
 {x_and_y: [33, 8], digits: [3, 3, 8]},
 {x_and_y: [4, 109], digits: [4, 1, 0, 9]}]
-----------------------------------------------------
type: 4 * {
    x_and_y: var * int64,
    digits: var * int64
}

Projecting arrays#

Sometimes we are interested only in a subset of the fields of an array. For example, imagine that we have an array of coordinates on the \(\hat{x}\hat{y}\) plane:

triangle = ak.Array(
    [
        {"x": 1, "y": 6, "z": 0},
        {"x": 2, "y": 7, "z": 0},
        {"x": 3, "y": 8, "z": 0},
    ]
)
[{x: 1, y: 6, z: 0},
 {x: 2, y: 7, z: 0},
 {x: 3, y: 8, z: 0}]
--------------------
type: 3 * {
    x: int64,
    y: int64,
    z: int64
}

If we know that these points should lie on a plane, then we might wish to discard the \(\hat{z}\) coordinate. We can do this by slicing only the \(\hat{x}\) and \(\hat{y}\) fields:

triangle_2d = triangle[["x", "y"]]
[{x: 1, y: 6},
 {x: 2, y: 7},
 {x: 3, y: 8}]
--------------
type: 3 * {
    x: int64,
    y: int64
}

Note that the key passed to the subscript operator is a list ["x", "y"], not a tuple. Awkward Array recognises the list to mean “take both the "x" and "y" fields”.

Projections can be combined with array slicing and masking, e.g.

triangle_2d_first_2 = triangle[:2, ["x", "y"]]
[{x: 1, y: 6},
 {x: 2, y: 7}]
--------------
type: 2 * {
    x: int64,
    y: int64
}

Let’s now consider an array of triangles, i.e. a polygon:

triangles = ak.Array(
    [
        [
            {"x": 1, "y": 6, "z": 0},
            {"x": 2, "y": 7, "z": 0},
            {"x": 3, "y": 8, "z": 0},
        ],
        [
            {"x": 4, "y": 9, "z": 0},
            {"x": 5, "y": 10, "z": 0},
            {"x": 6, "y": 11, "z": 0},
        ],
    ]
)
[[{x: 1, y: 6, z: 0}, {x: 2, y: 7, z: 0}, {x: 3, y: 8, z: 0}],
 [{x: 4, y: 9, z: 0}, {x: 5, y: 10, z: 0}, {x: 6, y: 11, z: 0}]]
----------------------------------------------------------------
type: 2 * var * {
    x: int64,
    y: int64,
    z: int64
}

We can combine an int index 0 with a str projection to view the "x" coordinates of the first triangle vertices

triangles[0, "x"]
[1,
 2,
 3]
---------------
type: 3 * int64

We could even ignore the first vertex of each triangle

triangles[0, 1:, "x"]
[2,
 3]
---------------
type: 2 * int64

Projections commute (to the left) with other indices to produce the same result as their “natural” position. This means that the above projection could also be written as

triangles[0, "x", 1:]
[2,
 3]
---------------
type: 2 * int64

or even

triangles["x", 0, 1:]
[2,
 3]
---------------
type: 2 * int64

For columnar Awkward Arrays, there is no performance difference between any of these approaches; projecting the records of an array just changes its metadata, rather than invoking any loops over the data.

Projecting records-of-records#

The records of an array can themselves contain records

polygon = ak.Array(
    [
        {
            "vertex": [
                {"x": 1, "y": 6, "z": 0},
                {"x": 2, "y": 7, "z": 0},
                {"x": 3, "y": 8, "z": 0},
            ],
            "normal": [
                {"x": 0.164, "y": 0.986, "z": 0.0},
                {"x": 0.275, "y": 0.962, "z": 0.0},
                {"x": 0.351, "y": 0.936, "z": 0.0},
            ],
            "n_vertex": 3,
        },
        {
            "vertex": [
                {"x": 4, "y": 9, "z": 0},
                {"x": 5, "y": 10, "z": 0},
                {"x": 6, "y": 11, "z": 0},
                {"x": 7, "y": 12, "z": 0},
            ],
            "normal": [
                {"x": 0.406, "y": 0.914, "z": 0.0},
                {"x": 0.447, "y": 0.894, "z": 0.0},
                {"x": 0.470, "y": 0.878, "z": 0.0},
                {"x": 0.504, "y": 0.864, "z": 0.0},
            ],
            "n_vertex": 4,
        },
    ]
)
[{vertex: [{x: 1, y: 6, z: 0}, ..., {...}], normal: [...], n_vertex: 3},
 {vertex: [{x: 4, y: 9, z: 0}, ..., {...}], normal: [...], n_vertex: 4}]
------------------------------------------------------------------------
type: 2 * {
    vertex: var * {
        x: int64,
        y: int64,
        z: int64
    },
    normal: var * {
        x: float64,
        y: float64,
        z: float64
    },
    n_vertex: int64
}

Naturally we can access the "vertex" field with the . operator:

polygon.vertex
[[{x: 1, y: 6, z: 0}, {x: 2, y: 7, z: 0}, {x: 3, y: 8, z: 0}],
 [{x: 4, y: 9, z: 0}, {x: 5, y: 10, z: 0}, {...}, {x: 7, y: 12, z: 0}]]
-----------------------------------------------------------------------
type: 2 * var * {
    x: int64,
    y: int64,
    z: int64
}

We can view the "x" field of the vertex array with an additional lookup

polygon.vertex.x
[[1, 2, 3],
 [4, 5, 6, 7]]
---------------------
type: 2 * var * int64

The . operator represents the simplest slice of a single string, i.e.

polygon["vertex"]
[[{x: 1, y: 6, z: 0}, {x: 2, y: 7, z: 0}, {x: 3, y: 8, z: 0}],
 [{x: 4, y: 9, z: 0}, {x: 5, y: 10, z: 0}, {...}, {x: 7, y: 12, z: 0}]]
-----------------------------------------------------------------------
type: 2 * var * {
    x: int64,
    y: int64,
    z: int64
}

The slice corresponding to the nested lookup .vertex.x is given by a tuple of str:

polygon[("vertex", "x")]
[[1, 2, 3],
 [4, 5, 6, 7]]
---------------------
type: 2 * var * int64

It is even possible to combine multiple and single projections. Let’s project the "x" field of the "vertex" and "normal" fields:

polygon[["vertex", "normal"], "x"]
[{vertex: [1, 2, 3], normal: [0.164, ..., 0.351]},
 {vertex: [4, 5, 6, 7], normal: [0.406, ..., 0.504]}]
-----------------------------------------------------
type: 2 * {
    vertex: var * int64,
    normal: var * float64
}