How to restructure arrays with zip/unzip and project#
Show code cell content
%config InteractiveShell.ast_node_interactivity = "last_expr_or_assign"
Unzipping an array of records#
As discussed in How to create arrays of records, in addition to primitive types like numpy.float64
and numpy.datetime64
, Awkward Arrays can also contain records. These records are formed from a fixed number of optionally named fields.
import awkward as ak
import numpy as np
records = ak.Array(
[
{"x": 1, "y": 1.1, "z": "one"},
{"x": 2, "y": 2.2, "z": "two"},
{"x": 3, "y": 3.3, "z": "three"},
{"x": 4, "y": 4.4, "z": "four"},
{"x": 5, "y": 5.5, "z": "five"},
]
)
[{x: 1, y: 1.1, z: 'one'}, {x: 2, y: 2.2, z: 'two'}, {x: 3, y: 3.3, z: 'three'}, {x: 4, y: 4.4, z: 'four'}, {x: 5, y: 5.5, z: 'five'}] ---------------------------- type: 5 * { x: int64, y: float64, z: string }
Although it is useful to be able to create arrays from a sequence of records (as arrays of structures), Awkward Array implements arrays as structures of arrays. It is therefore more natural to think about arrays in terms of their fields.
In the above example, we have created an array of records from a list of dictionaries. We can see that the x
field of records
contains five numpy.int64
values:
records.x
[1, 2, 3, 4, 5] --------------- type: 5 * int64
If we wanted to look at each of the fields of records
, we could pull them out individually from the array:
records.y
[1.1, 2.2, 3.3, 4.4, 5.5] ----------------- type: 5 * float64
records.z
['one', 'two', 'three', 'four', 'five'] ---------------- type: 5 * string
Clearly, for arrays with a large number of fields, retrieving each field in this manner would become tedious rather quickly. ak.unzip()
can be used to directly build a tuple of the field arrays:
ak.unzip(records)
(<Array [1, 2, 3, 4, 5] type='5 * int64'>,
<Array [1.1, 2.2, 3.3, 4.4, 5.5] type='5 * float64'>,
<Array ['one', 'two', 'three', 'four', 'five'] type='5 * string'>)
Records are not required to have field names. A record without field names is known as a “tuple”, e.g.
tuples = ak.Array(
[
(1, 1.1, "one"),
(2, 2.2, "two"),
(3, 3.3, "three"),
(4, 4.4, "four"),
(5, 5.5, "five"),
]
)
[(1, 1.1, 'one'), (2, 2.2, 'two'), (3, 3.3, 'three'), (4, 4.4, 'four'), (5, 5.5, 'five')] ------------------- type: 5 * ( int64, float64, string )
If we unzip an array of tuples, we obtain the same result as for records:
ak.unzip(tuples)
(<Array [1, 2, 3, 4, 5] type='5 * int64'>,
<Array [1.1, 2.2, 3.3, 4.4, 5.5] type='5 * float64'>,
<Array ['one', 'two', 'three', 'four', 'five'] type='5 * string'>)
ak.unzip()
can be combined with ak.fields()
to build a mapping from field name to field array:
dict(zip(ak.fields(records), ak.unzip(records)))
{'x': <Array [1, 2, 3, 4, 5] type='5 * int64'>,
'y': <Array [1.1, 2.2, 3.3, 4.4, 5.5] type='5 * float64'>,
'z': <Array ['one', 'two', 'three', 'four', 'five'] type='5 * string'>}
For tuples, the field names will be strings corresponding to the field index:
dict(zip(ak.fields(tuples), ak.unzip(tuples)))
{'0': <Array [1, 2, 3, 4, 5] type='5 * int64'>,
'1': <Array [1.1, 2.2, 3.3, 4.4, 5.5] type='5 * float64'>,
'2': <Array ['one', 'two', 'three', 'four', 'five'] type='5 * string'>}
Zipping together arrays#
Because Awkward Arrays unzip into distinct arrays, it is reasonable to ask whether the reverse is possible, i.e. given the following arrays
age = ak.Array([18, 32, 87, 55])
name = ak.Array(["Dorit", "Caitlin", "Theodor", "Albano"]);
can we form an array of records? The ak.zip()
function provides a way to join compatible arrays into a single array of records:
people = ak.zip({"age": age, "name": name})
[{age: 18, name: 'Dorit'}, {age: 32, name: 'Caitlin'}, {age: 87, name: 'Theodor'}, {age: 55, name: 'Albano'}] ---------------------------- type: 4 * { age: int64, name: string }
Similarly, we could also build an array of tuples by passing a sequence of arrays:
ak.zip([age, name])
[(18, 'Dorit'), (32, 'Caitlin'), (87, 'Theodor'), (55, 'Albano')] ----------------- type: 4 * ( int64, string )
Zipping and unzipping arrays is a lightweight operation, and so you should not hesitate to zip together arrays if it makes sense for the problem at hand. One of the benefits of combining arrays into an array of records is that slicing and masking operations are applied to all fields, e.g.
people[age > 35]
[{age: 87, name: 'Theodor'}, {age: 55, name: 'Albano'}] ---------------------------- type: 2 * { age: int64, name: string }
Arrays with different dimensions#
So far, we’ve looked at simple arrays with the same dimension in each field. It is actually possible to build arrays with fields of different dimensions, e.g.
x = ak.Array(
[
103,
450,
33,
4,
]
)
digits_of_x = ak.Array(
[
[1, 0, 3],
[4, 5, 0],
[3, 3],
[4],
]
)
x_and_digits = ak.zip({"x": x, "digits": digits_of_x})
[[{x: 103, digits: 1}, {x: 103, digits: 0}, {x: 103, digits: 3}], [{x: 450, digits: 4}, {x: 450, digits: 5}, {x: 450, digits: 0}], [{x: 33, digits: 3}, {x: 33, digits: 3}], [{x: 4, digits: 4}]] ----------------------------------------------------------------- type: 4 * var * { x: int64, digits: int64 }
The type of this array is
x_and_digits.type
ArrayType(ListType(RecordType([NumpyType('int64'), NumpyType('int64')], ['x', 'digits'])), 4)
Note that the x
field has changed type:
x.type
ArrayType(NumpyType('int64'), 4)
x_and_digits.x.type
ArrayType(ListType(NumpyType('int64')), 4)
In zipping the two arrays together, the x
has been broadcast against digits_of_x
. Sometimes you might want to limit the broadcasting to a particular depth (dimension). This can be done by passing the depth_limit
parameter:
x_and_digits = ak.zip({"x": x, "digits": digits_of_x}, depth_limit=1)
[{x: 103, digits: [1, 0, 3]}, {x: 450, digits: [4, 5, 0]}, {x: 33, digits: [3, 3]}, {x: 4, digits: [4]}] ----------------------------- type: 4 * { x: int64, digits: var * int64 }
Now the x
field has a single dimension
x_and_digits.x.type
ArrayType(NumpyType('int64'), 4)
Arrays with different dimension lengths#
What happens if we zip together arrays with the same dimensions, but different lengths in each dimensions?
x_and_y = ak.Array(
[
[103, 903],
[450, 83],
[33, 8],
[4, 109],
]
)
digits_of_x_and_y = ak.Array(
[
[1, 0, 3, 9, 0, 3],
[4, 5, 0, 8, 3],
[3, 3, 8],
[4, 1, 0, 9],
]
)
ak.zip({"x_and_y": x_and_y, "digits": digits_of_x_and_y})
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/operations/ak_zip.py:147, in zip(arrays, depth_limit, parameters, with_name, right_broadcast, optiontype_outside_record, highlevel, behavior)
134 with ak._errors.OperationErrorContext(
135 "ak.zip",
136 {
(...)
145 },
146 ):
--> 147 return _impl(
148 arrays,
149 depth_limit,
150 parameters,
151 with_name,
152 right_broadcast,
153 optiontype_outside_record,
154 highlevel,
155 behavior,
156 )
File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/operations/ak_zip.py:236, in _impl(arrays, depth_limit, parameters, with_name, right_broadcast, optiontype_outside_record, highlevel, behavior)
234 return None
--> 236 out = ak._broadcasting.broadcast_and_apply(
237 layouts, action, behavior, right_broadcast=right_broadcast
238 )
239 assert isinstance(out, tuple) and len(out) == 1
File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/_broadcasting.py:1062, in broadcast_and_apply(inputs, action, behavior, depth_context, lateral_context, allow_records, left_broadcast, right_broadcast, numpy_to_regular, regular_to_jagged, function_name, broadcast_parameters_rule)
1061 isscalar = []
-> 1062 out = apply_step(
1063 backend,
1064 broadcast_pack(inputs, isscalar),
1065 action,
1066 0,
1067 depth_context,
1068 lateral_context,
1069 behavior,
1070 {
1071 "allow_records": allow_records,
1072 "left_broadcast": left_broadcast,
1073 "right_broadcast": right_broadcast,
1074 "numpy_to_regular": numpy_to_regular,
1075 "regular_to_jagged": regular_to_jagged,
1076 "function_name": function_name,
1077 "broadcast_parameters_rule": broadcast_parameters_rule,
1078 },
1079 )
1080 assert isinstance(out, tuple)
File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/_broadcasting.py:1041, in apply_step(backend, inputs, action, depth, depth_context, lateral_context, behavior, options)
1040 elif result is None:
-> 1041 return continuation()
1042 else:
File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/_broadcasting.py:1014, in apply_step.<locals>.continuation()
1013 elif any(x.is_list for x in contents):
-> 1014 return broadcast_any_list()
1016 # Any RecordArrays?
File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/_broadcasting.py:563, in apply_step.<locals>.broadcast_any_list()
561 nextinputs.append(x)
--> 563 outcontent = apply_step(
564 backend,
565 nextinputs,
566 action,
567 depth + 1,
568 copy.copy(depth_context),
569 lateral_context,
570 behavior,
571 options,
572 )
573 assert isinstance(outcontent, tuple)
File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/_broadcasting.py:1041, in apply_step(backend, inputs, action, depth, depth_context, lateral_context, behavior, options)
1040 elif result is None:
-> 1041 return continuation()
1042 else:
File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/_broadcasting.py:1014, in apply_step.<locals>.continuation()
1013 elif any(x.is_list for x in contents):
-> 1014 return broadcast_any_list()
1016 # Any RecordArrays?
File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/_broadcasting.py:723, in apply_step.<locals>.broadcast_any_list()
722 elif isinstance(x, listtypes):
--> 723 nextinputs.append(x._broadcast_tooffsets64(offsets).content)
725 # Handle implicit left-broadcasting (non-NumPy-like broadcasting).
File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/contents/listoffsetarray.py:395, in ListOffsetArray._broadcast_tooffsets64(self, offsets)
389 assert (
390 nextcarry.nplike is self._backend.index_nplike
391 and offsets.nplike is self._backend.index_nplike
392 and starts.nplike is self._backend.index_nplike
393 and stops.nplike is self._backend.index_nplike
394 )
--> 395 self._handle_error(
396 self._backend[
397 "awkward_ListArray_broadcast_tooffsets",
398 nextcarry.dtype.type,
399 offsets.dtype.type,
400 starts.dtype.type,
401 stops.dtype.type,
402 ](
403 nextcarry.data,
404 offsets.data,
405 offsets.length,
406 starts.data,
407 stops.data,
408 self._content.length,
409 )
410 )
412 nextcontent = self._content._carry(nextcarry, True)
File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/contents/content.py:279, in Content._handle_error(self, error, slicer)
278 if slicer is None:
--> 279 raise ValueError(message)
280 else:
ValueError: cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-15/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
Cell In[21], line 19
1 x_and_y = ak.Array(
2 [
3 [103, 903],
(...)
7 ]
8 )
10 digits_of_x_and_y = ak.Array(
11 [
12 [1, 0, 3, 9, 0, 3],
(...)
16 ]
17 )
---> 19 ak.zip({"x_and_y": x_and_y, "digits": digits_of_x_and_y})
File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/operations/ak_zip.py:134, in zip(arrays, depth_limit, parameters, with_name, right_broadcast, optiontype_outside_record, highlevel, behavior)
11 def zip(
12 arrays,
13 depth_limit=None,
(...)
20 behavior=None,
21 ):
22 """
23 Args:
24 arrays (dict or iterable of arrays): Each value in this dict or iterable
(...)
132 <Array [None, (2, 5), None] type='3 * ?(int64, int64)'>
133 """
--> 134 with ak._errors.OperationErrorContext(
135 "ak.zip",
136 {
137 "arrays": arrays,
138 "depth_limit": depth_limit,
139 "parameters": parameters,
140 "with_name": with_name,
141 "right_broadcast": right_broadcast,
142 "optiontype_outside_record": optiontype_outside_record,
143 "highlevel": highlevel,
144 "behavior": behavior,
145 },
146 ):
147 return _impl(
148 arrays,
149 depth_limit,
(...)
155 behavior,
156 )
File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/_errors.py:56, in ErrorContext.__exit__(self, exception_type, exception_value, traceback)
53 try:
54 # Handle caught exception
55 if exception_type is not None and self.primary() is self:
---> 56 self.handle_exception(exception_type, exception_value)
57 finally:
58 # Step out of the way so that another ErrorContext can become primary.
59 if self.primary() is self:
File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/_errors.py:66, in ErrorContext.handle_exception(self, cls, exception)
64 self.decorate_exception(cls, exception)
65 else:
---> 66 raise self.decorate_exception(cls, exception)
ValueError: cannot broadcast nested list (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-15/awkward-cpp/src/cpu-kernels/awkward_ListArray_broadcast_tooffsets.cpp#L27)
This error occurred while calling
ak.zip(
arrays = {'x_and_y': <Array [[103, 903], [450, 83], [33, ...], [4, 10...
depth_limit = None
parameters = None
with_name = None
right_broadcast = False
optiontype_outside_record = False
highlevel = True
behavior = None
)
Arrays which cannot be broadcast against each other will raise a ValueError
. In this case, we want to stop broadcasting at the first dimension (depth_limit=1
)
ak.zip({"x_and_y": x_and_y, "digits": digits_of_x_and_y}, depth_limit=1)
[{x_and_y: [103, 903], digits: [1, 0, 3, ..., 0, 3]}, {x_and_y: [450, 83], digits: [4, 5, 0, 8, 3]}, {x_and_y: [33, 8], digits: [3, 3, 8]}, {x_and_y: [4, 109], digits: [4, 1, 0, 9]}] ----------------------------------------------------- type: 4 * { x_and_y: var * int64, digits: var * int64 }
Projecting arrays#
Sometimes we are interested only in a subset of the fields of an array. For example, imagine that we have an array of coordinates on the \(\hat{x}\hat{y}\) plane:
triangle = ak.Array(
[
{"x": 1, "y": 6, "z": 0},
{"x": 2, "y": 7, "z": 0},
{"x": 3, "y": 8, "z": 0},
]
)
[{x: 1, y: 6, z: 0}, {x: 2, y: 7, z: 0}, {x: 3, y: 8, z: 0}] -------------------- type: 3 * { x: int64, y: int64, z: int64 }
If we know that these points should lie on a plane, then we might wish to discard the \(\hat{z}\) coordinate. We can do this by slicing only the \(\hat{x}\) and \(\hat{y}\) fields:
triangle_2d = triangle[["x", "y"]]
[{x: 1, y: 6}, {x: 2, y: 7}, {x: 3, y: 8}] -------------- type: 3 * { x: int64, y: int64 }
Note that the key passed to the subscript operator is a list
["x", "y"]
, not a tuple
. Awkward Array recognises the list
to mean “take both the "x"
and "y"
fields”.
Projections can be combined with array slicing and masking, e.g.
triangle_2d_first_2 = triangle[:2, ["x", "y"]]
[{x: 1, y: 6}, {x: 2, y: 7}] -------------- type: 2 * { x: int64, y: int64 }
Let’s now consider an array of triangles, i.e. a polygon:
triangles = ak.Array(
[
[
{"x": 1, "y": 6, "z": 0},
{"x": 2, "y": 7, "z": 0},
{"x": 3, "y": 8, "z": 0},
],
[
{"x": 4, "y": 9, "z": 0},
{"x": 5, "y": 10, "z": 0},
{"x": 6, "y": 11, "z": 0},
],
]
)
[[{x: 1, y: 6, z: 0}, {x: 2, y: 7, z: 0}, {x: 3, y: 8, z: 0}], [{x: 4, y: 9, z: 0}, {x: 5, y: 10, z: 0}, {x: 6, y: 11, z: 0}]] ---------------------------------------------------------------- type: 2 * var * { x: int64, y: int64, z: int64 }
We can combine an int
index 0
with a str
projection to view the "x"
coordinates of the first triangle vertices
triangles[0, "x"]
[1, 2, 3] --------------- type: 3 * int64
We could even ignore the first vertex of each triangle
triangles[0, 1:, "x"]
[2, 3] --------------- type: 2 * int64
Projections commute (to the left) with other indices to produce the same result as their “natural” position. This means that the above projection could also be written as
triangles[0, "x", 1:]
[2, 3] --------------- type: 2 * int64
or even
triangles["x", 0, 1:]
[2, 3] --------------- type: 2 * int64
For columnar Awkward Arrays, there is no performance difference between any of these approaches; projecting the records of an array just changes its metadata, rather than invoking any loops over the data.
Projecting records-of-records#
The records of an array can themselves contain records
polygon = ak.Array(
[
{
"vertex": [
{"x": 1, "y": 6, "z": 0},
{"x": 2, "y": 7, "z": 0},
{"x": 3, "y": 8, "z": 0},
],
"normal": [
{"x": 0.164, "y": 0.986, "z": 0.0},
{"x": 0.275, "y": 0.962, "z": 0.0},
{"x": 0.351, "y": 0.936, "z": 0.0},
],
"n_vertex": 3,
},
{
"vertex": [
{"x": 4, "y": 9, "z": 0},
{"x": 5, "y": 10, "z": 0},
{"x": 6, "y": 11, "z": 0},
{"x": 7, "y": 12, "z": 0},
],
"normal": [
{"x": 0.406, "y": 0.914, "z": 0.0},
{"x": 0.447, "y": 0.894, "z": 0.0},
{"x": 0.470, "y": 0.878, "z": 0.0},
{"x": 0.504, "y": 0.864, "z": 0.0},
],
"n_vertex": 4,
},
]
)
[{vertex: [{x: 1, y: 6, z: 0}, ..., {...}], normal: [...], n_vertex: 3}, {vertex: [{x: 4, y: 9, z: 0}, ..., {...}], normal: [...], n_vertex: 4}] ------------------------------------------------------------------------ type: 2 * { vertex: var * { x: int64, y: int64, z: int64 }, normal: var * { x: float64, y: float64, z: float64 }, n_vertex: int64 }
Naturally we can access the "vertex"
field with the .
operator:
polygon.vertex
[[{x: 1, y: 6, z: 0}, {x: 2, y: 7, z: 0}, {x: 3, y: 8, z: 0}], [{x: 4, y: 9, z: 0}, {x: 5, y: 10, z: 0}, {...}, {x: 7, y: 12, z: 0}]] ----------------------------------------------------------------------- type: 2 * var * { x: int64, y: int64, z: int64 }
We can view the "x"
field of the vertex array with an additional lookup
polygon.vertex.x
[[1, 2, 3], [4, 5, 6, 7]] --------------------- type: 2 * var * int64
The .
operator represents the simplest slice of a single string, i.e.
polygon["vertex"]
[[{x: 1, y: 6, z: 0}, {x: 2, y: 7, z: 0}, {x: 3, y: 8, z: 0}], [{x: 4, y: 9, z: 0}, {x: 5, y: 10, z: 0}, {...}, {x: 7, y: 12, z: 0}]] ----------------------------------------------------------------------- type: 2 * var * { x: int64, y: int64, z: int64 }
The slice corresponding to the nested lookup .vertex.x
is given by a tuple
of str
:
polygon[("vertex", "x")]
[[1, 2, 3], [4, 5, 6, 7]] --------------------- type: 2 * var * int64
It is even possible to combine multiple and single projections. Let’s project the "x"
field of the "vertex"
and "normal"
fields:
polygon[["vertex", "normal"], "x"]
[{vertex: [1, 2, 3], normal: [0.164, ..., 0.351]}, {vertex: [4, 5, 6, 7], normal: [0.406, ..., 0.504]}] ----------------------------------------------------- type: 2 * { vertex: var * int64, normal: var * float64 }