How to concatenate and interleave arrays#
import awkward as ak
import numpy as np
import pandas as pd
Simple concatenation#
ak.concatenate()
is an analog of np.concatenate (in fact, you can use np.concatenate where you mean ak.concatenate()
). However, it applies to data of arbitrary data structures:
array1 = ak.Array([
[{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}, {"x": 3.3, "y": [1, 2, 3]}],
[],
[{"x": 4.4, "y": [1, 2, 3, 4]}, {"x": 5.5, "y": [1, 2, 3, 4, 5]}],
])
array2 = ak.Array([
[{"x": 6.6, "y": [1, 2, 3, 4, 5, 6]}],
[{"x": 7.7, "y": [1, 2, 3, 4, 5, 6, 7]}],
])
ak.concatenate([array1, array2])
[[{x: 1.1, y: [1]}, {x: 2.2, y: [...]}, {x: 3.3, y: [1, 2, 3]}], [], [{x: 4.4, y: [1, 2, 3, 4]}, {x: 5.5, y: [1, ..., 5]}], [{x: 6.6, y: [1, 2, 3, 4, 5, 6]}], [{x: 7.7, y: [1, 2, 3, 4, 5, 6, 7]}]] ---------------------------------------------------------------- backend: cpu nbytes: 392 B type: 5 * var * { x: float64, y: var * int64 }
The arrays can even have different data types, in which case the output has union-type.
array3 = ak.Array([{"z": None}, {"z": 0}, {"z": 123}])
ak.concatenate([array1, array2, array3])
[[{x: 1.1, y: [1]}, {x: 2.2, y: [...]}, {x: 3.3, y: [1, 2, 3]}], [], [{x: 4.4, y: [1, 2, 3, 4]}, {x: 5.5, y: [1, ..., 5]}], [{x: 6.6, y: [1, 2, 3, 4, 5, 6]}], [{x: 7.7, y: [1, 2, 3, 4, 5, 6, 7]}], {z: None}, {z: 0}, {z: 123}] -------------------------------------------------------------------------------------------------------------- backend: cpu nbytes: 504 B type: 8 * union[ var * { x: float64, y: var * int64 }, { z: ?int64 } ]
Keep in mind, however, that some operations can’t deal with union-types (heterogeneous data), so you might want to avoid this.
Interleaving lists with axis > 0
#
The default axis=0
returns an array whose length is equal to the sum of the lengths of the input arrays.
Other axis
values combine lists within the arrays, as long as the arrays have the same lengths.
array1 = ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
array2 = ak.Array([[10, 20], [30], [40, 50, 60, 70]])
len(array1), len(array2)
(3, 3)
ak.concatenate([array1, array2], axis=1)
[[1.1, 2.2, 3.3, 10, 20], [30], [4.4, 5.5, 40, 50, 60, 70]] ---------------------------- backend: cpu nbytes: 128 B type: 3 * var * float64
This can be used in some non-trivial ways: sometimes a problem that doesn’t seem to have anything to do with concatenation can be solved this way.
For instance, suppose that you have to pad some lists so that they start and stop with 0 (for some window-averaging procedure, perhaps). You can make the pad as a new array:
pad = np.zeros(len(array1))[:, np.newaxis]
pad
array([[0.],
[0.],
[0.]])
and concatenate it with axis=1
to get the desired effect:
ak.concatenate([pad, array1, pad], axis=1)
[[0, 1.1, 2.2, 3.3, 0], [0, 0], [0, 4.4, 5.5, 0]] ----------------------- backend: cpu nbytes: 120 B type: 3 * var * float64
Or similarly, to double the first value and double the last value (without affecting empty lists):
ak.concatenate([array1[:, :1], array1, array1[:, -1:]], axis=1)
[[1.1, 1.1, 2.2, 3.3, 3.3], [], [4.4, 4.4, 5.5, 5.5]] --------------------------- backend: cpu nbytes: 104 B type: 3 * var * float64
The same applies for more deeply nested lists and axis > 1
. Remember that axis=-1
starts counting from the innermost dimension, outward.
Emulating NumPy’s “stack” functions#
np.stack, np.hstack, np.vstack, and np.dstack are concatenations with np.newaxis (reshaping to add a dimension of length 1).
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
np.stack([a, b])
array([[1, 2, 3],
[4, 5, 6]])
np.concatenate([a[np.newaxis], b[np.newaxis]], axis=0)
array([[1, 2, 3],
[4, 5, 6]])
np.stack([a, b], axis=1)
array([[1, 4],
[2, 5],
[3, 6]])
np.concatenate([a[:, np.newaxis], b[:, np.newaxis]], axis=1)
array([[1, 4],
[2, 5],
[3, 6]])
Since ak.concatenate()
has the same interface as np.concatenate and Awkward Arrays can also be sliced with np.newaxis, they can be stacked the same way, with the addition of arbitrary data structures.
a = ak.Array([[1], [1, 2], [1, 2, 3]])
b = ak.Array([[4], [4, 5], [4, 5, 6]])
ak.concatenate([a[np.newaxis], b[np.newaxis]], axis=0)
[[[1], [1, 2], [1, 2, 3]], [[4], [4, 5], [4, 5, 6]]] -------------------------- backend: cpu nbytes: 152 B type: 2 * 3 * var * int64
ak.concatenate([a[:, np.newaxis], b[:, np.newaxis]], axis=1)
[[[1], [4]], [[1, 2], [4, 5]], [[1, 2, 3], [4, 5, 6]]] ------------------------- backend: cpu nbytes: 192 B type: 3 * 2 * var * int64
Differences from Pandas#
Concatenation in Awkward Array combines arrays lengthwise: by adding the lengths of the arrays or adding the lengths of lists within an array. It does not refer to adding fields to a record (that is, “adding columns to a table”). To add fields to a record, see ak.zip()
or ak.Array.__setitem__()
in how to zip/unzip and project and how to add fields. This is important to note because pandas.concat does both, depending on its axis
argument (and there’s no equivalent in NumPy).
Here’s a table-like example of concatenation in Awkward Array:
array1 = ak.Array({"column": [[1, 2, 3], [], [4, 5]]})
array2 = ak.Array({"column": [[1.1, 2.2, 3.3], [], [4.4, 5.5]]})
array1
[{column: [1, 2, 3]}, {column: []}, {column: [4, 5]}] ------------------------------------- backend: cpu nbytes: 72 B type: 3 * { column: var * int64 }
array2
[{column: [1.1, 2.2, 3.3]}, {column: []}, {column: [4.4, 5.5]}] --------------------------------------- backend: cpu nbytes: 72 B type: 3 * { column: var * float64 }
ak.concatenate([array1, array2], axis=0)
[{column: [1, 2, 3]}, {column: []}, {column: [4, 5]}, {column: [1.1, 2.2, 3.3]}, {column: []}, {column: [4.4, 5.5]}] --------------------------------------- backend: cpu nbytes: 136 B type: 6 * { column: var * float64 }
This is like Pandas for axis=0
,
df1 = pd.DataFrame({"column": [[1, 2, 3], [], [4, 5]]})
df2 = pd.DataFrame({"column": [[1.1, 2.2, 3.3], [], [4.4, 5.5]]})
df1
column | |
---|---|
0 | [1, 2, 3] |
1 | [] |
2 | [4, 5] |
df2
column | |
---|---|
0 | [1.1, 2.2, 3.3] |
1 | [] |
2 | [4.4, 5.5] |
pd.concat([df1, df2], axis=0)
column | |
---|---|
0 | [1, 2, 3] |
1 | [] |
2 | [4, 5] |
0 | [1.1, 2.2, 3.3] |
1 | [] |
2 | [4.4, 5.5] |
But for axis=1
, they’re quite different:
ak.concatenate([array1, array2], axis=1)
[{column: [1, 2, 3, 1.1, 2.2, 3.3]}, {column: []}, {column: [4, 5, 4.4, 5.5]}] --------------------------------------- backend: cpu nbytes: 112 B type: 3 * { column: var * float64 }
pd.concat([df1, df2], axis=1)
column | column | |
---|---|---|
0 | [1, 2, 3] | [1.1, 2.2, 3.3] |
1 | [] | [] |
2 | [4, 5] | [4.4, 5.5] |
ak.concatenate()
accepts any axis
less than the number of dimensions in the arrays, but Pandas has only two choices, axis=0
and axis=1
.
Fields (“columns”) of an Awkward Array are unrelated to array dimensions. If you want what pandas.concat does with axis=1
, you would use ak.zip()
:
ak.zip({"column1": array1.column, "column2": array2.column}, depth_limit=1)
[{column1: [1, 2, 3], column2: [1.1, 2.2, 3.3]}, {column1: [], column2: []}, {column1: [4, 5], column2: [4.4, 5.5]}] ------------------------------------------------------------------ backend: cpu nbytes: 144 B type: 3 * { column1: var * int64, column2: var * float64 }
The depth_limit
prevents ak.zip()
from interleaving the lists further:
ak.zip({"column1": array1.column, "column2": array2.column})
[[{column1: 1, column2: 1.1}, {...}, {column1: 3, column2: 3.3}], [], [{column1: 4, column2: 4.4}, {column1: 5, column2: 5.5}]] ----------------------------------------------------------------- backend: cpu nbytes: 112 B type: 3 * var * { column1: int64, column2: float64 }
which Pandas doesn’t do because lists in Pandas cells are Python objects that it doesn’t modify.