# Generic buffers¶

Most of the conversion functions target a particular library: NumPy, Arrow, Pandas, or Python itself. As a catch-all for other storage formats, Awkward Arrays can be converted to and from sets of named buffers. The buffers are not (usually) intelligible on their own; the length of the array and a JSON document are needed to reconstitute the original structure. This section will demonstrate how an array-set can be used to store an Awkward Array in an HDF5 file, which ordinarily wouldn’t be able to represent nested, irregular data structures.

import awkward as ak
import numpy as np
import h5py
import json


## From Awkward to buffers¶

Consider the following complex array:

ak_array = ak.Array([
[{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}, {"x": 3.3, "y": [1, 2, 3]}],
[],
[{"x": 4.4, "y": [1, 2, 3, 4]}, {"x": 5.5, "y": [1, 2, 3, 4, 5]}],
])
ak_array

<Array [[{x: 1.1, y: [1]}, ... 2, 3, 4, 5]}]] type='3 * var * {"x": float64, "y"...'>


The ak.to_buffers function decomposes it into a set of one-dimensional arrays (a zero-copy operation).

form, length, container = ak.to_buffers(ak_array)


The pieces needed to reconstitute this array are:

• the Form, which defines how structure is built from one-dimensional arrays,

• the length of the original array or lengths of all of its partitions (ak.partitions),

• the one-dimensional arrays in the container (a MutableMapping).

The Form is like an Awkward Type in that it describes how the data are structured, but with more detail: it includes distinctions such as the difference between ListArray and ListOffsetArray, as well as the integer types of structural Indexes.

It is usually presented as JSON, and has a compact JSON format (when Form.tojson is invoked).

form

{
"class": "ListOffsetArray64",
"offsets": "i64",
"content": {
"class": "RecordArray",
"contents": {
"x": {
"class": "NumpyArray",
"itemsize": 8,
"format": "d",
"primitive": "float64",
"form_key": "node2"
},
"y": {
"class": "ListOffsetArray64",
"offsets": "i64",
"content": {
"class": "NumpyArray",
"itemsize": 8,
"format": "l",
"primitive": "int64",
"form_key": "node4"
},
"form_key": "node3"
}
},
"form_key": "node1"
},
"form_key": "node0"
}


In this case, the length is just an integer. It would be a list of integers if ak_array was partitioned.

length

3


This container is a new dict, but it could have been a user-specified MutableMapping if passed into ak.to_buffers as an argument.

container

{'part0-node0-offsets': array([0, 3, 3, 5], dtype=int64),
'part0-node2-data': array([1.1, 2.2, 3.3, 4.4, 5.5]),
'part0-node3-offsets': array([ 0,  1,  3,  6, 10, 15], dtype=int64),
'part0-node4-data': array([1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5])}


## From buffers to Awkward¶

The function that reverses ak.to_buffers is ak.from_buffers. Its first three arguments are form, length, and container.

ak.from_buffers(form, length, container)

<Array [[{x: 1.1, y: [1]}, ... 2, 3, 4, 5]}]] type='3 * var * {"x": float64, "y"...'>


## Minimizing the size of the output buffers¶

The ak.to_buffers/ak.from_buffers functions exactly preserve an array, warts and all. Often, you’ll want to only write ak.packed arrays. “Packing” replaces an array structure with an equivalent structure that has no unreachable elements—data that you can’t see as part of the array, and therefore probably don’t want to write.

Here is an example of an array in need of packing:

unpacked = ak.Array(
ak.layout.ListArray64(
ak.layout.Index64(np.array([4, 10, 1])),
ak.layout.Index64(np.array([7, 10, 3])),
ak.layout.NumpyArray(
np.array([999, 4.4, 5.5, 999, 1.1, 2.2, 3.3, 999])
)
)
)
unpacked

<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>


This ListArray is in a strange order and the 999 values are unreachable. (Also, using starts[1] == stops[1] == 10 to represent an empty list is a little odd, though allowed by the specification.)

The ak.to_buffers function dutifully writes the 999 values into the output, even though they’re not visible in the array.

ak.to_buffers(unpacked)

({
"class": "ListArray64",
"starts": "i64",
"stops": "i64",
"content": {
"class": "NumpyArray",
"itemsize": 8,
"format": "d",
"primitive": "float64",
"form_key": "node1"
},
"form_key": "node0"
},
3,
{'part0-node0-starts': array([ 4, 10,  1], dtype=int64),
'part0-node0-stops': array([ 7, 10,  3], dtype=int64),
'part0-node1-data': array([999. ,   4.4,   5.5, 999. ,   1.1,   2.2,   3.3, 999. ])})


If the intended purpose of calling ak.to_buffers is to write to a file or send data over a network, this is wasted space. It can be trimmed by calling the ak.packed function.

packed = ak.packed(unpacked)
packed

<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>


At high-level, the array appears to be the same, but its low-level structure is quite different:

unpacked.layout

<ListArray64>
<starts><Index64 i="[4 10 1]" offset="0" length="3" at="0x000002419ce0"/></starts>
<stops><Index64 i="[7 10 3]" offset="0" length="3" at="0x000002425ee0"/></stops>
<content><NumpyArray format="d" shape="8" data="999 4.4 5.5 999 1.1 2.2 3.3 999" at="0x0000023fc7e0"/></content>
</ListArray64>

packed.layout

<ListOffsetArray64>
<offsets><Index64 i="[0 3 3 5]" offset="0" length="4" at="0x000002426490"/></offsets>
<content><NumpyArray format="d" shape="5" data="1.1 2.2 3.3 4.4 5.5" at="0x00000241dd40"/></content>
</ListOffsetArray64>


This version of the array is more concise when written with ak.to_buffers:

ak.to_buffers(packed)

({
"class": "ListOffsetArray64",
"offsets": "i64",
"content": {
"class": "NumpyArray",
"itemsize": 8,
"format": "d",
"primitive": "float64",
"form_key": "node1"
},
"form_key": "node0"
},
3,
{'part0-node0-offsets': array([0, 3, 3, 5], dtype=int64),
'part0-node1-data': array([1.1, 2.2, 3.3, 4.4, 5.5])})


## Saving Awkward Arrays to HDF5¶

The h5py library presents each group in an HDF5 file as a MutableMapping, which we can use as a container for an array-set. We must also save the form and length as metadata for the array to be retrievable.

file = h5py.File("/tmp/example.hdf5", "w")
group = file.create_group("awkward")
group

<HDF5 group "/awkward" (0 members)>


We can fill this group as a container by passing it in to ak.to_buffers. (See the previous section for more on ak.packed.)

form, length, container = ak.to_buffers(ak.packed(ak_array), container=group)

container

<HDF5 group "/awkward" (4 members)>


Now the HDF5 group has been filled with array pieces.

container.keys()

<KeysViewHDF5 ['part0-node0-offsets', 'part0-node2-data', 'part0-node3-offsets', 'part0-node4-data']>


Here’s one.

np.asarray(container["part0-node0-offsets"])

array([0, 3, 3, 5])


Now we need to add the other information to the group as metadata. Since HDF5 accepts string-valued metadata, we can put it all in as JSON or numbers.

group.attrs["form"] = form.tojson()
group.attrs["form"]

'{"class":"ListOffsetArray64","offsets":"i64","content":{"class":"RecordArray","contents":{"x":{"class":"NumpyArray","inner_shape":[],"itemsize":8,"format":"d","primitive":"float64","has_identities":false,"parameters":{},"form_key":"node2"},"y":{"class":"ListOffsetArray64","offsets":"i64","content":{"class":"NumpyArray","inner_shape":[],"itemsize":8,"format":"l","primitive":"int64","has_identities":false,"parameters":{},"form_key":"node4"},"has_identities":false,"parameters":{},"form_key":"node3"}},"has_identities":false,"parameters":{},"form_key":"node1"},"has_identities":false,"parameters":{},"form_key":"node0"}'

group.attrs["length"] = json.dumps(length)   # JSON-encode it because it might be a list
group.attrs["length"]

'3'


## Reading Awkward Arrays from HDF5¶

With that, we can reconstitute the array by supplying ak.from_buffers the right arguments from the group and metadata.

The group can’t be used as a container as-is, since subscripting it returns h5py.Dataset objects, rather than arrays.

reconstituted = ak.from_buffers(
ak.forms.Form.fromjson(group.attrs["form"]),
{k: np.asarray(v) for k, v in group.items()},
)
reconstituted

<Array [[{x: 1.1, y: [1]}, ... 2, 3, 4, 5]}]] type='3 * var * {"x": float64, "y"...'>


Like ak.from_parquet, ak.from_buffers has the option to read lazily, only accessing record fields and partitions that are accessed.

class LazyGet:
def __init__(self, group):
self.group = group

def __getitem__(self, key):
print(key)
return np.asarray(self.group[key])

lazy = ak.from_buffers(
ak.forms.Form.fromjson(group.attrs["form"]),
LazyGet(group),
lazy=True,
)


The LazyGet class prints out any keys that actually get read from the HDF5 file, when they get read. Nothing has been printed yet.

lazy.x

part0-node0-offsets
part0-node2-data

<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>


Now that we have looked at the "x" field, "node0-offsets" (for the outer list structure) and "node2" (the "x" values) have been read.

lazy.x

<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>


lazy.y

part0-node3-offsets
part0-node4-data

<Array [[[1], [1, 2], ... [1, 2, 3, 4, 5]]] type='3 * var * var * int64'>


Looking at the "y" field causes the "node3-offsets" (inner list structure) and "node4" ("y" values) to be read.

lazy.y

<Array [[[1], [1, 2], ... [1, 2, 3, 4, 5]]] type='3 * var * var * int64'>


Only once.