Generic buffers

Most of the conversion functions target a particular library: NumPy, Arrow, Pandas, or Python itself. As a catch-all for other storage formats, Awkward Arrays can be converted to and from sets of named buffers. The buffers are not (usually) intelligible on their own; the length of the array and a JSON document are needed to reconstitute the original structure. This section will demonstrate how an array-set can be used to store an Awkward Array in an HDF5 file, which ordinarily wouldn’t be able to represent nested, irregular data structures.

import awkward as ak
import numpy as np
import h5py
import json

From Awkward to buffers

Consider the following complex array:

ak_array = ak.Array([
    [{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}, {"x": 3.3, "y": [1, 2, 3]}],
    [],
    [{"x": 4.4, "y": [1, 2, 3, 4]}, {"x": 5.5, "y": [1, 2, 3, 4, 5]}],
])
ak_array
<Array [[{x: 1.1, y: [1]}, ... 2, 3, 4, 5]}]] type='3 * var * {"x": float64, "y"...'>

The ak.to_buffers function decomposes it into a set of one-dimensional arrays (a zero-copy operation).

form, length, container = ak.to_buffers(ak_array)

The pieces needed to reconstitute this array are:

  • the Form, which defines how structure is built from one-dimensional arrays,

  • the length of the original array or lengths of all of its partitions (ak.partitions),

  • the one-dimensional arrays in the container (a MutableMapping).

The Form is like an Awkward Type in that it describes how the data are structured, but with more detail: it includes distinctions such as the difference between ListArray and ListOffsetArray, as well as the integer types of structural Indexes.

It is usually presented as JSON, and has a compact JSON format (when Form.tojson is invoked).

form
{
    "class": "ListOffsetArray64",
    "offsets": "i64",
    "content": {
        "class": "RecordArray",
        "contents": {
            "x": {
                "class": "NumpyArray",
                "itemsize": 8,
                "format": "d",
                "primitive": "float64",
                "form_key": "node2"
            },
            "y": {
                "class": "ListOffsetArray64",
                "offsets": "i64",
                "content": {
                    "class": "NumpyArray",
                    "itemsize": 8,
                    "format": "l",
                    "primitive": "int64",
                    "form_key": "node4"
                },
                "form_key": "node3"
            }
        },
        "form_key": "node1"
    },
    "form_key": "node0"
}

In this case, the length is just an integer. It would be a list of integers if ak_array was partitioned.

length
3

This container is a new dict, but it could have been a user-specified MutableMapping if passed into ak.to_buffers as an argument.

container
{'part0-node0-offsets': array([0, 3, 3, 5], dtype=int64),
 'part0-node2-data': array([1.1, 2.2, 3.3, 4.4, 5.5]),
 'part0-node3-offsets': array([ 0,  1,  3,  6, 10, 15], dtype=int64),
 'part0-node4-data': array([1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5])}

From buffers to Awkward

The function that reverses ak.to_buffers is ak.from_buffers. Its first three arguments are form, length, and container.

ak.from_buffers(form, length, container)
<Array [[{x: 1.1, y: [1]}, ... 2, 3, 4, 5]}]] type='3 * var * {"x": float64, "y"...'>

Saving Awkward Arrays to HDF5

The h5py library presents each group in an HDF5 file as a MutableMapping, which we can use as a container for an array-set. We must also save the form and length as metadata for the array to be retrievable.

file = h5py.File("/tmp/example.hdf5", "w")
group = file.create_group("awkward")
group
<HDF5 group "/awkward" (0 members)>

We can fill this group as a container by passing it in to ak.to_buffers.

form, length, container = ak.to_buffers(ak_array, container=group)
container
<HDF5 group "/awkward" (4 members)>

Now the HDF5 group has been filled with array pieces.

container.keys()
<KeysViewHDF5 ['part0-node0-offsets', 'part0-node2-data', 'part0-node3-offsets', 'part0-node4-data']>

Here’s one.

np.asarray(container["part0-node0-offsets"])
array([0, 3, 3, 5])

Now we need to add the other information to the group as metadata. Since HDF5 accepts string-valued metadata, we can put it all in as JSON or numbers.

group.attrs["form"] = form.tojson()
group.attrs["form"]
'{"class":"ListOffsetArray64","offsets":"i64","content":{"class":"RecordArray","contents":{"x":{"class":"NumpyArray","inner_shape":[],"itemsize":8,"format":"d","primitive":"float64","has_identities":false,"parameters":{},"form_key":"node2"},"y":{"class":"ListOffsetArray64","offsets":"i64","content":{"class":"NumpyArray","inner_shape":[],"itemsize":8,"format":"l","primitive":"int64","has_identities":false,"parameters":{},"form_key":"node4"},"has_identities":false,"parameters":{},"form_key":"node3"}},"has_identities":false,"parameters":{},"form_key":"node1"},"has_identities":false,"parameters":{},"form_key":"node0"}'
group.attrs["length"] = json.dumps(length)   # JSON-encode it because it might be a list
group.attrs["length"]
'3'

Reading Awkward Arrays from HDF5

With that, we can reconstitute the array by supplying ak.from_buffers the right arguments from the group and metadata.

The group can’t be used as a container as-is, since subscripting it returns h5py.Dataset objects, rather than arrays.

reconstituted = ak.from_buffers(
    ak.forms.Form.fromjson(group.attrs["form"]),
    json.loads(group.attrs["length"]),
    {k: np.asarray(v) for k, v in group.items()},
)
reconstituted
<Array [[{x: 1.1, y: [1]}, ... 2, 3, 4, 5]}]] type='3 * var * {"x": float64, "y"...'>

Like ak.from_parquet, ak.from_buffers has the option to read lazily, only accessing record fields and partitions that are accessed.

class LazyGet:
    def __init__(self, group):
        self.group = group
    
    def __getitem__(self, key):
        print(key)
        return np.asarray(self.group[key])

lazy = ak.from_buffers(
    ak.forms.Form.fromjson(group.attrs["form"]),
    json.loads(group.attrs["length"]),
    LazyGet(group),
    lazy=True,
)

The LazyGet class prints out any keys that actually get read from the HDF5 file, when they get read. Nothing has been printed yet.

lazy.x
part0-node0-offsets
part0-node2-data
<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>

Now that we have looked at the "x" field, "node0-offsets" (for the outer list structure) and "node2" (the "x" values) have been read.

lazy.x
<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>

They were only read once.

lazy.y
part0-node3-offsets
part0-node4-data
<Array [[[1], [1, 2], ... [1, 2, 3, 4, 5]]] type='3 * var * var * int64'>

Looking at the "y" field causes the "node3-offsets" (inner list structure) and "node4" ("y" values) to be read.

lazy.y
<Array [[[1], [1, 2], ... [1, 2, 3, 4, 5]]] type='3 * var * var * int64'>

Only once.