What is an Awkward Array?

Efficiency and generality

Arrays are the most efficient data structures for sequential numeric processing, and NumPy makes it easy to interact with arrays in Python. However, NumPy’s arrays are rectangular tables or tensors that cannot express variable-length structures.

General tree-like data are often expressed using JSON, but at the expense of memory use and processing speed.

Awkward Arrays are general tree-like data structures, like JSON, but contiguous in memory and operated upon with compiled, vectorized code like NumPy. They’re basic building blocks for data analyses that are, well, more awkward than those involving neat tables.

This library was originally developed for high-energy particle physics. Particle physics datasets have rich data structures that usually can’t be flattened into rectangular arrays, but physicists need to process them efficiently because the datasets are enormous. Awkward Arrays combine generic data structures with high-performance number-crunching.

Let’s illustrate this with a non-physics dataset: maps of bike routes in my hometown of Chicago. You can also follow this as a video tutorial.

JSON to array

Here is a GeoJSON of bike paths of bike paths throughout the city of Chicago.

If you dig into the JSON, you’ll see that it contains street names, metadata, and longitude, latitude coordinates all along the bike paths.

Start by loading them into Python as Python objects,

import urllib.request
import json

url = "https://raw.githubusercontent.com/Chicago/osd-bike-routes/master/data/Bikeroutes.geojson"
bikeroutes_json = urllib.request.urlopen(url).read()
bikeroutes_pyobj = json.loads(bikeroutes_json)

and then as an Awkward Array (actually an ak.Record because the top-level construct is a JSON object).

import awkward as ak

bikeroutes = ak.from_json(bikeroutes_json)
# Alternatively, bikeroutes = ak.Record(bikeroutes_pyobj)
bikeroutes
<Record ... [-87.7, 42], [-87.7, 42]]]}}]} type='{"type": string, "crs": {"type"...'>

We only see a part of the data and its type if we don’t deliberately expand it out.

Data types

To get a full view of the type (Awkward’s equivalent of a NumPy dtype + shape), use the ak.type function. The display format adheres to Datashape syntax, when possible.

ak.type(bikeroutes)
{"type": string, "crs": {"type": string, "properties": {"name": string}}, "features": var * {"type": string, "properties": {"STREET": string, "TYPE": string, "BIKEROUTE": string, "F_STREET": string, "T_STREET": option[string]}, "geometry": {"type": string, "coordinates": var * var * var * float64}}}

In the above, {"field name": type, ...} denotes a record structure, which can be nested, and var indicates variable-length lists. The "coordinates" (at the end) are var * var * var * float64, lists of lists of lists of numbers, and any of these lists can have an arbitrary length.

In addition, there are strings (variable-length lists interpreted as text) and “option” types, meaning that values are allowed to be null.

Slicing

NumPy-like slicing extracts structures within the array. The slice may consist of integers, ranges, and many other slice types, like NumPy, and commas indicate different slices applied to different dimensions. Since the bike routes dataset contains records, we can use strings to select nested fields sequentially.

bikeroutes["features", "geometry", "coordinates"]
<Array [[[[-87.8, 41.9], ... [-87.7, 42]]]] type='1061 * var * var * var * float64'>

Alternatively, we could use dots for record field specifiers (if the field names are syntactically allowed in Python):

bikeroutes.features.geometry.coordinates
<Array [[[[-87.8, 41.9], ... [-87.7, 42]]]] type='1061 * var * var * var * float64'>

Slicing by field names (even if the records those fields belong to are nested within lists) slices across all elements of the lists. We can pick out just one object by putting integers in the square brackets:

bikeroutes["features", "geometry", "coordinates", 100, 0]
<Array [[-87.7, 42], ... [-87.7, 42]] type='7 * var * float64'>

or

bikeroutes.features.geometry.coordinates[100, 0]
<Array [[-87.7, 42], ... [-87.7, 42]] type='7 * var * float64'>

or even

bikeroutes.features[100].geometry.coordinates[0]
<Array [[-87.7, 42], ... [-87.7, 42]] type='7 * var * float64'>

(The strings that select record fields may be placed before or after integers and other slice types.)

To get full detail of one structured object, we can use the ak.to_list function, which converts Awkward records and lists into Python dicts and lists.

ak.to_list(bikeroutes.features[751])
{'type': 'Feature',
 'properties': {'STREET': 'E 26TH ST',
  'TYPE': '1',
  'BIKEROUTE': 'EXISTING BIKE LANE',
  'F_STREET': 'S STATE ST',
  'T_STREET': 'S DR MARTIN LUTHER KING JR DR'},
 'geometry': {'type': 'MultiLineString',
  'coordinates': [[[-87.62685625163756, 41.845587148411795],
    [-87.62675996392576, 41.84558902593194],
    [-87.62637708895348, 41.845596494328554],
    [-87.62626461651281, 41.845598326696425],
    [-87.62618268489399, 41.84559966093136],
    [-87.6261438116618, 41.84560027230502],
    [-87.62613206507362, 41.845600474403334],
    [-87.6261027723024, 41.8456009526551],
    [-87.62579736038116, 41.84560626159298],
    [-87.62553890383363, 41.845610239979905],
    [-87.62532611036139, 41.845613593674],
    [-87.6247932635836, 41.84562202574476]],
   [[-87.62532611036139, 41.845613593674],
    [-87.6247932635836, 41.84562202574476]],
   [[-87.6247932635836, 41.84562202574476],
    [-87.62446484629729, 41.84562675013391],
    [-87.62444032614908, 41.845627092762086]],
   [[-87.6247932635836, 41.84562202574476],
    [-87.62446484629729, 41.84562675013391],
    [-87.62444032614908, 41.845627092762086]],
   [[-87.62444032614908, 41.845627092762086],
    [-87.62417259047609, 41.84563048939241]],
   [[-87.62417259047609, 41.84563048939241],
    [-87.62407957610536, 41.845631726253856],
    [-87.62363619038386, 41.84563829041728],
    [-87.62339190417225, 41.845641912449615],
    [-87.62213773032211, 41.8456604706941],
    [-87.620481318361, 41.84568497173672],
    [-87.62033059867875, 41.84568719208078],
    [-87.61886420422526, 41.84571018731772],
    [-87.61783987848477, 41.845726258794926],
    [-87.61768559736353, 41.84572529758383],
    [-87.61767695024436, 41.84572400878766]],
   [[-87.62417259047609, 41.84563048939241],
    [-87.62407957610536, 41.845631726253856],
    [-87.62363619038386, 41.84563829041728]]]}}

Looking at one record in full detail can make it clear why, for instance, the “coordinates” field contains lists of lists of lists: they are path segments that collectively form a route, and there are many routes, each associated with a named street. This item, number 751, is Martin Luther King Drive, a route described by 7 segments. (Presumably, you have to pick up your bike and walk it.)

Variable-length lists

The last dimension of these lists always happens to have length 2. This is because it represents the longitude and latitude of each point along a path. You can see this with the ak.num function:

ak.num(bikeroutes.features[751].geometry.coordinates, axis=2)
<Array [[2, 2, 2, 2, 2, 2, ... 2], [2, 2, 2]] type='7 * var * int64'>

The axis is the depth at which this function is applied; the above could alternatively have been axis=-1 (deepest), and ak.num at less-deep axis values tells us the number of points in each segment:

ak.num(bikeroutes.features[751].geometry.coordinates, axis=1)
<Array [12, 2, 3, 3, 2, 11, 3] type='7 * int64'>

and the number of points:

ak.num(bikeroutes.features[751].geometry.coordinates, axis=0)
7

By verifying that all lists at this depth have length 2,

ak.all(ak.num(bikeroutes.features.geometry.coordinates, axis=-1) == 2)
True

we can be confident that we can select item 0 and item 1 without errors. Note that this is a major difference between variable-length lists and rectilinear arrays: in NumPy, a given index either exists for all nested lists or for none of them. For variable-length lists, we have to check (or ensure it with another selection).

Array math

We now know that the "coordinates" are longitude-latitude pairs, so let’s pull them out and name them as such. Item 0 of each of the deepest lists is the longitude and item 1 of each of the deepest lists is the latitude. We want to leave the structure of all lists other than the deepest untouched, which would mean a complete slice (colon : by itself) at each dimension except the last, but we can also use the ellipsis (...) shortcut from NumPy.

longitude = bikeroutes.features.geometry.coordinates[..., 0]
latitude = bikeroutes.features.geometry.coordinates[..., 1]
longitude, latitude
(<Array [[[-87.8, -87.8, ... -87.7, -87.7]]] type='1061 * var * var * float64'>,
 <Array [[[41.9, 41.9, 41.9, ... 42, 42, 42]]] type='1061 * var * var * float64'>)

Note that if we wanted to do this with Python objects, the above would have required many “append” operations in nested “for” loops. As Awkward Arrays, it’s just a slice.

Now that we have arrays of pure numbers (albeit inside of variable-length nested lists), we can run NumPy functions on them. For example,

import numpy as np

np.add(longitude, 180)
<Array [[[92.2, 92.2, 92.2, ... 92.3, 92.3]]] type='1061 * var * var * float64'>

rotates the longitude points 180 degrees around the world while maintaining the triply nested structure. Any “universal function” (ufunc) will work, including ufuncs from libraries other than NumPy (such as SciPy, or a domain-specific package). Simple NumPy functions like addition have the usual shortcuts:

longitude + 180
<Array [[[92.2, 92.2, 92.2, ... 92.3, 92.3]]] type='1061 * var * var * float64'>

In addition, some functions other than ufuncs have an Awkward equivalent, such as ak.mean, which is the equivalent of NumPy’s np.mean (not a ufunc because it takes a whole array and returns one value).

ak.mean(longitude)
-87.67152377693318

Using an extension mechanism within NumPy (introduced in NumPy 1.17), we can use ak.mean and np.mean interchangeably.

np.mean(longitude)
-87.67152377693318

Awkward functions have all or most of the same arguments as their NumPy equivalents. For instance, we can compute the mean along an axis, such as axis=1, which gives us the mean longitude of each path, rather than a single mean of all points.

np.mean(longitude, axis=1)
<Array [[-87.8, -87.8, ... -87.7, -87.7]] type='1061 * var * ?float64'>

To focus our discussion, let’s say that we’re trying to find the length of each path in the dataset. To do this, we need to convert the degrees longitude and latitude into common distance units, and to work with smaller numbers, we’ll start by subtracting the mean.

At Chicago’s latitude, one degree of longitude is 82.7 km and one degree of latitude is 111.1 km, which we can use as conversion factors.

km_east = (longitude - np.mean(longitude)) * 82.7 # km/deg
km_north = (latitude - np.mean(latitude)) * 111.1 # km/deg
km_east, km_north
(<Array [[[-9.68, -9.69, ... -3.58, -3.62]]] type='1061 * var * var * float64'>,
 <Array [[[6.68, 6.68, 6.67, ... 9.68, 9.72]]] type='1061 * var * var * float64'>)

To find distances between points, we first have to pair up points with their neighbors. Each path segment of \(N\) points has \(N-1\) pairs of neighbors. We can construct these pairs by making two partial copies of each list, one with everything except the first element and the other with everything except the last element, so that original index \(i\) can be compared with original index \(i+1\).

In plain NumPy, you would express it like this:

path = np.array([1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9])
path[1:] - path[:-1]
array([1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1])

The array[1:] has the first element dropped and the array[:-1] has the last element dropped, so their differences are the 8 distances between each of the 9 points in the original array. In this example, all differences are 1.1.

Here’s what that looks like for the first segment of the first bike path in our sample:

km_east[0, 0, 1:], km_east[0, 0, :-1]
(<Array [-9.69, -9.7, -9.71, ... -9.92, -9.92] type='15 * float64'>,
 <Array [-9.68, -9.69, -9.7, ... -9.91, -9.92] type='15 * float64'>)

and their differences are:

km_east[0, 0, 1:] - km_east[0, 0, :-1]
<Array [-0.00603, -0.0165, ... -0.00203] type='15 * float64'>

If we can do it for one list, we can do it for all of them by swapping index 0 with slice : in the first two dimensions.

km_east[:, :, 1:] - km_east[:, :, :-1]
<Array [[[-0.00603, -0.0165, ... -0.0385]]] type='1061 * var * var * float64'>

This expression subtracts pairs of neighboring points in all lists, each with a different length, maintaining the segments-within-paths structure.

Now that we know how to compute differences in \(x\) (km_east) and \(y\) (km_north) individually, we can compute distances using the distance formula: \(\sqrt{(x_i - x_{i + 1})^2 + (y_i - y_{i + 1})^2}\).

segment_length = np.sqrt(
    ( km_east[:, :, 1:] -  km_east[:, :, :-1])**2 +
    (km_north[:, :, 1:] - km_north[:, :, :-1])**2
)
segment_length
<Array [[[0.00603, 0.0165, ... 0.0523]]] type='1061 * var * var * float64'>

Going back to our example of Martin Luther King Drive, these pairwise distances are

ak.to_list(segment_length[751])
[[0.007965725361784344,
  0.03167462986536074,
  0.009303698353632218,
  0.006777366140642045,
  0.0032155337770008734,
  0.0009717022897204878,
  0.0024230948099507807,
  0.025264451818903598,
  0.021378926022712963,
  0.01760196411492381,
  0.0440763850916575],
 [0.0440763850916575],
 [0.02716518085563118, 0.0020281735105371133],
 [0.02716518085563118, 0.0020281735105371133],
 [0.02214495567782706],
 [0.007693515757648775,
  0.03667525064914765,
  0.020206477031414642,
  0.10374066852963969,
  0.13701231191288935,
  0.012466958457596494,
  0.12129772855897196,
  0.08473055433052853,
  0.012759495625968342,
  0.0007293106274618985],
 [0.007693515757648775, 0.03667525064914765]]

for each of the segments in this discontiguous path. Some of these segments had only two longitude, latitude points, and hence they have only one distance (single-element lists).

To make path distances from the pairwise distances, we need to add them up. There’s an ak.sum (equivalent to np.sum) that we can use with axis=-1 to add up the innermost lists.

For Martin Luther King Drive, this is

ak.to_list(ak.sum(segment_length[751], axis=-1))
[0.17065347764628935,
 0.0440763850916575,
 0.029193354366168295,
 0.029193354366168295,
 0.02214495567782706,
 0.5373122714812673,
 0.04436876640679643]

and in general, it’s

path_length = np.sum(segment_length, axis=-1)
path_length
<Array [[0.241], [0.0971], ... 0.347], [0.281]] type='1061 * var * float64'>

Notice that segment_length has type

ak.type(segment_length)
1061 * var * var * float64

and path_length has type

ak.type(path_length)
1061 * var * float64

The path_length has one fewer var dimension because we have summed over it. We can further sum over the discontiguous curves that 11 of the streets have to get total lengths.

Since there are multiple paths for each bike route, we sum up the innermost dimension again:

route_length = np.sum(path_length, axis=-1)
route_length
<Array [0.241, 0.0971, 0.203, ... 0.347, 0.281] type='1061 * float64'>

Now there’s exactly one of these for each of the 1061 streets.

for i in range(10):
    print(bikeroutes.features.properties.STREET[i], "\t\t", route_length[i])
W FULLERTON AVE 		 0.24076035127094295
N LA CROSSE AVE 		 0.09706818131239836
S DR MARTIN LUTHER KING JR DR W 		 0.20258150113769838
W 51ST ST 		 0.8459916013923557
E 50TH ST 		 0.021616600297903087
W MARQUETTE RD 		 0.7926173720366738
W MARQUETTE RD 		 0.4040218089349682
W 83RD ST 		 0.20738439769758524
E 83RD ST 		 0.12660735184266853
E 103RD ST 		 0.2740970708688548

This would have been incredibly awkward to write using only NumPy, and slow if executed in Python loops.

Performance

The full analysis, expressed in Python for loops, would be:

%%timeit

route_length = []
for route in bikeroutes_pyobj["features"]:
    path_length = []
    for segment in route["geometry"]["coordinates"]:
        segment_length = []
        last = None
        for lng, lat in segment:
            km_east = lng * 82.7
            km_north = lat * 111.1
            if last is not None:
                dx2 = (km_east - last[0])**2
                dy2 = (km_north - last[1])**2
                segment_length.append(np.sqrt(dx2 + dy2))
            last = (km_east, km_north)
        path_length.append(sum(segment_length))
    route_length.append(sum(route_length))
90.5 ms ± 2.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

whereas for Awkward Arrays, it is:

%%timeit

km_east = bikeroutes.features.geometry.coordinates[..., 0] * 82.7
km_north = bikeroutes.features.geometry.coordinates[..., 1] * 111.1

segment_length = np.sqrt((km_east[:, :, 1:] - km_east[:, :, :-1])**2 +
                         (km_north[:, :, 1:] - km_north[:, :, :-1])**2)

path_length = np.sum(segment_length, axis=-1)
route_length = np.sum(path_length, axis=-1)
17.8 ms ± 383 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In addition to being more concise, the latter is typically 5‒8× faster, especially when we scale to ever-larger problems:

The reasons for this speedup are all related to Awkward Array’s data structure, that it is more suited to structured numerical math than Python objects. Like a NumPy array, its numerical data are packed in memory-contiguous arrays of homogeneous type, which means that

  • only a single block of memory needs to be fetched from memory into the CPU cache (no “pointer chasing”),

  • data for fields other than the one being operated upon are not in the same buffer, so they don’t even need to be loaded (“columnar,” rather than “record-oriented”),

  • the data type can be evaluated once before applying a precompiled opeation to a whole array buffer, rather than once before each element of a Python list.

This memory layout is especially good for applying one operation on all values in the array, thinking about the result, and then applying another. This is the “interactive” style of data analysis that you’re probably familiar with from NumPy and Pandas, especially if you use Jupyter notebooks. It does have a performance cost, however: array buffers need to be allocated and filled after each step of the process, and some of those might never be used again.

Just as NumPy can be accelerated by just-in-time compiling your code with Numba, Awkward Arrays can be accelerated in the same way. The speedups described on Numba’s website are possible because they avoid creating temporary, intermediate arrays and flushing the CPU cache with multiple passes over the same data. The Numba-accelerated equivalent of our bike routes example looks very similar to the pure Python code:

import numba as nb

@nb.jit
def compute_lengths(bikeroutes):
    route_length = np.zeros(len(bikeroutes.features))
    for i in range(len(bikeroutes.features)):
        for path in bikeroutes.features[i].geometry.coordinates:
            first = True
            last_east, last_north = 0.0, 0.0
            for lng_lat in path:
                km_east = lng_lat[0] * 82.7
                km_north = lng_lat[1] * 111.1
                if not first:
                    dx2 = (km_east - last_east)**2
                    dy2 = (km_north - last_north)**2
                    route_length[i] += np.sqrt(dx2 + dy2)
                first = False
                last_east, last_north = km_east, km_north
    return route_length

compute_lengths(bikeroutes)
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
/tmp/ipykernel_2739/2272123353.py in <module>
----> 1 import numba as nb
      2 
      3 @nb.jit
      4 def compute_lengths(bikeroutes):
      5     route_length = np.zeros(len(bikeroutes.features))

~/python3.8/lib/python3.8/site-packages/numba/__init__.py in <module>
    196 
    197 _ensure_llvm()
--> 198 _ensure_critical_deps()
    199 
    200 # we know llvmlite is working as the above tests passed, import it now as SVML

~/python3.8/lib/python3.8/site-packages/numba/__init__.py in _ensure_critical_deps()
    136         raise ImportError("Numba needs NumPy 1.17 or greater")
    137     elif numpy_version > (1, 20):
--> 138         raise ImportError("Numba needs NumPy 1.20 or less")
    139 
    140     try:

ImportError: Numba needs NumPy 1.20 or less

But it runs 250× faster than the pure Python code:

%%timeit

compute_lengths(bikeroutes)

(Note that these are microseconds, not milliseconds.)

This improvement is due to a combination of streamlined data structures, precompiled logic, and minimizing the number of passes over the data. We haven’t even taken advantage of multithreading yet, which can multiply this speedup by (up to) the number of CPU cores your computer has. (See Numba’s parallel range, multithreading, and nogil mode for more.)

Internal structure

It’s possible to peek into this columnar structure (or manipulate it, if you’re a developer) by accessing the ak.Array’s layout. All of the columnar buffers are accessible this way.

If you look carefully at the following, you’ll see that all values for each field is in a separate buffer; the last of these is the longitude, latitude coordinates.

bikeroutes.layout

Compatibility

The Awkward Array library is not intended to replace your data analysis tools. It adds one key feature: the ability to manipulate JSON-like data structures with NumPy-like idioms. It “plays well” with the scientific Python ecosystem, providing functions to convert arrays into forms recognized by other libraries and adheres to standard protocols for sharing data.

They can be converted to and from Apache Arrow:

ak.to_arrow(bikeroutes.features).type

To and from Parquet files (through pyarrow):

ak.to_parquet(bikeroutes.features, "/tmp/bikeroutes.parquet")

To and from JSON:

ak.to_json(bikeroutes.features)[:100]

To Pandas:

ak.to_pandas(bikeroutes.features)

And to NumPy, if the arrays are first padded to be rectilinear:

ak.to_numpy(
    ak.pad_none(
        ak.pad_none(
            bikeroutes.features.geometry.coordinates, 1980, axis=2
        ), 7, axis=1
    )
)

Where to go next

The rest of these tutorials show how to use Awkward Array with various libraries, as well as how to do things that only Awkward Array can do. They are organized by task: see the left-bar (≡ button on mobile) for what you’re trying to do. If, however, you’re looking for documentation on a specific function, see the Python and C++ references below.

Python
API reference

C++
API reference