How to convert to Pandas

Pandas is a data analysis library for ordered time-series and relational data. In general, Pandas does not define operations for manipulating nested data structures, but in some cases, MultiIndex/advanced indexing can do equivalent things.

import awkward as ak
import pandas as pd
import pyarrow as pa
import urllib.request

From Pandas to Awkward

At the time of writing, there is no ak.from_pandas function, but such a thing could be useful.

However, Apache Arrow can be converted to and from Awkward Arrays, and Arrow can be converted to and from Pandas (sometimes zero-copy). See below for more on conversion through Arrow.

From Awkward to Pandas

The function for Awkward → Pandas conversion is ak.to_pandas.

ak_array = ak.Array([
    {"x": 1.1, "y": 1, "z": "one"},
    {"x": 2.2, "y": 2, "z": "two"},
    {"x": 3.3, "y": 3, "z": "three"},
    {"x": 4.4, "y": 4, "z": "four"},
    {"x": 5.5, "y": 5, "z": "five"},
])
ak_array
<Array [{x: 1.1, y: 1, ... z: 'five'}] type='5 * {"x": float64, "y": int64, "z":...'>
ak.to_pandas(ak_array)
x y z
entry
0 1.1 1 one
1 2.2 2 two
2 3.3 3 three
3 4.4 4 four
4 5.5 5 five

Awkward record field names are converted into Pandas column names, even if nested within lists.

ak_array = ak.Array([
    [{"x": 1.1, "y": 1, "z": "one"}, {"x": 2.2, "y": 2, "z": "two"}, {"x": 3.3, "y": 3, "z": "three"}],
    [],
    [{"x": 4.4, "y": 4, "z": "four"}, {"x": 5.5, "y": 5, "z": "five"}],
])
ak_array
<Array [[{x: 1.1, y: 1, ... z: 'five'}]] type='3 * var * {"x": float64, "y": int...'>
ak.to_pandas(ak_array)
x y z
entry subentry
0 0 1.1 1 one
1 2.2 2 two
2 3.3 3 three
2 0 4.4 4 four
1 5.5 5 five

In this case, we see that the "x", "y", and "z" fields are separate columns, but also that the index is now hierarchical, a MultiIndex. Nested lists become MultiIndex rows and nested records become MultiIndex columns.

Here is an example with three levels of depth:

ak_array = ak.Array([
    [[1.1, 2.2], [], [3.3]],
    [],
    [[4.4], [5.5, 6.6]],
    [[7.7]],
    [[8.8]],
])
ak_array
<Array [[[1.1, 2.2], [], ... 7.7]], [[8.8]]] type='5 * var * var * float64'>
ak.to_pandas(ak_array)
values
entry subentry subsubentry
0 0 0 1.1
1 2.2
2 0 3.3
2 0 0 4.4
1 0 5.5
1 6.6
3 0 0 7.7
4 0 0 8.8

And here is an example with nested records/hierarchical columns:

ak_array = ak.Array([
    {"I": {"a": _, "b": {"i": _}}, "II": {"x": {"y": {"z": _}}}}
    for _ in range(0, 50, 10)]
)
ak_array
<Array [{I: {a: 0, b: {i: 0}, ... z: 40}}}}] type='5 * {"I": {"a": int64, "b": {...'>
ak.to_pandas(ak_array)
I II
a b x
i y
z
entry
0 0 0 0
1 10 10 10
2 20 20 20
3 30 30 30
4 40 40 40

Although nested lists and records can be represented using Pandas’s MultiIndex, different-length lists in the same data structure can only be translated without loss into multiple DataFrames. This is because a DataFrame can have only one MultiIndex, but lists of different lengths require different MultiIndexes.

ak_array = ak.Array([
    {"x": [], "y": [4.4, 3.3, 2.2, 1.1]},
    {"x": [1], "y": [3.3, 2.2, 1.1]},
    {"x": [1, 2], "y": [2.2, 1.1]},
    {"x": [1, 2, 3], "y": [1.1]},
    {"x": [1, 2, 3, 4], "y": []},
])
ak_array
<Array [{x: [], y: [4.4, 3.3, ... 4], y: []}] type='5 * {"x": var * int64, "y": ...'>

To avoid losing any data, ak.to_pandas can be used with how=None (the default is how="inner") to return a list of the minimum number of DataFrames needed to encode the data.

In how=None mode, ak.to_pandas always returns a list (sometimes with only one item).

ak.to_pandas(ak_array, how=None)
[                x
 entry subentry   
 1     0         1
 2     0         1
       1         2
 3     0         1
       1         2
       2         3
 4     0         1
       1         2
       2         3
       3         4,
                   y
 entry subentry     
 0     0         4.4
       1         3.3
       2         2.2
       3         1.1
 1     0         3.3
       1         2.2
       2         1.1
 2     0         2.2
       1         1.1
 3     0         1.1]

The default how="inner" combines the above into a single DataFrame using pd.merge. This operation is lossy.

ak.to_pandas(ak_array, how="inner")
x y
entry subentry
1 0 1 3.3
2 0 1 2.2
1 2 1.1
3 0 1 1.1

The value of how is passed to pd.merge, so outer joins are possible as well.

ak.to_pandas(ak_array, how="outer")
x y
entry subentry
0 0 NaN 4.4
1 NaN 3.3
2 NaN 2.2
3 NaN 1.1
1 0 1.0 3.3
1 NaN 2.2
2 NaN 1.1
2 0 1.0 2.2
1 2.0 1.1
3 0 1.0 1.1
1 2.0 NaN
2 3.0 NaN
4 0 1.0 NaN
1 2.0 NaN
2 3.0 NaN
3 4.0 NaN

Conversion through Apache Arrow

Since Apache Arrow can be converted to and from Awkward Arrays and Pandas, Arrow can connect Awkward and Pandas in both directions. This is an alternative to ak.to_pandas (described above) with different behavior.

As described in the tutorial on Arrow, the ak.to_arrow function returns a pyarrow.lib.Arrow object. Arrow’s conversion to Pandas requires a pyarrow.lib.Table.

ak_array = ak.Array([
    [{"x": 1.1, "y": 1, "z": "one"}, {"x": 2.2, "y": 2, "z": "two"}, {"x": 3.3, "y": 3, "z": "three"}],
    [],
    [{"x": 4.4, "y": 4, "z": "four"}, {"x": 5.5, "y": 5, "z": "five"}],
])
ak_array
<Array [[{x: 1.1, y: 1, ... z: 'five'}]] type='3 * var * {"x": float64, "y": int...'>
pa_array = ak.to_arrow(ak_array)
pa_array
<pyarrow.lib.LargeListArray object at 0x7f1d31bd03d0>
[
  -- is_valid: all not null
  -- child 0 type: double
    [
      1.1,
      2.2,
      3.3
    ]
  -- child 1 type: int64
    [
      1,
      2,
      3
    ]
  -- child 2 type: string
    [
      "one",
      "two",
      "three"
    ],
  -- is_valid: all not null
  -- child 0 type: double
    []
  -- child 1 type: int64
    []
  -- child 2 type: string
    [],
  -- is_valid: all not null
  -- child 0 type: double
    [
      4.4,
      5.5
    ]
  -- child 1 type: int64
    [
      4,
      5
    ]
  -- child 2 type: string
    [
      "four",
      "five"
    ]
]
pa_table = pa.Table.from_batches([pa.RecordBatch.from_arrays([
    ak.to_arrow(ak_array.x),
    ak.to_arrow(ak_array.y),
    ak.to_arrow(ak_array.z),
], ["x", "y", "z"])])
pa_table
pyarrow.Table
x: large_list<item: double not null>
  child 0, item: double not null
y: large_list<item: int64 not null>
  child 0, item: int64 not null
z: large_list<item: string not null>
  child 0, item: string not null
pa_table.to_pandas()
x y z
0 [1.1, 2.2, 3.3] [1, 2, 3] [one, two, three]
1 [] [] []
2 [4.4, 5.5] [4, 5] [four, five]

Note that this is different from the output of ak.to_pandas:

ak.to_pandas(ak_array)
x y z
entry subentry
0 0 1.1 1 one
1 2.2 2 two
2 3.3 3 three
2 0 4.4 4 four
1 5.5 5 five

The Awkward → Arrow → Pandas route leaves the lists as nested data within each cell, whereas ak.to_pandas encodes the nested structure with a MultiIndex/advanced indexing and puts simple values in each cell. Depending on your needs, one or the other may be desirable.

Finally, the Pandas → Arrow → Awkward is currently the only means of turning Pandas DataFrames into Awkward Arrays.

pokemon = urllib.request.urlopen("https://gist.githubusercontent.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6/raw/92200bc0a673d5ce2110aaad4544ed6c4010f687/pokemon.csv")
df = pd.read_csv(pokemon)
df
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False
4 4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False
... ... ... ... ... ... ... ... ... ... ... ... ... ...
795 719 Diancie Rock Fairy 600 50 100 150 100 150 50 6 True
796 719 DiancieMega Diancie Rock Fairy 700 50 160 110 160 110 110 6 True
797 720 HoopaHoopa Confined Psychic Ghost 600 80 110 60 150 130 70 6 True
798 720 HoopaHoopa Unbound Psychic Dark 680 80 160 60 170 130 80 6 True
799 721 Volcanion Fire Water 600 80 110 120 130 90 70 6 True

800 rows × 13 columns

ak_array = ak.from_arrow(pa.Table.from_pandas(df))
ak_array
<Array [{'#': 1, ... Legendary: True}] type='800 * {"#": ?int64, "Name": option[...'>
ak.type(ak_array)
800 * {"#": ?int64, "Name": option[string], "Type 1": option[string], "Type 2": option[string], "Total": ?int64, "HP": ?int64, "Attack": ?int64, "Defense": ?int64, "Sp. Atk": ?int64, "Sp. Def": ?int64, "Speed": ?int64, "Generation": ?int64, "Legendary": ?bool}
ak.to_list(ak_array[0])
{'#': 1,
 'Name': 'Bulbasaur',
 'Type 1': 'Grass',
 'Type 2': 'Poison',
 'Total': 318,
 'HP': 45,
 'Attack': 49,
 'Defense': 49,
 'Sp. Atk': 65,
 'Sp. Def': 65,
 'Speed': 45,
 'Generation': 1,
 'Legendary': False}

This array is ready for data analysis.

ak_array[ak_array.Legendary].Attack - ak_array[ak_array.Legendary].Defense
<Array [-15, 5, 10, 20, ... 50, 50, 100, -10] type='65 * ?int64'>