How to convert to Pandas#

Pandas is a data analysis library for ordered time-series and relational data. In general, Pandas does not define operations for manipulating nested data structures, but in some cases, MultiIndex/advanced indexing can do equivalent things.

import awkward as ak
import pandas as pd
import pyarrow as pa
import urllib.request

From Pandas to Awkward#

At the time of writing, there is no ak.from_pandas function, but such a thing could be useful.

However, Apache Arrow can be converted to and from Awkward Arrays, and Arrow can be converted to and from Pandas (sometimes zero-copy). See below for more on conversion through Arrow.

From Awkward to Pandas#

The function for Awkward → Pandas conversion is ak.to_pandas.

ak_array = ak.Array([
    {"x": 1.1, "y": 1, "z": "one"},
    {"x": 2.2, "y": 2, "z": "two"},
    {"x": 3.3, "y": 3, "z": "three"},
    {"x": 4.4, "y": 4, "z": "four"},
    {"x": 5.5, "y": 5, "z": "five"},
])
ak_array
<Array [{x: 1.1, y: 1, ... z: 'five'}] type='5 * {"x": float64, "y": int64, "z":...'>
ak.to_pandas(ak_array)
x y z
entry
0 1.1 1 one
1 2.2 2 two
2 3.3 3 three
3 4.4 4 four
4 5.5 5 five

Awkward record field names are converted into Pandas column names, even if nested within lists.

ak_array = ak.Array([
    [{"x": 1.1, "y": 1, "z": "one"}, {"x": 2.2, "y": 2, "z": "two"}, {"x": 3.3, "y": 3, "z": "three"}],
    [],
    [{"x": 4.4, "y": 4, "z": "four"}, {"x": 5.5, "y": 5, "z": "five"}],
])
ak_array
<Array [[{x: 1.1, y: 1, ... z: 'five'}]] type='3 * var * {"x": float64, "y": int...'>
ak.to_pandas(ak_array)
x y z
entry subentry
0 0 1.1 1 one
1 2.2 2 two
2 3.3 3 three
2 0 4.4 4 four
1 5.5 5 five

In this case, we see that the "x", "y", and "z" fields are separate columns, but also that the index is now hierarchical, a MultiIndex. Nested lists become MultiIndex rows and nested records become MultiIndex columns.

Here is an example with three levels of depth:

ak_array = ak.Array([
    [[1.1, 2.2], [], [3.3]],
    [],
    [[4.4], [5.5, 6.6]],
    [[7.7]],
    [[8.8]],
])
ak_array
<Array [[[1.1, 2.2], [], ... 7.7]], [[8.8]]] type='5 * var * var * float64'>
ak.to_pandas(ak_array)
values
entry subentry subsubentry
0 0 0 1.1
1 2.2
2 0 3.3
2 0 0 4.4
1 0 5.5
1 6.6
3 0 0 7.7
4 0 0 8.8

And here is an example with nested records/hierarchical columns:

ak_array = ak.Array([
    {"I": {"a": _, "b": {"i": _}}, "II": {"x": {"y": {"z": _}}}}
    for _ in range(0, 50, 10)]
)
ak_array
<Array [{I: {a: 0, b: {i: 0}, ... z: 40}}}}] type='5 * {"I": {"a": int64, "b": {...'>
ak.to_pandas(ak_array)
I II
a b x
i y
z
entry
0 0 0 0
1 10 10 10
2 20 20 20
3 30 30 30
4 40 40 40

Although nested lists and records can be represented using Pandas’s MultiIndex, different-length lists in the same data structure can only be translated without loss into multiple DataFrames. This is because a DataFrame can have only one MultiIndex, but lists of different lengths require different MultiIndexes.

ak_array = ak.Array([
    {"x": [], "y": [4.4, 3.3, 2.2, 1.1]},
    {"x": [1], "y": [3.3, 2.2, 1.1]},
    {"x": [1, 2], "y": [2.2, 1.1]},
    {"x": [1, 2, 3], "y": [1.1]},
    {"x": [1, 2, 3, 4], "y": []},
])
ak_array
<Array [{x: [], y: [4.4, 3.3, ... 4], y: []}] type='5 * {"x": var * int64, "y": ...'>

To avoid losing any data, ak.to_pandas can be used with how=None (the default is how="inner") to return a list of the minimum number of DataFrames needed to encode the data.

In how=None mode, ak.to_pandas always returns a list (sometimes with only one item).

ak.to_pandas(ak_array, how=None)
[                x
 entry subentry   
 1     0         1
 2     0         1
       1         2
 3     0         1
       1         2
       2         3
 4     0         1
       1         2
       2         3
       3         4,
                   y
 entry subentry     
 0     0         4.4
       1         3.3
       2         2.2
       3         1.1
 1     0         3.3
       1         2.2
       2         1.1
 2     0         2.2
       1         1.1
 3     0         1.1]

The default how="inner" combines the above into a single DataFrame using pd.merge. This operation is lossy.

ak.to_pandas(ak_array, how="inner")
x y
entry subentry
1 0 1 3.3
2 0 1 2.2
1 2 1.1
3 0 1 1.1

The value of how is passed to pd.merge, so outer joins are possible as well.

ak.to_pandas(ak_array, how="outer")
x y
entry subentry
0 0 NaN 4.4
1 NaN 3.3
2 NaN 2.2
3 NaN 1.1
1 0 1.0 3.3
1 NaN 2.2
2 NaN 1.1
2 0 1.0 2.2
1 2.0 1.1
3 0 1.0 1.1
1 2.0 NaN
2 3.0 NaN
4 0 1.0 NaN
1 2.0 NaN
2 3.0 NaN
3 4.0 NaN

Conversion through Apache Arrow#

Since Apache Arrow can be converted to and from Awkward Arrays and Pandas, Arrow can connect Awkward and Pandas in both directions. This is an alternative to ak.to_pandas (described above) with different behavior.

As described in the tutorial on Arrow, the ak.to_arrow function returns a pyarrow.lib.Arrow object. Arrow’s conversion to Pandas requires a pyarrow.lib.Table.

ak_array = ak.Array([
    [{"x": 1.1, "y": 1, "z": "one"}, {"x": 2.2, "y": 2, "z": "two"}, {"x": 3.3, "y": 3, "z": "three"}],
    [],
    [{"x": 4.4, "y": 4, "z": "four"}, {"x": 5.5, "y": 5, "z": "five"}],
])
ak_array
<Array [[{x: 1.1, y: 1, ... z: 'five'}]] type='3 * var * {"x": float64, "y": int...'>
pa_array = ak.to_arrow(ak_array)
pa_array
<pyarrow.lib.LargeListArray object at 0x7f3d3074e4c0>
[
  -- is_valid: all not null
  -- child 0 type: double
    [
      1.1,
      2.2,
      3.3
    ]
  -- child 1 type: int64
    [
      1,
      2,
      3
    ]
  -- child 2 type: string
    [
      "one",
      "two",
      "three"
    ],
  -- is_valid: all not null
  -- child 0 type: double
    []
  -- child 1 type: int64
    []
  -- child 2 type: string
    [],
  -- is_valid: all not null
  -- child 0 type: double
    [
      4.4,
      5.5
    ]
  -- child 1 type: int64
    [
      4,
      5
    ]
  -- child 2 type: string
    [
      "four",
      "five"
    ]
]
pa_table = pa.Table.from_batches([pa.RecordBatch.from_arrays([
    ak.to_arrow(ak_array.x),
    ak.to_arrow(ak_array.y),
    ak.to_arrow(ak_array.z),
], ["x", "y", "z"])])
pa_table
pyarrow.Table
x: large_list<item: double not null>
  child 0, item: double not null
y: large_list<item: int64 not null>
  child 0, item: int64 not null
z: large_list<item: string not null>
  child 0, item: string not null
----
x: [[[1.1,2.2,3.3],[],[4.4,5.5]]]
y: [[[1,2,3],[],[4,5]]]
z: [[["one","two","three"],[],["four","five"]]]
pa_table.to_pandas()
x y z
0 [1.1, 2.2, 3.3] [1, 2, 3] [one, two, three]
1 [] [] []
2 [4.4, 5.5] [4, 5] [four, five]

Note that this is different from the output of ak.to_pandas:

ak.to_pandas(ak_array)
x y z
entry subentry
0 0 1.1 1 one
1 2.2 2 two
2 3.3 3 three
2 0 4.4 4 four
1 5.5 5 five

The Awkward → Arrow → Pandas route leaves the lists as nested data within each cell, whereas ak.to_pandas encodes the nested structure with a MultiIndex/advanced indexing and puts simple values in each cell. Depending on your needs, one or the other may be desirable.

Finally, the Pandas → Arrow → Awkward is currently the only means of turning Pandas DataFrames into Awkward Arrays.

pokemon = urllib.request.urlopen("https://gist.githubusercontent.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6/raw/92200bc0a673d5ce2110aaad4544ed6c4010f687/pokemon.csv")
df = pd.read_csv(pokemon)
df
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False
4 4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False
... ... ... ... ... ... ... ... ... ... ... ... ... ...
795 719 Diancie Rock Fairy 600 50 100 150 100 150 50 6 True
796 719 DiancieMega Diancie Rock Fairy 700 50 160 110 160 110 110 6 True
797 720 HoopaHoopa Confined Psychic Ghost 600 80 110 60 150 130 70 6 True
798 720 HoopaHoopa Unbound Psychic Dark 680 80 160 60 170 130 80 6 True
799 721 Volcanion Fire Water 600 80 110 120 130 90 70 6 True

800 rows × 13 columns

ak_array = ak.from_arrow(pa.Table.from_pandas(df))
ak_array
<Array [{'#': 1, ... Legendary: True}] type='800 * {"#": ?int64, "Name": option[...'>
ak.type(ak_array)
800 * {"#": ?int64, "Name": option[string], "Type 1": option[string], "Type 2": option[string], "Total": ?int64, "HP": ?int64, "Attack": ?int64, "Defense": ?int64, "Sp. Atk": ?int64, "Sp. Def": ?int64, "Speed": ?int64, "Generation": ?int64, "Legendary": ?bool}
ak.to_list(ak_array[0])
{'#': 1,
 'Name': 'Bulbasaur',
 'Type 1': 'Grass',
 'Type 2': 'Poison',
 'Total': 318,
 'HP': 45,
 'Attack': 49,
 'Defense': 49,
 'Sp. Atk': 65,
 'Sp. Def': 65,
 'Speed': 45,
 'Generation': 1,
 'Legendary': False}

This array is ready for data analysis.

ak_array[ak_array.Legendary].Attack - ak_array[ak_array.Legendary].Defense
<Array [-15, 5, 10, 20, ... 50, 50, 100, -10] type='65 * ?int64'>