How to convert to/from ROOT RDataFrame#
The ROOT RDataFrame is a declarative, parallel framework for data analysis and manipulation. RDataFrame
reads columnar data via a data source. The transformations can be applied to the data to select rows and/or to define new columns, and to produce results: histograms, etc.
import awkward as ak
import ROOT
From Awkward to RDataFrame#
The function for Awkward → RDataFrame
conversion is ak.to_rdataframe()
.
The argument to this function requires a dictionary: { <column name string> : <awkwad array> }
. This function always returns
cppyy.gbl.ROOT.RDF.RInterface
object.
array_x = ak.Array(
[
{"x": [1.1, 1.2, 1.3]},
{"x": [2.1, 2.2]},
{"x": [3.1]},
{"x": [4.1, 4.2, 4.3, 4.4]},
{"x": [5.1]},
]
)
array_y = ak.Array([1, 2, 3, 4, 5])
array_z = ak.Array([[1.1], [2.1, 2.3, 2.4], [3.1], [4.1, 4.2, 4.3], [5.1]])
The arrays given for each column have to be equal length:
assert len(array_x) == len(array_y) == len(array_z)
The dictionary key defines a column name in RDataFrame.
df = ak.to_rdataframe({"x": array_x, "y": array_y, "z": array_z})
The ak.to_rdataframe()
function presents a generated-on-demand Awkward Array view as an RDataFrame
source. There is a small overhead of generating Awkward RDataSource C++ code. This operation does not execute the RDataFrame
event loop. The array data are not copied.
The column readers are generated based on the run-time type of the views. Here is a description of the RDataFrame
columns:
df.Describe().Print()
Dataframe from datasource Custom Datasource
Property Value
-------- -----
Columns in total 3
Columns from defines 0
Event loops run 0
Processing slots 1
Column Type Origin
------ ---- ------
x awkward::Record_GhCqhYAvGs Dataset
y int64_t Dataset
z ROOT::VecOps::RVec<double> Dataset
The x
column contains an Awkward Array with a made-up type; awkward::Record_cKnX5DyNVM
.
Awkward Arrays are dynamically typed, so in a C++ context, the type name is hashed. In practice, there is no need to know the type. The C++ code should use a placeholder type specifier auto
. The type of the variable that is being declared will be automatically deduced from its initializer.
From RDataFrame to Awkward#
The function for RDataFrame
→ Awkward conversion is ak.from_rdataframe()
. The argument to this function accepts a tuple of strings that are the RDataFrame
column names. By default this function returns
type.
array = ak.from_rdataframe(
df,
columns=(
"x",
"y",
"z",
),
)
array
[{y: 1, z: [1.1], x: {x: [1.1, ..., 1.3]}}, {y: 2, z: [2.1, 2.3, 2.4], x: {x: [2.1, ...]}}, {y: 3, z: [3.1], x: {x: [3.1]}}, {y: 4, z: [4.1, 4.2, 4.3], x: {x: [4.1, ...]}}, {y: 5, z: [5.1], x: {x: [5.1]}}] ----------------------------------------------------------------------------------------- backend: cpu nbytes: 328 B type: 5 * { y: int64, z: var * float64, x: { x: var * float64 } }
When RDataFrame
runs multi-threaded event loops, the entry processing order is not guaranteed:
ROOT.ROOT.EnableImplicitMT()
Let’s recreate the dataframe, to reflect the new multi-threading mode
df = ak.to_rdataframe({"x": array_x, "y": array_y, "z": array_z})
If the keep_order
parameter set to True
, the columns will keep order after filtering:
df = df.Filter("y % 2 == 0")
array = ak.from_rdataframe(
df,
columns=(
"x",
"y",
"z",
),
keep_order=True,
)
array
[{y: 2, z: [2.1, 2.3, 2.4], x: {x: [2.1, ...]}}, {y: 4, z: [4.1, 4.2, 4.3], x: {x: [4.1, ...]}}] ----------------------------------------------------------------------------------------- backend: cpu nbytes: 288 B type: 2 * { y: int64, z: var * float64, x: { x: var * float64 } }