{ "cells": [ { "cell_type": "markdown", "id": "ae14d3ce-0dc5-4a62-bed2-a0f75e72887c", "metadata": {}, "source": [ "# Uproot Awkward Columnar HATS" ] }, { "cell_type": "markdown", "id": "d2fa2b35-5e1c-4be2-aa58-385f3b370683", "metadata": {}, "source": [ "_Originally presented as [part](https://github.com/jpivarski-talks/2021-06-14-uproot-awkward-columnar-hats/blob/main/3-awkward-array.ipynb) of [CMS HATS training on June 14, 2021](https://indico.cern.ch/event/1042866/)._" ] }, { "cell_type": "markdown", "id": "0f98f8c2-91ce-4a15-b06a-c20f1d40256b", "metadata": {}, "source": [ "




" ] }, { "cell_type": "markdown", "id": "8da7642a-d311-488c-be06-8fd51114b71c", "metadata": {}, "source": [ "## What about an array of lists?" ] }, { "cell_type": "code", "execution_count": null, "id": "c8e59475-25c6-41b1-a37c-3553517b3a98", "metadata": {}, "outputs": [], "source": [ "import skhep_testdata\n", "import awkward as ak\n", "import numpy as np\n", "import uproot" ] }, { "cell_type": "code", "execution_count": null, "id": "b3a79fec-71a0-40fd-83c6-0c3369cf7597", "metadata": {}, "outputs": [], "source": [ "events = uproot.open(skhep_testdata.data_path(\"uproot-HZZ.root\"))[\"events\"]\n", "events.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "39208788-6a41-4afe-be49-9b42321a899f", "metadata": {}, "outputs": [], "source": [ "events[\"Muon_Px\"].array()" ] }, { "cell_type": "code", "execution_count": null, "id": "98d28d9b-96bd-4316-b26c-42f12db10614", "metadata": {}, "outputs": [], "source": [ "events[\"Muon_Px\"].array(entry_stop=20).tolist()" ] }, { "cell_type": "markdown", "id": "e163f018-cd77-47eb-be1a-e90e8252a796", "metadata": {}, "source": [ "This is what Awkward Array was made for. NumPy's equivalent is cumbersome and inefficient." ] }, { "cell_type": "code", "execution_count": null, "id": "6a406416-f5b6-49aa-afb9-720446a8b990", "metadata": {}, "outputs": [], "source": [ "jagged_numpy = events[\"Muon_Px\"].array(entry_stop=20, library=\"np\")\n", "jagged_numpy" ] }, { "cell_type": "markdown", "id": "0f6366b4-a61f-4b59-9574-ce2a203d6d39", "metadata": {}, "source": [ "What if I want the first item in each list as an array?" ] }, { "cell_type": "code", "execution_count": null, "id": "d93944f5-cdc2-4280-82a3-9bc4865e2f25", "metadata": {}, "outputs": [], "source": [ "np.array([x[0] for x in jagged_numpy])" ] }, { "cell_type": "markdown", "id": "f611c7ac-aa33-4446-aea9-4dc1224e488a", "metadata": {}, "source": [ "This violates the rule from [1-python-performance.ipynb](https://github.com/jpivarski-talks/2021-06-14-uproot-awkward-columnar-hats/blob/main/1-python-performance.ipynb): don't iterate in Python." ] }, { "cell_type": "code", "execution_count": null, "id": "e8bb7a35-4a43-4956-ab78-613742726ae5", "metadata": {}, "outputs": [], "source": [ "jagged_awkward = events[\"Muon_Px\"].array(entry_stop=20, library=\"ak\")\n", "jagged_awkward" ] }, { "cell_type": "code", "execution_count": null, "id": "73d7617c-5951-448b-8202-77ee6ae4354b", "metadata": {}, "outputs": [], "source": [ "jagged_awkward[:, 0]" ] }, { "cell_type": "markdown", "id": "0c1cdce9-4b40-4878-a3a9-a42cf858a910", "metadata": {}, "source": [ "




" ] }, { "cell_type": "markdown", "id": "237987c8-97ff-4002-adf1-b735ff0bc640", "metadata": {}, "source": [ "## Awkward Array is a general-purpose library: NumPy-like idioms on JSON-like data" ] }, { "cell_type": "markdown", "id": "9eaca985-580b-4564-a9be-a05cf434fb89", "metadata": {}, "source": [ "![](pivarski-one-slide-summary.svg)" ] }, { "cell_type": "markdown", "id": "93577b1f-2008-4ae1-a4d9-d78da0859d44", "metadata": {}, "source": [ "




" ] }, { "cell_type": "markdown", "id": "3632e9fe-91c7-4319-9041-0abda61b0a62", "metadata": {}, "source": [ "## Main idea: slicing through structure is computationally inexpensive" ] }, { "cell_type": "markdown", "id": "bebb13ec-3c82-4c85-a4fa-8668fbe383f4", "metadata": {}, "source": [ "Slicing by field name doesn't modify any large buffers and [ak.zip](https://awkward-array.readthedocs.io/en/latest/_auto/ak.zip.html) only scans them to ensure they're compatible (not even that if `depth_limit=1`)." ] }, { "cell_type": "code", "execution_count": null, "id": "c272052a-1a9e-4fe3-951b-db38a6cceb40", "metadata": {}, "outputs": [], "source": [ "array = events.arrays()\n", "array" ] }, { "cell_type": "markdown", "id": "d93d9d83-a5f6-49d2-a1d6-9e985b94465c", "metadata": {}, "source": [ "Think of this as zero-cost:" ] }, { "cell_type": "code", "execution_count": null, "id": "8c6f8d39-75f3-4d6e-867f-c60bd16d83ba", "metadata": {}, "outputs": [], "source": [ "array.Muon_Px, array.Muon_Py, array.Muon_Pz" ] }, { "cell_type": "markdown", "id": "e2ed505d-6eca-4807-b43b-880ed4c4fd0c", "metadata": {}, "source": [ "Think of this as zero-cost:" ] }, { "cell_type": "code", "execution_count": null, "id": "684275da-c070-4523-ab77-1f4e1727cf0e", "metadata": {}, "outputs": [], "source": [ "ak.zip({\"px\": array.Muon_Px, \"py\": array.Muon_Py, \"pz\": array.Muon_Pz})" ] }, { "cell_type": "markdown", "id": "f534ea92-4d94-4265-9166-c3789548cfb1", "metadata": {}, "source": [ "(The above is a manual version of `how=\"zip\"`.)" ] }, { "cell_type": "markdown", "id": "74f6e268-26ff-45ff-af49-24e1fc4be70c", "metadata": {}, "source": [ "


\n", "\n", "NumPy ufuncs work on these arrays (if they're \"[broadcastable](https://awkward-array.readthedocs.io/en/latest/_auto/ak.broadcast_arrays.html)\")." ] }, { "cell_type": "code", "execution_count": null, "id": "c107261e-e687-4844-aad3-65ce162531c3", "metadata": {}, "outputs": [], "source": [ "np.sqrt(array.Muon_Px**2 + array.Muon_Py**2)" ] }, { "cell_type": "markdown", "id": "9f96c45a-dac4-4bf8-bc0d-e8e539129ee4", "metadata": {}, "source": [ "


\n", "\n", "And there are specialized operations that only make sense in a variable-length context.\n", "\n", "{func}`ak.cartesian`\n", "\n", "![](cartoon-cartesian.png)\n", "\n", "{func}`ak.combinations`\n", "\n", "![](cartoon-combinations.png)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "020c5f92-13d5-48b6-a29d-8ae19827becf", "metadata": {}, "outputs": [], "source": [ "ak.cartesian((array.Muon_Px, array.Jet_Px))" ] }, { "cell_type": "code", "execution_count": null, "id": "f7976a9d-4e14-4f71-821b-b07659701bec", "metadata": {}, "outputs": [], "source": [ "ak.combinations(array.Muon_Px, 2)" ] }, { "cell_type": "markdown", "id": "b836817e-1fea-405a-ae1e-92ba4f6c09cb", "metadata": {}, "source": [ "




" ] }, { "cell_type": "markdown", "id": "2b021955-b508-4fe1-9e91-98cd5ab93241", "metadata": {}, "source": [ "## Arrays can have custom [behavior](https://awkward-array.readthedocs.io/en/latest/ak.behavior.html)" ] }, { "cell_type": "markdown", "id": "f7f35dea-5745-4953-8053-36d744a5c196", "metadata": {}, "source": [ "The following come from the new [Vector](https://github.com/scikit-hep/vector#readme) library." ] }, { "cell_type": "code", "execution_count": null, "id": "a09f2884-dc3b-4f15-a4e0-3afbcd77a984", "metadata": {}, "outputs": [], "source": [ "import vector\n", "vector.register_awkward()" ] }, { "cell_type": "code", "execution_count": null, "id": "0f497377-398d-4c11-ad08-8483c61f2239", "metadata": {}, "outputs": [], "source": [ "muons = ak.zip({\"px\": array.Muon_Px, \"py\": array.Muon_Py, \"pz\": array.Muon_Pz, \"E\": array.Muon_E}, with_name=\"Momentum4D\")\n", "muons" ] }, { "cell_type": "markdown", "id": "3099e3d5-2dc6-41ec-8cb9-372923904c45", "metadata": {}, "source": [ "This is an array of lists of vectors, and methods like `pt`, `eta`, `phi` apply through the whole array." ] }, { "cell_type": "code", "execution_count": null, "id": "877cac01-693b-435d-9e6f-f67325cbe9d0", "metadata": {}, "outputs": [], "source": [ "muons.pt" ] }, { "cell_type": "code", "execution_count": null, "id": "67d0cb46-a576-4de4-8968-f7734a049fad", "metadata": {}, "outputs": [], "source": [ "muons.eta" ] }, { "cell_type": "code", "execution_count": null, "id": "830140bb-fe03-4c78-8178-f15a8748dd60", "metadata": {}, "outputs": [], "source": [ "muons.phi" ] }, { "cell_type": "markdown", "id": "7e56579b-d3e2-4fa3-9774-da3f15fbe0a5", "metadata": {}, "source": [ "


" ] }, { "cell_type": "markdown", "id": "a4e6b57c-09d2-4b2d-b112-8a89f04c9e75", "metadata": {}, "source": [ "Let's try an example: ΔR(muons, jets)" ] }, { "cell_type": "code", "execution_count": null, "id": "0646a94b-7e04-48e5-b1b9-cf347e0b16d7", "metadata": {}, "outputs": [], "source": [ "jets = ak.zip({\"px\": array.Jet_Px, \"py\": array.Jet_Py, \"pz\": array.Jet_Pz, \"E\": array.Jet_E}, with_name=\"Momentum4D\")\n", "jets" ] }, { "cell_type": "code", "execution_count": null, "id": "38732ecc-4850-4e11-956e-7413a0845cbb", "metadata": {}, "outputs": [], "source": [ "ak.num(muons), ak.num(jets)" ] }, { "cell_type": "code", "execution_count": null, "id": "1933b22d-42fe-45dd-a0cd-cfbf949053bc", "metadata": {}, "outputs": [], "source": [ "ms, js = ak.unzip(ak.cartesian((muons, jets)))\n", "ms, js" ] }, { "cell_type": "code", "execution_count": null, "id": "ac76799e-b86d-4872-a09f-aeb9d3ed6fb7", "metadata": {}, "outputs": [], "source": [ "ak.num(ms), ak.num(js)" ] }, { "cell_type": "code", "execution_count": null, "id": "0206ebe6-b580-4872-9b16-58d606e92b09", "metadata": {}, "outputs": [], "source": [ "ms.deltaR(js)" ] }, { "cell_type": "markdown", "id": "12b2c1e7-5cfd-44b8-8870-d37878422a28", "metadata": {}, "source": [ "


\n", "\n", "And another: muon pairs (all combinations, not just the first two per event)." ] }, { "cell_type": "code", "execution_count": null, "id": "b366c0b7-a4e3-4ebc-b4e2-a6150019ca16", "metadata": {}, "outputs": [], "source": [ "ak.num(muons)" ] }, { "cell_type": "code", "execution_count": null, "id": "207bf9a9-84c0-428a-815e-6de6fb8694a3", "metadata": {}, "outputs": [], "source": [ "m1, m2 = ak.unzip(ak.combinations(muons, 2))\n", "m1, m2" ] }, { "cell_type": "code", "execution_count": null, "id": "a3b698ef-989d-4185-8de0-62a70087072c", "metadata": {}, "outputs": [], "source": [ "ak.num(m1), ak.num(m2)" ] }, { "cell_type": "code", "execution_count": null, "id": "6d2444a8-a7ef-4731-b3cd-923c0ed0c7ea", "metadata": {}, "outputs": [], "source": [ "m1 + m2" ] }, { "cell_type": "code", "execution_count": null, "id": "9bc067cb-a97e-4333-92b4-48d705fe5107", "metadata": {}, "outputs": [], "source": [ "(m1 + m2).mass" ] }, { "cell_type": "code", "execution_count": null, "id": "2722fa73-649d-43f2-8312-2703776a9433", "metadata": {}, "outputs": [], "source": [ "import hist\n", "\n", "hist.Hist.new.Reg(120, 0, 120, name=\"mass\").Double().fill(\n", " ak.flatten((m1 + m2).mass)\n", ").plot()\n", "\n", "None" ] }, { "cell_type": "markdown", "id": "0a2bbb34-0e56-42e3-9251-8e53b7df1f16", "metadata": {}, "source": [ "


" ] }, { "cell_type": "markdown", "id": "4e3f780c-5fcd-4281-b4dd-be0a1c5f1ace", "metadata": {}, "source": [ "### It doesn't matter which coordinates were used to construct it" ] }, { "cell_type": "code", "execution_count": null, "id": "118fd5b1-894f-486d-9e23-548ba1c84c63", "metadata": {}, "outputs": [], "source": [ "array2 = uproot.open(\n", " \"https://github.com/jpivarski-talks/2023-12-18-hsf-india-tutorial-bhubaneswar/raw/main/data/SMHiggsToZZTo4L.root:Events\"\n", ").arrays([\"Muon_pt\", \"Muon_eta\", \"Muon_phi\", \"Muon_charge\"], entry_stop=100000)" ] }, { "cell_type": "code", "execution_count": null, "id": "d266244f-461d-4590-9214-1d4380a8866d", "metadata": {}, "outputs": [], "source": [ "import particle\n", "\n", "muons2 = ak.zip({\"pt\": array2.Muon_pt, \"eta\": array2.Muon_eta, \"phi\": array2.Muon_phi, \"q\": array2.Muon_charge}, with_name=\"Momentum4D\")\n", "muons2[\"mass\"] = particle.Particle.findall(\"mu-\")[0].mass / 1000.0\n", "muons2" ] }, { "cell_type": "markdown", "id": "d0391ff6-7281-46ff-801e-0b8928347fc3", "metadata": {}, "source": [ "As long as you use properties (dots, not strings in brackets), you don't need to care what coordinates it's based on." ] }, { "cell_type": "code", "execution_count": null, "id": "7eb3b75b-f5a3-4658-a9cc-c0b29d1b0e4b", "metadata": {}, "outputs": [], "source": [ "muons2.px" ] }, { "cell_type": "code", "execution_count": null, "id": "6566a49c-4ce4-481a-9931-b8c2e95e80a6", "metadata": {}, "outputs": [], "source": [ "muons2.py" ] }, { "cell_type": "code", "execution_count": null, "id": "8d13dc67-35b1-4fad-8e98-f0d9733f577d", "metadata": {}, "outputs": [], "source": [ "muons2.pz" ] }, { "cell_type": "code", "execution_count": null, "id": "7eef9466-6d63-4e9e-b1c9-c0ea896a6118", "metadata": {}, "outputs": [], "source": [ "muons2.E" ] }, { "cell_type": "code", "execution_count": null, "id": "0da4eb75-6b02-4cca-b770-0944f40a5da8", "metadata": {}, "outputs": [], "source": [ "m1, m2 = ak.unzip(ak.combinations(muons2, 2))\n", "hist.Hist.new.Log(200, 0.1, 120, name=\"mass\").Double().fill(\n", " ak.flatten((m1 + m2).mass)\n", ").plot()\n", "\n", "None" ] }, { "cell_type": "markdown", "id": "9758607a-8216-47ee-a41a-2e47694fd6cb", "metadata": {}, "source": [ "


" ] }, { "cell_type": "markdown", "id": "8b93a41f-3099-4889-ba46-55b87bd64e71", "metadata": {}, "source": [ "## Awkward Arrays and Vector in Numba" ] }, { "cell_type": "markdown", "id": "e84b6e59-dbee-4cf6-9349-8b989935e3ca", "metadata": {}, "source": [ "Remember Numba, the JIT-compiler from [1-python-performance.ipynb](https://github.com/jpivarski-talks/2021-06-14-uproot-awkward-columnar-hats/blob/main/1-python-performance.ipynb)? Awkward Array and Vector have been implemented in Numba's compiler." ] }, { "cell_type": "code", "execution_count": null, "id": "27f91719-7143-47d1-bc93-ece0e14d1515", "metadata": {}, "outputs": [], "source": [ "import numba as nb\n", "\n", "@nb.njit\n", "def first_big_dimuon(events):\n", " for event in events:\n", " for i in range(len(event)):\n", " mu1 = event[i]\n", " for j in range(i + 1, len(event)):\n", " mu2 = event[j]\n", " dimuon = mu1 + mu2\n", " if dimuon.mass > 10:\n", " return dimuon" ] }, { "cell_type": "code", "execution_count": null, "id": "aa36234f-f1f8-432d-80e4-697072a8be85", "metadata": {}, "outputs": [], "source": [ "first_big_dimuon(muons2)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.14" } }, "nbformat": 4, "nbformat_minor": 5 }