---
jupytext:
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
    jupytext_version: 1.15.0
kernelspec:
  display_name: Python 3 (ipykernel)
  language: python
  name: python3
---

How to create arrays by "unflattening" or "grouping"
====================================================

```{code-cell} ipython3
import awkward as ak
import pandas as pd
import numpy as np
from urllib.request import urlopen
```

## Finding runs in an array
It is often the case that one has an array of data that they wish to subdivide into common groups. Let's imagine that we're looking at NASA's [Earth Meteorite Landings dataset](https://data.nasa.gov/resource/y77d-th95.json), and that we wish to find the largest meteorite in each classification. This is known as a `groupby` operation, followed by a reduction.

+++

First, we should load the data

```{code-cell} ipython3
with open("../data/y77d-th95.json", "rb") as f:
    landing = ak.from_json(f)
landing.fields
```

In order to find the _largest_ meteorite by each category, we must first group the entries into categories. This is called a `groupby` operation, whereby we are ordering the entire array into subgroups given by a particular label. To perform a `groupby` in Awkward Array, we must first sort the array by the category

```{code-cell} ipython3
landing_sorted_class = landing[ak.argsort(landing.recclass)]
landing_sorted_class
```

This sorted array can be subdivided into sublists of the same category. To determine how long each of these sublists must be, Awkward provides _another_ function {func}`ak.run_lengths` which, as the name implies, finds the lengths of consecutive _runs_ in an array, e.g.

```{code-cell} ipython3
ak.run_lengths([1, 1, 1, 3, 3, 2, 4, 4, 4])
```

The function does not accept an `axis` argument; Awkward Array only supports finding runs in the innermost `axis=-1` axis of the array. Let's find the lengths of each category sublist using {func}`ak.run_lengths`:

```{code-cell} ipython3
lengths = ak.run_lengths(landing_sorted_class.recclass)
lengths
```

## Dividing an array into sublists

+++

Awkward Array provides an {func}`ak.unflatten` operation that adds a new dimension to an array, using either a single integer denoting the (regular) size of the dimension, or a list of integers representing the lengths of the sublists to create e.g.

```{code-cell} ipython3
ak.unflatten(
    ["Do", "re", "mi", "fa", "so", "la"],
    [1, 2, 2, 1]
)
```

If we pass an integer instead of a list of lengths, we get a regular array

```{code-cell} ipython3
ak.unflatten(
    ["Do", "re", "mi", "fa", "so", "la"],
    2
)
```

We can unflatten our sorted array using the length of runs each classification, in order to finalise our groupby operation.

```{code-cell} ipython3
landing_by_class = ak.unflatten(
    landing_sorted_class, 
    lengths
)
landing_by_class
```

We can see the categories of this grouped array by pulling out the first item of each sublist

```{code-cell} ipython3
landing_by_class.recclass[..., 0]
```

The above three steps:
1. Sort the array
2. Compute the length of runs within the sorted array
3. Unflatten the sorted array by the run lengths

form a `groupby` operation.

+++

### Computing the mass of the largest meteorites

+++

Now that we have grouped our meteorite landings by classification, we can find the largest mass meteorite in each group. If we look at the type of the array, we can see that the `mass` field is actually a string:

```{code-cell} ipython3
landing_by_class.type.show()
```

Let's convert it to a floating point number

```{code-cell} ipython3
landing_by_class['mass'] = ak.strings_astype(landing_by_class.mass, np.float64)
```

Now we can find the index of the largest mass in each sublist. We'll use `keepdims=True` in order to be able to use this array to index `landing_by_class` and pull out the corresponding record.

```{code-cell} ipython3
i_largest_mass = ak.argmax(landing_by_class.mass, axis=-1, keepdims=True)
```

Finding the largest meteorite is then a simple case of using `i_largest_mass` as an index, and flattening the result to drop the unneeded dimension

```{code-cell} ipython3
largest_meteorite = ak.flatten(
    landing_by_class[i_largest_mass], 
    axis=1,
)
largest_meteorite
```

Here are there names!

```{code-cell} ipython3
largest_meteorite.name
```