Splitting and joining strings#

Strings in Awkward Array can arbitrarily be joined together, and split into sublists. Let’s start by creating an array of strings that we can later manipulate. The following timestamps array contains a list of timestamp-like strings

import awkward as ak
timestamp = ak.from_iter(
    [
        "12-17 19:31:36.263",
        "12-17 19:31:36.263",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.265",
        "12-17 19:31:36.265",
        "12-17 19:31:36.265",
        "12-17 19:31:36.265",
        "12-17 19:31:36.265",
        "12-17 19:31:36.265",
        "12-17 19:31:36.267",
        "12-17 19:31:36.270",
        "12-17 19:31:36.271",
        "12-17 19:31:36.275",
        "12-17 19:31:36.275",
        "12-17 19:31:36.275",
        "12-17 19:31:36.276",
        "12-17 19:31:36.278",
        "12-17 19:31:36.279",
        "12-17 19:31:36.279",
        "12-17 19:31:36.279",
        "12-17 19:31:36.280",
        "12-17 19:31:36.280",
        "12-17 19:31:36.280",
        "12-17 19:31:36.280",
        "12-17 19:31:36.280",
        "12-17 19:31:36.280",
        "12-17 19:31:36.280",
        "12-17 19:31:36.281",
        "12-17 19:31:36.282",
        "12-17 19:31:36.283",
        "12-17 19:31:36.284",
        "12-17 19:31:36.285",
        "12-17 19:31:36.285",
        "12-17 19:31:36.289",
        "12-17 19:31:36.295",
        "12-17 19:31:36.297",
        "12-17 19:31:36.297",
        "12-17 19:31:36.298",
        "12-17 19:31:36.299",
        "12-17 19:31:36.300",
        "12-17 19:31:36.301",
        "12-17 19:31:36.301",
        "12-17 19:31:36.301",
        "12-17 19:31:36.301",
        "12-17 19:31:36.301",
        "12-17 19:31:36.301",
        "12-17 19:31:36.302",
        "12-17 19:31:36.304",
        "12-17 19:31:36.311",
        "12-17 19:31:36.311",
        "12-17 19:31:36.311",
        "12-17 19:31:36.311",
        "12-17 19:31:36.313",
    ]
)

Joining strings together#

Parsing datetimes in a performant manner is tricky. Pandas has such an ability, but it uses NumPy’s fixed-width strings. Arrow provides strptime, but it does not handle fractional seconds or timedeltas and requires a full date. In order to use Arrow’s pyarrow.compute.strptime() function, we can manipulate the string to prepend the date, operating only on the non-fraction part of the match.

Let’s assume that these timestamps were recorded in the year 2022. We can prepend the string “2022” with the “-” delimiter to complete the timestamp string

timestamp_with_year = ak.str.join_element_wise(["2022"], timestamp, ["-"])
timestamp_with_year
['2022-12-17 19:31:36.263',
 '2022-12-17 19:31:36.263',
 '2022-12-17 19:31:36.264',
 '2022-12-17 19:31:36.264',
 '2022-12-17 19:31:36.264',
 '2022-12-17 19:31:36.264',
 '2022-12-17 19:31:36.264',
 '2022-12-17 19:31:36.264',
 '2022-12-17 19:31:36.264',
 '2022-12-17 19:31:36.264',
 ...,
 '2022-12-17 19:31:36.301',
 '2022-12-17 19:31:36.301',
 '2022-12-17 19:31:36.302',
 '2022-12-17 19:31:36.304',
 '2022-12-17 19:31:36.311',
 '2022-12-17 19:31:36.311',
 '2022-12-17 19:31:36.311',
 '2022-12-17 19:31:36.311',
 '2022-12-17 19:31:36.313']
---------------------------
backend: cpu
nbytes: 2.0 kB
type: 64 * string

The ["2022"] and ["-"] arrays are broadcast with the timestamp array before joining element-wise.

ak.str.join_element_wise() is useful for building new strings from separate arrays. It might also be the case that one has a single array of strings that they wish to join along the final axis (like a reducer). There exists a separate function ak.str.join() for such a purpose

ak.str.join(
    [
        ["do", "re", "me"],
        ["fa", "so"],
        ["la"],
        ["ti", "da"],
    ],
    separator="-🎵-",
)
['do-🎵-re-🎵-me',
 'fa-🎵-so',
 'la',
 'ti-🎵-da']
----------------
backend: cpu
nbytes: 60 B
type: 4 * string

Splitting strings apart#

The timestamps above still cannot be parsed by Arrow; the fractional time component is not (at time of writing) yet supported. To fix this, we can split the fractional component from the timestamp, and add it as a timedelta64[ms] later on.

Let’s split the fractional time component into two parts using ak.str.split_pattern().

timestamp_split = ak.str.split_pattern(timestamp_with_year, ".", max_splits=1)
timestamp_split
[['2022-12-17 19:31:36', '263'],
 ['2022-12-17 19:31:36', '263'],
 ['2022-12-17 19:31:36', '264'],
 ['2022-12-17 19:31:36', '264'],
 ['2022-12-17 19:31:36', '264'],
 ['2022-12-17 19:31:36', '264'],
 ['2022-12-17 19:31:36', '264'],
 ['2022-12-17 19:31:36', '264'],
 ['2022-12-17 19:31:36', '264'],
 ['2022-12-17 19:31:36', '264'],
 ...,
 ['2022-12-17 19:31:36', '301'],
 ['2022-12-17 19:31:36', '301'],
 ['2022-12-17 19:31:36', '302'],
 ['2022-12-17 19:31:36', '304'],
 ['2022-12-17 19:31:36', '311'],
 ['2022-12-17 19:31:36', '311'],
 ['2022-12-17 19:31:36', '311'],
 ['2022-12-17 19:31:36', '311'],
 ['2022-12-17 19:31:36', '313']]
--------------------------------
backend: cpu
nbytes: 2.7 kB
type: 64 * var * string
timestamp_non_fractional = timestamp_split[:, 0]
timestamp_fractional = timestamp_split[:, 1]

Now we can parse these timestamps using Arrow!

import pyarrow.compute

datetime = ak.from_arrow(
    pyarrow.compute.strptime(
        ak.to_arrow(timestamp_non_fractional, extensionarray=False),
        "%Y-%m-%d %H:%M:%S",
        "ms",
    )
)
datetime
[2022-12-17T19:31:36.000,
 2022-12-17T19:31:36.000,
 2022-12-17T19:31:36.000,
 2022-12-17T19:31:36.000,
 2022-12-17T19:31:36.000,
 2022-12-17T19:31:36.000,
 2022-12-17T19:31:36.000,
 2022-12-17T19:31:36.000,
 2022-12-17T19:31:36.000,
 2022-12-17T19:31:36.000,
 ...,
 2022-12-17T19:31:36.000,
 2022-12-17T19:31:36.000,
 2022-12-17T19:31:36.000,
 2022-12-17T19:31:36.000,
 2022-12-17T19:31:36.000,
 2022-12-17T19:31:36.000,
 2022-12-17T19:31:36.000,
 2022-12-17T19:31:36.000,
 2022-12-17T19:31:36.000]
--------------------------
backend: cpu
nbytes: 520 B
type: 64 * ?datetime64[ms]

Finally, we build an offset for the fractional component (in milliseconds) using ak.strings_astype()

import numpy as np

datetime_offset = ak.strings_astype(timestamp_fractional, np.dtype("timedelta64[ms]"))
datetime_offset
[263 milliseconds,
 263 milliseconds,
 264 milliseconds,
 264 milliseconds,
 264 milliseconds,
 264 milliseconds,
 264 milliseconds,
 264 milliseconds,
 264 milliseconds,
 264 milliseconds,
 ...,
 301 milliseconds,
 301 milliseconds,
 302 milliseconds,
 304 milliseconds,
 311 milliseconds,
 311 milliseconds,
 311 milliseconds,
 311 milliseconds,
 313 milliseconds]
--------------------------
backend: cpu
nbytes: 512 B
type: 64 * timedelta64[ms]

This offset is added to the absolute datetime to obtain a timestamp

timestamp = datetime + datetime_offset
timestamp
[2022-12-17T19:31:36.263,
 2022-12-17T19:31:36.263,
 2022-12-17T19:31:36.264,
 2022-12-17T19:31:36.264,
 2022-12-17T19:31:36.264,
 2022-12-17T19:31:36.264,
 2022-12-17T19:31:36.264,
 2022-12-17T19:31:36.264,
 2022-12-17T19:31:36.264,
 2022-12-17T19:31:36.264,
 ...,
 2022-12-17T19:31:36.301,
 2022-12-17T19:31:36.301,
 2022-12-17T19:31:36.302,
 2022-12-17T19:31:36.304,
 2022-12-17T19:31:36.311,
 2022-12-17T19:31:36.311,
 2022-12-17T19:31:36.311,
 2022-12-17T19:31:36.311,
 2022-12-17T19:31:36.313]
--------------------------
backend: cpu
nbytes: 1.0 kB
type: 64 * ?datetime64[ms]

If we had a different parsing library that could only handle dates and times separately, then we could also split on the whitespace. Although ak.str.split_pattern() supports whitespace, it is more performant (and versatile) to use ak.str.split_whitespace()

ak.str.split_whitespace(timestamp_with_year)
[['2022-12-17', '19:31:36.263'],
 ['2022-12-17', '19:31:36.263'],
 ['2022-12-17', '19:31:36.264'],
 ['2022-12-17', '19:31:36.264'],
 ['2022-12-17', '19:31:36.264'],
 ['2022-12-17', '19:31:36.264'],
 ['2022-12-17', '19:31:36.264'],
 ['2022-12-17', '19:31:36.264'],
 ['2022-12-17', '19:31:36.264'],
 ['2022-12-17', '19:31:36.264'],
 ...,
 ['2022-12-17', '19:31:36.301'],
 ['2022-12-17', '19:31:36.301'],
 ['2022-12-17', '19:31:36.302'],
 ['2022-12-17', '19:31:36.304'],
 ['2022-12-17', '19:31:36.311'],
 ['2022-12-17', '19:31:36.311'],
 ['2022-12-17', '19:31:36.311'],
 ['2022-12-17', '19:31:36.311'],
 ['2022-12-17', '19:31:36.313']]
--------------------------------
backend: cpu
nbytes: 2.7 kB
type: 64 * var * string

If we also needed to split off the fractional component (and manually build the time delta), then we could have used ak.str.split_pattern_regex() to split on both whitespace and the period

ak.str.split_pattern_regex(timestamp_with_year, r"\.|\s")
[['2022-12-17', '19:31:36', '263'],
 ['2022-12-17', '19:31:36', '263'],
 ['2022-12-17', '19:31:36', '264'],
 ['2022-12-17', '19:31:36', '264'],
 ['2022-12-17', '19:31:36', '264'],
 ['2022-12-17', '19:31:36', '264'],
 ['2022-12-17', '19:31:36', '264'],
 ['2022-12-17', '19:31:36', '264'],
 ['2022-12-17', '19:31:36', '264'],
 ['2022-12-17', '19:31:36', '264'],
 ...,
 ['2022-12-17', '19:31:36', '301'],
 ['2022-12-17', '19:31:36', '301'],
 ['2022-12-17', '19:31:36', '302'],
 ['2022-12-17', '19:31:36', '304'],
 ['2022-12-17', '19:31:36', '311'],
 ['2022-12-17', '19:31:36', '311'],
 ['2022-12-17', '19:31:36', '311'],
 ['2022-12-17', '19:31:36', '311'],
 ['2022-12-17', '19:31:36', '313']]
-----------------------------------
backend: cpu
nbytes: 3.1 kB
type: 64 * var * string