How to create arrays of strings

Awkward Arrays can contain strings, although these strings are just a special view of lists of uint8 numbers. As such, the variable-length data are efficiently stored.

NumPy’s strings are padded to have equal width, and Pandas’s strings are Python objects. Awkward Array doesn’t have nearly as many functions for manipulating arrays of strings as NumPy and Pandas, though.

import awkward as ak
import numpy as np

From Python strings

The ak.Array constructor and ak.from_iter recognize strings, and strings are returned by ak.to_list.

ak.Array(["one", "two", "three"])
<Array ['one', 'two', 'three'] type='3 * string'>

They may be nested within anything.

ak.Array([["one", "two"], [], ["three"]])
<Array [['one', 'two'], [], ['three']] type='3 * var * string'>

From NumPy arrays

NumPy strings are also recognized by ak.from_numpy and ak.to_numpy.

numpy_array = np.array(["one", "two", "three", "four"])
numpy_array
array(['one', 'two', 'three', 'four'], dtype='<U5')
awkward_array = ak.Array(numpy_array)
awkward_array
<Array ['one', 'two', 'three', 'four'] type='4 * string'>

Operations with strings

Since strings are really just lists, some of the list operations “just work” on strings.

ak.num(awkward_array)
<Array [3, 3, 5, 4] type='4 * int64'>
awkward_array[:, 1:]
<Array ['ne', 'wo', 'hree', 'our'] type='4 * string'>

Others had to be specially overloaded for the string case, such as string-equality. The default meaning for == would be to descend to the lowest level and compare numbers (characters, in this case).

awkward_array == "three"
<Array [False, False, True, False] type='4 * bool'>
awkward_array == ak.Array(["ONE", "TWO", "three", "four"])
<Array [False, False, True, True] type='4 * bool'>

Similarly, ak.sort and ak.argsort sort strings lexicographically, not individual characters.

ak.sort(awkward_array)
<Array ['four', 'one', 'three', 'two'] type='4 * string'>

Still other operations had to be inhibited, since they wouldn’t make sense for strings.

np.sqrt(awkward_array)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [11], in <cell line: 1>()
----> 1 np.sqrt(awkward_array)

File ~/python3.8/lib/python3.8/site-packages/awkward/highlevel.py:1411, in Array.__array_ufunc__(self, ufunc, method, *inputs, **kwargs)
   1354 """
   1355 Intercepts attempts to pass this Array to a NumPy
   1356 [universal functions](https://docs.scipy.org/doc/numpy/reference/ufuncs.html)
   (...)
   1408 See also #__array_function__.
   1409 """
   1410 if not hasattr(self, "_tracers"):
-> 1411     return ak._connect._numpy.array_ufunc(ufunc, method, inputs, kwargs)
   1412 else:
   1413     return ak._connect._jax.jax_utils.array_ufunc(
   1414         self, ufunc, method, inputs, kwargs
   1415     )

File ~/python3.8/lib/python3.8/site-packages/awkward/_connect/_numpy.py:250, in array_ufunc(ufunc, method, inputs, kwargs)
    240         raise ValueError(
    241             "no overloads for custom types: {}({})".format(
    242                 ufunc.__name__,
   (...)
    245             + ak._util.exception_suffix(__file__)
    246         )
    248     return None
--> 250 out = ak._util.broadcast_and_apply(
    251     inputs, getfunction, behavior, allow_records=False, pass_depth=False
    252 )
    253 assert isinstance(out, tuple) and len(out) == 1
    254 return ak._util.wrap(out[0], behavior)

File ~/python3.8/lib/python3.8/site-packages/awkward/_util.py:1166, in broadcast_and_apply(inputs, getfunction, behavior, allow_records, pass_depth, pass_user, user, left_broadcast, right_broadcast, numpy_to_regular, regular_to_jagged)
   1164 else:
   1165     isscalar = []
-> 1166     out = apply(broadcast_pack(inputs, isscalar), 0, user)
   1167     assert isinstance(out, tuple)
   1168     return tuple(broadcast_unpack(x, isscalar) for x in out)

File ~/python3.8/lib/python3.8/site-packages/awkward/_util.py:919, in broadcast_and_apply.<locals>.apply(inputs, depth, user)
    916     else:
    917         nextinputs.append(x)
--> 919 outcontent = apply(nextinputs, depth + 1, user)
    920 assert isinstance(outcontent, tuple)
    922 length = None

File ~/python3.8/lib/python3.8/site-packages/awkward/_util.py:733, in broadcast_and_apply.<locals>.apply(inputs, depth, user)
    730 if pass_user:
    731     args = args + (user,)
--> 733 custom = getfunction(inputs, *args)
    734 if callable(custom):
    735     return custom()

File ~/python3.8/lib/python3.8/site-packages/awkward/_connect/_numpy.py:240, in array_ufunc.<locals>.getfunction(inputs)
    238         else:
    239             custom_types.append(type(x).__name__)
--> 240     raise ValueError(
    241         "no overloads for custom types: {}({})".format(
    242             ufunc.__name__,
    243             ", ".join(custom_types),
    244         )
    245         + ak._util.exception_suffix(__file__)
    246     )
    248 return None

ValueError: no overloads for custom types: sqrt(string)

(https://github.com/scikit-hep/awkward-1.0/blob/1.9.0rc4/src/awkward/_connect/_numpy.py#L245)

Categorical strings

A large set of strings with few unique values are more efficiently manipulated as integers than as strings. In Pandas, this is categorical data, in R, it’s called a factor, and in Arrow and Parquet, it’s dictionary encoding.

The ak.to_categorical function makes Awkward Arrays categorical in this sense. ak.to_arrow and ak.to_parquet recognize categorical data and convert it to the corresponding Arrow and Parquet types.

uncategorized = ak.Array(["three", "one", "two", "two", "three", "one", "one", "one"])
uncategorized
<Array ['three', 'one', ... 'one', 'one'] type='8 * string'>
categorized = ak.to_categorical(uncategorized)
categorized
<Array ['three', 'one', ... 'one', 'one'] type='8 * categorical[type=string]'>

Internally, the data now have an index that selects from a set of unique strings.

categorized.layout.index
<Index64 i="[0 1 2 2 0 1 1 1]" offset="0" length="8" at="0x0000033390d0"/>
ak.Array(categorized.layout.content)
<Array ['three', 'one', 'two'] type='3 * string'>

The main advantage to Awkward categorical data (other than proper conversions to Arrow and Parquet) is that equality is performed using the index integers.

categorized == "one"
<Array [False, True, False, ... True, True] type='8 * bool'>

With ArrayBuilder

ak.ArrayBuilder is described in more detail in this tutorial, but you can add strings by calling the string method or simply appending them.

(This is what ak.from_iter uses internally to accumulate data.)

builder = ak.ArrayBuilder()

builder.string("one")
builder.append("two")
builder.append("three")

array = builder.snapshot()
array
<Array ['one', 'two', 'three'] type='3 * string'>