How to create arrays of strings#

Awkward Arrays can contain strings, although these strings are just a special view of lists of uint8 numbers. As such, the variable-length data are efficiently stored.

NumPy’s strings are padded to have equal width, and Pandas’s strings are Python objects. Awkward Array doesn’t have nearly as many functions for manipulating arrays of strings as NumPy and Pandas, though.

import awkward as ak
import numpy as np

From Python strings#

The ak.Array constructor and ak.from_iter() recognize strings, and strings are returned by ak.to_list().

ak.Array(["one", "two", "three"])
['one',
 'two',
 'three']
----------------
type: 3 * string

They may be nested within anything.

ak.Array([["one", "two"], [], ["three"]])
[['one', 'two'],
 [],
 ['three']]
----------------------
type: 3 * var * string

From NumPy arrays#

NumPy strings are also recognized by ak.from_numpy() and ak.to_numpy().

numpy_array = np.array(["one", "two", "three", "four"])
numpy_array
array(['one', 'two', 'three', 'four'], dtype='<U5')
awkward_array = ak.Array(numpy_array)
awkward_array
['one',
 'two',
 'three',
 'four']
----------------
type: 4 * string

Operations with strings#

Since strings are really just lists, some of the list operations “just work” on strings.

ak.num(awkward_array)
[3,
 3,
 5,
 4]
---------------
type: 4 * int64
awkward_array[:, 1:]
['ne',
 'wo',
 'hree',
 'our']
----------------
type: 4 * string

Others had to be specially overloaded for the string case, such as string-equality. The default meaning for == would be to descend to the lowest level and compare numbers (characters, in this case).

awkward_array == "three"
[False,
 False,
 True,
 False]
--------------
type: 4 * bool
awkward_array == ak.Array(["ONE", "TWO", "three", "four"])
[False,
 False,
 True,
 True]
--------------
type: 4 * bool

Similarly, ak.sort() and ak.argsort() sort strings lexicographically, not individual characters.

ak.sort(awkward_array)
['four',
 'one',
 'three',
 'two']
----------------
type: 4 * string

Still other operations had to be inhibited, since they wouldn’t make sense for strings.

np.sqrt(awkward_array)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[11], line 1
----> 1 np.sqrt(awkward_array)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/highlevel.py:1511, in Array.__array_ufunc__(self, ufunc, method, *inputs, **kwargs)
   1509 name = f"{type(ufunc).__module__}.{ufunc.__name__}.{method!s}"
   1510 with ak._errors.OperationErrorContext(name, inputs, kwargs):
-> 1511     return ak._connect.numpy.array_ufunc(ufunc, method, inputs, kwargs)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_connect/numpy.py:466, in array_ufunc(ufunc, method, inputs, kwargs)
    458         raise TypeError(
    459             "no {}.{} overloads for custom types: {}".format(
    460                 type(ufunc).__module__, ufunc.__name__, ", ".join(error_message)
    461             )
    462         )
    464     return None
--> 466 out = ak._broadcasting.broadcast_and_apply(
    467     inputs, action, allow_records=False, function_name=ufunc.__name__
    468 )
    470 if len(out) == 1:
    471     return wrap_layout(out[0], behavior=behavior, attrs=attrs)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:1108, in broadcast_and_apply(inputs, action, depth_context, lateral_context, allow_records, left_broadcast, right_broadcast, numpy_to_regular, regular_to_jagged, function_name, broadcast_parameters_rule)
   1106 backend = backend_of(*inputs, coerce_to_common=False)
   1107 isscalar = []
-> 1108 out = apply_step(
   1109     backend,
   1110     broadcast_pack(inputs, isscalar),
   1111     action,
   1112     0,
   1113     depth_context,
   1114     lateral_context,
   1115     {
   1116         "allow_records": allow_records,
   1117         "left_broadcast": left_broadcast,
   1118         "right_broadcast": right_broadcast,
   1119         "numpy_to_regular": numpy_to_regular,
   1120         "regular_to_jagged": regular_to_jagged,
   1121         "function_name": function_name,
   1122         "broadcast_parameters_rule": broadcast_parameters_rule,
   1123     },
   1124 )
   1125 assert isinstance(out, tuple)
   1126 return tuple(broadcast_unpack(x, isscalar) for x in out)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:1086, in apply_step(backend, inputs, action, depth, depth_context, lateral_context, options)
   1084     return result
   1085 elif result is None:
-> 1086     return continuation()
   1087 else:
   1088     raise AssertionError(result)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:1055, in apply_step.<locals>.continuation()
   1053 # Any non-string list-types?
   1054 elif any(x.is_list and not is_string_like(x) for x in contents):
-> 1055     return broadcast_any_list()
   1057 # Any RecordArrays?
   1058 elif any(x.is_record for x in contents):

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:623, in apply_step.<locals>.broadcast_any_list()
    620         nextinputs.append(x)
    621         nextparameters.append(NO_PARAMETERS)
--> 623 outcontent = apply_step(
    624     backend,
    625     nextinputs,
    626     action,
    627     depth + 1,
    628     copy.copy(depth_context),
    629     lateral_context,
    630     options,
    631 )
    632 assert isinstance(outcontent, tuple)
    633 parameters = parameters_factory(nextparameters, len(outcontent))

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:1068, in apply_step(backend, inputs, action, depth, depth_context, lateral_context, options)
   1061     else:
   1062         raise ValueError(
   1063             "cannot broadcast: {}{}".format(
   1064                 ", ".join(repr(type(x)) for x in inputs), in_function(options)
   1065             )
   1066         )
-> 1068 result = action(
   1069     inputs,
   1070     depth=depth,
   1071     depth_context=depth_context,
   1072     lateral_context=lateral_context,
   1073     continuation=continuation,
   1074     backend=backend,
   1075     options=options,
   1076 )
   1078 if isinstance(result, tuple) and all(isinstance(x, Content) for x in result):
   1079     if any(content.backend is not backend for content in result):

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_connect/numpy.py:402, in array_ufunc.<locals>.action(inputs, **ignore)
    397     # Do we have all-strings? If so, we can't proceed
    398     if all(
    399         x.is_list and x.parameter("__array__") in ("string", "bytestring")
    400         for x in contents
    401     ):
--> 402         raise TypeError(
    403             f"{type(ufunc).__module__}.{ufunc.__name__} is not implemented for string types. "
    404             "To register an implementation, add a name to these string(s) and register a behavior overload"
    405         )
    407 if ufunc is numpy.matmul:
    408     raise NotImplementedError(
    409         "matrix multiplication (`@` or `np.matmul`) is not yet implemented for Awkward Arrays"
    410     )

TypeError: numpy.sqrt is not implemented for string types. To register an implementation, add a name to these string(s) and register a behavior overload

This error occurred while calling

    numpy.sqrt.__call__(
        <Array ['one', 'two', 'three', 'four'] type='4 * string'>
    )

Categorical strings#

A large set of strings with few unique values are more efficiently manipulated as integers than as strings. In Pandas, this is categorical data, in R, it’s called a factor, and in Arrow and Parquet, it’s dictionary encoding.

The ak.str.to_categorical() (requires PyArrow) function makes Awkward Arrays categorical in this sense. ak.to_arrow() and ak.to_parquet() recognize categorical data and convert it to the corresponding Arrow and Parquet types.

uncategorized = ak.Array(["three", "one", "two", "two", "three", "one", "one", "one"])
uncategorized
['three',
 'one',
 'two',
 'two',
 'three',
 'one',
 'one',
 'one']
----------------
type: 8 * string
categorized = ak.str.to_categorical(uncategorized)
categorized
['three',
 'one',
 'two',
 'two',
 'three',
 'one',
 'one',
 'one']
----------------------------------
type: 8 * categorical[type=string]

Internally, the data now have an index that selects from a set of unique strings.

categorized.layout.index
<Index dtype='int64' len='8'>[0 1 2 2 0 1 1 1]</Index>
ak.Array(categorized.layout.content)
['three',
 'one',
 'two']
----------------
type: 3 * string

The main advantage to Awkward categorical data (other than proper conversions to Arrow and Parquet) is that equality is performed using the index integers.

categorized == "one"
[False,
 True,
 False,
 False,
 False,
 True,
 True,
 True]
--------------
type: 8 * bool

With ArrayBuilder#

ak.ArrayBuilder() is described in more detail in this tutorial, but you can add strings by calling the string method or simply appending them.

(This is what ak.from_iter() uses internally to accumulate data.)

builder = ak.ArrayBuilder()

builder.string("one")
builder.append("two")
builder.append("three")

array = builder.snapshot()
array
['one',
 'two',
 'three']
----------------
type: 3 * string