How to create arrays of strings#

Awkward Arrays can contain strings, although these strings are just a special view of lists of uint8 numbers. As such, the variable-length data are efficiently stored.

NumPy’s strings are padded to have equal width, and Pandas’s strings are Python objects. Awkward Array doesn’t have nearly as many functions for manipulating arrays of strings as NumPy and Pandas, though.

import awkward as ak
import numpy as np

From Python strings#

The ak.Array constructor and ak.from_iter() recognize strings, and strings are returned by ak.to_list().

ak.Array(["one", "two", "three"])
['one',
 'two',
 'three']
----------------
type: 3 * string

They may be nested within anything.

ak.Array([["one", "two"], [], ["three"]])
[['one', 'two'],
 [],
 ['three']]
----------------------
type: 3 * var * string

From NumPy arrays#

NumPy strings are also recognized by ak.from_numpy() and ak.to_numpy().

numpy_array = np.array(["one", "two", "three", "four"])
numpy_array
array(['one', 'two', 'three', 'four'], dtype='<U5')
awkward_array = ak.Array(numpy_array)
awkward_array
['one',
 'two',
 'three',
 'four']
----------------
type: 4 * string

Operations with strings#

Since strings are really just lists, some of the list operations “just work” on strings.

ak.num(awkward_array)
[3,
 3,
 5,
 4]
---------------
type: 4 * int64
awkward_array[:, 1:]
['ne',
 'wo',
 'hree',
 'our']
----------------
type: 4 * string

Others had to be specially overloaded for the string case, such as string-equality. The default meaning for == would be to descend to the lowest level and compare numbers (characters, in this case).

awkward_array == "three"
[False,
 False,
 True,
 False]
--------------
type: 4 * bool
awkward_array == ak.Array(["ONE", "TWO", "three", "four"])
[False,
 False,
 True,
 True]
--------------
type: 4 * bool

Similarly, ak.sort() and ak.argsort() sort strings lexicographically, not individual characters.

ak.sort(awkward_array)
['four',
 'one',
 'three',
 'two']
----------------
type: 4 * string

Still other operations had to be inhibited, since they wouldn’t make sense for strings.

np.sqrt(awkward_array)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/micromamba/envs/awkward-docs/lib/python3.10/site-packages/awkward/highlevel.py:1438, in Array.__array_ufunc__(self, ufunc, method, *inputs, **kwargs)
   1437 with ak._errors.OperationErrorContext(name, inputs, kwargs):
-> 1438     return ak._connect.numpy.array_ufunc(ufunc, method, inputs, kwargs)

File ~/micromamba/envs/awkward-docs/lib/python3.10/site-packages/awkward/_connect/numpy.py:449, in array_ufunc(ufunc, method, inputs, kwargs)
    447     return None
--> 449 out = ak._broadcasting.broadcast_and_apply(
    450     inputs, action, allow_records=False, function_name=ufunc.__name__
    451 )
    453 if len(out) == 1:

File ~/micromamba/envs/awkward-docs/lib/python3.10/site-packages/awkward/_broadcasting.py:1026, in broadcast_and_apply(inputs, action, depth_context, lateral_context, allow_records, left_broadcast, right_broadcast, numpy_to_regular, regular_to_jagged, function_name, broadcast_parameters_rule)
   1025 isscalar = []
-> 1026 out = apply_step(
   1027     backend,
   1028     broadcast_pack(inputs, isscalar),
   1029     action,
   1030     0,
   1031     depth_context,
   1032     lateral_context,
   1033     {
   1034         "allow_records": allow_records,
   1035         "left_broadcast": left_broadcast,
   1036         "right_broadcast": right_broadcast,
   1037         "numpy_to_regular": numpy_to_regular,
   1038         "regular_to_jagged": regular_to_jagged,
   1039         "function_name": function_name,
   1040         "broadcast_parameters_rule": broadcast_parameters_rule,
   1041     },
   1042 )
   1043 assert isinstance(out, tuple)

File ~/micromamba/envs/awkward-docs/lib/python3.10/site-packages/awkward/_broadcasting.py:1004, in apply_step(backend, inputs, action, depth, depth_context, lateral_context, options)
   1003 elif result is None:
-> 1004     return continuation()
   1005 else:

File ~/micromamba/envs/awkward-docs/lib/python3.10/site-packages/awkward/_broadcasting.py:973, in apply_step.<locals>.continuation()
    972 elif any(x.is_list and not is_string_like(x) for x in contents):
--> 973     return broadcast_any_list()
    975 # Any RecordArrays?

File ~/micromamba/envs/awkward-docs/lib/python3.10/site-packages/awkward/_broadcasting.py:629, in apply_step.<locals>.broadcast_any_list()
    627         nextparameters.append(NO_PARAMETERS)
--> 629 outcontent = apply_step(
    630     backend,
    631     nextinputs,
    632     action,
    633     depth + 1,
    634     copy.copy(depth_context),
    635     lateral_context,
    636     options,
    637 )
    638 assert isinstance(outcontent, tuple)

File ~/micromamba/envs/awkward-docs/lib/python3.10/site-packages/awkward/_broadcasting.py:986, in apply_step(backend, inputs, action, depth, depth_context, lateral_context, options)
    980         raise ValueError(
    981             "cannot broadcast: {}{}".format(
    982                 ", ".join(repr(type(x)) for x in inputs), in_function(options)
    983             )
    984         )
--> 986 result = action(
    987     inputs,
    988     depth=depth,
    989     depth_context=depth_context,
    990     lateral_context=lateral_context,
    991     continuation=continuation,
    992     backend=backend,
    993     options=options,
    994 )
    996 if isinstance(result, tuple) and all(isinstance(x, Content) for x in result):

File ~/micromamba/envs/awkward-docs/lib/python3.10/site-packages/awkward/_connect/numpy.py:385, in array_ufunc.<locals>.action(inputs, **ignore)
    381     if all(
    382         x.is_list and x.parameter("__array__") in ("string", "bytestring")
    383         for x in contents
    384     ):
--> 385         raise TypeError(
    386             f"{type(ufunc).__module__}.{ufunc.__name__} is not implemented for string types. "
    387             "To register an implementation, add a name to these string(s) and register a behavior overload"
    388         )
    390 if ufunc is numpy.matmul:

TypeError: numpy.sqrt is not implemented for string types. To register an implementation, add a name to these string(s) and register a behavior overload

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
Cell In[11], line 1
----> 1 np.sqrt(awkward_array)

File ~/micromamba/envs/awkward-docs/lib/python3.10/site-packages/awkward/highlevel.py:1437, in Array.__array_ufunc__(self, ufunc, method, *inputs, **kwargs)
   1372 """
   1373 Intercepts attempts to pass this Array to a NumPy
   1374 [universal functions](https://docs.scipy.org/doc/numpy/reference/ufuncs.html)
   (...)
   1434 See also #__array_function__.
   1435 """
   1436 name = f"{type(ufunc).__module__}.{ufunc.__name__}.{method!s}"
-> 1437 with ak._errors.OperationErrorContext(name, inputs, kwargs):
   1438     return ak._connect.numpy.array_ufunc(ufunc, method, inputs, kwargs)

File ~/micromamba/envs/awkward-docs/lib/python3.10/site-packages/awkward/_errors.py:67, in ErrorContext.__exit__(self, exception_type, exception_value, traceback)
     60 try:
     61     # Handle caught exception
     62     if (
     63         exception_type is not None
     64         and issubclass(exception_type, Exception)
     65         and self.primary() is self
     66     ):
---> 67         self.handle_exception(exception_type, exception_value)
     68 finally:
     69     # `_kwargs` may hold cyclic references, that we really want to avoid
     70     # as this can lead to large buffers remaining in memory for longer than absolutely necessary
     71     # Let's just clear this, now.
     72     self._kwargs.clear()

File ~/micromamba/envs/awkward-docs/lib/python3.10/site-packages/awkward/_errors.py:82, in ErrorContext.handle_exception(self, cls, exception)
     80     self.decorate_exception(cls, exception)
     81 else:
---> 82     raise self.decorate_exception(cls, exception)

TypeError: numpy.sqrt is not implemented for string types. To register an implementation, add a name to these string(s) and register a behavior overload

This error occurred while calling

    numpy.sqrt.__call__(
        <Array ['one', 'two', 'three', 'four'] type='4 * string'>
    )

Categorical strings#

A large set of strings with few unique values are more efficiently manipulated as integers than as strings. In Pandas, this is categorical data, in R, it’s called a factor, and in Arrow and Parquet, it’s dictionary encoding.

The ak.to_categorical() function makes Awkward Arrays categorical in this sense. ak.to_arrow() and ak.to_parquet() recognize categorical data and convert it to the corresponding Arrow and Parquet types.

uncategorized = ak.Array(["three", "one", "two", "two", "three", "one", "one", "one"])
uncategorized
['three',
 'one',
 'two',
 'two',
 'three',
 'one',
 'one',
 'one']
----------------
type: 8 * string
categorized = ak.to_categorical(uncategorized)
categorized
/home/runner/micromamba/envs/awkward-docs/lib/python3.10/site-packages/awkward/operations/ak_to_categorical.py:92: DeprecationWarning: In version 2.5.0, this will be an error.
To raise these warnings as errors (and get stack traces to find out where they're called), run
    import warnings
    warnings.filterwarnings("error", module="awkward.*")
after the first `import awkward` or use `@pytest.mark.filterwarnings("error:::awkward.*")` in pytest.
Issue: The general purpose `ak.to_categorical` has been replaced by `ak.str.to_categorical`.
  return _impl(array, highlevel, behavior)
['three',
 'one',
 'two',
 'two',
 'three',
 'one',
 'one',
 'one']
----------------------------------
type: 8 * categorical[type=string]

Internally, the data now have an index that selects from a set of unique strings.

categorized.layout.index
<Index dtype='int64' len='8'>[0 1 2 2 0 1 1 1]</Index>
ak.Array(categorized.layout.content)
['three',
 'one',
 'two']
----------------
type: 3 * string

The main advantage to Awkward categorical data (other than proper conversions to Arrow and Parquet) is that equality is performed using the index integers.

categorized == "one"
[False,
 True,
 False,
 False,
 False,
 True,
 True,
 True]
--------------
type: 8 * bool

With ArrayBuilder#

ak.ArrayBuilder() is described in more detail in this tutorial, but you can add strings by calling the string method or simply appending them.

(This is what ak.from_iter() uses internally to accumulate data.)

builder = ak.ArrayBuilder()

builder.string("one")
builder.append("two")
builder.append("three")

array = builder.snapshot()
array
['one',
 'two',
 'three']
----------------
type: 3 * string