--- jupytext: text_representation: extension: .md format_name: myst format_version: 0.13 jupytext_version: 1.10.3 kernelspec: display_name: Python 3 language: python name: python3 --- How to create arrays of strings =============================== Awkward Arrays can contain strings, although these strings are just a special view of lists of `uint8` numbers. As such, the variable-length data are efficiently stored. NumPy's strings are padded to have equal width, and Pandas's strings are Python objects. Awkward Array doesn't have nearly as many functions for manipulating arrays of strings as NumPy and Pandas, though. ```{code-cell} ipython3 import awkward as ak import numpy as np ``` From Python strings ------------------- The {class}`ak.Array` constructor and {func}`ak.from_iter` recognize strings, and strings are returned by {func}`ak.to_list`. ```{code-cell} ipython3 ak.Array(["one", "two", "three"]) ``` They may be nested within anything. ```{code-cell} ipython3 ak.Array([["one", "two"], [], ["three"]]) ``` From NumPy arrays ----------------- NumPy strings are also recognized by {func}`ak.from_numpy` and {func}`ak.to_numpy`. ```{code-cell} ipython3 numpy_array = np.array(["one", "two", "three", "four"]) numpy_array ``` ```{code-cell} ipython3 awkward_array = ak.Array(numpy_array) awkward_array ``` Operations with strings ----------------------- Since strings are really just lists, some of the list operations "just work" on strings. ```{code-cell} ipython3 ak.num(awkward_array) ``` ```{code-cell} ipython3 awkward_array[:, 1:] ``` Others had to be specially overloaded for the string case, such as string-equality. The default meaning for `==` would be to descend to the lowest level and compare numbers (characters, in this case). ```{code-cell} ipython3 awkward_array == "three" ``` ```{code-cell} ipython3 awkward_array == ak.Array(["ONE", "TWO", "three", "four"]) ``` Similarly, {func}`ak.sort` and {func}`ak.argsort` sort strings lexicographically, not individual characters. ```{code-cell} ipython3 ak.sort(awkward_array) ``` Still other operations had to be inhibited, since they wouldn't make sense for strings. ```{code-cell} ipython3 :tags: [raises-exception] np.sqrt(awkward_array) ``` Categorical strings ------------------- A large set of strings with few unique values are more efficiently manipulated as integers than as strings. In Pandas, this is [categorical data](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html), in R, it's called a [factor](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/factor), and in Arrow and Parquet, it's [dictionary encoding](https://arrow.apache.org/blog/2019/09/05/faster-strings-cpp-parquet/). The {func}`ak.str.to_categorical` (requires PyArrow) function makes Awkward Arrays categorical in this sense. {func}`ak.to_arrow` and {func}`ak.to_parquet` recognize categorical data and convert it to the corresponding Arrow and Parquet types. ```{code-cell} ipython3 uncategorized = ak.Array(["three", "one", "two", "two", "three", "one", "one", "one"]) uncategorized ``` ```{code-cell} ipython3 categorized = ak.str.to_categorical(uncategorized) categorized ``` Internally, the data now have an index that selects from a set of unique strings. ```{code-cell} ipython3 categorized.layout.index ``` ```{code-cell} ipython3 ak.Array(categorized.layout.content) ``` The main advantage to Awkward categorical data (other than proper conversions to Arrow and Parquet) is that equality is performed using the index integers. ```{code-cell} ipython3 categorized == "one" ``` With ArrayBuilder ----------------- {func}`ak.ArrayBuilder` is described in more detail [in this tutorial](how-to-create-arraybuilder), but you can add strings by calling the `string` method or simply appending them. (This is what {func}`ak.from_iter` uses internally to accumulate data.) ```{code-cell} ipython3 builder = ak.ArrayBuilder() builder.string("one") builder.append("two") builder.append("three") array = builder.snapshot() array ```