ak.from_json#
Defined in awkward.operations.ak_from_json on line 28.
- ak.from_json(source, *, line_delimited=False, schema=None, nan_string=None, posinf_string=None, neginf_string=None, complex_record_fields=None, buffersize=65536, initial=1024, resize=8, highlevel=True, behavior=None, attrs=None)#
- Parameters
source (bytes/str, pathlib.Path, or file-like object) – Data source of the JSON-formatted string(s). If bytes/str, the string is parsed. If a
pathlib.Path
, a file with that name is opened, parsed, and closed. If that path has a URI protocol (like"https://"
or"s3://"
), this function attempts to open the file with the fsspec library. If a file-like object with aread
method, this function reads from the object, but does not close it.line_delimited (bool) – If False, a single JSON document is read as an entire array or record. If True, this function reads line-delimited JSON into an array (regardless of how many there are). The line delimiter is not actually checked, so it may be
"\n"
,"\r\n"
or anything else.schema (None, JSON str or equivalent lists/dicts) – If None, the data type is discovered while parsing. If a JSONSchema (json-schema.org), that schema is used to parse the JSON more quickly by skipping type-discovery.
nan_string (None or str) – If not None, strings with this value will be interpreted as floating-point NaN values.
posinf_string (None or str) – If not None, strings with this value will be interpreted as floating-point positive infinity values.
neginf_string (None or str) – If not None, strings with this value will be interpreted as floating-point negative infinity values.
complex_record_fields (None or (str, str)) – If not None, defines a pair of field names to interpret 2-field records as complex numbers.
buffersize (int) – Number of bytes in each read from source: larger values use more memory but read less frequently. (Python GIL is released before and after read events.)
initial (int) – Initial size (in bytes) of buffers used by the
ak::ArrayBuilder
.resize (float) – Resize multiplier for buffers used by the
ak::ArrayBuilder
; should be strictly greater than 1.highlevel (bool) – If True, return an
ak.Array
; otherwise, return a low-levelak.contents.Content
subclass.behavior (None or dict) – Custom
ak.behavior
for the output array, if high-level.attrs (None or dict) – Custom attributes for the output array, if high-level.
Converts a JSON string into an Awkward Array.
There are a few different dichotomies in JSON-reading; all of the combinations are supported:
Reading from in-memory str/bytes, on-disk or over-network file, or an arbitrary Python object with a
read(num_bytes)
method.Reading a single JSON document or a sequence of line-delimited documents.
Unknown schema (slow and general) or with a provided JSONSchema (fast, but not all possible cases are supported).
Conversion of strings representing not-a-number, plus and minus infinity into the appropriate floating-point numbers.
Conversion of records with a real and imaginary part into complex numbers.
Non-JSON features not allowed, including literals for not-a-number or infinite
numbers; they must be quoted strings for nan_string
, posinf_string
, and
neginf_string
to recognize. The document or line-delimited documents must
adhere to the strict JSON schema.
Sources#
In-memory strings or bytes are simply passed as the first argument:
>>> ak.from_json("[[1.1, 2.2, 3.3], [], [4.4, 5.5]]")
<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>
File names/paths need to be wrapped in pathlib.Path
, and remote files are
recognized by URI protocol (like "https://"
or "s3://"
) and handled by fsspec
(which must be installed).
>>> import pathlib
>>> with open("tmp.json", "w") as file:
... file.write("[[1.1, 2.2, 3.3], [], [4.4, 5.5]]")
...
33
>>> ak.from_json(pathlib.Path("tmp.json"))
<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>
And any object with a read(num_bytes)
method can be used as the source
.
>>> class HasReadMethod:
... def __init__(self, data):
... self.bytes = data.encode()
... self.pos = 0
... def read(self, num_bytes):
... start = self.pos
... self.pos += num_bytes
... return self.bytes[start:self.pos]
...
>>> filelike_obj = HasReadMethod("[[1.1, 2.2, 3.3], [], [4.4, 5.5]]")
>>> ak.from_json(filelike_obj)
<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>
If this function opens a file or network connection (because it is passed as
a pathlib.Path
), then this function will also close that file or connection.
If this function is provided a file-like object with a read(num_bytes)
method,
this function will not close it. (It might not even have a close
method.)
Data structures#
This function interprets JSON arrays and JSON objects in the same way that
ak.from_iter
interprets Python lists and Python dicts. It could be used as a
synonym for Python’s json.loads
followed by ak.from_iter
, but the direct
JSON-reading is faster (especially with a schema) and uses less memory.
Consider
>>> import json
>>> json_data = "[[1.1, 2.2, 3.3], [], [4.4, 5.5]]"
>>> ak.from_iter(json.loads(json_data))
<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>
>>> ak.from_json(json_data)
<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>
and
>>> json_data = '{"x": 1.1, "y": [1, 2]}'
>>> ak.from_iter(json.loads(json_data))
<Record {x: 1.1, y: [1, 2]} type='{x: float64, y: var * int64}'>
>>> ak.from_json(json_data)
<Record {x: 1.1, y: [1, 2]} type='{x: float64, y: var * int64}'>
As shown above, reading JSON may result in ak.Array
or ak.Record
, but line-delimited
(line_delimited=True
) only results in ak.Array
:
>>> ak.from_json(
... '{"x": 1.1, "y": [1]}\n{"x": 2.2, "y": [1, 2]}\n{"x": 3.3, "y": [1, 2, 3]}',
... line_delimited=True,
... )
<Array [{x: 1.1, y: [1]}, ..., {x: 3.3, ...}] type='3 * {x: float64, y: var...'>
Even arrays of length zero:
>>> ak.from_json("", line_delimited=True)
<Array [] type='0 * unknown'>
Note that JSON interpreted with line_delimited
doesn’t actually need delimiters
between JSON documents or an absence of delimiters within each document. Parsing
with line_delimited=True
continues to the end of a JSON document and starts
again with the next JSON document. It may be necessary to require actual delimiters
between and never within JSON documents to split a large source for
parallel-processing, but that consideration is beyond this function.
If a JSONSchema is provided, the schema describes the structure of the JSON
document, regardless of whether there’s only one of them (may be an ak.Record
)
or many of them (must be an ak.Array
).
>>> schema = {
... "type": "object",
... "properties": {
... "x": {"type": "number"},
... "y": {"type": "array", "items": {"type": "integer"}},
... },
... "required": ["x", "y"],
... }
>>> ak.from_json(
... '{"x": 1.1, "y": [1, 2, 3]}',
... schema=schema,
... )
<Record {x: 1.1, y: [1, ..., 3]} type='{x: float64, y: var * int64}'>
>>> ak.from_json(
... '{"x": 1.1, "y": [1]}\n{"x": 2.2, "y": [1, 2]}\n{"x": 3.3, "y": [1, 2, 3]}',
... schema=schema,
... line_delimited=True,
... )
<Array [{x: 1.1, y: [1]}, ..., {x: 3.3, ...}] type='3 * {x: float64, y: var...'>
All numbers in the final array are signed 64-bit (integers and floating-point).
JSONSchemas#
This function supports a subset of JSONSchema (see the JSONSchema specification). The schemas may be passed as JSON text or as Python lists and dicts representing JSON, but the following conditions apply:
The root of the schema must be
"type": "array"
or"type": "object"
.Every level must have a
"type"
, which can only name one type (as a string or length-1 list) or one type and"null"
(as a length-2 list)."type": "boolean"
→ 1-byte boolean values."type": "integer"
→ 8-byte integer values. If a part of the schema is declared to have integer type but the JSON numbers are expressed as floating-point, such as3.14
,3.0
, or3e0
, this function raises an error."type": "number"
→ 8-byte floating-point values. If used with this function’snan_string
,posinf_string
, and/orneginf_string
, the value in the JSON could be a string, as long as it matches one of these three."type": "string"
→ UTF-8 encoded strings. All JSON escape sequences are supported. Remember that thesource
data are ASCII; Unicode is derived from “\uXXXX
” escape sequences. If an"enum"
is given, strings are represented as categorical values (ak.contents.IndexedArray
orak.contents.IndexedOptionArray
)."type": "array"
→ nested lists. The"items"
must be specified. If"minItems"
and"maxItems"
are specified and equal to each other, the list has regular-type (ak.types.RegularType
); otherwise, it has variable-length type (ak.types.ListType
)."type": "object"
→ nested records. The"properties"
must be specified, and any properties in the data not described by"properties"
will not appear in the output.
Substitutions for non-finite and complex numbers#
JSON doesn’t support not-a-number values, infinite values, or complex number types (as in numbers with a real and imaginary part). Some work-arounds use non-JSON syntax, but this function converts valid JSON into these numbers with user-specified rules.
The nan_string
, posinf_string
, and neginf_string
convert quoted strings
into floating-point numbers. You can specify what these strings are.
>>> ak.from_json(
... '[1, 2, "nan", "inf", "-inf"]',
... nan_string="nan",
... posinf_string="inf",
... neginf_string="-inf",
... )
<Array [1, 2, nan, inf, -inf] type='5 * float64'>
Without these rules, the array would be interpreted as a union of numbers and strings:
>>> ak.from_json(
... '[1, 2, "nan", "inf", "-inf"]',
... )
<Array [1, 2, 'nan', 'inf', '-inf'] type='5 * union[int64, string]'>
When combined with a JSONSchema, you need to say that these values have type
"number"
, not a union of strings and numbers (i.e. the conversion is performed
before schema-validation). Note that they can’t be "integer"
, since
not-a-number and infinite values are only possible for floating-point numbers.
>>> ak.from_json(
... '[1, 2, "nan", "inf", "-inf"]',
... nan_string="nan",
... posinf_string="inf",
... neginf_string="-inf",
... schema={"type": "array", "items": {"type": "number"}}
... )
<Array [1, 2, nan, inf, -inf] type='5 * float64'>
The complex_record_fields
is a 2-tuple of field names (strings) of objects
to identify as the real and imaginary parts of complex numbers. Complex number
representations in JSON vary, though most are JSON objects with real and
imaginary parts and possibly other fields. Any other fields will be excluded
from the output array.
>>> ak.from_json(
... '[{"r": 1, "i": 1.1, "other": ""}, {"r": 2, "i": 2.2, "other": ""}]',
... complex_record_fields=("r", "i"),
... )
<Array [1+1.1j, 2+2.2j] type='2 * complex128'>
Without this rule, the array would be interpreted as an array of records:
>>> ak.from_json(
... '[{"r": 1, "i": 1.1, "other": ""}, {"r": 2, "i": 2.2, "other": ""}]',
... )
<Array [{r: 1, i: 1.1, other: ''}, {...}] type='2 * {r: int64, i: float64, ...'>
When combined with a JSONSchema, you need to specify the object type (i.e. the
conversion is performed after schema-validation). Note that even the fields
that will be ignored by complex_record_fields
need to be specified.
>>> ak.from_json(
... '[{"r": 1, "i": 1.1, "other": ""}, {"r": 2, "i": 2.2, "other": ""}]',
... complex_record_fields=("r", "i"),
... schema={
... "type": "array",
... "items": {
... "type": "object",
... "properties": {
... "r": {"type": "number"},
... "i": {"type": "number"},
... "other": {"type": "string"},
... },
... "required": ["r", "i"],
... },
... },
... )
<Array [1+1.1j, 2+2.2j] type='2 * complex128'>
See also ak.to_json
.