JSONReader
A simple JSON data loader with various options. Either parses the entire string, cleaning it and treat each line as an embedding or performs a recursive depth-first traversal yielding JSON paths. Supports streaming of large JSON data using @discoveryjs/json-ext
Usage
import { JSONReader } from "llamaindex";
const file = "../../PATH/TO/FILE";
const content = new TextEncoder().encode("JSON_CONTENT");
const reader = new JSONReader({ levelsBack: 0, collapseLength: 100 });
const docsFromFile = reader.loadData(file);
const docsFromContent = reader.loadDataAsContent(content);
Options
Basic:
-
streamingThreshold?
: The threshold for using streaming mode in MB of the JSON Data. CEstimates characters by calculating bytes:(streamingThreshold * 1024 * 1024) / 2
and comparing against.length
of the JSON string. Setundefined
to disable streaming or0
to always use streaming. Default is50
MB. -
ensureAscii?
: Wether to ensure only ASCII characters be present in the output by converting non-ASCII characters to their unicode escape sequence. Default isfalse
. -
isJsonLines?
: Wether the JSON is in JSON Lines format. If true, will split into lines, remove empty one and parse each line as JSON. Note: Uses a custom streaming parser, most likely less robust than json-ext. Default isfalse
-
cleanJson?
: Whether to clean the JSON by filtering out structural characters ({}, [], and ,
). If set to false, it will just parse the JSON, not removing structural characters. Default istrue
. -
logger?
: A placeholder for a custom logger function.
Depth-First-Traversal:
-
levelsBack?
: Specifies how many levels up the JSON structure to include in the output.cleanJson
will be ignored. If set to 0, all levels are included. If undefined, parses the entire JSON, treat each line as an embedding and create a document per top-level array. Default isundefined
-
collapseLength?
: The maximum length of JSON string representation to be collapsed into a single line. Only applicable whenlevelsBack
is set. Default isundefined
Examples
Input:
{"a": {"1": {"key1": "value1"}, "2": {"key2": "value2"}}, "b": {"3": {"k3": "v3"}, "4": {"k4": "v4"}}}
Default options:
LevelsBack
= undefined
& cleanJson
= true
Output:
"a": {
"1": {
"key1": "value1"
"2": {
"key2": "value2"
"b": {
"3": {
"k3": "v3"
"4": {
"k4": "v4"
Depth-First Traversal all levels:
levelsBack
= 0
Output:
a 1 key1 value1
a 2 key2 value2
b 3 k3 v3
b 4 k4 v4
Depth-First Traversal and Collapse:
levelsBack
= 0
& collapseLength
= 35
Output:
a 1 {"key1":"value1"}
a 2 {"key2":"value2"}
b {"3":{"k3":"v3"},"4":{"k4":"v4"}}
Depth-First Traversal limited levels:
levelsBack
= 2
Output:
1 key1 value1
2 key2 value2
3 k3 v3
4 k4 v4
Uncleaned JSON:
levelsBack
= undefined
& cleanJson
= false
Output:
{"a":{"1":{"key1":"value1"},"2":{"key2":"value2"}},"b":{"3":{"k3":"v3"},"4":{"k4":"v4"}}}
ASCII-Conversion:
Input:
{ "message": "こんにちは世界" }
Output:
"message": "\u3053\u3093\u306b\u3061\u306f\u4e16\u754c"
JSON Lines Format:
Input:
{"tweet": "Hello world"}\n{"tweet": "こんにちは世界"}
Output:
"tweet": "Hello world"
"tweet": "こんにちは世界"