This example demonstrates how to create a Tesseract that can ingest large datasets without loading the whole dataset to memory.
For many more examples, see the
examples/
directory in the Tesseract repository
An example dataset is stored at examples/unit_tesseracts/dataloader/testdata
:
# run from the root of the tesseract-core repo
$ ls examples/unit_tesseracts/dataloadertestdata/
sample_0.json sample_2.json sample_4.json sample_6.json sample_8.json
sample_1.json sample_3.json sample_5.json sample_7.json sample_9.json
Each sample contains just a small, base64
encoded array:
{
"object_type": "array",
"shape": [3, 3],
"dtype": "float32",
"data": {
"buffer": [
[0.6369616985321045, 0.2697867155075073, 0.04097352549433708],
[0.016527635976672173, 0.8132702112197876, 0.91275554895401],
[0.6066357493400574, 0.7294965386390686, 0.543624997138977]
],
"encoding": "json"
}
}
But assuming these arrays would be too large to fit into memory all at once, we can still process them one by one by defining an InputSchema
with a LazySequence
.
class InputSchema(BaseModel):
# NOTE: no file references here
data: LazySequence[Differentiable[Array[(None, 3), Float32]]] = Field(
description="Data to be processed."
)
Note that the InputSchema
only contains the actual types that the data will be parsed into, no file references. The paths to the data will be provided by the inputs.json
payload in the end. This makes the tesseract more flexible as it will accept either file references or data directly.
You can build and run this Tesseract locally:
$ tesseract build examples/unit_tesseracts/dataloader
When running the tesseract, instead of providing the full dataset in the input payload we can use a glob pattern starting with @
in place of the LazySequence
.
# run from the root of the tesseract-core repo
tesseract run dataloader \
--volume $(pwd)/examples/unit_tesseracts/dataloader/testdata:/mnt/testdata:ro \
apply '{"inputs": {"data": "@/mnt/testdata/*.json"}}'
The command above is part of the file examples/unit_tesseracts/dataloader/test-tesseract.sh
.