Out-of-core dataloading

This example demonstrates how to create a Tesseract that can ingest large datasets without loading the whole dataset to memory.

:rocket: For many more examples, see the examples/ directory in the Tesseract repository

An example dataset is stored at examples/unit_tesseracts/dataloader/testdata:

# run from the root of the tesseract-core repo
$ ls examples/unit_tesseracts/dataloadertestdata/
sample_0.json sample_2.json sample_4.json sample_6.json sample_8.json
sample_1.json sample_3.json sample_5.json sample_7.json sample_9.json

Each sample contains just a small, base64 encoded array:

{
  "object_type": "array",
  "shape": [3, 3],
  "dtype": "float32",
  "data": {
    "buffer": [
      [0.6369616985321045, 0.2697867155075073, 0.04097352549433708],
      [0.016527635976672173, 0.8132702112197876, 0.91275554895401],
      [0.6066357493400574, 0.7294965386390686, 0.543624997138977]
    ],
    "encoding": "json"
  }
}

But assuming these arrays would be too large to fit into memory all at once, we can still process them one by one by defining an InputSchema with a LazySequence.

class InputSchema(BaseModel):
    # NOTE: no file references here
    data: LazySequence[Differentiable[Array[(None, 3), Float32]]] = Field(
        description="Data to be processed."
    )

Note that the InputSchema only contains the actual types that the data will be parsed into, no file references. The paths to the data will be provided by the inputs.json payload in the end. This makes the tesseract more flexible as it will accept either file references or data directly.

You can build and run this Tesseract locally:

$ tesseract build examples/unit_tesseracts/dataloader

When running the tesseract, instead of providing the full dataset in the input payload we can use a glob pattern starting with @ in place of the LazySequence.

# run from the root of the tesseract-core repo
tesseract run dataloader \
    --volume $(pwd)/examples/unit_tesseracts/dataloader/testdata:/mnt/testdata:ro \
    apply '{"inputs": {"data": "@/mnt/testdata/*.json"}}'

The command above is part of the file examples/unit_tesseracts/dataloader/test-tesseract.sh.