Introduction

Computational bioinformatics use complex pipelines that need to be orchestrated on modern infrastructure, and reproducible. Needs to

Run tools reliably in any computational environment
A well-defined way to orchestrate this
Data management layer for reproducibility

Pachyderm fullfils these requirements, the thesis contains multiple contributions to pachyderm as well as an example that shows the properties of pachyderm.

Bioinformatics turned more complex and data-intensive in the recent years. Foundation of the scientific method is that experiments are reproducible by other researchers.

Usage of various processing environments and franeworks make this hard. Snakemake, galaxy aid researchers by managing dependencies between tools under the hood. VMs have fallen out of favor since availability and reproducibility are poor. Microservices architecture inspired use of Docker containers for pipelines. They are lightweight, scalabe and capture the compute env. reliably.

Pachyderm. built on kubernetes, promises scalability, as well as reproducibility through use of containers and a fully versioned filesystem.

Many pipeline solutions exist

Bpipe, Reflow are tied to specific language, no containers
Dat, Git LFS handle data dependencies, not integrated with any pipelining things
iRODS can be used to implement file access layer and pipelining solution through a

rule-based language

Pachyderm along with Argo is one of the few workflow tools, that are language/data agnostic.

System and methods

Pachyderm leverages open source software of the container ecosystem (docker, k8s, etcd) The pachyderm daemon (pachd) consists of

block store component
filesystem component cooperates with block store to handle pfs (pachyderm file system) requests
pipelining component controls the worker pods via k8s API, coordinates the injection of data with file system component
A separate object store is used to save data
ETCD key value store is used for managing pipeline state and such
Pachyderm has an internal caching system

External storage does not have to be used But you can use e.g. external GlusterFS and minion processes to pull/upload files between k8s and glusterfs

PFS

Copy on write semantics, every dataset used needs to be possible to restore
Files are hashed, and stored in the filesystem using this hash (similar to DVC)
Hash is tracked in etcd
Versions of files live in a pachyderm data repository (pdr)

Using this concept of data repositories and pipelines data provenance/lineage can be automatically tracked. Any state of data is identified by a commit, and it is also possible to get versioned pipeline definitions used to process a datum. "any run of any pipeline producing any result is completely reproducible and explainable"

Pachyderm workers

Are long-running pods (by default) that run pachyderm system containers along with the user container. They are created alongside the pipeline definition.

Incremental processing, only new data is fed to jobs as soon as it's commited to a repository

Data is split into datums, minimum independently processable units (e.g. lines in a csv file) for parallel processing. Number of workers is controlled by the user, and a worker processes one datum at the time in isolation.

Results

Authors contributed block storage (S3?) integrations with pachyderm
And a helm chart to deploy in multiple infrastructure scenarios

Pachyderm was used to implement a really complex bioinformatics pipeline. Scaling efficiency and speedup with regards to reference solution were measured.

Discussion

Containers are more and more used for distributed and reproducible processing, high adoption in bioinformatics Performance impact of using containers (as opposed to say .jar applications) is negligible. "a scaling efficiency of 79% and a speedup of 63 were achieved when using 80 workers in one of the benchmarks."

Drawbacks of pachyderm

limited access to external storage and data locality
non-optimal container networking and security
performance overhead when compared to bare-metal settings
It's possible to implement streaming pipelines, but job parallelism is not possible
100% tied to k8s, and Kubernetes is not easy to set up

Overall pachyderm is a good and convenient tool if one can live with above drawbacks.