DYAD System Design

Where is DYAD useful?

Deep Learning Training by Distributed Stochastic Gradient Descent (SGD)

Deep learning training often requires randomizing the order of input samples at each epoch. In distributed or parallel training, where each worker processes a subset of the samples, the set of files assigned to each worker changes at every epoch due to this randomization.

With DYAD, workers can avoid repeatedly loading files from shared storage at every epoch. Instead, workers can retrieve files from the local storage of other workers when needed. Initially, the dataset is partitioned across workers. When a file is reference for the first time, it is staged into the DYAD managed directory on the local storage of the worker who owns the partition.

Check out our use case with the DLIO representing PyTorch dataloader Devarajan et al. 2024.

Overview of Synchronization and File Transfer

DYAD high-level operation — DYAD data flow between producer and consumer nodes. Numbers indicate the order of steps and correspond to the labeled arrows in the diagram.

DYAD consists of two components: a **wrapper** library that is injected into the application process via LD_PRELOAD or C++ stream wrapper classes to transparently intercept I/O calls, and a **service** that runs as a plugin module to the Flux broker on each node. Together they coordinate file access and transfer between nodes.

The wrapper intercepts I/O only for files in the directory it manages, or DYAD managed directory (DMD). If the file shared between producer and consumer under DMD resides on the producer’s local storage that is not visible to the consumer, DYAD both synchronizes accesses and transfers files to the DMD on the consumer’s local storage. If it is visible to both parties (e.g., residing on a shared storage), DYAD only synchronizes accesses.

Producer path

When a file is written into the DMD or any of its subdirectories, the following steps occur in order (see p.1–p.2 in the diagram):

p.1 write(managed_dir/filename) — the application writes the file to local storage. The wrapper intercepts this call on file close to perform necessary steps.
p.2 publish(<filename, producer_rank>) — the wrapper registers the metadata in the DYAD hierarchical file locator (HFL) that relies on the global key-value store (KVS) managed by Flux. The metadata is a (filename, rank) pair, where rank is the rank of the Flux broker on the node where the file owner has local visibility to the file. In rare cases, there can be multiple brokers running on each node and local coordination protects accesses.

Consumer path

When a consumer application opens a file in the DMD, the following steps occur in order (see c.1–c.3 in the diagram):

c.1 query(filename) → producer_rank — the wrapper queries the HFL to obtain the rank of the file owner (producer). If the entry is not yet present, the consumer blocks until it receives a metadata that matches the key it queried. The HFL service responds to the posted query when it sees the matching entry, which propagates from the producer’s local HFL through the hierarchy.
c.2 rpc_get(producer_rank, filename) — the wrapper sends an RPC to the producer’s DYAD service requesting the file. The service transfers the file over the selected DTL backend (Flux RPC, Margo, or UCX) and stores it on the consumer’s local storage.
c.3 read(managed_dir/filename) — the application reads the file from local storage as if it had always been there, with no knowledge of the transfer or synchronization that took place.

Data Transport Layer (DTL)

The following diagrams show the sequence of DTL function calls for each backend, starting from dyad_get_data() on the consumer side through to the producer service.

General DTL sequence

General DYAD DTL data transfer sequence — The abstract DTL interface sequence starting from `dyad_get_data()`. All backends follow this same call order — the implementation of each step differs per backend. The buffer allocated by `get_buffer()` serves different purposes on each side of the transfer. On the producer side, the file is read from local storage into the buffer, which is then passed to `send()` to transfer the data to the consumer. On the consumer side, the buffer receives the incoming data via `recv()`, after which `dyad_cons_store()` writes the buffer contents to the consumer’s local managed directory. From that point on, the application can read the file through its normal I/O path — whether via the GOTCHA wrapper or the C++ stream wrapper — without any knowledge of the transfer or synchronization that took place. To the application, the file simply appears in local storage as if it had always been there.

Flux RPC backend

DYAD Flux RPC DTL data transfer sequence — The Flux RPC backend routes all data through the Flux broker message bus. Several steps are no-ops since the streaming RPC connection is implicit. The stored `flux_msg_t*` pointer routes `flux_respond_raw()` responses back to the correct consumer.

Margo backend

DYAD Margo DTL data transfer sequence — The Margo backend uses an inverted model — the consumer runs a Margo server and the producer connects back to it. Data is transferred via `HG_BULK_PULL`: the producer registers its buffer and triggers the consumer’s `data_ready_rpc()` handler to pull the data via RDMA.

UCX backend

DYAD UCX DTL data transfer sequence — The UCX backend uses a push model with pre-registered RDMA memory. The producer performs a one-sided `ucp_put_nbx()` directly into the consumer’s pre-registered buffer. The consumer polls a size sentinel to detect arrival. Endpoints are cached across transfers.

DTL backend comparison

All three DTL backends share the same abstract interface — the same sequence of rpc_pack(), rpc_unpack(), rpc_respond(), rpc_recv_response(), establish_connection(), get_buffer(), send(), recv(), return_buffer(), and close_connection() calls — but implement them very differently depending on the underlying transport.

The Flux RPC backend is the simplest. Data travels entirely through the Flux broker message bus: the producer calls flux_respond_raw() with the file contents and the consumer reads them via flux_rpc_get_raw(). No RDMA, no external library, no connection setup — the Flux streaming RPC handles everything. The original request message (flux_msg_t*) is stored in the DTL handle after rpc_unpack() so that send() can route the response back to the correct consumer. Several steps are no-ops: rpc_respond(), rpc_recv_response(), and both establish_connection() calls do nothing because the streaming RPC connection is implicit. This backend requires only Flux and is the most portable, but it routes all data through the broker and cannot exploit RDMA hardware.

The Margo backend uses an inverted client-server model. The consumer initializes Margo in MARGO_SERVER_MODE during dyad_dtl_margo_init() so that the producer can connect back to it. The consumer embeds its own Margo server address (margo_addr_to_string()) in the Flux RPC request payload, which the producer extracts and resolves via margo_addr_lookup() during rpc_unpack(). The actual data transfer uses HG_BULK_PULL — the producer registers its file buffer as read-only, sends a Margo RPC (data_ready_rpc) to the consumer’s Margo server, and the consumer’s handler pulls the data directly from the producer’s buffer via RDMA. The consumer’s recv() busy-waits on margo_handle->recv_ready until data_ready_rpc() sets it to 1. No endpoint caching is used since each transfer creates and destroys the RPC handle within send().

The UCX backend uses a push model with pre-registered RDMA memory. During dyad_dtl_ucx_init(), both producer and consumer allocate a large buffer (UCX_MAX_TRANSFER_SIZE + sizeof(size_t) bytes) and register it with UCX via ucp_mem_map(). The consumer packs its buffer address (cons_buf_ptr) and the packed remote key (rkey_buf, base64-encoded) into the Flux RPC request. The producer decodes these during rpc_unpack(), unpacks the remote key via ucp_ep_rkey_unpack(), and performs a one-sided RDMA push directly into the consumer’s pre-registered memory via ucp_put_nbx(). The consumer’s recv() polls the first sizeof(ssize_t) bytes of its buffer — the producer prepends the file size as a sentinel, and a non-zero value signals that the push has started. To avoid repeated endpoint creation, the UCX backend maintains an endpoint cache (ucx_ep_cache_h) keyed by consumer connection key. The return_buffer() call is a no-op for UCX — the pre-registered buffer persists for the lifetime of the DTL handle and is only freed during finalization.

File Lookup and Hierarchical File Discovery

DYAD employs a multi-level Hierarchical File Locator (HFL) to efficiently locate files across distributed nodes without creating metadata bottlenecks when multiple I/O workers access files concurrently. The locator searches metadata through three levels in order of increasing scope and latency: the local DYAD Managed Directory (DMD), the node-local cache of the HFL key-value store, and the remaining levels of the HFL hierarchy, which distributes the search across nodes via a hierarchical key-value search tree operated by underlying Flux KVS. Workers always check local sources first, reducing redundant network searches and balancing workload across the cluster. As files are located via metadata lookups, their metadata is cached at each level so that future lookups resolve at the lowest possible level. The local DMD uses filesystem metadata for the fastest lookups, i.e., obtaining a file lock on a file descriptor; the local HFL is approximately one order of magnitude slower than the DMD but one order of magnitude faster than the global HFL, providing an efficient intermediate cache for node-local misses.

Hierarchical File Locator — four cases — Four cases illustrating the hierarchical file locator. In Case 1, both worker 1 (W1) and worker 2 (W2) are on the same node and the file is already in the local DMD — W2 finds it immediately without any network lookup because W1 had already cached it there (i.e., written it to the local DMD). In Case 2, the file was evicted from the DMD before W2’s search, so W2 falls back to the local HFL to retrieve the metadata of the file. In Case 3, W1 and W2 are on different nodes. W1 had already cached the files locally (i.e., produced the files into its local DMD), and thus W2 finds the file in the global HFL due to a miss in its local caches. In Case 4, no cache contains the file and the locator falls back to the parallel file system (PFS) for the initial fetch.

Scalable Metadata Key

DYAD maintains a mapping of <filename, owner_rank> pairs in the Flux Key-Value Store (KVS). Naively storing and searching filenames as raw strings would require byte-by-byte comparison for every lookup, which becomes expensive as the number of tracked files grows. DYAD instead uses a hierarchical key structure that enables early termination of costly string comparisons through a sequence of inexpensive hash comparisons.

Each filename is hashed multiple times using different seed values, producing a sequence of hash values. These hash values form a multi-level key hierarchy: at each level, only entries whose hash matches the query hash at that level are considered for the next level. A mismatch at any level immediately eliminates the candidate without requiring further comparison. Only when all hash values match, a full byte-by-byte string comparison is performed to rule out hash collision. This structure trades a small amount of storage overhead for a significant reduction in the number of full string comparisons, making metadata lookups scalable as the number of tracked files increases. Because each level uses an independent hash seed, the probability that two distinct filenames produce matching hashes at every level simultaneously decreases exponentially with depth, keeping the false positive rate low even for large file sets.

Two environment variables control the shape of the key hierarchy at run time, without requiring recompilation:

DYAD_KEY_BINS — the number of hash bins at each level of the hierarchy. A larger value reduces the probability of hash collisions at each level, lowering the number of false positives that proceed to deeper levels or full string comparison.
DYAD_KEY_DEPTH — the number of levels in the hash hierarchy. A greater depth means more hash comparisons are performed before a full string comparison, providing more opportunities for early termination at the cost of a slightly deeper key structure.

The combination of DYAD_KEY_BINS and DYAD_KEY_DEPTH allows users to tune the trade-off between key space size and lookup performance for their workload. The defaults (DYAD_KEY_DEPTH=2, DYAD_KEY_BINS=256) are suitable for most use cases.

The next section details how concurrent accesses to the same file by multiple workers — as in Case 1 — are coordinated. It also describes how files are transferred into local storage, which applies to Cases 2, 3, and 4 where the file is not yet present on the requesting worker’s node.

Local file access coordination via locking

To allow seamless coordination between producers and consumers of files, DYAD supports both intercepting POSIX I/O using GOTCHA and wrapper classes for C++ streams. The two methods currently operate in a mutually exclusive fashion. To assure the soundness of access on files that are visible to multiple workers including producers and consumers, we rely on file locking mechanisms.

DYAD locking protocol summary — This diagram shows the flow of each of the two independent methods. The C GOTCHA wrappers intercept POSIX-level calls while the C++ stream wrappers operate independently. In both paths the producer holds an exclusive lock for the duration of the write to protect consumers with direct file visibility from reading premature files. The consumer acquires an exclusive lock inside `dyad_consume()` to coordinate with local producers as well as to serialize the fetch among concurrent consumers.

Consumer race management — file fetched from remote producer — The exclusive lock serializes consumers so that only one performs the expensive KVS wait and data fetch while avoiding to overwrite the file. When the second consumer eventually acquires the lock, it finds the file already present (size > 0) and skips the fetch entirely. The full sequence for consumer A is: fetch metadata -> data transfer (`dyad_get_data()`) -> write to disk (`dyad_cons_store()`) -> release lock, with consumer B blocked throughout all three steps.

Consumer race management — file already local — If the file is already present locally, there is no need for fetching metadata or the file itself. In that case, the consumer releases the lock immediately and continues to perform read-only access on the file.

Fallback behavior when filesystem locking is unavailable

Filesystem locking in the C++ stream wrapper path relies on the ability to extract the underlying file descriptor from a std::basic_filebuf object via the GCC/libstdc++ internal _M_file.fd() member. This capability is detected automatically at configure time by CMake and exposed as the DYAD_HAS_STD_FSTREAM_FD compile-time flag.

When DYAD_HAS_STD_FSTREAM_FD is not defined, filesystem locking is unavailable in the C++ stream path and ctx->use_fs_locks is set to false. In this case, dyad_consume() cannot rely on file size alone to determine whether a co-located producer has finished writing — since the producer cannot lock the file it is still writing, a non-zero file size does not guarantee the file is complete. DYAD therefore falls back to always consulting the Flux KVS for synchronization regardless of file size, at the cost of additional network overhead compared to the local locking path.

The C GOTCHA wrapper path is unaffected by this fallback. Since it always has direct access to file descriptors, use_fs_locks is explicitly set to true in dyad_wrapper_init() during library initialization.

dyad_consume() is invoked on the consumer side in both interception paths: in the C GOTCHA wrapper it is called inside the open() wrapper, and in the C++ stream path it is called via open_sync() from basic_ifstream_dyad and basic_fstream_dyad constructors and their open() methods.

Similarly, dyad_produce() is invoked on the producer side in both paths: in the C GOTCHA wrapper it is called inside the close() wrapper, and in the C++ stream path it is called via close_sync() from basic_ofstream_dyad and basic_fstream_dyad destructors and their close() methods.

Data durability in the C++ stream path

Unlike the C GOTCHA wrapper, the C++ stream destructor and close() perform an additional step of flushing kernel-buffered stream data to stable storage by calling fsync() on the file descriptor extracted from the stream. This ensures all written data is durable before dyad_produce() notifies consumers, providing a stronger durability guarantee than the C GOTCHA wrapper path which does not perform an explicit fsync().

DYAD entry points and initalization

Client

DYAD initialization points — DYAD has multiple initialization entry points depending on the interception method in use. The C GOTCHA wrapper path initializes via `dyad_wrapper_init()`, which calls `dyad_ctx_init()`, which in turn calls `dyad_init_env()` to read configuration from environment variables before delegating to `dyad_init()`. The C++ stream wrapper path initializes via `dyad_stream_core::init()`, which either calls `dyad_init()` directly when given explicit parameters via `dyad_params`, or follows the same `dyad_ctx_init()` → `dyad_init_env()` chain when reading from environment variables. The Python bindings (pydyad) may call either `dyad_init()` directly or `dyad_init_env()`, depending on whether configuration is provided programmatically or via the environment. All paths ultimately converge on `dyad_init()`, which allocates and configures the DYAD context. The Flux module initialization path is shown separately in the diagram below.

Service

DYAD Flux module initialization — The DYAD Flux module follows the same `dyad_ctx_init()` → `dyad_init_env()` -> `dyad_init()` chain as the C GOTCHA wrapper (shown above), but with an additional step — `mod_main()` first calls `dyad_module_ctx_init()`, which applies any command-line argument overrides to environment variables before delegating to `dyad_ctx_init()`. The module also differs in two other key ways: it initializes with `DYAD_COMM_SEND` (producer mode) rather than `DYAD_COMM_RECV` (consumer mode), and it adopts the Flux handle provided by the broker rather than opening its own.