DYAD System Design
High-level Overview
Comparison of a producer and consumer pair sharing a file with and without DYAD
Without DYAD, producer and consumer share data via a shared file system (e.g., Lustre) with explicit synchronization through dependent jobs, APIs, or polling. With DYAD, they use node-local storage with the abstraction of shared visibility. DYAD transparently intercepts I/O and transfers data across local storages while coordinating accesses through KVS and RDMA-based data movement, eliminating both the shared file system bottleneck and the need for explicit synchronization.
DYAD Overview: Synchronization and File Transfer
DYAD data flow between producer and consumer nodes. Numbers indicate the order of steps and correspond to the labeled arrows in the diagram.
DYAD consists of two components: a **wrapper** library that is injected
into the application process via LD_PRELOAD or C++ stream wrapper
classes to transparently intercept I/O calls, and a **service** that
runs as a plugin module to the Flux broker on each node. Together they
coordinate file access and transfer between nodes.
The wrapper intercepts I/O only for files in the directory it manages, or DYAD managed directory (DMD). If the file shared between producer and consumer under DMD resides on the producer’s local storage that is not visible to the consumer, DYAD both synchronizes accesses and transfers files to the DMD on the consumer’s local storage. If it is visible to both parties (e.g., residing on a shared storage), DYAD only synchronizes accesses.
Producer path
When a file is written into the DMD or any of its subdirectories, the following steps occur in order (see p.1–p.2 in the diagram):
p.1
write(managed_dir/filename)— the application writes the file to local storage. The wrapper intercepts this call on file close to perform necessary steps.p.2
publish(<filename, producer_rank>)— the wrapper registers the metadata in the DYAD hierarchical file locator (HFL) that relies on the global key-value store (KVS) managed by Flux. The metadata is a(filename, rank)pair, whererankis the rank of the Flux broker on the node where the file owner has local visibility to the file. In rare cases, there can be multiple brokers running on each node and local coordination protects accesses.
Consumer path
When a consumer application opens a file in the DMD, the following steps occur in order (see c.1–c.3 in the diagram):
c.1
query(filename) → producer_rank— the wrapper queries the HFL to obtain the rank of the file owner (producer). If the entry is not yet present, the consumer blocks until it receives a metadata that matches the key it queried. The HFL service responds to the posted query when it sees the matching entry, which propagates from the producer’s local HFL through the hierarchy.c.2
rpc_get(producer_rank, filename)— the wrapper sends an RPC to the producer’s DYAD service requesting the file. The service transfers the file over the selected DTL backend (Flux RPC, Margo, or UCX) and stores it on the consumer’s local storage.c.3
read(managed_dir/filename)— the application reads the file from local storage as if it had always been there, with no knowledge of the transfer or synchronization that took place.
Data Transport Layer (DTL)
The following diagrams show the sequence of DTL function calls for each
backend, starting from dyad_get_data() on the consumer side through
to the producer service.
General DTL sequence
The abstract DTL interface sequence starting from dyad_get_data().
All backends follow this same call order — the implementation of each
step differs per backend.
The buffer allocated by get_buffer() serves different purposes on
each side of the transfer. On the producer side, the file is read from
local storage into the buffer, which is then passed to send() to
transfer the data to the consumer. On the consumer side, the buffer
receives the incoming data via recv(), after which
dyad_cons_store() writes the buffer contents to the consumer’s local
managed directory. From that point on, the application can read the file
through its normal I/O path — whether via the GOTCHA wrapper or the C++
stream wrapper — without any knowledge of the transfer or synchronization
that took place. To the application, the file simply appears in local
storage as if it had always been there.
Flux RPC backend
The Flux RPC backend routes all data through the Flux broker message
bus. Several steps are no-ops since the streaming RPC connection is
implicit. The stored flux_msg_t* pointer routes flux_respond_raw()
responses back to the correct consumer.
Margo backend
The Margo backend uses an inverted model — the consumer runs a Margo
server and the producer connects back to it. Data is transferred via
HG_BULK_PULL: the producer registers its buffer and triggers the
consumer’s data_ready_rpc() handler to pull the data via RDMA.
UCX backend
The UCX backend uses a push model with pre-registered RDMA memory.
The producer performs a one-sided ucp_put_nbx() directly into the
consumer’s pre-registered buffer. The consumer polls a size sentinel
to detect arrival. Endpoints are cached across transfers.
DTL backend comparison
All three DTL backends share the same abstract interface — the same
sequence of rpc_pack(), rpc_unpack(), rpc_respond(),
rpc_recv_response(), establish_connection(), get_buffer(),
send(), recv(), return_buffer(), and close_connection()
calls — but implement them very differently depending on the underlying
transport.
The Flux RPC backend is the simplest. Data travels entirely through
the Flux broker message bus: the producer calls flux_respond_raw()
with the file contents and the consumer reads them via
flux_rpc_get_raw(). No RDMA, no external library, no connection
setup — the Flux streaming RPC handles everything. The original request
message (flux_msg_t*) is stored in the DTL handle after
rpc_unpack() so that send() can route the response back to the
correct consumer. Several steps are no-ops: rpc_respond(),
rpc_recv_response(), and both establish_connection() calls do
nothing because the streaming RPC connection is implicit. This backend
requires only Flux and is the most portable, but it routes all data
through the broker and cannot exploit RDMA hardware.
The Margo backend uses an inverted client-server model. The consumer
initializes Margo in MARGO_SERVER_MODE during dyad_dtl_margo_init()
so that the producer can connect back to it. The consumer embeds its own
Margo server address (margo_addr_to_string()) in the Flux RPC request
payload, which the producer extracts and resolves via margo_addr_lookup()
during rpc_unpack(). The actual data transfer uses HG_BULK_PULL —
the producer registers its file buffer as read-only, sends a Margo RPC
(data_ready_rpc) to the consumer’s Margo server, and the consumer’s
handler pulls the data directly from the producer’s buffer via RDMA. The
consumer’s recv() busy-waits on margo_handle->recv_ready until
data_ready_rpc() sets it to 1. No endpoint caching is used since each
transfer creates and destroys the RPC handle within send().
The UCX backend uses a push model with pre-registered RDMA memory.
During dyad_dtl_ucx_init(), both producer and consumer allocate a
large buffer (UCX_MAX_TRANSFER_SIZE + sizeof(size_t) bytes) and register
it with UCX via ucp_mem_map(). The consumer packs its buffer address
(cons_buf_ptr) and the packed remote key (rkey_buf, base64-encoded)
into the Flux RPC request. The producer decodes these during rpc_unpack(),
unpacks the remote key via ucp_ep_rkey_unpack(), and performs a
one-sided RDMA push directly into the consumer’s pre-registered memory
via ucp_put_nbx(). The consumer’s recv() polls the first
sizeof(ssize_t) bytes of its buffer — the producer prepends the file
size as a sentinel, and a non-zero value signals that the push has started.
To avoid repeated endpoint creation, the UCX backend maintains an endpoint
cache (ucx_ep_cache_h) keyed by consumer connection key. The
return_buffer() call is a no-op for UCX — the pre-registered buffer
persists for the lifetime of the DTL handle and is only freed during
finalization.
File Lookup and Hierarchical File Discovery
DYAD employs a multi-level Hierarchical File Locator (HFL) to efficiently locate files across distributed nodes without creating metadata bottlenecks when multiple I/O workers access files concurrently. The locator searches metadata through three levels in order of increasing scope and latency: the local DYAD Managed Directory (DMD), the node-local cache of the HFL key-value store, and the remaining levels of the HFL hierarchy, which distributes the search across nodes via a hierarchical key-value search tree operated by underlying Flux KVS. Workers always check local sources first, reducing redundant network searches and balancing workload across the cluster. As files are located via metadata lookups, their metadata is cached at each level so that future lookups resolve at the lowest possible level. The local DMD uses filesystem metadata for the fastest lookups, i.e., obtaining a file lock on a file descriptor; the local HFL is approximately one order of magnitude slower than the DMD but one order of magnitude faster than the global HFL, providing an efficient intermediate cache for node-local misses.
Four cases illustrating the hierarchical file locator. In Case 1, both worker 1 (W1) and worker 2 (W2) are on the same node and the file is already in the local DMD — W2 finds it immediately without any network lookup because W1 had already cached it there (i.e., written it to the local DMD). In Case 2, the file was evicted from the DMD before W2’s search, so W2 falls back to the local HFL to retrieve the metadata of the file. In Case 3, W1 and W2 are on different nodes. W1 had already cached the files locally (i.e., produced the files into its local DMD), and thus W2 finds the file in the global HFL due to a miss in its local caches. In Case 4, no cache contains the file and the locator falls back to the parallel file system (PFS) for the initial fetch.
Scalable Metadata Key
DYAD maintains a mapping of <filename, owner_rank> pairs in the Flux
Key-Value Store (KVS). Naively storing and searching filenames as raw
strings would require byte-by-byte comparison for every lookup, which
becomes expensive as the number of tracked files grows. DYAD instead uses
a hierarchical key structure that enables early termination of costly
string comparisons through a sequence of inexpensive hash comparisons.
Each filename is hashed multiple times using different seed values, producing a sequence of hash values. These hash values form a multi-level key hierarchy: at each level, only entries whose hash matches the query hash at that level are considered for the next level. A mismatch at any level immediately eliminates the candidate without requiring further comparison. Only when all hash values match, a full byte-by-byte string comparison is performed to rule out hash collision. This structure trades a small amount of storage overhead for a significant reduction in the number of full string comparisons, making metadata lookups scalable as the number of tracked files increases. Because each level uses an independent hash seed, the probability that two distinct filenames produce matching hashes at every level simultaneously decreases exponentially with depth, keeping the false positive rate low even for large file sets.
Two environment variables control the shape of the key hierarchy at run time, without requiring recompilation:
DYAD_KEY_BINS— the number of hash bins at each level of the hierarchy. A larger value reduces the probability of hash collisions at each level, lowering the number of false positives that proceed to deeper levels or full string comparison.DYAD_KEY_DEPTH— the number of levels in the hash hierarchy. A greater depth means more hash comparisons are performed before a full string comparison, providing more opportunities for early termination at the cost of a slightly deeper key structure.
The combination of DYAD_KEY_BINS and DYAD_KEY_DEPTH allows users
to tune the trade-off between key space size and lookup performance for
their workload. The defaults (DYAD_KEY_DEPTH=2,
DYAD_KEY_BINS=256) are suitable for most use cases.
The next section details how concurrent accesses to the same file by multiple workers — as in Case 1 — are coordinated. It also describes how files are transferred into local storage, which applies to Cases 2, 3, and 4 where the file is not yet present on the requesting worker’s node.
Local file access coordination via locking
To allow seamless coordination between producers and consumers of files, DYAD supports both intercepting POSIX I/O using GOTCHA and wrapper classes for C++ streams. The two methods currently operate in a mutually exclusive fashion. To assure the soundness of access on files that are visible to multiple workers including producers and consumers, we rely on file locking mechanisms.
This diagram shows the flow of each of the two independent methods.
The C GOTCHA wrappers intercept POSIX-level calls while the C++ stream
wrappers operate independently. In both paths the producer holds an
exclusive lock for the duration of the write to protect consumers with
direct file visibility from reading premature files.
The consumer acquires an exclusive lock inside dyad_consume() to
coordinate with local producers as well as to serialize the fetch
among concurrent consumers.
The exclusive lock serializes consumers so that only one performs the
expensive KVS wait and data fetch while avoiding to overwrite the file.
When the second consumer eventually acquires the lock, it finds the file
already present (size > 0) and skips the fetch entirely.
The full sequence for consumer A is: fetch metadata ->
data transfer (dyad_get_data()) -> write to disk (dyad_cons_store())
-> release lock, with consumer B blocked throughout all three steps.
If the file is already present locally, there is no need for fetching metadata or the file itself. In that case, the consumer releases the lock immediately and continues to perform read-only access on the file.
Data durability in the C++ stream path
Unlike the C GOTCHA wrapper, the C++ stream destructor and close()
perform an additional step of flushing kernel-buffered stream data to stable
storage by calling fsync() on the file descriptor extracted from the
stream. This ensures all written data is durable before dyad_produce()
notifies consumers, providing a stronger durability guarantee than the C
GOTCHA wrapper path which does not perform an explicit fsync().
DYAD entry points and initalization
Client
DYAD has multiple initialization entry points depending on the interception
method in use. The C GOTCHA wrapper path initializes via
dyad_wrapper_init(), which calls dyad_ctx_init(), which in turn
calls dyad_init_env() to read configuration from environment variables
before delegating to dyad_init(). The C++ stream wrapper path
initializes via dyad_stream_core::init(), which either calls
dyad_init() directly when given explicit parameters via
dyad_params, or follows the same dyad_ctx_init() →
dyad_init_env() chain when reading from environment variables.
The Python bindings (pydyad) may call either dyad_init() directly
or dyad_init_env(), depending on whether configuration is provided
programmatically or via the environment. All paths ultimately converge
on dyad_init(), which allocates and configures the DYAD context.
The Flux module initialization path is shown separately in the diagram
below.
Service
The DYAD Flux module follows the same dyad_ctx_init() →
dyad_init_env() -> dyad_init() chain as the C GOTCHA wrapper
(shown above), but with an additional step — mod_main() first calls
dyad_module_ctx_init(), which applies any command-line argument
overrides to environment variables before delegating to
dyad_ctx_init(). The module also differs in two other key ways: it
initializes with DYAD_COMM_SEND (producer mode) rather than
DYAD_COMM_RECV (consumer mode), and it adopts the Flux handle
provided by the broker rather than opening its own.