Common Tips for Debugging with DYAD

Debugging distributed operations of mutiple jobs coordination under batch system is quite challending. Here are several tips.

Build DYAD for Debugging

To facilitate debugging, DYAD provides several CMake options that can be enabled at build time.

  • For users: Enable DYAD logging support:

    -DDYAD_LOGGER=FLUX|CPP_LOGGER -DDYAD_LOGGER_LEVEL=DEBUG
    
  • For developers: Treat all compiler warnings as errors:

    -DDYAD_WARNINGS_AS_ERRORS=ON
    
  • For developers: Use Clang with AddressSanitizer as needed:

    -DCMAKE_C_COMPILER=clang
    -DCMAKE_CXX_COMPILER=clang++
    -DCMAKE_BUILD_TYPE=Debug
    

Runtime Logging

  • Enable Flux logging when starting an instance to capture DYAD logs:

flux start -v -o,-S,log-filename=out.txt
  • Enable stdout and stderr forwarding with allocation (see flux logging):

flux alloc -N 2 --broker-opts=--setattr=log-filename="$PWD/flux-${USER}.log" --broker-opts=--setattr=log-level=7 --broker-opts=--setattr=log-forward-level=7 --broker-opts=--setattr=log-critical-level=7 --broker-opts=--setattr=log-stderr-level=7 --broker-opts=--setattr=log-syslog-enable=1 --broker-opts=--setattr=log-stderr-mode=leader

Controlling Job Standard I/O

Flux job-related options can be used to control standard I/O behavior (see flux-run):

  • Disable output buffering:

    -u, --unbuffered
    
  • Label output by rank:

    -l, --label-io
    
  • Redirect job output streams:

    --output=, --error=, --log=, --log-stderr=
    
  • Use mustache templates for fine-controlling output.

Simulated Multi-Node Debugging

Use a single node with a simulated multi-node setup via flux start --test-size=N. In this configuration, DYAD should use different managed paths to mimic operations on distinct nodes.

Common Debugging Steps

When isolating errors in DYAD-enabled applications, the following steps are recommended:

  • Verify environment variable propagation by running a script that prints all DYAD-related environment variables in place of a DYAD job.

  • Ensure environment variables are set consistently between producers and consumers.

  • Confirm that DYAD_KVS_NAMESPACE is set and that the namespace exists in the KVS. flux kvs namespace list

  • Clear any namespaces or files left over from previous runs.

  • Inspect logging output to identify where a DYAD consumer may be hanging or where a DYAD job may have crashed.

  • Inspect KVS entries at both the producer and consumer as needed. flux kvs dir -N ${DYAD_KVS_NAMESPACE} [key]