Getting Started with FedPilot

A step-by-step guide to setting up and running your first federated learning experiment with FedPilot.


Prerequisites

Before running FedPilot, make sure your environment meets the hardware and software requirements described in Requirements & Installation.


Getting Started with Training

Interactive Configuration Setup

FedPilot streamlines the setup process through a Make-based command-line interface. Begin by verifying your environment with:

make validate-setup

This command performs dependency checks and creates all required directories.

Launching Training

Once setup is complete, initiate the interactive training session with:

make train
Show example interactive output and explanation

Example Output:

Configuration browser
========================================================
Found 796 configuration(s)


Choose mode: [prod/dev]
Enter mode (default: dev): dev

Detected 1 GPU(s).
Choose device: [cpu/gpu]
Enter device type (default: gpu): gpu
Current directory: templates

1. bert
2. cnn
3. enhanced_chunking
4. lenet
5. mobilenet
6. others
7. resnet18
8. resnet50
9. vgg16

Enter your choice (1-9) or 'q' to quit: 2

Current directory: templates/cnn

1. [Go back]
2. dir
3. label-100
4. label-20
5. label-30
...

This command launches an interactive configuration browser that guides you through the setup process:

  • Navigates through available configuration templates in the templates/ directory
  • Prompts for key experiment parameters (device type, federation schema, topology)
  • Automatically generates a federation ID with version tracking
  • Creates a complete config.yaml file tailored to your selections

The interactive system is designed to be beginner-friendly while providing access to all of FedPilot’s advanced features.

Configuration Files

Configuration files like config.yaml that is created during the make train process contain all the necessary parameters to run your experiments. The example below shows a simplified version of what a configuration file contains:

Show example configuration
device: cpu

federation_id: '0.0.1'
federated_learning_schema: 'DecentralizedFederatedLearning'
draw_topology: false
federated_learning_topology: 'k_connect'
adjacency_matrix_file_name: 'adjacency_matrix_2.csv'
client_k_neighbors: 2
client_role: 'train'
placement_group_strategy: 'SPREAD'

random_seed: 42

learning_rate: 0.001
runtime_engine: "torch"
model_type: "cnn"
transformer_model_size: "base"
pretrained_models: false
dataset_type: "fmnist"
loss_function: "CrossEntropy"
optimizer: "sgd"

# Data distribution settings
data_distribution_kind: "20"
desired_distribution: null
dirichlet_beta: 0.1

# Aggregation strategy
aggregation_strategy: "FedAvg"
fed_avg: true

Understanding Configurations

Configurations are YAML files in the templates/ directory. They define:

  • Model: Which ML model to train (CNN, ResNet, BERT, etc.)
  • Dataset: Training data source (MNIST, CIFAR-10, etc.)
  • Topology: Communication structure (Star, Ring, K-connected)
  • Privacy: Differential privacy settings
  • Optimization: Aggregation strategies and parameters

You can modify these configuration files manually or create your own custom configurations. FedPilot includes validation checks that run before training sessions begin, which helps prevent configuration errors and ensures your experiments start with valid parameters.

Run with Current Configuration

To execute training with an existing configuration without going through the interactive setup:

make run

Configuration Management

Create a configuration from template

Explore available configuration templates without starting training:

make config

View Current Configuration

Display the active configuration:

make show-config

Show grouped summary of template families

You can view a categorized list of available configuration templates:

make config-summary 
Show example output of config-summary
Available configuration templates
========================================================
Root directory: templates
Total templates: 796

bert - 1 templates
    examples: bert_fl.yaml

cnn - 132 templates
    examples: cfl-coordinate-dp.yaml, cfl-cosine-dp.yaml, cfl-cosine-grads-dp.yaml

enhanced_chunking - 1 templates
    examples: config.yaml

lenet - 132 templates
    examples: cfl-coordinate-dp.yaml, cfl-cosine-dp.yaml, cfl-cosine-grads-dp.yaml

mobilenet - 132 templates
    examples: cfl-coordinate-dp.yaml, cfl-cosine-dp.yaml, cfl-cosine-grads-dp.yaml

others - 2 templates
    examples: shapley_lenet_test.yaml, test.yaml

resnet18 - 132 templates
    examples: cfl-coordinate-dp.yaml, cfl-cosine-dp.yaml, cfl-cosine-grads-dp.yaml

resnet50 - 132 templates
    examples: cfl-coordinate-dp.yaml, cfl-cosine-dp.yaml, cfl-cosine-grads-dp.yaml

vgg16 - 132 templates
    examples: cfl-coordinate-dp.yaml, cfl-cosine-dp.yaml, cfl-cosine-grads-dp.yaml

Use 'make config' to browse and select a specific template interactively.

List All Available Configurations

You can list all available configuration templates in the templates/ directory (uses less command):

make list-configs

Validate Configuration

You can also validate a configuration file. This includes checking for required fields, as well as the dataset and model compatibility:

make validate-config
Show example validation output and common errors

Example Output:

Validating config.yaml
========================================================
Validating config: /home/Disquiet/Desktop/fed/core/config.yaml
Checks:
  1) YAML parsing and default loading via yaml_loader
  2) Semantic validation using ConfigValidator
  3) Required field presence and missing-field warnings
  4) Model and dataset modality compatibility

Step 1/4: YAML parsing and default loading ... ok
Step 2/4: semantic validation (ConfigValidator) ... using default value for `SENSITIVITY_PERCENTAGE` which is 100
ok
Step 3/4: checking required fields and defaults ... ok (with warnings)
[WARN] The following fields are missing from config.yaml; framework defaults will be used:
  - aggregation_sample_scaling
  - client_k_neighbors
  - gpu_index
  - shapley
  - shapley_type
  - use_global_accuracy_for_noniid
  Consider running 'make fill-config' to write these defaults into the file.
Step 4/4: model/dataset modality check ... ok

Config validation complete.
Summary:
  model_type:                 'cnn'
  dataset_type:               'fmnist'
  federated_learning_schema:  'DecentralizedFederatedLearning'
  federated_learning_topology:'ring'

Session Management

FedPilot automatically creates tmux sessions for long-running experiments to ensure they continue running even if you disconnect. You can manage these sessions using:

# 1. View active sessions
make sessions (Currently unavailable)

# 2. List all available sessions
tmux list-sessions             

# 3. Attach to specific session
tmux attach -t fl-resnet-cifar-12345

# 4. View logs for specific session
tail -100 logs/resnet_cifar10_*.log

# 5. Kill session if needed
tmux kill-session -t fl-resnet-cifar-12345

Configuration Selection Guide

Basic Federated Learning Demonstration

Configuration: templates/cnn/label-20/encryption-free/fl.yaml

  • Model: Convolutional Neural Network (CNN)
  • Dataset: MNIST with 20% label heterogeneity
  • Use Case: Introductory experiments and system validation
  • Training Time: 5-10 minutes per federation round
  • Characteristics: Minimal configuration with standard FedAvg aggregation

Realistic Non-IID Data Distribution

Configuration: templates/lenet/label-90/encryption-free/fl.yaml

  • Model: LeNet architecture
  • Dataset: Highly heterogeneous data distribution (90% label skew)
  • Use Case: Studying federation under realistic data partitioning
  • Characteristics: 10 clients with significant statistical heterogeneity
  • Research Focus: Algorithm robustness to non-IID data challenges

Privacy-Preserving Federated Learning

Configuration: templates/resnet18/label-50/differential-privacy/fl-dp.yaml

  • Model: ResNet-18 for complex vision tasks
  • Privacy: Differential Privacy with DP-SGD optimization
  • Use Case: Applications requiring formal privacy guarantees
  • Characteristics: Noise injection and gradient clipping mechanisms
  • Compliance: Meets rigorous privacy preservation standards

Communication-Efficient Federation

Configuration: templates/mobilenet/label-50/encryption-free/cfl-cosine.yaml

  • Model: MobileNet (lightweight architecture)
  • Optimization: Cosine similarity clustering and model compression
  • Use Case: Bandwidth-constrained environments
  • Characteristics: Reduced communication overhead through intelligent chunking
  • Efficiency: Balanced trade-off between accuracy and communication costs

Advanced Clustering Analysis

Configuration: templates/resnet50/label-50/encryption-free/cfl-euclidean.yaml

  • Model: ResNet-50 for high-performance vision tasks
  • Methodology: Data-driven clustering with Euclidean distance metrics
  • Use Case: Investigating client similarity and cluster formation
  • Characteristics: Multiple clustering rounds with detailed analysis output
  • Research Value: Insights into data distribution and client relationships
Show supported models, datasets, and distribution levels

Models

Model Type Params Use Case
CNN Image ~200K Quick testing, baseline
LeNet Image ~60K Fast training, embedded
ResNet-18 Image ~11M Standard baseline
ResNet-50 Image ~25M Realistic tasks
VGG-16 Image ~138M Large-scale tasks
MobileNet Image ~4M Edge devices, compression
ViT-Small Image ~22M Vision transformers
BERT NLP ~110M Language tasks

Datasets

Dataset Type Classes Samples
MNIST Image 10 70K
Fashion-MNIST Image 10 70K
CIFAR-10 Image 10 60K
CIFAR-100 Image 100 60K
FMNIST Image 10 70K
Shakespeare Text 80 4M characters
BBC News Text 5 2.2K docs

Data Distribution Levels

  • IID (Uniform): All clients have same class distribution
  • 20/50/90: Non-IID level (lower = more heterogeneous)
  • Dir (Dirichlet): Beta parameter controls distribution

Monitoring and Analysis

Training Logs

Access training logs:

make logs
Show example log output and explanation
FedPilot training logs and metrics
========================================================

Available runs:
  1) run_20251207_183858
  0) Cancel
Select a run [0-1]: 1

Selected run: logs/run_20251207_183858

Contents of this run directory:
client-0-communication-metrics.csv  client-3-round-metrics.csv          client-7-memory-metrics.csv
client-0-convergence-metrics.csv    client-3-system-metrics.csv         client-7-performance-metrics.csv
client-0-memory-metrics.csv         client-4-communication-metrics.csv  client-7-round-metrics.csv
client-0-performance-metrics.csv    client-4-convergence-metrics.csv    client-7-system-metrics.csv
client-0-round-metrics.csv          client-4-memory-metrics.csv         client-8-communication-metrics.csv
client-0-system-metrics.csv         client-4-performance-metrics.csv    client-8-convergence-metrics.csv
client-1-communication-metrics.csv  client-4-round-metrics.csv          client-8-memory-metrics.csv
client-1-memory-metrics.csv         client-4-system-metrics.csv         client-8-performance-metrics.csv
client-1-performance-metrics.csv    client-5-communication-metrics.csv  client-8-round-metrics.csv
client-1-round-metrics.csv          client-5-convergence-metrics.csv    client-8-system-metrics.csv
client-1-system-metrics.csv         client-5-memory-metrics.csv         client-9-communication-metrics.csv
client-2-communication-metrics.csv  client-5-performance-metrics.csv    client-9-memory-metrics.csv
client-2-convergence-metrics.csv    client-5-round-metrics.csv          client-9-performance-metrics.csv
client-2-memory-metrics.csv         client-5-system-metrics.csv         client-9-round-metrics.csv
client-2-performance-metrics.csv    client-6-communication-metrics.csv  client-9-system-metrics.csv
client-2-round-metrics.csv          client-6-memory-metrics.csv         config.yaml
client-2-system-metrics.csv         client-6-performance-metrics.csv    topology-manager-memory-metrics.csv
client-3-communication-metrics.csv  client-6-round-metrics.csv          topology-manager-performance-metrics.csv
client-3-convergence-metrics.csv    client-6-system-metrics.csv         topology-manager-system-metrics.csv
client-3-memory-metrics.csv         client-7-communication-metrics.csv
client-3-performance-metrics.csv    client-7-convergence-metrics.csv

Metric groups available:
  1) Client metrics
  2) Topology-manager metrics
  0) Cancel
Select metric group [0-2]: 

Ray Dashboard Monitoring

Ray dashboard runs on http://localhost:8266.


Troubleshooting

Show troubleshooting guide

Common Issues

Ray Connection Error:

Issue: “`–address` is a required flag unless starting a head node with `–head`.”

Issue: “ConnectionError: Ray is trying to start at <ip address>, but is already running at <ip address>. Please specify a different port using the `–port` flag of `ray start` command.”

ray status
ray stop --force
ray start --head

GPU Not Detected:

Issue: “AssertionError: Torch not compiled with CUDA enabled”

Issue: “RuntimeError: CUDA error: no CUDA-capable device is detected”

First make sure your GPU is detected and that you have installed Nvidia Cuda Toolkit

nvidia-smi # Make sure it show the correct information for your GPU

Then install PyTorch with CUDA Support:

# Install PyTorch with CUDA 12.1 support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Or for CUDA 11.8
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Verify CUDA
python -c "import torch; print(torch.cuda.get_device_name(0))"

Finally, verify that your GPU is detected:

import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA devices: {torch.cuda.device_count()}")

Configuration Errors:

Execute the following command to diagnose potential issues in the configuration file:

make validate

Example output

✓ Configuration file exists
Validating YAML syntax...
✓ Valid YAML syntax
Checking required fields...
  ❌ Missing required field: device
  ✓ random_seed: 42
  ✓ learning_rate: 0.001
  ✓ model_type: cnn

If you are familiar with the configuration parameters, you may attempt to resolve these issues by modifying the existing configuration file. Alternatively, you may generate a new configuration file by executing the commands below.

make clean-config
make config

Other common validation errors:

# ERROR: Invalid model/dataset combination
model_type: "bert"
dataset_type: "mnist"        # BERT expects text data

# ERROR: Incompatible topology
federated_learning_schema: "traditional"
federated_learning_topology: "ring"  # Ring requires decentralized

# ERROR: Invalid aggregation
aggregation_strategy: "custom"       # Not implemented

# ERROR: Incomplete DP config
dp_enabled: true
dp_epsilon: null            # Must specify epsilon

Memory Issues:

If the program terminates unexpectedly, reducing the resource requirements may alleviate the issue.

train_batch_size: 16
test_batch_size: 16
number_of_clients: 2

Ready to dive deeper? Check out the Configuration Guide