Getting Started with FedPilot

A step-by-step guide to setting up and running your first federated learning experiment with FedPilot.

Prerequisites

Before running FedPilot, make sure your environment meets the hardware and software requirements described in Requirements & Installation.

Getting Started with Training

Interactive Configuration Setup

FedPilot streamlines the setup process through a Make-based command-line interface. Begin by verifying your environment with:

make validate-setup

This command performs dependency checks and creates all required directories.

Launching Training

Once setup is complete, initiate the interactive training session with:

make train

Show example interactive output and explanation

Example Output:

Configuration browser
========================================================
Found 796 configuration(s)


Choose mode: [prod/dev]
Enter mode (default: dev): dev

Detected 1 GPU(s).
Choose device: [cpu/gpu]
Enter device type (default: gpu): gpu
Current directory: templates

1. bert
2. cnn
3. enhanced_chunking
4. lenet
5. mobilenet
6. others
7. resnet18
8. resnet50
9. vgg16

Enter your choice (1-9) or 'q' to quit: 2

Current directory: templates/cnn

1. [Go back]
2. dir
3. label-100
4. label-20
5. label-30
...

This command launches an interactive configuration browser that guides you through the setup process:

Navigates through available configuration templates in the templates/ directory
Prompts for key experiment parameters (device type, federation schema, topology)
Automatically generates a federation ID with version tracking
Creates a complete config.yaml file tailored to your selections

The interactive system is designed to be beginner-friendly while providing access to all of FedPilot’s advanced features.

Configuration Files

Configuration files like config.yaml that is created during the make train process contain all the necessary parameters to run your experiments. The example below shows a simplified version of what a configuration file contains:

Show example configuration

device: cpu

federation_id: '0.0.1'
federated_learning_schema: 'DecentralizedFederatedLearning'
draw_topology: false
federated_learning_topology: 'k_connect'
adjacency_matrix_file_name: 'adjacency_matrix_2.csv'
client_k_neighbors: 2
client_role: 'train'
placement_group_strategy: 'SPREAD'

random_seed: 42

learning_rate: 0.001
runtime_engine: "torch"
model_type: "cnn"
transformer_model_size: "base"
pretrained_models: false
dataset_type: "fmnist"
loss_function: "CrossEntropy"
optimizer: "sgd"

# Data distribution settings
data_distribution_kind: "20"
desired_distribution: null
dirichlet_beta: 0.1

# Aggregation strategy
aggregation_strategy: "FedAvg"
fed_avg: true

Understanding Configurations

Configurations are YAML files in the templates/ directory. They define:

Model: Which ML model to train (CNN, ResNet, BERT, etc.)
Dataset: Training data source (MNIST, CIFAR-10, etc.)
Topology: Communication structure (Star, Ring, K-connected)
Privacy: Differential privacy settings
Optimization: Aggregation strategies and parameters

You can modify these configuration files manually or create your own custom configurations. FedPilot includes validation checks that run before training sessions begin, which helps prevent configuration errors and ensures your experiments start with valid parameters.

Run with Current Configuration

To execute training with an existing configuration without going through the interactive setup:

make run

Configuration Management

Create a configuration from template

Explore available configuration templates without starting training:

make config

View Current Configuration

Display the active configuration:

make show-config

Show grouped summary of template families

You can view a categorized list of available configuration templates:

make config-summary

Show example output of config-summary

Available configuration templates
========================================================
Root directory: templates
Total templates: 796

bert - 1 templates
    examples: bert_fl.yaml

cnn - 132 templates
    examples: cfl-coordinate-dp.yaml, cfl-cosine-dp.yaml, cfl-cosine-grads-dp.yaml

enhanced_chunking - 1 templates
    examples: config.yaml

lenet - 132 templates
    examples: cfl-coordinate-dp.yaml, cfl-cosine-dp.yaml, cfl-cosine-grads-dp.yaml

mobilenet - 132 templates
    examples: cfl-coordinate-dp.yaml, cfl-cosine-dp.yaml, cfl-cosine-grads-dp.yaml

others - 2 templates
    examples: shapley_lenet_test.yaml, test.yaml

resnet18 - 132 templates
    examples: cfl-coordinate-dp.yaml, cfl-cosine-dp.yaml, cfl-cosine-grads-dp.yaml

resnet50 - 132 templates
    examples: cfl-coordinate-dp.yaml, cfl-cosine-dp.yaml, cfl-cosine-grads-dp.yaml

vgg16 - 132 templates
    examples: cfl-coordinate-dp.yaml, cfl-cosine-dp.yaml, cfl-cosine-grads-dp.yaml

Use 'make config' to browse and select a specific template interactively.

List All Available Configurations

You can list all available configuration templates in the templates/ directory (uses less command):

make list-configs

Validate Configuration

You can also validate a configuration file. This includes checking for required fields, as well as the dataset and model compatibility:

make validate-config

Show example validation output and common errors

Example Output:

Validating config.yaml
========================================================
Validating config: /home/Disquiet/Desktop/fed/core/config.yaml
Checks:
  1) YAML parsing and default loading via yaml_loader
  2) Semantic validation using ConfigValidator
  3) Required field presence and missing-field warnings
  4) Model and dataset modality compatibility

Step 1/4: YAML parsing and default loading ... ok
Step 2/4: semantic validation (ConfigValidator) ... using default value for `SENSITIVITY_PERCENTAGE` which is 100
ok
Step 3/4: checking required fields and defaults ... ok (with warnings)
[WARN] The following fields are missing from config.yaml; framework defaults will be used:
  - aggregation_sample_scaling
  - client_k_neighbors
  - gpu_index
  - shapley
  - shapley_type
  - use_global_accuracy_for_noniid
  Consider running 'make fill-config' to write these defaults into the file.
Step 4/4: model/dataset modality check ... ok

Config validation complete.
Summary:
  model_type:                 'cnn'
  dataset_type:               'fmnist'
  federated_learning_schema:  'DecentralizedFederatedLearning'
  federated_learning_topology:'ring'

Session Management

FedPilot automatically creates tmux sessions for long-running experiments to ensure they continue running even if you disconnect. You can manage these sessions using:

# 1. View active sessions
make sessions (Currently unavailable)

# 2. List all available sessions
tmux list-sessions             

# 3. Attach to specific session
tmux attach -t fl-resnet-cifar-12345

# 4. View logs for specific session
tail -100 logs/resnet_cifar10_*.log

# 5. Kill session if needed
tmux kill-session -t fl-resnet-cifar-12345

Configuration Selection Guide

Recommended Starting Configurations

Basic Federated Learning Demonstration

Configuration: templates/cnn/label-20/encryption-free/fl.yaml

Model: Convolutional Neural Network (CNN)
Dataset: MNIST with 20% label heterogeneity
Use Case: Introductory experiments and system validation
Training Time: 5-10 minutes per federation round
Characteristics: Minimal configuration with standard FedAvg aggregation

Realistic Non-IID Data Distribution

Configuration: templates/lenet/label-90/encryption-free/fl.yaml

Model: LeNet architecture
Dataset: Highly heterogeneous data distribution (90% label skew)
Use Case: Studying federation under realistic data partitioning
Characteristics: 10 clients with significant statistical heterogeneity
Research Focus: Algorithm robustness to non-IID data challenges

Privacy-Preserving Federated Learning

Configuration: templates/resnet18/label-50/differential-privacy/fl-dp.yaml

Model: ResNet-18 for complex vision tasks
Privacy: Differential Privacy with DP-SGD optimization
Use Case: Applications requiring formal privacy guarantees
Characteristics: Noise injection and gradient clipping mechanisms
Compliance: Meets rigorous privacy preservation standards

Communication-Efficient Federation

Configuration: templates/mobilenet/label-50/encryption-free/cfl-cosine.yaml

Model: MobileNet (lightweight architecture)
Optimization: Cosine similarity clustering and model compression
Use Case: Bandwidth-constrained environments
Characteristics: Reduced communication overhead through intelligent chunking
Efficiency: Balanced trade-off between accuracy and communication costs

Advanced Clustering Analysis

Configuration: templates/resnet50/label-50/encryption-free/cfl-euclidean.yaml

Model: ResNet-50 for high-performance vision tasks
Methodology: Data-driven clustering with Euclidean distance metrics
Use Case: Investigating client similarity and cluster formation
Characteristics: Multiple clustering rounds with detailed analysis output
Research Value: Insights into data distribution and client relationships

Show supported models, datasets, and distribution levels

Models

Model	Type	Params	Use Case
CNN	Image	~200K	Quick testing, baseline
LeNet	Image	~60K	Fast training, embedded
ResNet-18	Image	~11M	Standard baseline
ResNet-50	Image	~25M	Realistic tasks
VGG-16	Image	~138M	Large-scale tasks
MobileNet	Image	~4M	Edge devices, compression
ViT-Small	Image	~22M	Vision transformers
BERT	NLP	~110M	Language tasks

Datasets

Dataset	Type	Classes	Samples
MNIST	Image	10	70K
Fashion-MNIST	Image	10	70K
CIFAR-10	Image	10	60K
CIFAR-100	Image	100	60K
FMNIST	Image	10	70K
Shakespeare	Text	80	4M characters
BBC News	Text	5	2.2K docs

Data Distribution Levels

IID (Uniform): All clients have same class distribution
20/50/90: Non-IID level (lower = more heterogeneous)
Dir (Dirichlet): Beta parameter controls distribution

Monitoring and Analysis

Training Logs

Access training logs:

make logs

Show example log output and explanation

FedPilot training logs and metrics
========================================================

Available runs:
  1) run_20251207_183858
  0) Cancel
Select a run [0-1]: 1

Selected run: logs/run_20251207_183858

Contents of this run directory:
client-0-communication-metrics.csv  client-3-round-metrics.csv          client-7-memory-metrics.csv
client-0-convergence-metrics.csv    client-3-system-metrics.csv         client-7-performance-metrics.csv
client-0-memory-metrics.csv         client-4-communication-metrics.csv  client-7-round-metrics.csv
client-0-performance-metrics.csv    client-4-convergence-metrics.csv    client-7-system-metrics.csv
client-0-round-metrics.csv          client-4-memory-metrics.csv         client-8-communication-metrics.csv
client-0-system-metrics.csv         client-4-performance-metrics.csv    client-8-convergence-metrics.csv
client-1-communication-metrics.csv  client-4-round-metrics.csv          client-8-memory-metrics.csv
client-1-memory-metrics.csv         client-4-system-metrics.csv         client-8-performance-metrics.csv
client-1-performance-metrics.csv    client-5-communication-metrics.csv  client-8-round-metrics.csv
client-1-round-metrics.csv          client-5-convergence-metrics.csv    client-8-system-metrics.csv
client-1-system-metrics.csv         client-5-memory-metrics.csv         client-9-communication-metrics.csv
client-2-communication-metrics.csv  client-5-performance-metrics.csv    client-9-memory-metrics.csv
client-2-convergence-metrics.csv    client-5-round-metrics.csv          client-9-performance-metrics.csv
client-2-memory-metrics.csv         client-5-system-metrics.csv         client-9-round-metrics.csv
client-2-performance-metrics.csv    client-6-communication-metrics.csv  client-9-system-metrics.csv
client-2-round-metrics.csv          client-6-memory-metrics.csv         config.yaml
client-2-system-metrics.csv         client-6-performance-metrics.csv    topology-manager-memory-metrics.csv
client-3-communication-metrics.csv  client-6-round-metrics.csv          topology-manager-performance-metrics.csv
client-3-convergence-metrics.csv    client-6-system-metrics.csv         topology-manager-system-metrics.csv
client-3-memory-metrics.csv         client-7-communication-metrics.csv
client-3-performance-metrics.csv    client-7-convergence-metrics.csv

Metric groups available:
  1) Client metrics
  2) Topology-manager metrics
  0) Cancel
Select metric group [0-2]: 

Ray Dashboard Monitoring

Ray dashboard runs on http://localhost:8266.

Troubleshooting

Show troubleshooting guide

Common Issues

Ray Connection Error:

Issue: “`–address` is a required flag unless starting a head node with `–head`.”

Issue: “ConnectionError: Ray is trying to start at <ip address>, but is already running at <ip address>. Please specify a different port using the `–port` flag of `ray start` command.”

ray status
ray stop --force
ray start --head

GPU Not Detected:

Issue: “AssertionError: Torch not compiled with CUDA enabled”

Issue: “RuntimeError: CUDA error: no CUDA-capable device is detected”

First make sure your GPU is detected and that you have installed Nvidia Cuda Toolkit

nvidia-smi # Make sure it show the correct information for your GPU

Then install PyTorch with CUDA Support:

# Install PyTorch with CUDA 12.1 support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Or for CUDA 11.8
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Verify CUDA
python -c "import torch; print(torch.cuda.get_device_name(0))"

Finally, verify that your GPU is detected:

import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA devices: {torch.cuda.device_count()}")

Configuration Errors:

Execute the following command to diagnose potential issues in the configuration file:

make validate

Example output

✓ Configuration file exists
Validating YAML syntax...
✓ Valid YAML syntax
Checking required fields...
  ❌ Missing required field: device
  ✓ random_seed: 42
  ✓ learning_rate: 0.001
  ✓ model_type: cnn

If you are familiar with the configuration parameters, you may attempt to resolve these issues by modifying the existing configuration file. Alternatively, you may generate a new configuration file by executing the commands below.

make clean-config
make config

Other common validation errors:

# ERROR: Invalid model/dataset combination
model_type: "bert"
dataset_type: "mnist"        # BERT expects text data

# ERROR: Incompatible topology
federated_learning_schema: "traditional"
federated_learning_topology: "ring"  # Ring requires decentralized

# ERROR: Invalid aggregation
aggregation_strategy: "custom"       # Not implemented

# ERROR: Incomplete DP config
dp_enabled: true
dp_epsilon: null            # Must specify epsilon

Memory Issues:

If the program terminates unexpectedly, reducing the resource requirements may alleviate the issue.

train_batch_size: 16
test_batch_size: 16
number_of_clients: 2

Ready to dive deeper? Check out the Configuration Guide