Overview

This page provides an overview of the Federated Learning module in MEDomicsLab, offering insights into both the application's interface and the backend package employed for conducting experiments.

Introduction

The Federated Learning Module in MEDomicsLab simulates the process of federated learning and allows for training models in a decentralized manner using multiple datasets. This approach preserves privacy and enhances data security by ensuring that data never leaves its original location.

Key Aspects of the Federated Learning Module:

  • Decentralized Training: Models are trained across multiple nodes without transferring raw data.

  • Privacy Preservation: Utilizing techniques like differential privacy to ensure data confidentiality.

  • Hyperparameter Optimization: Tools to automatically tune and optimize model hyperparameters for improved performance.

  • Transfer learning: Allowing the user to use pre-trained models to initialize the central server to improve the model performance

Video Tutorial

MEDfl package

The Federated Learning module in the MEDomicsLab application uses MEDfl in the backend, a standalone Python package designed for simulating federated learning.

You can also use MEDfl independently from the app to create your networks and pipelines directly with code. Below is a brief example demonstrating how to do that.

# install MEDfl 
pip install MEDfl
# import MEDfl
import MEDfl 

# ... import the rest of the dependencies here 

# Create a network "Net_1"
Net_1 = Network(name="Auto_Net")
Net_1.create_network()

# Create a MasterDataSet from Net_1
Net_1.create_master_dataset()

# FLsetup creation
autoFl  = FLsetup(name = "Flsetup_1", description = "The first fl setup",network = Net_1)
autoFl.create()

# Create node 
hospital = Node(name="hospital_1", train=1)
Net_1.add_node(hospital)
hospital.upload_dataset("hospital_1"+'_dataset', base_url + '/notebooks/data/nodesData/output_1'+'.csv')

# Create FLDataSet
fl_dataset = autoFl.create_federated_dataset(
                output="deceased",
                fit_encode=[],
                to_drop=["deceased"]
 )


# Load the pre-trained model
model = Model.load_model("../../notebooks/.ipynb_checkpoints/trainedModels/grid_search_classifier.pth")


# Create the strategy
aggreg_algo = Strategy(config['aggreg_algo'],
                                   fraction_fit=1.0,
                                   fraction_evaluate=1.0,
                                   min_fit_clients=2,
                                   min_evaluate_clients=2,
                                   min_available_clients=2,
                                   initial_parameters=global_model.get_parameters())
aggreg_algo.create_strategy()

# Create The server
server = FlowerServer(global_model,
                                  strategy=aggreg_algo,
                                  num_rounds=server_rounds,
                                  num_clients=len(fl_dataset.trainloaders),
                                  fed_dataset=fl_dataset,
                                  diff_privacy=config['dp_activate'],
                                  # You can change the resources located for each client based on your machine
                                  client_resources={
                                      'num_cpus': 1.0, 'num_gpus': 0.0}
                                  )
# Create the pipeline
ppl_1 = FLpipeline(name="the second fl_pipeline",
                               description="This is our first FL pipeline",
                               server=server)
            
# Run the Training of the model
history = ppl_1.server.run()

# Test the model 
report = ppl_1.auto_test()
   

For more detailed examples, you can check the tutorials on the GitHub repository.

Application Interface

The interface of the MEDfl module in the MEDomicsLab application provides a user-friendly space where you can visually manage and connect multiple nodes to create your federated learning pipelines. Each node type in the interface has a specific role and attributes, allowing you to build and customize your federated learning networks seamlessly.

Below is a table explaining the role and attributes of each node:

Node DescriptioninputOutput

The Dataset Node is where you specify the master dataset for your experiment. The master dataset is used differently based on the type of network you create:

  • Auto Network: The master dataset is split between the different created nodes based on a specified column value.

  • Manual Network: The master dataset is used to validate the schema of the dataset selected for each node.

To select a master dataset, click on the "Select Dataset" button, choose the file, and specify the target of the dataset.

/

Dataset

The Network Node is responsible for creating the federated network. A new screen will appear when you click on it, displaying additional node types: the Client Node and the Server Node. You will have the option to add multiple clients and a central server that will aggregate the results.

Dataset

Network

The FL Setup Node is responsible for configuring the federated learning setup. The user only needs to specify the name and description of the setup.

Network

Flsetup

The FL Dataset Node creates the federated dataset, which generates train, test, and validation loaders from the clients' datasets. To create a federated dataset, the user must specify two parameters:

  • Validation fraction: The fraction of the data used for validation.

  • Test fraction: The fraction of the data used for testing.

Flsetupt

FL dataset

The Model Node is responsible for creating the model that initializes the federated learning process. The user has several options based on whether they activate or deactivate transfer learning:

  • Transfer Learning Activated: Specify a pre-trained model and additional parameters such as optimizer and learning rate.

  • Transfer Learning Deactivated: Choose between two options:

    1. Use custom models provided by MEDfl, specifying parameters like the number of layers, hidden size, optimizer, and learning rate. Optionally, parameters can be filled using results from a hyperparameter optimization experiment.

    2. Create a model from scratch using a code editor to define a custom model.

FL datatset

model

The Optimize Node is responsible for hyperparameter optimization. Users can optimize hyperparameters using the following methods:

  • Grid Search Optimization: A straightforward method where the user specifies a list of metrics to optimize, such as the number of layers, hidden size, and others.

  • Optuna Central Optimization: Utilizes Optuna to optimize parameters on the central server. Users can specify Optuna parameters like metric, direction, optimization algorithm, and intervals for each hyperparameter.

  • Optuna Federated Optimization: Uses Optuna for hyperparameter optimization in a federated manner. Optimization occurs during the execution of the federated pipeline, adapting parameters based on distributed data.

For more details on Optuna, you can find additional information here.

Model + Dataset

Model

The FL Strategy Node is responsible for creating the server strategy to aggregate and manage the network it contains. This includes defining:

  • Aggregation Algorithm: A list of algorithms from the Flower package.

  • fraction_fit: The fraction of clients sampled for training the model.

  • fraction_evaluate: The fraction of clients sampled for model evaluation (validation).

  • min_fit_clients: The minimum number of clients sampled for training in each round.

  • min_evaluate_clients: The minimum number of clients sampled for evaluation in each round.

  • min_available_clients: The minimum required number of available clients to initiate a federation round.

Model

fl strategy

The Train Model Node is used to define the client resources for training, specifying whether to utilize GPU or CPU resources during the training process.

flstrategy

train results

The Train Model Node is used to define the client resources for training, specifying whether to utilize GPU or CPU resources during the training process.

train results

save results

This node is used to merge two or more results files into one file

save results / none