Step 3: Prepare ML tables
Feb 12 – Feb 26 | Prepare ML tables
Last updated
Feb 12 – Feb 26 | Prepare ML tables
Last updated
If you completed Step 2 - Extract Data, you have data ready for Step 3 - Prepare ML tables.
However, before proceeding to Step 3 - Prepare ML tables, we recommend that you replace your own output data from Step 2 - Extract Data (the extracted_features folder) with the data that we prepared for you (MEDomicsLab_TestingPhase_Step3.zip). This will ensure consistency of results across all participants of the Testing Phase.
An invitation to access the MEDomicsLab_TestingPhase_Step3.zip data was sent by email.
The current Step 3 - Prepare ML tables step is divided into five parts, and involves preparing Machine Learning tables using the extracted features from Step 2 - Extract Data of the Testing Phase as follows:
Reduce Extracted Features: Use the Input Module to reduce the large CSV files obtained from the previous step via Principal Component Analysis (PCA) and Spearman correlation.
Merge All Data: Combine the reduced extracted features with demographic embeddings into a master CSV table using the MEDprofiles package. Additionally, create MEDprofiles with the master table.
Visualize Data: Use the MEDprofiles figure to visualize the data.
Define Static Time Points: Use the MEDprofiles figure to set static time points and export the data as static CSV tables.
Create Learning and Holdout Sets: Use the Input Module to generate Learning and Holdout sets.
The goal of defining static time points is to simulate a longitudinal CDSS (Clinical Decision Support System) scenario using data aggregated over time. In Step 5 - Create Model of the Testing Phase, we will attempt to identify the point in time where we reach sufficient predictive power (the point in time when, in real-life, we could potentially intervene).
Before proceeding with Step 3 - Prepare ML tables of the MEDomicsLab Testing Phase, we recommend consulting the documentation of the Input Module.
Input ModuleReminder: Make sure to save your datasets when updating column names by pressing the 'Save' button icon (an example is shown at 16:08 in the video above).
If you do not press the 'Save' button icon after modifying a CSV file in the app, the changes will not be applied in your workspace.
Content
Intro 0:00
Reduce extracted features 0:50
Merge all our data 8:24
Visualize MEDprofiles 10:52
Define static time points 12:01
Create learning and holdout sets 14:13
We acknowledge that using Spearman correlation with the target variable to massively reduce the feature set dimension on the whole dataset is not part of best practices in machine learning.
This Spearman correlation process, if needed as a feature set reduction method, should normally be performed "on-the-fly" on the training sets of the Learning set (and ideally, the PCA process too).
Here, we decided to use Spearman correlation on the whole dataset during the Reduce extracted features process to get around some difficulties we have in handling large feature sets in downstream processes.
However, please note that we are actively working on enhancing the scalability of our application to eliminate the need of applying Spearman correlation on the whole dataset in the future.