Step 4: Explore Data
Feb 26 – Mar 11 | Explore Data
Feb 26 – Mar 11 | Explore Data
If you completed Step 3 - Prepare ML tables, you have data ready for Step 4 - Explore Data.
However, before proceeding to Step 4 - Explore Data, we recommend that you replace your own output data from Step 3 - Prepare ML tables (the MEDprofiles/timePoints folder) with the data that we prepared for you (MEDomicsLab_TestingPhase_Step4.zip). This will ensure consistency of results across all participants of the Testing Phase.
An invitation to access the MEDomicsLab_TestingPhase_Step4.zip data was sent by email.
The MEDomicsLab_TestingPhase_Step4.zip also includes a new_demographic_embeddings CSV file that you will need for this Step 4 - Explore Data. The rationale behind providing this new file is explained in the "Set the demographics in T1 only" section (5:41) of the video.
The current Step 4 - Explore Data step is divided into seven parts, and involves exploring the learning set we obtained from Step 3 - Prepare ML tables of the Testing Phase as follows:
Analyze the learning set using YData profiling: Employ the YData profiling tool from the Exploratory Module to delve into your learning set. Record the percentages of missing values for each class across all time points.
Set demographic embeddings in T1 only: Based on the insights from part 1, eliminate demographic embeddings from all time point CSV files, including learning and holdout sets. Consolidate all demographic data into the T1 time point using the new_demographic_embeddings CSV file. Conduct this operation in the Input Module.
Remove chart events from T1: Referring to the analysis in part 1, eliminate chart events from the T1 datasets (both learning and holdout) using the Input Module.
Transform procedure events: Building on the findings from part 1, transform the procedure events columns in all time point datasets (learning and holdout sets) using the Input Module.
Analyze the learning set using D-Tale: Leverage the D-Tale tool from the Exploratory Module to scrutinize your learning set. Explore the inter-variable correlation matrices for each time point.
Analyze the learning set using SweetViz: Utilize the SweetViz tool from the Exploratory Module to study your learning set. Identify sets of variables exhibiting a high correlation rate, considering the observations made with D-Tale.
Remove high correlated columns from the time points datasets: In the Input Module, eliminate the variables identified as having high correlation rates from the time point datasets (learning and holdout sets), aligning with the insights gained in part 6.
Please note that the CSV files for the time points obtained from Step 3 - Prepare ML tables are already tagged. This is done by the MEDprofiles Module when exporting the data as time points.
Before proceeding with Step 4 - Explore Data of the MEDomicsLab Testing Phase, we recommend consulting the documentation of the Input Module and Exploratory Module.
Please consider the warnings mentioned on the Input Module page: we are continuously working on enhancing the MEDomicsLab platform.
Content
Intro 0:00
YData-profiling 0:54
Set the demographics in T1 only 5:41
Remove chart events from T1 12:04
Transform procedure events 14:00
D-Tale 18:50
SweetViz 23:24
Remove Highly Correlated Columns 28:26
Kindly be informed that the last part, 'Remove Highly Correlated Columns,' is optional as it can be time-consuming. We recognize that this process might be lengthy, and we are actively working to enhance the Delete Columns tool in the Input Module to expedite this procedure in the future. Rest assured, even if you choose not to perform this last part, we will provide you with our own output data from Step 4 - Explore Data for use in Step 5 - Create Model.
We encourage you not only to follow the video but also to independently utilize the Exploratory Module for exploring the learning set. This self-directed analysis will prove valuable in Step 8 - Challenge.