Step 4: Explore Data
Feb 26 – Mar 11 | Explore Data

The current Step 4 - Explore Data step is divided into seven parts, and involves exploring the learning set we obtained from Step 3 - Prepare ML tables of the Testing Phase as follows:
Analyze the learning set using YData profiling: Employ the YData profiling tool from the Exploratory Module to delve into your learning set. Record the percentages of missing values for each class across all time points.
Set demographic embeddings in T1 only: Based on the insights from part 1, eliminate demographic embeddings from all time point CSV files, including learning and holdout sets. Consolidate all demographic data into the T1 time point using the new_demographic_embeddings CSV file. Conduct this operation in the Input Module.
Remove chart events from T1: Referring to the analysis in part 1, eliminate chart events from the T1 datasets (both learning and holdout) using the Input Module.
Transform procedure events: Building on the findings from part 1, transform the procedure events columns in all time point datasets (learning and holdout sets) using the Input Module.
Analyze the learning set using D-Tale: Leverage the D-Tale tool from the Exploratory Module to scrutinize your learning set. Explore the inter-variable correlation matrices for each time point.
Analyze the learning set using SweetViz: Utilize the SweetViz tool from the Exploratory Module to study your learning set. Identify sets of variables exhibiting a high correlation rate, considering the observations made with D-Tale.
Remove high correlated columns from the time points datasets: In the Input Module, eliminate the variables identified as having high correlation rates from the time point datasets (learning and holdout sets), aligning with the insights gained in part 6.
Recommendations
Before proceeding with Step 4 - Explore Data of the MEDomicsLab Testing Phase, we recommend consulting the documentation of the Input Module and Exploratory Module.
Input ModuleExploratory ModulePlease consider the warnings mentioned on the Input Module page: we are continuously working on enhancing the MEDomicsLab platform.
Instructions for Step 4 - Explore Data
Content
Intro 0:00
YData-profiling 0:54
Set the demographics in T1 only 5:41
Remove chart events from T1 12:04
Transform procedure events 14:00
D-Tale 18:50
SweetViz 23:24
Remove Highly Correlated Columns 28:26
Kindly be informed that the last part, 'Remove Highly Correlated Columns,' is optional as it can be time-consuming. We recognize that this process might be lengthy, and we are actively working to enhance the Delete Columns tool in the Input Module to expedite this procedure in the future. Rest assured, even if you choose not to perform this last part, we will provide you with our own output data from Step 4 - Explore Data for use in Step 5 - Create Model.
Last updated