Step 6: Create Model
Mar 25 – Apr 8 | Create Model
Last updated
Mar 25 – Apr 8 | Create Model
Last updated
If you completed Step 4 - Explore Data, you have data ready for Step 6 - Create Model.
However, before proceeding to Step 6 - Create Model, we recommend that you replace your own output data from Step 4 - Explore Data (the MEDprofiles/timePoints folder) with the data that we prepared for you (MEDomicsLab_TestingPhase_Step6.zip). This will ensure consistency of results across all participants of the Testing Phase.
An invitation to access the MEDomicsLab_TestingPhase_Step6.zip data was sent by email.
In this current Step 6 - Create Model, we will leverage the functionalities of the Learning Module to build machine learning models using the learning set obtained from Step 4 - Explore Data. In this step, we'll create two Learning scenes:
Scene 1: Time-Dependent Model Comparison
We aim to assess the impact of patient timelines on model performance, hypothesizing that the performance will increase with time, particularly nearing the last hospital stay. We will compare the best models from the following datasets:
Dataset from the data obtained at the first time point (T1_learning_modified.csv).
Dataset combining data from the first and second time points (T1_learning_modified.csv and T2_learning_modified.csv).
Dataset combining data from the first, second, and third time points (T1_learning_modified.csv, T2_learning_modified.csv, and T3_learning_modified.csv).
Dataset combining data from all time points (T1_learning_modified.csv, T2_learning_modified.csv, T3_learning_modified.csv, and T4_learning_modified.csv).
Scene 2: Variable-Dependent Model Comparison
This scene aims to assess the impact of considered variables on model performance. We will use data from the first two time points (T1_learning_modified.csv and T2_learning_modified.csv), assuming that models involving data from the last time points might make predictions too late in a patient's timeline. We'll compare the best models from the following datasets:
All demographic and time-series data (tslab, tsprocedure, and tschart classes) from T1_learning_modified.csv and T2_learning_modified.csv.
All demographic and notes data (ndischarge and nradiology) from T1_learning_modified.csv and T2_learning_modified.csv.
All demographic and image data from T1_learning_modified.csv and T2_learning_modified.csv.
Selected variables from various data types based on observations made using the first three pipelines, aiming to obtain the best possible model.
These scenes are designed to provide a comprehensive comparison of models under different temporal and variable considerations.
Before proceeding with Step 6 - Create Model of the MEDomicsLab Testing Phase, we recommend consulting the documentation of the Learning Module.
Please note that the Learning Module is a graphical implementation of the PyCaret Python library. Additionally, if you are seeking information about elements in the Learning Module, you may find it in the PyCaret documentation.
The PyCaret documentation often refers to other Python packages, as they built their functions around these packages. If you want to learn more about some options of certain functionalities, you may need to search in these other packages to find the information you are looking for.
For example, if you are looking for information on the fold_strategy
parameter in the Dataset box:
Visit the PyCaret documentation, specifically the Data Preprocessing section.
Look for the category related to the fold_strategy
parameter, which is under Other Setup Parameters -> Model Selection.
The Model Selection part contains explanations about related parameters, including the fold_strategy
parameter. It specifies that this parameter takes, as input, predefined strings or a cross-validation object compatible with scikit-learn. If you want additional information about the possible parameters, you'll have to search for the information on your own in the scikit-learn documentation. For example, if you want to know more about the default value for fold_strategy
(which is stratifiedkfold
), you will have to search for 'stratifiedkfold' in the scikit-learn documentation. The page related to this information is available here.
Also, if you want to fully understand how PyCaret works in the background, this is an open-source library, and the code is available on GitHub. (As we use the 3.1.0 version in our application, we recommend you to consult the 3.1.0 code if your research is related to our application).
Please pay attention to our last sections in the Learning Module:
What PyCaret does?
PyCaret ROC (Receiver Operating Characteristic)/AUC (Area Under the Curve) plots
Content
Intro 0:00
First Pipeline 1:09
Explanations about PyCaret 5:37
Scene 1: Time-Dependent Model Comparison 7:35
Scene 2: Variable-Dependent Model Comparison 17:12
You are welcome to use this step to conduct your own experiments and explore the functionalities of the Learning Module. However, please note that there are some missing options and tooltips that we haven't implemented yet, and we intend to address these before Step 8 - Challenge.