Displaying 119 methodological articles out of 119.
Authors: NR Abdul Jalil, NA Mohamed, R Mohamad Yunus
Year: 2022
In this paper, we propose a new estimation method in estimating optimal dynamic treatment regimes. The quadratic inference functions in myopic regret-regression (QIF-MRr) can be used to estimate the parameters of the mean response at each visit, conditional on previous states and actions. Singularity issues may arise during computation when estimating the parameters in ODTR using QIF-MRr due to multicollinearity. Hence, the ridge penalty was introduced in rQIF-MRr to tackle the issues. A simulation study and an application to anticoagulation dataset were conducted to investigate the model's performance in parameter estimation. The results show that estimations using rQIF-MRr are more efficient than the QIF-MRr.
Estimand: The authors state that their method does not apply to data that suffers missingness or censored data. The method can therefore only be employed where missingness is completely-at-random (MCAR). No handling of ICE is reported. The measure is the Regret Function (the loss in expected outcome when not using the optimal treatment) which is a kind of risk difference. The goal is to minimize this regret (or maximize the mean outcome).
Estimator Description: The paper introduces a new estimation method called rQIF-MRr (Ridge Quadratic Inference Function for Myopic Regret-Regression). It incorporates Ridge Regression (L2 regularization), a standard machine learning regularization technique, to handle multicollinearity and singularity issues in the estimation process.
Authors: E Arjas, O Saarela
Year: 2010
Dynamic treatment regime is a decision rule in which the choice of the treatment of an individual at any given time can depend on the known past history of that individual, including baseline covariates, earlier treatments, and their measured responses. In this paper we argue that finding an optimal regime can, at least in moderately simple cases, be accomplished by a straightforward application of nonparametric Bayesian modeling and predictive inference. As an illustration we consider an inference problem in a subset of the Multicenter AIDS Cohort Study (MACS) data set, studying the effect of AZT initiation on future CD4-cell counts during a 12-month follow-up.
Estimand: The paper defines the estimand as the treatment regime that maximizes the expected value of a response variable at the study's conclusion. The mean difference between different regimes is calculated. Censored instances are simply ignored - therefore the resulting estimand cannot be taken to represent an ITT estimand as this requires outcome data to be recorded for dropouts. The authors state that the estimator still produces valid inference under a MAR (non-informative censoring) assumption and simply exclude participants that dropped out. This can be seen as more akin to a naive per-protocol analysis that excludes instances that did not comply with the treatment strategy. The authors do not explain how the estimator can apply the hypothetical strategy for handling of ICE.
Estimator Description: In this paper the authors argue that finding an optimal regime can, at least in moderately simple cases, be accomplished by a straightforward application of nonparametric Bayesian modeling and predictive inference.
Authors: Xiaofei Bai, Anastasios A. Tsiatis, Wenbin Lu, Rui Song
Year: 2017
A treatment regime at a single decision point is a rule that assigns a treatment, among the available options, to a patient based on the patient's baseline characteristics. The value of a treatment regime is the average outcome of a population of patients if they were all treated in accordance to the treatment regime, where large values are desirable. The optimal treatment regime is a regime which results in the greatest value. Typically, the optimal treatment regime is estimated by positing a regression relationship for the outcome of interest as a function of treatment and baseline characteristics. However, this can lead to suboptimal treatment regimes when the regression model is misspecified. We instead consider value search estimators for the optimal treatment regime where we directly estimate the value for any treatment regime and then maximize this estimator over a class of regimes. For many studies the primary outcome of interest is survival time which is often censored. We derive a locally efficient, doubly robust, augmented inverse probability weighted complete case estimator for the value function with censored survival data and study the large sample properties of this estimator. The optimization is realized from a weighted classification perspective that allows us to use available off the shelf software. In some studies one treatment may have greater toxicity or side effects, thus we also consider estimating a quality adjusted optimal treatment regime that allows a patient to trade some additional risk of death in order to avoid the more invasive treatment.
Estimand: The authors use IPW to deal with informative censoring (which corresponds to the hypothetical strategy). This requires coarsening-at-random to hold. The authors employ a composite strategy to handle toxicity arising from the treatment. They do so by allowing patients to modify the outcome of interest by indicating how much toxicity (that goes along with some invasive treatments) they are willing to trade with mortality risk. The causal effect measure is mean survival time.
Estimator Description: The authors propose a ‘value search estimator’ for the optimal treatment regime (OTR) specifically tailored for censored survival data. The novelty lies in deriving a locally efficient, doubly robust, augmented inverse probability weighted complete case (AIPW-CC) estimator for the value function (mean survival outcome) of any given regime. They then maximize this estimated value over a class of restricted regimes (e.g., linear rules) by reframing the optimization problem as a weighted classification problem. This allows the use of standard, off-the-shelf classification software (like Support Vector Machines or logistic regression) to find the optimal rule, bypassing complex custom optimization routines.
Authors: Jessica K. Barrett, Robin Henderson, Susanne Rosthoj
Year: 2014
We compare methods for estimating optimal dynamic decision rules from observational data, with particular focus on estimating the regret functions defined by Murphy (in J. R. Stat. Soc., Ser. B, Stat. Methodol. 65:331-355, 2003). We formulate a doubly robust version of the regret-regression approach of Almirall et al. (in Biometrics 66:131-139, 2010) and Henderson et al. (in Biometrics 66: 11921201, 2010) and demonstrate that it is equivalent to a reduced form of Robins' efficient g-estimation procedure (Robins, in Proceedings of the Second Symposium on Biostatistics. Springer, New York, pp. 189-326, 2004). Simulation studies suggest that while the regret-regression approach is most efficient when there is no model misspecification, in the presence of misspecification the efficient g-estimation procedure is more robust. The g-estimation method can be difficult to apply in complex circumstances, however. We illustrate the ideas and methods through an application on control of blood clotting time for patients on long term anticoagulation.
Estimand: The proposed methods makes no further assumptions aside from identifiability. Treatment switching hinges on treatment success or failure which is tackled as an ICE by being incorporated in the DTR. The resulting metric of comparison is a expected causal risk difference.
Estimator Description: The authors propose a doubly robust version of the regret-regression approach (originally proposed by Almirall et al. and Henderson et al.) for estimating optimal dynamic treatment regimes. The novelty lies in deriving this doubly robust formulation and demonstrating that it is ‘equivalent to a reduced form of Robins' efficient g-estimation procedure’. This addresses the difficulty of applying g-estimation in complex circumstances while improving robustness compared to standard regret-regression.
Authors: A Batorsky, KJ Anstrom, D Zeng
Year: 2024
Sequential multiple assignment randomized trials (SMARTs) are the gold standard for estimating optimal dynamic treatment regimes (DTRs), but are costly and require a large sample size. We introduce the multi-stage augmented Q-learning estimator (MAQE) to improve efficiency of estimation of optimal DTRs by augmenting SMART data with observational data. Our motivating example comes from the Back Pain Consortium, where one of the overarching aims is to learn how to tailor treatments for chronic low back pain to individual patient phenotypes, knowledge which is lacking clinically. The Consortium-wide collaborative SMART and observational studies within the Consortium collect data on the same participant phenotypes, treatments, and outcomes at multiple time points, which can easily be integrated. Previously published single-stage augmentation methods for integration of trial and observational study (OS) data were adapted to estimate optimal DTRs from SMARTs using Q-learning. Simulation studies show the MAQE, which integrates phenotype, treatment, and outcome information from multiple studies over multiple time points, more accurately estimates the optimal DTR, and has a higher average value than a comparable Q-learning estimator without augmentation. We demonstrate this improvement is robust to a wide range of trial and OS sample sizes, addition of noise variables, and effect sizes.
Estimand: It is noteworthy that they do not assume exchangeability in the observational dataset, but they use the trial data to correct for it through weighting by a shrinkage parameter. The ICE (non-response) was an embedded tailoring variable in the actual trial design (treatment policy). Specifically, participants who did not meet the response threshold were re-randomized to a new treatment or a combination, while responders continued their initial treatment. The causal effect measure is the value function which maximizes the expected cumulative outcome.
Estimator Description: The authors introduce a multi-stage augmented Q-learning estimator (MAQE) that is able to integrate observational and trial data for better accuracy in estimating the optimal DTR. It is noteworthy that, to this end, they had access to a dataset on the same participant phenotypes, treatments, and outcomes at multiple time points, which can easily be integrated.
Software: https: //github.com/abatorsky/MAQE
Authors: I Bhattacharya, A Ertefaie, KG Lynch, JR McKay, BA Johnson
Year: 2023
Existing methods for estimation of dynamic treatment regimes are mostly limited to intention-to-treat analyses-which estimate the effect of randomization to a particular treatment regime without considering the compliance behavior of patients. In this article, we propose a novel nonparametric Bayesian Q-learning approach to construct optimal sequential treatment regimes that adjust for partial compliance. We consider the popular potential compliance framework, where some potential compliances are latent and need to be imputed. The key challenge is learning the joint distribution of the potential compliances, which we accomplish using a Dirichlet process mixture model. Our approach provides two kinds of treatment regimes: (1) conditional regimes that depend on the potential compliance values; and (2) marginal regimes where the potential compliances are marginalized. Extensive simulation studies highlight the usefulness of our method compared to intention-to-treat analyses. We apply our method to the Adaptive Treatment for Alcohol and Cocaine Dependence (ENGAGE) study , where the goal is to construct optimal treatment regimes to engage patients in therapy.
Estimand: The method is based on principal strata, which requires assumptions of monotonicity and exclusion restriction. Compliance is a central issue addressed in this method: it uses principal strata to calculate potential compliance values. Therefore the method corrects for the ICE of non-adherence by applying the principle stratum method. The method estimates the expected potential outcome under the optimal regime for specific principal strata.
Estimator Description: The contribution of this paper is the integration of Nonparametric Bayesian methods with Q-learning to specifically handle partial compliance (an intercurrent event). They use a Dirichlet Process Mixture Model (DPMM) to model the joint distribution of latent ‘potential compliances’ and outcomes, allowing them to impute missing compliance data and estimate effects for specific principle strata conditional on the potential compliance. The method allows for estimating 1) a principal treatment regime and 2) a marginal treatment regime. The former shows how the optimal treatment regime varies across different potential compliance values (counterfactual compliance) while the latter depends solely observed covariates and can thus be used in practice for decision-making.
Software: https://github.com/indrabati646/Bayesian-Q-Learning
Authors: Paul H. Chaffee, Mark J. van der Laan
Year: 2012
Sequential Randomized Controlled Trials (SRCTs) are rapidly becoming essential tools in the search for optimized treatment regimes in ongoing treatment settings. Analyzing data for multiple time-point treatments with a view toward optimal treatment regimes is of interest in many types of afflictions: HIV infection, Attention Deficit Hyperactivity Disorder in children, leukemia, prostate cancer, renal failure, and many others. Methods for analyzing data from SRCTs exist but they are either inefficient or suffer from the drawbacks of estimating equation methodology. We describe an estimation procedure, targeted maximum likelihood estimation (TMLE), which has been fully developed and implemented in point treatment settings, including time to event outcomes, binary outcomes and continuous outcomes. Here we develop and implement TMLE in the SRCT setting. As in the former settings, the TMLE procedure is targeted toward a pre-specified parameter of the distribution of the observed data, and thereby achieves important bias reduction in estimation of that parameter. As with the so-called Augmented Inverse Probability of Censoring Weight (A-IPCW) estimator, TMLE is double-robust and locally efficient. We report simulation results corresponding to two data-generating distributions from a longitudinal data structure.
Estimand: The proposed TMLE estimator makes use of inverse probability of censoring weights to estimate the treatment effect under no censoring. This is an application of the hypothetical strategy to the (broad) ICE of informative censoring. It can estimate effect measures for binary outcomes (P(Y=1)), continuous measures (mean outcome) or time-to-event measures (mean survival time).
Estimator Description: The authors provide an efficient implementation of the doubly robust, semiparametric Targeted Maximum Likelihood Estimation (TMLE) for the sequentially randomised controlled trial setting.
Authors: B Chakraborty, P Ghosh, EE Moodie, AJ Rush
Year: 2016
A dynamic treatment regimen consists of decision rules that recommend how to individualize treatment to patients based on available treatment and covariate history. In many scientific domains, these decision rules are shared across stages of intervention. As an illustrative example, we discuss STAR*D, a multistage randomized clinical trial for treating major depression. Estimating these shared decision rules often amounts to estimating parameters indexing the decision rules that are shared across stages. In this article, we propose a novel simultaneous estimation procedure for the shared parameters based on Q-learning. We provide an extensive simulation study to illustrate the merit of the proposed method over simple competitors, in terms of the treatment allocation matching of the procedure with the "oracle" procedure, defined as the one that makes treatment recommendations based on the true parameter values as opposed to their estimates. We also look at bias and mean squared error of the individual parameter-estimates as secondary metrics. Finally, we analyze the STAR*D data using the proposed method.
Estimand: The method assumes that the Markov property holds. One of the included covariates is slope of symptom deterioration. The estimated DTR concluded that the optimal regime should recommend more aggressive treatment for patients with a positive and large slope wrt. disease progression (referring to depression symptoms in this example). This corresponds to the DTR strategy for the ICE of disease progression.
Estimator Description: The authors propose a novel method called Q-Shared, which extends Q-learning to estimate Optimal Shared-Parameter Dynamic Regimens. In standard Q-learning, parameters for the optimal decision rule are estimated separately at each stage. However, in many chronic diseases (like depression in the STAR*D trial), the decision logic (e.g., ‘if score > X, increase dose’) should theoretically be the same across all stages to be clinically interpretable and parsimonious. This method imposes a constraint that the parameters defining the rule are shared across stages, allowing for simultaneous estimation using all stage data. This reduces the dimension of the problem and increases statistical efficiency by pooling information across stages.
Authors: B Chakraborty, EB Laber, Y Zhao
Year: 2013
A dynamic treatment regime consists of a set of decision rules that dictate how to individualize treatment to patients based on available treatment and covariate history. A common method for estimating an optimal dynamic treatment regime from data is Q-learning which involves nonsmooth operations of the data. This nonsmoothness causes standard asymptotic approaches for inference like the bootstrap or Taylor series arguments to breakdown if applied without correction. Here, we consider the m-out-of-n bootstrap for constructing confidence intervals for the parameters indexing the optimal dynamic regime. We propose an adaptive choice of m and show that it produces asymptotically correct confidence sets under fixed alternatives. Furthermore, the proposed method has the advantage of being conceptually and computationally much simple than competing methods possessing this same theoretical property. We provide an extensive simulation study to compare the proposed method with currently available inference procedures. The results suggest that the proposed method delivers nominal coverage while being less conservative than alternatives. The proposed methods are implemented in the qLearn R-package and have been made available on the Comprehensive R-Archive Network (http://cran.r-project.org/). Analysis of the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) study is used as an illustrative example.
Estimand: The proposed method assumes correct specification for the Q-functions (e.g., linear regression). The exact modeling assumption therefore depends on the choice of Q-function. Further it relaxed the boundary condition to allow for Null treatment effects.
Estimator Description: The authors focus on the inference (uncertainty estimation) for Dynamic Treatment Regimes (DTRs) estimated via Q-learning. The novelty is the proposal of an ‘adaptive m-out-of-n bootstrap scheme’. Standard bootstrap fails (is inconsistent) for Q-learning because the estimator involves ‘nonsmooth operations’ (maximization over treatment options) which causes asymptotic non-regularity. The authors propose an adaptive way to choose the resample size m (where m < n) to restore consistency.
Authors: Bibhas Chakraborty, Eric B. Laber, Ying-Qi Zhao
Year: 2014
BACKGROUND: A dynamic treatment regime (DTR) comprises a sequence of decision rules, one per stage of intervention, that recommends how to individualize treatment to patients based on evolving treatment and covariate history. These regimes are useful for managing chronic disorders, and fit into the larger paradigm of personalized medicine. The Value of a DTR is the expected outcome when the DTR is used to assign treatments to a population of interest. PURPOSE: The Value of a data-driven DTR, estimated using data from a Sequential Multiple Assignment Randomized Trial, is both a data-dependent parameter and a non-smooth function of the underlying generative distribution. These features introduce additional variability that is not accounted for by standard methods for conducting statistical inference, for example, the bootstrap or normal approximations, if applied without adjustment. Our purpose is to provide a feasible method for constructing valid confidence intervals (CIs) for this quantity of practical interest. METHODS: We propose a conceptually simple and computationally feasible method for constructing valid CIs for the Value of an estimated DTR based on subsampling. The method is self-tuning by virtue of an approach called the double bootstrap. We demonstrate the proposed method using a series of simulated experiments. RESULTS: The proposed method offers considerable improvement in terms of coverage rates of the CIs over the standard bootstrap approach. LIMITATIONS: In this article, we have restricted our attention to Q-learning for estimating the optimal DTR. However, other methods can be employed for this purpose; to keep the discussion focused, we have not explored these alternatives. CONCLUSION: Subsampling-based CIs provide much better performance compared to standard bootstrap for the Value of an estimated DTR.
Estimand: The authors state no additional assumptions besides identifiability. The ICE of treatment non-response is handled by being incorporated in the DTR. The mean outcome under the treatment rule is used as a causal effect measure.
Estimator Description: The authors propose a method for constructing ‘valid confidence intervals’ for the ‘Value of an estimated DTR’ (the expected outcome of a data-driven regime) using subsampling. The novelty addresses the problem that the Value is a non-smooth function of the underlying generative distribution (due to the max operator in estimating optimal rules), which causes standard bootstrap methods to fail (be inconsistent). The proposed method uses a ‘double bootstrap’ to make the procedure ‘self-tuning’ (adaptively selecting the subsample size).
Authors: Yasin Khadem Charvadeh, Grace Y. Yi
Year: 2024
Research on dynamic treatment regimes has enticed extensive interest. Many methods have been proposed in the literature, which, however, are vulnerable to the presence of misclassification in covariates. In particular, although Q-learning has received considerable attention, its applicability to data with misclassified covariates is unclear. In this article, we investigate how ignoring misclassification in binary covariates can impact the determination of optimal decision rules in randomized treatment settings, and demonstrate its deleterious effects on Q-learning through empirical studies. We present two correction methods to address misclassification effects on Q-learning. Numerical studies reveal that misclassification in covariates induces non-negligible estimation bias and that the correction methods successfully ameliorate bias in parameter estimation.
Estimand: No ICE handling is performed.
Estimator Description: The authors show that misclassification of binary covariates introduces non-negligible bias in Q-learning. They propose two correction methods to this end and show their performance through simulation studies.
Authors: Elynn Chen, Sai Li, Michael I. Jordan
Year: 2025
Time-inhomogeneous finite-horizon Markov decision processes (MDP) are frequently employed to model decision-making in dynamic treatment regimes and other statistical reinforcement learning (RL) scenarios. These fields, especially healthcare and business, often face challenges such as high-dimensional state spaces and time-inhomogeneity of the MDP process, compounded by insufficient sample availability which complicates informed decision-making. To overcome these challenges, we investigate knowledge transfer within time-inhomogeneous finite-horizon MDP by leveraging data from both a target RL task and several related source tasks. We have developed transfer learning (TL) algorithms that are adaptable for both batch and online Q-learning, integrating valuable insights from offline source studies. The proposed transfer Q-learning algorithm contains a novel re-targeting step that enables cross-stage transfer along multiple stages in an RL task, besides the usual cross-task transfer for supervised learning. We establish the first theoretical justifications of TL in RL tasks by showing a faster rate of convergence of the Q \& lowast;-function estimation in the offline RL transfer, and a lower regret bound in the offline-to-online RL transfer under stage-wise reward similarity and mild design similarity across tasks. Empirical evidence from both synthetic and real datasets is presented to evaluate the proposed algorithm and support our theoretical results.
Estimand: The paper proposes an estimator that assumes problem similarity for transfer learning. It also assumes a Markov decision process. ICE handling is not addressed. The measure is the optimal Q-function / value function (expected cumulative reward).
Estimator Description: The authors transfer learning within time-inhomogeneous (i.e., transition probabilities and reward functions depend on the specific time step t) finite-horizon (i.e., fixed number of steps) Markov decision processes by leveraging data from both a target reinforcement learning task (e.g., DTR which we optimise) and several related source tasks (other sources of data). They propose transfer learning (TL) algorithms that are adaptable for both batch and online Q-learning, thus leveraging offline source studies. The proposed transfer Q-learning algorithm also contains a novel re-targeting mechanism that enables cross-stage transfer along multiple stages in an RL task.
Authors: Xinyuan Chen, Li Hu, Fan Li
Year: 2025
In longitudinal observational studies with time-to-event outcomes, a common objective in causal analysis is to estimate the causal survival curve under hypothetical intervention scenarios. The g-formula is a useful tool for this analysis. To enhance the traditional parametric g-formula, we developed an alternative g-formula estimator, which incorporates the Bayesian Additive Regression Trees into the modeling of the time-evolving generative components, aiming to mitigate the bias due to model misspecification. We focus on binary time-varying treatments and introduce a general class of g-formulas for discrete survival data that can incorporate longitudinal balancing scores. The minimum sufficient formulation of these longitudinal balancing scores is linked to the nature of treatment strategies, i.e., static or dynamic. For each type of treatment strategy, we provide posterior sampling algorithms. We conducted simulations to illustrate the empirical performance of the proposed method and demonstrate its practical utility using data from the Yale New Haven Health System's electronic health records.
Estimand: The method assumes ignorability with respect to the counterfactual interventions. One of the variables they intervene on is censoring status. Sequential ignorability wrt. censoring status is akin to CAR given the covariate set. They can then use the hypothetical strategy to handle the ICE of informative censoring. The effect measure of interest is the causal survival curve, but they do not show how to use this measure to compare DTRs.
Estimator Description: The novelty of the proposed method lies in enhancing the traditional parametric g-formula by using BART to model the time-evolving components (covariates/outcomes), thereby reducing bias from model misspecification. It also introduces a general class of g-formulas that can incorporate longitudinal balancing scores (like propensity scores) as a dimension reduction device within the Bayesian framework.
Authors: T Choi, H Lee, S Choi
Year: 2023
Dynamic treatment regime (DTR) is an emerging paradigm in recent medical studies, which searches a series of decision rules to assign optimal treatments to each patient by taking into account individual features such as genetic, environmental, and social factors. Although there is a large and growing literature on statistical methods to estimate optimal treatment regimes, most methodologies focused on complete data. In this article, we propose an accountable contrast-learning algorithm for optimal dynamic treatment regime with survival endpoints. Our estimating procedure is originated from a doubly-robust weighted classification scheme, which is a model-based contrast-learning method that directly characterizes the interaction terms between predictors and treatments without main effects. To reflect the censorship, we adopt the pseudo-value approach that replaces survival quantities with pseudo-observations for the time-to-event outcome. Unlike many existing approaches, mostly based on complicated outcome regression modeling or inverse-probability weighting schemes, the pseudo-value approach greatly simplifies the estimating procedure for optimal treatment regime by allowing investigators to conveniently apply standard machine learning techniques to censored survival data without losing much efficiency. We further explore a SCAD-penalization to find informative clinical variables and modified algorithms to handle multiple treatment options by searching upper and lower bounds of the objective function. We demonstrate the utility of our proposal via extensive simulations and application to AIDS data.
Estimand: In addition to the identifiability conditions, the authors explicitly assume CAR. They highlight that the pseudo-value approach can be used to account for informative censoring, competing risks, and interval censoring. Subsequent treatment and covariate history are only defined if the patient has not experienced the ICE of death (hypothetical DTR strategy).
Estimator Description: The novelty lies in combining the pseudo-value approach with contrast-learning (specifically weighted classification). This allows the use of standard machine learning algorithms (like SVM or Random Forests) for censored survival data without needing complex outcome regression modeling or inverse-probability censoring weighting (IPCW) for the regime search itself.
Authors: J Chu, Y Zhang, F Huang, L Si, S Huang, Z Huang
Year: 2022
BACKGROUND AND OBJECTIVE: Treatment effect estimation, as a fundamental problem in causal inference, focuses on estimating the outcome difference between different treatments. However, in clinical observational data, some patient covariates (such as gender, age) not only affect the outcomes but also affect the treatment assignment. Such covariates, named as confounders, produce distribution discrepancies between different treatment groups, thereby introducing the selection bias for the estimation of treatment effects. The situation is even more complicated in longitudinal data, because the confounders are time-varying that are subject to patient history and meanwhile affect the future outcomes and treatment assignments. Existing methods mainly work on cross-sectional data obtained at a specific time point, but cannot process the time-varying confounders hidden in the longitudinal data. METHODS: In this study, we address this problem for the first time by disentangled representation learning, which considers the observational data as consisting of three components, including outcome-specific factors, treatment-specific factors, and time-varying confounders. Based on this, the proposed approach adopts a recurrent neural network-based framework to process sequential information and learn the disentangled representations of the components from longitudinal observational sequences, captures the posterior distributions of latent factors by multi-task learning strategy. Moreover, mutual information-based regularization is adopted to eliminate the time-varying confounders. In this way, the association between patient history and treatment assignment is removed and the estimation can be effectively conducted. RESULTS: We evaluate our model in a realistic set-up using a model of tumor growth. The proposed model achieves the best performance over benchmark models for both one-step ahead prediction (0.70% vs 0.74% for the-state-of-the-art model, when γ = 3. Measured by normalized root mean square error, the lower the better) and five-step ahead prediction (1.47% vs 1.83%) in most cases. By increasing the effect of confounders, our proposed model always shows superiority against the state-of-the-art model. In addition, we adopted T-SNE to visualize the disentangled representations and present the effectiveness of disentanglement explicitly and intuitively. CONCLUSIONS: The experimental results indicate the powerful capacity of our model in learning disentangled representations from longitudinal observational data and dealing with the time-varying confounders, and demonstrate the surpassing performance achieved by our proposed model on dynamic treatment effect estimation.
Estimand: The method does not make assumptions besides identifiability. The estimator predicts the Potential Outcome directly. No ICE handling was specified as only simulated data was used. The causal effect is implicitly the Difference in Potential Outcomes (Risk Difference or Mean Difference) between two treatment sequences.
Estimator Description: The novelty lies in applying disentangled representation learning to sequential data to handle time-varying confounders. The method assumes observed data consists of three components: outcome-specific factors, treatment-specific factors, and time-varying confounders. It uses two Recurrent Neural Networks (RNNs) combined with a Mutual Information (MI)-based regularization to separate these factors and eliminate the confounding information shared between treatment and outcome representations.
Authors: Q Clairon, R Henderson, NJ Young, ED Wilson, CJ Taylor
Year: 2021
A control theory perspective on determination of optimal dynamic treatment regimes is considered. The aim is to adapt statistical methodology that has been developed for medical or other biostatistical applications to incorporate powerful control techniques that have been designed for engineering or other technological problems. Data tend to be sparse and noisy in the biostatistical area and interest has tended to be in statistical inference for treatment effects. In engineering fields, experimental data can be more easily obtained and reproduced and interest is more often in performance and stability of proposed controllers rather than modeling and inference per se. We propose that modeling and estimation should be based on standard statistical techniques but subsequent treatment policy should be obtained from robust control. To bring focus, we concentrate on A-learning methodology as developed in the biostatistical literature and H∞ -synthesis from control theory. Simulations and two applications demonstrate robustness of the H∞ strategy compared to standard A-learning in the presence of model misspecification or measurement error.
Estimand: The paper proposes an estimator that assumes that at each time t there is an action which will allow the expected state to reach its target value. Furthermore, they assume state-space evolution, meaning that a patient's next health state is determined by a function of their entire accumulated history and current treatment action, plus a stochastic disturbance, rather than depending solely on the current state (as in a Markov process). ICE handling was not specified in the paper. The method estimates the regret, representing the risk difference between the assessed policy and the optimal policy.
Estimator Description: This paper tries to introduce concepts from control theory (from the engineering field) to the estimation of DTRs, particularly H-synthesis. The novelty lies in using robust control techniques to determine treatment policies that are resilient to ‘parametric uncertainty,’ ‘measurement uncertainty,’ and ‘model misspecification’ after the initial modelling stage.
Authors: L Deng, H Xiong, F Wu, S Kapoor, S Ghosh, Z Shahn, LH Lehman
Year: 2024
In medical decision-making, clinicians must choose between different time-varying treatment strategies. Counterfactual prediction via g-computation enables comparison of alternative outcome distributions under such treatment strategies. While deep learning can better model high-dimensional data with complex temporal dependencies, incorporating model uncertainty into predicted conditional counterfactual distributions remains challenging. We propose a principled approach to model uncertainty in deep learning implementations of g-computations using approximate Bayesian posterior predictive distributions of counterfactual outcomes via variational dropout and deep ensembles. We evaluate these methods by comparing their counterfactual predictive calibration and performance in decision-making tasks, using two simulated datasets from mechanistic models and a real-world sepsis dataset. Our findings suggest that the proposed uncertainty quantification approach improves both calibration and decision-making performance, particularly in minimizing risks of worst-case adverse clinical outcomes under alternative dynamic treatment regimes. To our knowledge, this is the first work to propose and compare multiple uncertainty quantification methods in machine learning models of g-computation in estimating conditional treatment effects under dynamic treatment regimes.
Estimand: No additional assumptions are made, besides relevant modeling and identifiability assumptions. The authors incorporate the ICE of disease-progression as the outcome wished-to-be minimised, therefore using a composite strategy. They do not cite a causal effect measure, but the performance metric is the percentage of times adverse events occur across the testset
Estimator Description: The author proposes a method to establish counterfactual predictive distributions (Bayesian uncertainty bounds) for machine learning implementations of g-computation aimed at optimal DTR estimation of conditional treatment effects .They do so by using variational dropout and deep ensembles.
Authors: A Ertefaie, S Shortreed, B Chakraborty
Year: 2016
Q-learning is a regression-based approach that uses longitudinal data to construct dynamic treatment regimes, which are sequences of decision rules that use patient information to inform future treatment decisions. An optimal dynamic treatment regime is composed of a sequence of decision rules that indicate how to optimally individualize treatment using the patients' baseline and time-varying characteristics to optimize the final outcome. Constructing optimal dynamic regimes using Q-learning depends heavily on the assumption that regression models at each decision point are correctly specified; yet model checking in the context of Q-learning has been largely overlooked in the current literature. In this article, we show that residual plots obtained from standard Q-learning models may fail to adequately check the quality of the model fit. We present a modified Q-learning procedure that accommodates residual analyses using standard tools. We present simulation studies showing the advantage of the proposed modification over standard Q-learning. We illustrate this new Q-learning approach using data collected from a sequential multiple assignment randomized trial of patients with schizophrenia. Copyright © 2016 John Wiley & Sons, Ltd.
Estimand: The diagnostic plots that the authors propose assume normally distributed errors. They incorporate the ICE of treatment non-response (as determined by the patient) as a tailoring variable in the DTR. The DTR that maximises the expected outcome is chosen. However, it is noteworthy that this paper’s main contribution relates more to model diagnostics.
Estimator Description: The authors propose a diagnostic method called Q-learning with Model Residuals (QL-MR). The novelty is addressing the lack of model-checking tools for Q-learning. In standard Q-learning, the ‘outcome’ for the first-stage regression is a ‘pseudo-outcome’ derived from the second-stage model (specifically, the predicted future optimal value). If the second-stage model is misspecified, this pseudo-outcome is biased, but standard residual plots for the first stage will not reveal this error because the ‘observed’ data for the first stage is itself a model prediction. QL-MR introduces a way to incorporate the residuals from the future stage into the current stage's estimation to ‘restore’ the variability and allow for valid model checking (e.g., checking for non-linearity or outliers) using standard residual plots.
Authors: Ashkan Ertefaie, James R. McKay, David Oslin, Robert L. Strawderman
Year: 2021
Q-learning is a regression-based approach that is widely used to formalize the development of an optimal dynamic treatment strategy. Finite dimensional working models are typically used to estimate certain nuisance parameters, and misspecification of these working models can result in residual confounding and/or efficiency loss. We propose a robust Q-learning approach which allows estimating such nuisance parameters using data-adaptive techniques. We study the asymptotic behavior of our estimators and provide simulation studies that highlight the need for and usefulness of the proposed method in practice. We use the data from the "Extending Treatment Effectiveness of Naltrexone" multi-stage randomized trial to illustrate our proposed methods.
Estimand: No assumptions are made asides the identifiability assumptions. The ICE of non-adherence and non-response to treatment are handled by being incorporated into the DTR. The estimator estimates the Blip Function (or Interaction Effect), which represents the difference in expected outcome between the optimal treatment and the reference, conditional on history.
Estimator Description: The authors propose a Q-learning algorithm that is robust against model misspecification by using data-adaptive techniques to estimate nuisance parameters. The authors contrast their method with traditional doubly robust estimators, noting that their approach uses flexible learning to reduce inconsistency risk.
Authors: Yanqin Fan, Ming He, Liangjun Su, Xiao-Hua Zhou
Year: 2019
In this paper, we propose a smoothed Q-learning algorithm for estimating optimal dynamic treatment regimes. In contrast to the Q-learning algorithm in which nonregular inference is involved, we show that, under assumptions adopted in this paper, the proposed smoothed Q-learning estimator is asymptotically normally distributed even when the Q-learning estimator is not and its asymptotic variance can be consistently estimated. As a result, inference based on the smoothed Q-learning estimator is standard. We derive the optimal smoothing parameter and propose a data-driven method for estimating it. The finite sample properties of the smoothed Q-learning estimator are studied and compared with several existing estimators including the Q-learning estimator via an extensive simulation study. We illustrate the new method by analyzing data from the Clinical Antipsychotic Trials of Intervention Effectiveness-Alzheimer's Disease (CATIE-AD) study.
Estimand: The method relies on the correct choice of a smoothing parameter (bandwidth) to balance bias and variance, though they provide asymptotic guidance. Interestingly, they show the interaction of treatment response with other covariates to decide on further treatment option in the DTR. The method estimates the parameters of the Q-function (conditional expectation of the reward). The causal effect is often summarized by the coefficients associated with the interaction between treatment and history (determining the optimal decision).
Estimator Description: The paper proposes a Smoothed Q-learning algorithm. It modifies the traditional Q-learning algorithm by replacing the non-smooth ‘max’ operator (or indicator function) with a smooth approximation (e.g., using a kernel function or sigmoid). The primary novelty is addressing the non-regularity of standard Q-learning. Standard Q-learning involves a non-smooth maximization step (finding the max Q-value), which leads to complicated, non-normal asymptotic distributions when the treatment effect is small or zero (non-regular inference). By smoothing this step, the authors restore asymptotic normality, allowing for standard Wald-type inference and confidence intervals. A primary interest are moderators of tailoring variables, captured by interaction coefficients.
Authors: Yuhe Gao, Chengchun Shi, Rui Song
Year: 2023
Dynamic treatment regimes assign personalized treatments to patients sequentially over time based on their baseline information and time-varying covariates. In mobile health applications, these covariates are typically collected at different frequencies over a long time horizon. In this paper, we propose a deep spectral Q-learning algorithm, which integrates principal component analysis (PCA) with deep Q-learning to handle the mixed frequency data. In theory, we prove that the mean return under the estimated optimal policy converges to that under the optimal one and establish its rate of convergence. The usefulness of our proposal is further illustrated via simulations and an application to a diabetes dataset.
Estimand: The paper proposes an estimator that assumes that eigenvalues follow specific decaying trend, a Markovian process, a margin type condition and a specific Bellman optimality operator (completeness) as well as consistency of the estimator. No ICE handling was specified. The estimator targets the State-Value Function (or Mean Return/Expected Cumulative Reward) under the optimal policy.
Estimator Description: In mobile health applications, data is often collected at different frequencies. The contribution of this paper is to integrate Functional Data Analysis (PCA/Spectral decomposition) with Deep Q-Networks (DQN) to handle mixed frequency data in mobile health, where patient covariates (like continuous glucose monitoring) are collected at high frequencies (e.g., every 5 minutes), but treatment decisions are made at lower frequencies (e.g., hourly). Standard DTR methods struggle with this high-dimensional longitudinal data.
Authors: P. Ghosh, X. Wang, T. Nalamada, S. Agarwal, M. Jahja, B. Chakraborty
Year: 2025
A dynamic treatment regimen (DTR) is a set of decision rules to personalize treatments for an individual using their medical history. The Q-learning-based Q-shared algorithm has been used to develop DTRs that involve decision rules shared across multiple stages of intervention. We show that the existing Q-shared algorithm can suffer from non-convergence due to the use of linear models in the Q-learning setup, and identify the condition under which Q-shared fails. We develop a penalized Q-shared algorithm that not only converges in settings that violate the condition, but can outperform the original Q-shared algorithm even when the condition is satisfied. We give evidence for the proposed method in a real-world application and several synthetic simulations.
Estimand: No additional assumptions are made. ICE handling is not mentioned. The measure is the value (expected outcome) of the optimal regime, typically compared against other regimes or a control.
Estimator Description: Shared Q-learning is an existing method for cases in which the decision rules are desired to be constant across stages. The authors demonstrate that shared Q-learning can fail to converge due to the use of linear models. To overcome this they propose a penalised shared Q-learning algorithm that converges in situations that conventional shared Q-learning fails in. Furthermore, the proposed penalised shared Q-learning is shown to outperform conventional shared Q-learning also under conditions in which convergence is not threatened.
Authors: Y Goldberg, MR Kosorok
Year: 2012
We develop methodology for a multistage-decision problem with flexible number of stages in which the rewards are survival times that are subject to censoring. We present a novel Q-learning algorithm that is adjusted for censored data and allows a flexible number of stages. We provide finite sample bounds on the generalization error of the policy learned by the algorithm, and show that when the optimal Q-function belongs to the approximation space, the expected survival time for policies obtained by the algorithm converges to that of the optimal policy. We simulate a multistage clinical trial with flexible number of stages and apply the proposed censored-Q-learning algorithm to find individualized treatment regimens. The methodology presented in this paper has implications in the design of personalized medicine trials in cancer and in other life-threatening diseases.
Estimand: The proposed method uses the inverse-probability-ofcensoring weighting to correct the bias induced by censoring. This relies on the CAR assumption. In the simulation the algorithm assigns treatment based on wellness and tumor growth. The employed strategy is thus incorporation the ICE of disease-progression into the DTR. Noteworthy: since they do not assume that the problem is Markovian, they present a version of Q-learning that uses backward recursion. DTRs are compared based on the mean survival time.
Estimator Description: The authors propose a flexible Q-learning algorithm that can be applied to an arbitrary amount of allocation stages. Rewards are defined as survival times, making the estimator applicable to time-to-event data. Since they do not assume that the problem is Markovian, they present a version of Q-learning that uses backward recursion.
Authors: Wei Gong, Linxiao Cao, Yifei Zhu, Fang Zuo, Xin He, Haoquan Zhou
Year: 2023
Clinical decision-making models have been developed to support therapeutic interventions based on medical data from either a single hospital or multiple hospitals. However, models based on multihospital data require collaboration among hospitals to integrate local data, which can result in information leakage and violate patient privacy. To address this challenge, we propose a novel approach that combines federated learning (FL) with inverse reinforcement learning (IRL) to create an efficient medical decision-making support tool while preserving patient privacy. Our approach uses an IRL algorithm with differential privacy to train a neural network-based agent on local data containing clinician trajectories, which learns a private treatment policy by observing patients' conditions. Additionally, we integrate FL into the proposed algorithm to learn a global optimal action policy collaboratively among various smart intensive care units, overcoming data limitations at each hospital. We evaluate our approach using real-world medical data and demonstrate that it achieves superior performance in a distributed manner.
Estimand: . IRL assumes the environment (patient state evolution) follows an MDP. It also typically implies expert optimality: that the observed expert trajectories are attempting to maximize some latent reward function. The algorithm tries to recover this function. In addition data must be IID, or handled correctly if not (which the authors claim to do). ICE handling is not addressed. The method estimates a policy and a reward function. The performance is measured by the difference between the learned policy's actions and the expert's actions (risk-difference/regret).
Estimator Description: The novelty is the combination of federated learning (FL) with inverse reinforcement Learning (IRL). Federated Learning allows training a global policy across multiple hospitals (ICUs) without sharing raw patient data, overcoming data silos and privacy concerns. Instead of maximizing a predefined reward (like survival), IRL learns a private treatment policy by observing patients' conditions and clinician trajectories. It tries to infer the expert's implicit reward function and then optimize it.
Authors: R Hager, AA Tsiatis, M Davidian
Year: 2018
Clinicians often make multiple treatment decisions at key points over the course of a patient's disease. A dynamic treatment regime is a sequence of decision rules, each mapping a patient's observed history to the set of available, feasible treatment options at each decision point, and thus formalizes this process. An optimal regime is one leading to the most beneficial outcome on average if used to select treatment for the patient population. We propose a method for estimation of an optimal regime involving two decision points when the outcome of interest is a censored survival time, which is based on maximizing a locally efficient, doubly robust, augmented inverse probability weighted estimator for average outcome over a class of regimes. By casting this optimization as a classification problem, we exploit well-studied classification techniques such as support vector machines to characterize the class of regimes and facilitate implementation via a backward iterative algorithm. Simulation studies of performance and application of the method to data from a sequential, multiple assignment randomized clinical trial in acute leukemia are presented.
Estimand: The authors state that ‘censoring is noninformative in that the hazard … depends only on information observed through time and not on unobserved potential outcomes’. This is akin to the coarsening-at-random assumption common with IPW-based correction methods for informative censoring. This requires that all common causes of outcome and censoring are measured and conditioned on. Handling of informative censoring using IPW is part of the hypothetical strategy for handling ICE. They also use a composite outcome ‘event free survival’ to handle toxicity arising as part of the treatment. The estimator measures the expected value (mean) of a transformed time-to-event outcome under a specific regime.
Estimator Description:
Authors: Robin Henderson, Phil Ansell, Deyadeen Alshibani
Year: 2010
We consider optimal dynamic treatment regime determination in practice. Model building, checking, and comparison have had little or no attention so far in this literature. Motivated by an application on optimal dosage of anticoagulants, we propose a modeling and estimation strategy that incorporates the regret functions of Murphy (2003, Journal of the Royal Statistical Society, Series B 65, 331-366) into a regression model for observed responses. Estimation is quick and diagnostics are available, meaning a variety of candidate models can be compared. The method is illustrated using simulation and the anticoagulation application.
Estimand: This paper estimates the optimal dynamic treatment rule, meaning that the optimal sequence of treatment assignments is identified. This resembles the treatment policy strategy, if ICE are not considered. The paper makes no reference to bias induced by ICE such as dropouts, death or non-adherence. It could thus be used to calculate an ITT estimand, given that follow-up data is complete (even for drop-outs) or imputed.
Estimator Description: The regret-regression estimator is a method that incorporates causal regret functions into a standard regression model for observed responses.
Authors: X Huang, S Choi, L Wang, PF Thall
Year: 2015
In medical therapies involving multiple stages, a physician's choice of a subject's treatment at each stage depends on the subject's history of previous treatments and outcomes. The sequence of decisions is known as a dynamic treatment regime or treatment policy. We consider dynamic treatment regimes in settings where each subject's final outcome can be defined as the sum of longitudinally observed values, each corresponding to a stage of the regime. Q-learning, which is a backward induction method, is used to first optimize the last stage treatment then sequentially optimize each previous stage treatment until the first stage treatment is optimized. During this process, model-based expectations of outcomes of late stages are used in the optimization of earlier stages. When the outcome models are misspecified, bias can accumulate from stage to stage and become severe, especially when the number of treatment stages is large. We demonstrate that a modification of standard Q-learning can help reduce the accumulated bias. We provide a computational algorithm, estimators, and closed-form variance formulas. Simulation studies show that the modified Q-learning method has a higher probability of identifying the optimal treatment regime even in settings with misspecified models for outcomes. It is applied to identify optimal treatment regimes in a study for advanced prostate cancer and to estimate and compare the final mean rewards of all the possible discrete two-stage treatment sequences.
Estimand: The authors make a Markovian assumption. They use the DTR strategy to adapt to treatment response, thereby incorporating the ICE of non-response into the regime. The outcome of interest is the maximum expected potential outcome, whereby the optimal DTR is selected.
Estimator Description: The authors propose a Modified Q-Learning method to reduce the accumulated bias in multi-stage settings. Standard Q-learning relies on backward induction where future optimal outcomes are predicted using models; if these models are misspecified, bias accumulates as one moves backward to earlier stages. The proposed method utilizes accumulated data (actual observed rewards/outcomes from future stages) instead of purely model-predicted values during the backward induction steps, thereby reducing sensitivity to model misspecification.
Software: No software implementation, but a detailed computational algorithm is given in Section 2.2.
Authors: William Hua, Hongyuan Mei, Sarah Zohar, Magali Giral, Yanxun Xu
Year: 2022
Accurate models of clinical actions and their impacts on disease progression are critical for estimating personalized optimal dynamic treatment regimes (DTRs) in medical/health research, especially in managing chronic conditions. Traditional statistical methods for DTRs usually focus on estimating the optimal treatment or dosage at each given medical intervention, but overlook the important question of ???when this intervention should happen.??? We fill this gap by developing a two-step Bayesian approach to optimize clinical decisions with timing. In the first step, we build a generative model for a sequence of medical interventions???which are discrete events in continuous time???with a marked temporal point process (MTPP) where the mark is the assigned treatment or dosage. Then this clinical action model is embedded into a Bayesian joint framework where the other components model clinical observations including longitudinal medical measurements and time-to-event data conditional on treatment histories. In the second step, we propose a policy gradient method to learn the personalized optimal clinical decision that maximizes the patient survival by interacting the MTPP with the model on clinical observations while accounting for uncertainties in clinical observations learned from the posterior inference of the Bayesian joint model in the first step. A signature application of the proposed approach is to schedule follow-up visitations and assign a dosage at each visitation for patients after kidney transplantation. We evaluate our approach with comparison to alternative methods on both simulated and real-world datasets. In our experiments, the personalized decisions made by the proposed method are clinically useful: they are interpretable and successfully help improve patient survival.
Estimand: The decision at the current time is independent of the full history given the current clinical measurement (i.e., Markov property; though the authors note this can be relaxed). Toxicity is indeed used as a factor in determining treatment (hypothetical DTR strategy), but its role is complex: it is used as a latent "tailoring" variable to optimize the dosage and timing in a hypothetical DTR, rather than being a binary switch in a fixed protocol.The estimator predicts and maximizes the survival time for a specific patient under the optimized policy.
Estimator Description: The paper introduces a novel two-step Bayesian method combining a Marked Temporal Point Process (MTPP) with Policy Gradient (Reinforcement Learning) to estimate and optimize personalized dynamic treatment regimes (DTRs). Unlike standard DTR methods that focus only on what treatment to give, this method uses machine learning (RL) to optimize when to treat (timing) and how much (dosage) in continuous time.
Authors: Liyuan Hu, Jitao Wang, Zhenke Wu, Chengchun Shi
Year: 2025
This paper focuses on reinforcement learning (RL) with clustered data, which is commonly encountered in healthcare applications. We propose a generalized fitted Q-iteration (FQI) algorithm that incorporates generalized estimating equations into policy learning to handle the intra-cluster correlations. Theoretically, we demonstrate (i) the optimalities of our Q-function and policy estimators when the correlation structure is correctly specified and (ii) their consistencies when the structure is mis-specified. Empirically, through simulations and analyses of a mobile health dataset, we find the proposed generalized FQI achieves, on average, a half reduction in regret compared to the standard FQI.
Estimand: The authors assume a Markov decision process. They further assume a working correlation structure for intra-cluster correlations, though it proves consistency even if this is misspecified. ICE handling is not addressed. The measure is the value function (expected discounted cumulative reward) or regret (difference in value between optimal and learned policy).
Estimator Description: The authors propose a generalised fitted Q-titeration algorithm for clustered data that incorporates generalised estimating equations into policy learning to handle the intracluster correlations.
Software: Code for reproducing the simulation studies is available at https://github.com/zaza0209/GEERL
Authors: Nicholas Illenberger, Andrew J. Spieker, Nandita Mitra
Year: 2023
Health policy decisions regarding patient treatment strategies require consideration of both treatment effectiveness and cost. We propose a two-step approach for identifying an optimally cost-effective and interpretable dynamic treatment regime. First, we develop a combined Q-learning and policy-search approach to estimate optimal list-based regimes under a constraint on expected treatment costs. Second, we propose an iterative procedure to select an optimally cost-effective regime from a set of candidate regimes corresponding to different cost constraints. Our approach can estimate optimal regimes in the presence of time-varying confounding, censoring, and correlated outcomes. Through simulation studies, we examine the operating characteristics of our approach under flexible modelling approaches. We also apply our methodology to identify optimally cost-effective treatment strategies for assigning adjuvant therapies to endometrial cancer patients.
Estimand: Besides the usual identifiability conditions the authors assume CAR to handle censoring (here referred to as ‘sequentially ignorable censoring’). Censoring is then intervened on by treating it as an intervention on a counterfactual outcome, thereby using the hypothetical strategy to deal with the ICE of informative censoring.
Estimator Description: The contribution of this paper lies in integrating cost-effectiveness analysis directly into the DTR optimization process. The authors propose a combined Q-learning and policy-search approach that estimates optimal list-based regimes (interpretable rules) subject to a constraint on expected treatment costs. They then use an iterative procedure to select the regime that maximizes the Net Monetary Benefit (NMB) across a range of willingness-to-pay thresholds.
Authors: A Jaman, G Wang, A Ertefaie, M Bally, R Lévesque, RW Platt, ME Schnitzer
Year: 2025
Effect modification occurs when the impact of the treatment on an outcome varies based on the levels of other covariates known as effect modifiers. Modeling these effect differences is important for etiological goals and for purposes of optimizing treatment. Structural nested mean models (SNMMs) are useful causal models for estimating the potentially heterogeneous effect of a time-varying exposure on the mean of an outcome in the presence of time-varying confounding. A data-adaptive selection approach is necessary if the effect modifiers are unknown a priori and need to be identified. Although variable selection techniques are available for estimating the conditional average treatment effects using marginal structural models or for developing optimal dynamic treatment regimens, all of these methods consider a single end-of-follow-up outcome. In the context of an SNMM for repeated outcomes, we propose a doubly robust penalized G-estimator for the causal effect of a time-varying exposure with a simultaneous selection of effect modifiers and prove the oracle property of our estimator. We conduct a simulation study for the evaluation of its performance in finite samples and verification of its double-robustness property. Our work is motivated by the study of hemodiafiltration for treating patients with end-stage renal disease at the Centre Hospitalier de l'Université de Montréal. We apply the proposed method to investigate the effect heterogeneity of dialysis facility on the repeated session-specific hemodiafiltration outcomes.
Estimand: No additional assumptions are made. ICE handling is not specified. The effect measure is the blip function (or conditional average treatment effect of the current treatment given history, subtracted from the counterfactual outcome under a reference treatment).
Estimator Description: The authors highlight the need for procedures that automatically identify effect modifiers when these are not known a priori, especially in the context of repeated measures outcomes. They propose a doubly-robust penalised g-estimator for the causal effect of a time-varying exposure with a simultaneous selection of effect modifiers.
Authors: C. Jiang, M. Thompson, M. Wallace
Year: 2024
The focus of precision medicine is on decision support, often in the form of dynamic treatment regimes, which are sequences of decision rules. At each decision point, the decision rules determine the next treatment according to the patient’s baseline characteristics, the information on treatments and responses accrued by that point, and the patient’s current health status, including symptom severity and other measures. However, dynamic treatment regime estimation with ordinal outcomes is rarely studied, and rarer still in the context of interference – where one patient’s treatment may affect another’s outcome. In this paper, we introduce the weighted proportional odds model: a regression based, approximate doubly-robust approach to single-stage dynamic treatment regime estimation for ordinal outcomes. This method also accounts for the possibility of interference between individuals sharing a household through the use of covariate balancing weights derived from joint propensity scores. Examining different types of balancing weights, we verify the approximate double robustness of weighted proportional odds model with our adjusted weights via simulation studies. We further extend weighted proportional odds model to multi-stage dynamic treatment regime estimation with household interference, namely dynamic weighted proportional odds model. Lastly, we demonstrate our proposed methodology in the analysis of longitudinal survey data from the Population Assessment of Tobacco and Health study, which motivates this work. Furthermore, considering interference, we provide optimal treatment strategies for households to achieve smoking cessation of the pair in the household.
Estimand: The estimator assumes proportional odds: that the effect of covariates are constant for all outcome categories (i.e., that the slopes are parallel across the cumulative log-odds). ICE handling strategies are not specified. The causal effect is quantified as the difference in the log-odds of the ordinal outcome under a specific treatment configuration versus the null (a risk difference).
Estimator Description: The authors propose an estimator of optimal dynamic treatment regimes with ordinal outcomes, the dynamic weighted proportional odds model, which corrects for household-interference by using covariate balancing weights derived from joint propensity scores.
Authors: Simi Job, Xiaohui Tao, Lin Li, Haoran Xie, Taotao Cai, Jianming Yong, Qing Li
Year: 2024
Personalized clinical decision support systems are increasingly being adopted due to the emergence of data-driven technologies, with this approach now gaining recognition in critical care. The task of incorporating diverse patient conditions and treatment procedures into critical care decision-making can be challenging due to the heterogeneous nature of medical data. Advances in Artificial Intelligence (AI), particularly Reinforcement Learning (RL) techniques, enables the development of personalized treatment strategies for severe illnesses by using a learning agent to recommend optimal policies. In this study, we propose a Deep Reinforcement Learning (DRL) model with a tailored reward function and an LSTM-GRU-derived state representation to formulate optimal treatment policies for vasopressor administration in stabilizing patient physiological states in critical care settings. Using an ICU dataset and the Medical Information Mart for Intensive Care (MIMIC-III) dataset, we focus on patients with Acute Respiratory Distress Syndrome (ARDS) that has led to Sepsis, to derive optimal policies that can prioritize patient recovery over patient survival. Both the DDQN (RepDRL-DDQN) and Dueling DDQN (RepDRL-DDDQN) versions of the DRL model surpass the baseline performance, with the proposed model's learning agent achieving an optimal learning process across our performance measuring schemes. The robust state representation served as the foundation for enhancing the model's performance, ultimately providing an optimal treatment policy focused on rapid patient recovery.
Estimand: The authors assume a Markov Decision Process. Handling of ICE is not addressed. The estimator uses cumulative reward (or Q-values) to measure the effect of the regime.
Estimator Description: The authors propose a deep reinforcement learning (DRL) model with a tailored reward function and a Long Short-Term Memory and Gated Recurrent Unit (LSTM-GRU)-derived state representation to formulate optimal treatment policies. The novelty lies in augmenting conventional RL techniques with a embeddings as input in the form of state representation using an LSTM-GRU model to capture "patient trajectory in its entirety
Authors: M Josefsson, MJ Daniels
Year: 2021
Causal inference with observational longitudinal data and time-varying exposures is often complicated by time-dependent confounding and attrition. The G-computation formula is one approach for estimating a causal effect in this setting. The parametric modeling approach typically used in practice relies on strong modeling assumptions for valid inference, and moreover depends on an assumption of missing at random, which is not appropriate when the missingness is missing not at random (MNAR) or due to death. In this work we develop a flexible Bayesian semi-parametric G-computation approach for assessing the causal effect on the subpopulation that would survive irrespective of exposure, in a setting with MNAR dropout. The approach is to specify models for the observed data using Bayesian additive regression trees, and then use assumptions with embedded sensitivity parameters to identify and estimate the causal effect. The proposed approach is motivated by a longitudinal cohort study on cognition, health, and aging, and we apply our approach to study the effect of becoming a widow on memory. We also compare our approach to several standard methods.
Estimand: The authors develop a flexible Bayesian semi-parametric G-computation approach for assessing the causal effect on the subpopulation that would survive irrespective of exposure, assuming dropout is MNAR or due to death. They therefore apply the principal stratum strategy for handling the ICE of death. To handle confounding due to MNAR dropout they introduce sensitivity parameters that model the difference between the distribution of outcomes for those who dropped out versus those who did not, conditional on covariates and history. They use Bayesian Additive Regression Trees (BART) to model the observed data distributions flexible. The correction is applied by modifying the imputation step in the G-computation algorithm using the sensitivity parameters. When simulating the potential outcomes for the ‘missing’ counterfactuals, they draw from the modified distribution rather than the observed distribution. This is therefore a hypothetical strategy for MNAR dropout. The paper estimates the Mean Difference in outcomes between two static treatment regimes (e.g., ‘always treated’ vs ‘never treated’) within the principal stratum of always-survivors.
Estimator Description: The novelty of this method is the combination of Bayesian semi-parametric G-computation (using Bayesian Additive Regression Trees, BART) with a specific strategy to handle both Missing Not At Random (MNAR) dropout and truncation by death (using Principal Stratification). It avoids strong parametric assumptions by using flexible BART models for the observed data components
Authors: EB Laber, F Wu, C Munera, I Lipkovich, S Colucci, S Ripa
Year: 2018
There is growing interest and investment in precision medicine as a means to provide the best possible health care. A treatment regime formalizes precision medicine as a sequence of decision rules, one per clinical intervention period, that specify if, when and how current treatment should be adjusted in response to a patient's evolving health status. It is standard to define a regime as optimal if, when applied to a population of interest, it maximizes the mean of some desirable clinical outcome, such as efficacy. However, in many clinical settings, a high-quality treatment regime must balance multiple competing outcomes; eg, when a high dose is associated with substantial symptom reduction but a greater risk of an adverse event. We consider the problem of estimating the most efficacious treatment regime subject to constraints on the risk of adverse events. We combine nonparametric Q-learning with policy-search to estimate a high-quality yet parsimonious treatment regime. This estimator applies to both observational and randomized data, as well as settings with variable, outcome-dependent follow-up, mixed treatment types, and multiple time points. This work is motivated by and framed in the context of dosing for chronic pain; however, the proposed framework can be applied generally to estimate a treatment regime which maximizes the mean of one primary outcome subject to constraints on one or more secondary outcomes. We illustrate the proposed method using data pooled from 5 open-label flexible dosing clinical trials for chronic pain.
Estimand: The authors use a while-on strategy for the ICE of dropout: they only use data up until the participant drops out to calculate the outcome. They also use incorporate toxicity as a criteria for dose-switching in the DTR. The estimator measures the Expected Cumulative Utility (Efficacy) and Expected Cumulative Cost (Risk) under a regime. They also employ a composite strategy by including a cost term (reflecting toxicity) in the utility function.
Estimator Description: The authors propose a method to estimate an optimal dynamic treatment regime (DTR) that maximizes a primary outcome (efficacy) subject to a user-specified constraint on a secondary outcome (safety/risk). The novelty lies in combining non-parametric Q-learning with policy-search to handle these constraints in a multi-stage setting with flexible dosing.
Authors: EB Laber, DJ Lizotte, B Ferguson
Year: 2014
Dynamic treatment regimes (DTRs) operationalize the clinical decision process as a sequence of functions, one for each clinical decision, where each function maps up-to-date patient information to a single recommended treatment. Current methods for estimating optimal DTRs, for example Q-learning, require the specification of a single outcome by which the "goodness" of competing dynamic treatment regimes is measured. However, this is an over-simplification of the goal of clinical decision making, which aims to balance several potentially competing outcomes, for example, symptom relief and side-effect burden. When there are competing outcomes and patients do not know or cannot communicate their preferences, formation of a single composite outcome that correctly balances the competing outcomes is not possible. This problem also occurs when patient preferences evolve over time. We propose a method for constructing DTRs that accommodates competing outcomes by recommending sets of treatments at each decision point. Formally, we construct a sequence of set-valued functions that take as input up-to-date patient information and give as output a recommended subset of the possible treatments. For a given patient history, the recommended set of treatments contains all treatments that produce non-inferior outcome vectors. Constructing these set-valued functions requires solving a non-trivial enumeration problem. We offer an exact enumeration algorithm by recasting the problem as a linear mixed integer program. The proposed methods are illustrated using data from the CATIE schizophrenia study.
Estimand: The authors propose a novel method for accounting for competing outcomes (not to be confused with competing risk analysis from survival analysis), e.g., symptom relief and side-effect burden. Instead of recommending a single treatment at each decision point, the proposed algorithm provides Set-Valued Dynamic Treatment Regimes (SVDTRs) from which one can then be selected based on patient preference. This can be seen as applying the composite strategy (in this case on the ICE of toxicity, but it could be any outcome, corresponding to any ICE). No additional assumptions are made. The expected mean outcome is given for each of the competing outcomes which can be on different scales.
Estimator Description: The authors propose Set-Valued Dynamic Treatment Regimes (SVDTRs). The novelty is that instead of recommending a single ‘optimal’ treatment (which requires a pre-specified composite outcome or utility function combining competing objectives like efficacy and safety), the estimator recommends a set of treatments. This set contains all options that are ‘non-dominated’ (Pareto optimal) with respect to the competing outcomes. This allows clinical decision-making to accommodate ‘patient preferences [that] evolve over time’ or are unknown at the design stage. They provide a ‘novel mathematical programming formulation’ to estimate these sets efficiently from data.
Authors: Eric B. Laber, Kristin A. Linn, Leonard A. Stefanski
Year: 2014
Evidence-based rules for optimal treatment allocation are key components in the quest for efficient, effective health care delivery. Q-learning, an approximate dynamic programming algorithm, is a popular method for estimating optimal sequential decision rules from data. Q-learning requires the modeling of nonsmooth, nonmonotone transformations of the data, complicating the search for adequately expressive, yet parsimonious, statistical models. The default Q-learning working model is multiple linear regression, which is not only provably misspecified under most data-generating models, but also results in nonregular regression estimators, complicating inference. We propose an alternative strategy for estimating optimal sequential decision rules for which the requisite statistical modeling does not depend on nonsmooth, nonmonotone transformed data, does not result in nonregular regression estimators, is consistent under a broader array of data-generation models than Q-learning, results in estimated sequential decision rules that have better sampling properties, and is amenable to established statistical approaches for exploratory data analysis, model building, and validation. We derive the new method, IQ-learning, via an interchange in the order of certain steps in Q-learning. In simulated experiments IQ-learning improves on Q-learning in terms of integrated mean squared error and power. The method is illustrated using data from a study of major depressive disorder.
Estimand: The method assumes the specified parametric (or semi-parametric) models for the mean and variance are correct. They use the DTR strategy to tackle the ICE of treatment non-response. While they do not do so, they highlight options for handling competing outcomes through composite strategies. The effect measure to compare treatments are the Q functions, which in this case represent conditional means.
Estimator Description: The authors propose Interactive Q-learning (IQ-learning). The novelty is addressing the requirement in standard Q-learning to model non-smooth, non-monotone transformations of the data (specifically, the max operator over future Q-functions), which leads to model misspecification and non-regular estimators. IQ-learning reverses the order of operations: instead of modeling the Q-function directly, it models the conditional mean and conditional variance of the future outcome. By using these standard smooth quantities, one can interactively build models using standard diagnostic tools (like residual plots) and then derive the Q-functions.
Authors: Jeongjin Lee, Jong-Min Kim
Year: 2025
Treatment strategies are critical in healthcare, particularly when outcomes are subject to censoring. This study introduces the Counterfactual Buckley-James Q-Learning framework, which integrates counterfactual reasoning with the Buckley-James method and reinforcement learning to address challenges arising from longitudinal survival data. The Buckley-James method imputes censored survival times via conditional expectations based on observed data, offering a robust mechanism for handling incomplete outcomes. By incorporating these imputed values into a counterfactual Q-learning framework, the proposed method enables the estimation and comparison of potential outcomes under different treatment strategies. This facilitates the identification of optimal dynamic treatment regimes that maximize expected survival time. Through extensive simulation studies, the method demonstrates robust performance across various sample sizes and censoring scenarios, including right censoring and missing at random. Application to real-world clinical trial data further highlights the utility of this approach in informing personalized treatment decisions, providing an interpretable and reliable tool for optimizing survival outcomes in complex clinical settings.
Estimand: To impute the missing values and thereby correct for informative censoring, CAR needs to be assumed. A hypothetical strategy is then used to correct for the ICE of informative censoring. The effect measure is the expected survival time (or a function thereof, like restricted mean) under the optimal regime.
Estimator Description: The novelty lies in integrating the Buckley-James method (a semiparametric approach for handling censored data by imputing censored times with their conditional expectations) into the Q-learning framework. This allows standard Q-learning to be applied to longitudinal survival data. It is noteworthy that the authors published this estimator already a year prior in a different venue.
Authors: Hyobeen Lee, Yeji Kim, Hyungjun Cho, Sangbum Choi
Year: 2021
Dynamic treatment regimes (DTRs) are decision-making rules designed to provide personalized treatment to individuals in multi-stage randomized trials. Unlike classical methods, in which all individuals are prescribed the same type of treatment, DTRs prescribe patient-tailored treatments which take into account individual characteristics that may change over time. The Q-learning method, one of regression-based algorithms to figure out optimal treatment rules, becomes more popular as it can be easily implemented. However, the performance of the Q-learning algorithm heavily relies on the correct specification of the Q-function for response, especially in observational studies. In this article, we examine a number of double-robust weighted least-squares estimating methods for Q-learning in high-dimensional settings, where treatment models for propensity score and penalization for sparse estimation are also investigated. We further consider flexible ensemble machine learning methods for the treatment model to achieve double-robustness, so that optimal decision rule can be correctly estimated as long as at least one of the outcome model or treatment model is correct. Extensive simulation studies show that the proposed methods work well with practical sample sizes. The practical utility of the proposed methods is proven with real data example.
Estimand: (The paper had to be translated from Korean as no English version was available). The method makes no assumptions besides identifiability. Handling of ICE is not specified. The estimator targets the Blip Function (the difference in expected potential outcomes between treatment and reference given covariates) and the Value of the optimal regime (expected potential outcome under the optimal rule.
Estimator Description: The paper introduces a doubly-robust weighted least-squares estimating method for Q-learning in high-dimensional settings that incorporates flexible ensemble machine learning methods for the treatment model to achieve doubly robustness.
Authors: Jeongjin Lee, Jong-Min Kim
Year: 2024
This research paper presents the Buckley-James Q-learning (BJ-Q) algorithm, a cutting-edge method designed to optimize personalized treatment strategies, especially in the presence of right censoring. We critically assess the algorithm's effectiveness in improving patient outcomes and its resilience across various scenarios. Central to our approach is the innovative use of the survival time to impute the reward in Q-learning, employing the Buckley-James method for enhanced accuracy and reliability. Our findings highlight the significant potential of personalized treatment regimens and introduce the BJ-Q learning algorithm as a viable and promising approach. This work marks a substantial advancement in our comprehension of treatment dynamics and offers valuable insights for augmenting patient care in the ever-evolving clinical landscape.
Estimand: The authors highlight that they do not make any Markovian assumption. For the Buckley-James imputation method, the assumption that censoring is independent of survival time given the covariates (CAR) must hold. The use of imputation methods are akin to the hypothetical strategy for dealing with dropout. The estimator uses the expected cumulative reward (which equates to mean survival time).
Estimator Description: The author propose the novel Buckley-James Q-learning estimator for right-censored data. Instead of relying on IPW the authors employ the Buckley-James method which effectively handles imputation using existing covariate data.
Authors: S Liang, W Lu, R Song
Year: 2018
Recently deep learning has successfully achieved state-of-the-art performance on many difficult tasks. Deep neural network outperforms many existing popular methods in the field of reinforcement learning. It can also identify important covariates automatically. Parameter sharing of convolutional neural network (CNN) greatly reduces the amount of parameters in the neural network, which allows for high scalability. However few research has been done on deep advantage learning (A-learning). In this paper, we present a deep A-learning approach to estimate optimal dynamic treatment regime. A-learning models the advantage function, which is of direct relevance to the goal. We use an inverse probability weighting (IPW) method to estimate the difference between potential outcomes, which does not require to make any model assumption on the baseline mean function. We implemented different architectures of deep CNN and convexified convolutional neural networks (CCNN). The proposed deep A-learning methods are applied to a data from the STAR*D trial and are shown to have better performance compared with the penalized least square estimator using a linear decision rule.
Estimand: The authors incorporate the ICE of protocol deviation in their DTR.
Estimator Description: The authors propose Deep Advantage Learning (DAL), a novel method for estimating optimal Dynamic Treatment Regimes (DTRs). The novelty lies in combining convolutional neural networks with Advantage Learning to handle complex, high-dimensional, and nonlinear relationships in clinical data without requiring strict parametric assumptions. The estimator measures the Expected Potential Outcome (Value) under the optimal regime compared to other regimes.
Authors: W Liang, J Jia
Year: 2025
OBJECTIVE: Early fluid resuscitation is crucial in the treatment of sepsis, yet the optimal dosage remains debated. This study aims to determine the optimal multi-stage fluid resuscitation dosage for sepsis patients. METHODS: We propose a reinforcement learning algorithm with neural networks (RL-NN), utilizing the flexibility of deep learning architectures to mitigate model misspecification. We use cross-validation and random search for hyperparameter tuning to further enhance model robustness and generalization. RESULTS: Simulation results demonstrate that our method outperforms existing methods in terms of both the percentage of correctly classified optimal treatments and the predicted counterfactual mean outcome. Applying this method to the sepsis cohort from the Medical Information Mart for Intensive Care III (MIMIC-III), we recommend that all sepsis patients receive adequate fluid resuscitation (≥ 30 mL/kg) within the first 3 h of admission to the MICU. Our approach is expected to significantly reduce the mean SOFA score by 23.71%, enhancing patient outcomes. CONCLUSION: Our RL-NN method offers an accurate, real-time approach to optimizing sepsis treatment and aligns with the 'Surviving Sepsis Campaign' guidelines. It also has the potential to be integrated with existing electronic health record (EHR) systems, guiding clinical decision-making and thereby improving patient prognosis.
Estimand: No additional assumptions are made, and ICE handling is not addressed. The measure is the expected outcome of the regime.
Estimator Description: The authors propose a reinforcement learning algorithm in combination with neural networks (RL-NN) to mitigate model misspecification.
Software: https://github.com/Vicky-LL/ RL-NN
Authors: KA Linn, EB Laber, LA Stefanski
Year: 2017
A dynamic treatment regime is a sequence of decision rules, each of which recommends treatment based on features of patient medical history such as past treatments and outcomes. Existing methods for estimating optimal dynamic treatment regimes from data optimize the mean of a response variable. However, the mean may not always be the most appropriate summary of performance. We derive estimators of decision rules for optimizing probabilities and quantiles computed with respect to the response distribution for two-stage, binary treatment settings. This enables estimation of dynamic treatment regimes that optimize the cumulative distribution function of the response at a prespecified point or a prespecified quantile of the response distribution such as the median. The proposed methods perform favorably in simulation experiments. We illustrate our approach with data from a sequentially randomized trial where the primary outcome is remission of depression symptoms.
Estimand: The authors make a Parametric Distribution Assumption: The conditional distribution of the outcome (or next stage features) given history follows a specified parametric family (e.g., Normal or location-scale family) to allow the calculation of quantiles of the mixture distribution at the previous stage. Besides the usual treatment switching in DTRs no ICE were specified or tackled in the illustrative example. The outcome of interest uses quantiles of the outcome.
Estimator Description: The authors propose Interactive Q-learning (IQ-learning), a method to estimate optimal dynamic treatment regimes (DTRs) that maximize a quantile (e.g., the median) or a probability (e.g., probability of response > X) of the outcome distribution, rather than the mean. Standard Q-learning focuses on maximizing the mean outcome, which may not be robust or clinically relevant for skewed data or when the goal is to ‘raise the floor’ for poor responders. The novelty lies in generalizing the Q-learning framework to handle quantiles by assuming a parametric model for the entire conditional distribution of the outcome (or the residuals) and integrating over it to estimate the future optimal value (the ‘max’ operator applied to quantiles is not straightforward like it is for means). They introduce a nonsmooth maximization similar to Q-learning but adapted for quantiles.
Authors: S Lin, S Saghafian, JM Lipschitz, KE Burdick
Year: 2025
This study introduces a novel multiagent reinforcement learning (MARL) algorithm designed for identifying and optimizing personalized recommendations in bipolar disorder. The algorithm leverages longitudinal offline data from wearables to recommend self-care strategies tailored to individual patients. We focus on self-care strategies involving physical activity (measured by steps), sleep duration, and bedtime consistency, aiming to reduce the periods of mood exacerbations. A key innovation of our MARL approach is the integration of copulas to model interagent dependencies, enhancing coordination among agents and improving policy learning. Findings suggest that following our algorithm's self-care recommendations could significantly reduce periods of elevated mood symptoms, resulting in improved overall well-being. Finally, the algorithm offers important clinical insights for treating bipolar patients, and shows promising theoretical properties independent of the specific application. Thus, this work not only advances MARL applications in personalized healthcare but also provides a new algorithmic approach for adaptive interventions in a wide range of chronic diseases.
Estimand: The authors model their problem as a decentralized partially observable Markov decision process (Dec-POMDP). ICE handling is not addressed. The authors refer to the optimal policy that minimizes mood exacerbations by assigning negative reward to episodes of mood excacerbations.
Estimator Description: Some problem contexts require that DTRs are suggested for different target outcomes. The authors address the context of the treatment of bipolar patients in which different categories of treatment considerations need to be adjusted simultaneously: (‘step-based activity interventions, sleep duration adjustments, and bedtime consistency enhancements’). This is akin to a multi-agent system in which each agent controls one category of treatment considerations. The authors therefore propose a multiagent reinforcement learning (MARL) method to this end. The method tackles the challenge of MARL in an offline setting (using observational data) and taking dependencies between different agents (i.e., different treatment considerations) into account.
Authors: D Liu, W He
Year: 2024
The study of precision medicine involves dynamic treatment regimes (DTRs), which are sequences of treatment decision rules recommended based on patient-level information. The primary goal of the DTR study is to identify an optimal DTR, a sequence of treatment decision rules that optimizes the clinical outcome across multiple decision points. Statistical methods have been developed in recent years to estimate an optimal DTR, including Q-learning, a regression-based method in the DTR literature. Although there are many studies concerning Q-learning, little attention has been paid in the presence of noisy data, such as misclassified outcomes. In this article, we investigate the effect of outcome misclassification on identifying optimal DTRs using Q-learning and propose a correction method to accommodate the misclassification effect on DTR. Simulation studies are conducted to demonstrate the satisfactory performance of the proposed method. We illustrate the proposed method using two examples from the National Health and Nutrition Examination Survey Data I Epidemiologic Follow-up Study and the Population Assessment of Tobacco and Health Study.
Estimand: The authors assume the availability of validation data where both the true and the misclassified outcome are both observed. The ICE of interest here is outcome misclassification, a type of measurement error. They correct for this measurement error by using maximum likelihood estimation, which is often used for the calculation of potential outcomes (hence hypothetical strategy). The method effectively applies a measurement correction strategy to estimate the causal effect on the true clinical outcome (quitting smoking) rather than the observed (misreported) one. The estimator uses the expected future reward (which for a binary outcome is the probability of the event, e.g., probability of quitting smoking).
Estimator Description: Little research has been done on the performance of Q-learning on noisy data. This paper investigates the effect of outcome misclassification on Q-learning and proposes a correction method to accommodate the misclassification effect.
Authors: Huidong Liu, Xiangfei Zhang, Hang Yu, Qingchen Zhang
Year: 2025
Sepsis is a highly complex and heterogeneous critical illness, requiring dynamic adjustments to treatment strategies based on individual patient characteristics. However, existing reinforcement learning (RL) methods for personalized treatment face several challenges, such as incomplete modeling of patient states, limited generalization capability of policies, and distributional shift issues in offline learning. To address these challenges, we propose a novel Dual-Channel Batch-Constrained Deep Q-Learning (DBCDQ) method to enable more precise personalized sepsis treatment. Specifically, we design a dual-channel mechanism that integrates the patient's current physiological state with their historical treatment responses, enabling comprehensive modeling of dynamic patient characteristics and improving responsiveness to individualized treatment needs. Additionally, we introduce a batch-constrained mechanism into the policy network, which enforces consistency between the learned policy and the actual clinical data distribution. This mitigates distributional shift issues in offline reinforcement learning (ORL). We evaluate our approach on the sepsis data from the MIMIC-III dataset, and experimental results show that our method outperforms state-of-the-art methods and can reduce patient clinical mortality by 3.85\%.
Estimand: No additional assumptions are stated and ICE handling is not specified. The authors refer to the optimal DTR (or Policy) that maximizes the expected return.
Estimator Description: The authors propose a novel Dual-Channel Batch-Constrained Deep Q-Learning (DBCDQ) method to enable more precise personalized sepsis treatment to address challenges associated with offline learning, such as incomplete modeling of patient states and distributional shift issues.
Authors: D. Liu, W. He
Year: 2025
Dynamic treatment regimes (DTRs) have received increasing interests in recent years. DTRs are sequences of treatment decision rules tailored to patient-level information. The main goal of the DTR study is to identify an optimal DTR, a sequence of treatment decision rules that yields the best expected clinical outcome. Q-learning has been regarded as one of the most popular regression-based methods for estimating the optimal DTR. However, it has been rarely studied in an error-prone setting, where patient information is contaminated with measurement error. In this article, we shed light on the effect of covariate measurement error on Q-learning and propose an effective method to correct the error in Q-learning. Simulation studies are conducted to assess the performance of the proposed correction method in Q-learning. We illustrate the use of the proposed method in an application to the Sequenced Treatment Alternatives to Relieve Depression data.
Estimand: No additional assumptions are stated. The authors use regression calibration to estimate the true covariate values. While this does not fall neatly into any strategy, it can be seen as a hypothetical strategy to calculate estimates under the scenario of ‘no measurement error‘ to handle the ICE of erroneous covariate values. The measure is the expected clinical outcome of the regime.
Estimator Description: The authors highlight the lack of research on the effect of measurement error on estimates obtained through Q-learning. They evaluate the effect of mis-measured covariates in particular and propose an effective method for handling such measurement error by using regression calibration.
Authors: M Liu, Y Wang, H Fu, D Zeng
Year: 2024
Dynamic treatment regimen (DTR) is one of the most important tools to tailor treatment in personalized medicine. For many diseases such as cancer and type 2 diabetes mellitus (T2D), more aggressive treatments can lead to a higher efficacy but may also increase risk. However, few methods for estimating DTRs can take into account both cumulative benefit and risk. In this work, we propose a general statistical learning framework to learn optimal DTRs that maximize the reward outcome while controlling the cumulative adverse risk to be below a pre-specified threshold. We convert this constrained optimization problem into an unconstrained optimization using a Lagrange function. We then solve the latter using either backward learning algorithms or simultaneously over all stages based on constructing a novel multistage ramp loss. Theoretically, we establish Fisher consistency of the proposed method and further obtain non-asymptotic convergence rates for both reward and risk outcomes under the estimated DTRs. The finite sample performance of the proposed method is demonstrated via simulation studies and through an application to a two-stage clinical trial for T2D patients. Supplementary materials for this article are available online.
Estimand: The basic premise of the new approach is that it helps define a composite outcome for adverse events. In the given example they use change in BMI as a side-effect of insulin therapy. Failure to reach safe Hemoglobin A1c level triggers the move to the second stage of intensification treatment. This is an application of the treatment policy strategy to non-response by incorporating the ICE in the DTR as a tailoring variable. The estimator optimizes the mean reward subject to a constraint on the cumulative risk.
Estimator Description: This authors propose a statistical framework specifically concerned with optimising a reward while maintaining the cumulative risk underneath a pre-specified threshold using a Lagrangian relaxation approach converted into a weighted classification problem. They propose a new procedure under multistage ramp loss (MRL) to estimate the DTRs simultaneously across all stages. The MRL can be viewed as an extension of the univariate ramp loss to a multivariate setting.
Authors: M Liu, Y Wang, D Zeng
Year: 2025
Dynamic treatment regimens (DTRs), where treatment decisions are tailored to individual patient's characteristics and evolving health status over multiple stages, have gained increasing interest in the modern era of precision medicine. Identifying important features that drive these decisions over stages not only leads to parsimonious DTRs for practical use but also enhances the reliability of learning optimal DTRs. Existing methods for learning optimal DTRs, such as Q-learning and O-learning, rely on a sequential procedure to estimate the optimal decision at each stage backward. Incorporating feature selection in these methods through regularization at each stage of estimation only identifies unimportant tailoring variables at each stage but is not necessary for those variables that are not important across all the stages. As a result, false discovery errors are likely to accumulate over stages in these sequential methods. To overcome this limitation, we propose a framework, namely L1 multistage ramp loss (L1-MRL) learning, to learn the optimal decision rules and, at the same time, perform variable selection across all the stages simultaneously. This framework uses a single multistage ramp loss to estimate optimal DTRs for all stages. Furthermore, a group Lasso-type penalty is imposed to penalize the coefficients in the decision rules across all stages, which enables the identification of features that are important for at least one stage decision. Theoretically, we show that the estimator is consistent and enjoys the oracle property toward the optimal. We demonstrate that the proposed method performs equally well as or better than many existing DTR methods with variable selection capability via extensive simulation studies and an application to electronic health record (EHR) data for type 2 diabetes (T2D) patients.
Estimand: No additional assumptions are stated and ICE handling is not addressed. The measure is the expected cumulative outcome.
Estimator Description: Feature selection can make DTRs more parsimoneous and reliable. However, selecting different features at each stage can inflate false discovery rates. The authors propose L1 multistage ramp loss (L1-MRL) learning, to learn the optimal decision rules and, at the same time, perform variable selection across all the stages simultaneously.
Authors: Ximeng Liu, Robert H. Deng, Kim-Kwang Raymond Choo, Yang Yang
Year: 2021
In this paper, we propose a privacy-preserving reinforcement learning framework for a patient-centric dynamic treatment regime, which we refer to as Preyer. Using Preyer, a patient-centric treatment strategy can be made spontaneously while preserving the privacy of the patient's current health state and the treatment decision. Specifically, we first design a new storage and computation method to support noninteger processing for multiple encrypted domains. A new secure plaintext length control protocol is also proposed to avoid plaintext overflow after executing secure computation repeatedly. Moreover, we design a new privacy-preserving reinforcement learning framework with experience replay to build the model for secure dynamic treatment policymaking. Furthermore, we prove that Preyer facilitates patient dynamic treatment policymaking without leaking sensitive information to unauthorized parties. We also demonstrate the utility and efficiency of Preyer using simulations and analysis.
Estimand: The method relies on the standard Reinforcement Learning assumption that the environment is a Markov Decision Process (the next state and reward depend only on the current state and action). The authors do not apply strategies for ICE. The paper focuses on the computational and privacy aspects of DTRs rather than clinical trial estimands. The measure used is the Q-value (expected cumulative discounted reward) for state-action pairs.
Estimator Description: The authors propose a privacy-preserving reinforcement learning framework for DTR estimation called Preyer. It can provide new patient-centered treatment policies spontaneously while preserving the privacy of the current patient’s health state.
Authors: Y Liu, Y Wang, MR Kosorok, Y Zhao, D Zeng
Year: 2018
Dynamic treatment regimens (DTRs) are sequential treatment decisions tailored by patient's evolving features and intermediate outcomes at each treatment stage. Patient heterogeneity and the complexity and chronicity of many diseases call for learning optimal DTRs that can best tailor treatment according to each individual's time-varying characteristics (eg, intermediate response over time). In this paper, we propose a robust and efficient approach referred to as Augmented Outcome-weighted Learning (AOL) to identify optimal DTRs from sequential multiple assignment randomized trials. We improve previously proposed outcome-weighted learning to allow for negative weights. Furthermore, to reduce the variability of weights for numeric stability and improve estimation accuracy, in AOL, we propose a robust augmentation to the weights by making use of predicted pseudooutcomes from regression models for Q-functions. We show that AOL still yields Fisher-consistent DTRs even if the regression models are misspecified and that an appropriate choice of the augmentation guarantees smaller stochastic errors in value function estimation for AOL than the previous outcome-weighted learning. Finally, we establish the convergence rates for AOL. The comparative advantage of AOL over existing methods is demonstrated through extensive simulation studies and an application to a sequential multiple assignment randomized trial for major depressive disorder.
Estimand: They incorporated treatment response as a tailoring variable, thereby employing the DTR strategy. One of the methods benefits is that it can provide insights into suitable tailoring variables. The estimator measures the Expected Cumulative Reward (Value) of the regime.
Estimator Description: The authors propose Augmented Outcome-weighted Learning (AOL). The novelty lies in creating a ‘hybrid approach’ that integrates Outcome-Weighted Learning (OWL) with regression models for Q-functions to estimate optimal DTRs. It is therefore not a classifier as usual OWL models are.
Authors: X Li, BR Logan, SMF Hossain, EEM Moodie
Year: 2024
To achieve the goal of providing the best possible care to each individual under their care, physicians need to customize treatments for individuals with the same health state, especially when treating diseases that can progress further and require additional treatments, such as cancer. Making decisions at multiple stages as a disease progresses can be formalized as a dynamic treatment regime (DTR). Most of the existing optimization approaches for estimating dynamic treatment regimes including the popular method of Q-learning were developed in a frequentist context. Recently, a general Bayesian machine learning framework that facilitates using Bayesian regression modeling to optimize DTRs has been proposed. In this article, we adapt this approach to censored outcomes using Bayesian additive regression trees (BART) for each stage under the accelerated failure time modeling framework, along with simulation studies and a real data example that compare the proposed approach with Q-learning. We also develop an R wrapper function that utilizes a standard BART survival model to optimize DTRs for censored outcomes. The wrapper function can easily be extended to accommodate any type of Bayesian machine learning model.
Estimand: The authors assume CAR to correct for informative censoring, using the hypothetical strategy to do so by estimating what the survival would be if censoring did not occur, via the accelerated failure time model. The method optimizes the mean survival time (or log-survival time) directly.
Estimator Description: In a recent paper, a general Bayesian machine learning framework that facilitates using Bayesian regression modeling for DTRs has been proposed. The authors propose Bayesian additive regression trees (BART) to optimise DTRs in the presence of censored outcomes under an accelerated failure time modeling framework.
Software: https://github. com/xiaoli-mcw/dtrBART
Authors: L Lyu, Y Cheng, AS Wahed
Year: 2023
Q-learning has been one of the most commonly used methods for optimizing dynamic treatment regimes (DTRs) in multistage decision-making. Right-censored survival outcome poses a significant challenge to Q-Learning due to its reliance on parametric models for counterfactual estimation which are subject to misspecification and sensitive to missing covariates. In this paper, we propose an imputation-based Q-learning (IQ-learning) where flexible nonparametric or semiparametric models are employed to estimate optimal treatment rules for each stage and then weighted hot-deck multiple imputation (MI) and direct-draw MI are used to predict optimal potential survival times. Missing data are handled using inverse probability weighting and MI, and the nonrandom treatment assignment among the observed is accounted for using a propensity-score approach. We investigate the performance of IQ-learning via extensive simulations and show that it is more robust to model misspecification than existing Q-Learning methods, imputes only plausible potential survival times contrary to parametric models and provides more flexibility in terms of baseline hazard shape. Using IQ-learning, we developed an optimal DTR for leukemia treatment based on a randomized trial with observational follow-up that motivated this study.
Estimand: The authors assume CAR to handle informative censoring. They use the DTR strategy to handle disease progression in their illustrative example. The method aims to yield the maximal expected overall survival time.
Estimator Description: The method replaces the standard parametric Q-function modeling of survival outcomes (often subject to misspecification and large pseudo-outcomes) with a multiple imputation approach (weighted hot-deck and direct-draw MI). This allows the use of flexible models (non/semiparametric) for the Q-function and ensures predicted survival times are plausible.
Authors: BR Miao, B Shahbaba, A Qu
Year: 2025
Offline reinforcement learning (RL) aims to find optimal policies in dynamic environments in order to maximize the expected total rewards by leveraging pre-collected data. Learning from heterogeneous data is one of the fundamental challenges in offline RL. Traditional methods focus on learning an optimal policy for all individuals with pre-collected data from a single episode or homogeneous batch episodes, and thus, may result in a suboptimal policy for a heterogeneous population. In this paper, we propose an individualized offline policy optimization framework for heterogeneous time-stationary Markov decision processes (MDPs). The proposed heterogeneous model with individual latent variables enables us to efficiently estimate the individual Q-functions, and our Penalized Pessimistic Personalized Policy Learning (P4L) algorithm guarantees a fast rate on the average regret under a weak partial coverage assumption on behavior policies. In addition, our simulation studies and a real data application demonstrate the superior numerical performance of the proposed method compared with existing methods.
Estimand: A Markov decision process is assumed. It also assumes the Q-function can be decomposed into a component shared by all and a component driven by individual latent variables. No ICE handling is specified. The measure of interest is the expected total reward or the regret (difference between the value of the optimal policy and the learned policy).
Estimator Description: The novelty lies in addressing data heterogeneity in offline reinforcement learning. Unlike standard methods that learn a single global policy (ignoring differences) or group-wise policies (requiring known clusters), this method models heterogeneity via individual latent variables within a time-stationary Markov Decision Process (MDP). It uses a pessimistic penalty (specifically a penalty on the Q-function variance/uncertainty) to handle the distributional shift common in offline RL, while simultaneously clustering individuals to borrow strength across similar subjects.
Authors: EEM Moodie, DA Stephens, S Alam, MJ Zhang, B Logan, M Arora, S Spellman, EF Krakow
Year: 2019
Cancers treated by transplantation are often curative, but immunosuppressive drugs are required to prevent and (if needed) to treat graft-versus-host disease. Estimation of an optimal adaptive treatment strategy when treatment at either one of two stages of treatment may lead to a cure has not yet been considered. Using a sample of 9563 patients treated for blood and bone cancers by allogeneic hematopoietic cell transplantation drawn from the Center for Blood and Marrow Transplant Research database, we provide a case study of a novel approach to Q-learning for survival data in the presence of a potentially curative treatment, and demonstrate the results differ substantially from an implementation of Q-learning that fails to account for the cure-rate.
Estimand: This estimator accounts for an uncommonly noted ICE in a survival context: the possibility that participant never experience the outcome of interest due to their disease being cured by the treatment under investigation. To this end, the estimator assumes sufficient follow-up to distinguish cured patients from censored ones (though they use a mixture model to handle this probabilistically). They account for the ICE of cure-rate by what can be seen as a composite strategy and developing Q-functions for a censored outcome that allow for a fraction of the population to be cured. Furthermore, they incorporate an ICE referring to toxicity due to treatment, specifically Graft-versus-host disease (GVHD), into the DTR. The occurrence of acute GVHD triggers the second stage of treatment (a decision point). Furthermore they employ a composite strategy of cancer recurrence or death, to deal with the ICE of cancer recurrence (i.e., disease-free survival time).
Estimator Description: The paper proposes a Cure-Rate Q-learning estimator. It adapts the standard Q-learning algorithm (a regression-based Reinforcement Learning method) to handle time-to-event outcomes where a fraction of the population is ‘cured’ (long-term survivors who will not experience the event of interest). The novelty addresses the challenge where standard survival Q-learning methods (e.g., using mean survival) fail because the mean may be undefined or infinite for cured patients, and standard censoring adjustments don't account for the ‘cured’ sub-population structure.
Authors: Erica E. M. Moodie, Nema Dean, Yue Ru Sun
Year: 2014
Dynamic treatment regimes are fast becoming an important part of medicine, with the corresponding change in emphasis from treatment of the disease to treatment of the individual patient. Because of the limited number of trials to evaluate personally tailored treatment sequences, inferring optimal treatment regimes from observational data has increased importance. Q-learning is a popular method for estimating the optimal treatment regime, originally in randomized trials but more recently also in observational data. Previous applications of Q-learning have largely been restricted to continuous utility end-points with linear relationships. This paper is the first attempt at both extending the framework to discrete utilities and implementing the modelling of covariates from linear to more flexible modelling using the generalized additive model (GAM) framework. Simulated data results show that the GAM adapted Q-learning typically outperforms Q-learning with linear models and other frequently-used methods based on propensity scores in terms of coverage and bias/MSE. This represents a promising step toward a more fully general Q-learning approach to estimating optimal dynamic treatment regimes.
Estimand: The proposed estimator makes no further assumptions. They employ a form of Q-learning which can handle full history (non-Markovian). Since the authors employ simulation studies not ICE and corresponding are specifically mentioned, besides the DTR strategy in general.
Estimator Description: The authors propose generative-additive-model (GAM) based Q-learning. The novelty lies in extending the standard Q-learning framework (which typically assumes linear relationships and continuous outcomes) in that it replaces standard linear regression with (GAMs) to model the Q-functions. This allows for non-linear, flexible relationships between patient covariates and the utility (outcome) without requiring the strict parametric assumptions of linear models. They extend the method to handle binary/discrete outcomes (utilities) using logit links within the Q-learning steps, which had been less explored compared to continuous outcomes.
Authors: TA Murray, Y Yuan, PF Thall
Year: 2018
Medical therapy often consists of multiple stages, with a treatment chosen by the physician at each stage based on the patient's history of treatments and clinical outcomes. These decisions can be formalized as a dynamic treatment regime. This paper describes a new approach for optimizing dynamic treatment regimes that bridges the gap between Bayesian inference and existing approaches, like Q-learning. The proposed approach fits a series of Bayesian regression models, one for each stage, in reverse sequential order. Each model uses as a response variable the remaining payoff assuming optimal actions are taken at subsequent stages, and as covariates the current history and relevant actions at that stage. The key difficulty is that the optimal decision rules at subsequent stages are unknown, and even if these decision rules were known the relevant response variables may be counterfactual. However, posterior distributions can be derived from the previously fitted regression models for the optimal decision rules and the counterfactual response variables under a particular set of rules. The proposed approach averages over these posterior distributions when fitting each regression model. An efficient sampling algorithm for estimation is presented, along with simulation studies that compare the proposed approach with Q-learning.
Estimand: No real world example is given, only simulation studies that do not refer to specific ICE handling. The estimator measures the Mean Payoff (Value) under the optimal regime.
Estimator Description: The authors propose Bayesian Machine Learning (BML), an approximate dynamic programming approach. The novelty lies in bridging the gap between Bayesian inference and Q-learning by fitting a series of Bayesian regression models in reverse sequential order (backward induction).
Authors: Inbal Nahum-Shani, Min Qian, Daniel Almirall, William E. Pelham, Beth Gnagy, Gregory A. Fabiano, James G. Waxmonsky, Jihnhee Yu, Susan A. Murphy
Year: 2012
Increasing interest in individualizing and adapting intervention services over time has led to the development of adaptive interventions. Adaptive interventions operationalize the individualization of a sequence of intervention options over time via the use of decision rules that input participant information and output intervention recommendations. We introduce Q-learning, which is a generalization of regression analysis to settings in which a sequence of decisions regarding intervention options or services is made. The use of Q is to indicate that this method is used to assess the relative qualify of the intervention options. In particular, we use Q-learning with linear regression to estimate the optimal (i.e., most effective) sequence of decision rules. We illustrate how Q-teaming can be used with data from sequential multiple assignment randomized trials (SMARTs; Murphy, 2005) to inform the construction of a more deeply tailored sequence of decision rules than those embedded in the SMART design. We also discuss the advantages of Q-learning compared to other data analysis approaches. Finally, we use the Adaptive Interventions for Children With ADHD SMART study (Center for Children and Families, University at Buffalo, State University of New York, William E. Pelham as principal investigator) for illustration.
Estimand: The authors incorporate non-adherence and treatment failure as decision criteria for treatment switching, thus employing the DTR strategy for these ICE. Because they are exploring additional tailoring variables beyond what was strictly assigned in the protocol, they are estimating non-embedded DTRs (a hypothetical strategy). The estimator specifies the model using linear regression equations, thus making parametric assumptions.
Estimator Description: The authors use Q-learning in multi-stage decision making with linear regression on data arising from a SMART.
Authors: Romain Neugebauer, Malini Chandra, Antonio Paredes, David J. Graham, Carolyn McCloskey, Alan S. Go
Year: 2013
Purpose: Observational studies designed to investigate the safety of a drug in a postmarketing setting typically aim to examine rare and non-acute adverse effects in a population that is not restricted to particular patient subgroups for which the therapy, typically a drug, was originally approved. Large healthcare databases and, in particular, rich electronic medical record (EMR) databases, are well suited for the conduct of these safety studies since they can provide detailed longitudinal information on drug exposure, confounders, and outcomes for large and representative samples of patients that are considered for treatment in clinical settings. Analytic efforts for drawing valid causal inferences in such studies are faced with three challenges: (1) the formal definition of relevant effect measures addressing the safety question of interest; (2) the development of analytic protocols to estimate such effects based on causal methodologies that can properly address the problems of time-dependent confounding and selection bias due to informative censoring, and (3) the practical implementation of such protocols in a large clinical/medical database setting. In this article, we describe an effort to specifically address these challenges with marginal structural modeling based on inverse probability weighting with data reduction and super learning. Methods: We describe the principles of, motivation for, and implementation of an analytical protocol applied in a safety study investigating possible effects of exposure to oral bisphosphonate therapy on the risk of non-elective hospitalization for atrial fibrillation or atrial flutter among older women based on EMR data from the Kaiser Permanente Northern California integrated health care delivery system. Adhering to guidelines brought forward by Hernan (Epidemiology 2011; 22: 290-1), we start by framing the safety research question as one that could be directly addressed by a sequence of ideal randomized experiments before describing the estimation approach that we implemented to emulate inference from such trials using observational data. Results: This report underlines the important computation burden involved in the application of the current R implementation of super learning with large data sets. While computing time and memory requirements did not permit aggressive estimator selection with super learning, this analysis demonstrates the applicability of simplified versions of super learning based on select sets of candidate learners to avoid complete reliance on arbitrary selection of parametric models for confounding and selection bias adjustment. Results do not raise concern over the safety of one-year exposure to BP but may suggest residual bias possibly due to unmeasured confounders or insufficient parametric adjustment for observed confounders with the candidate learners selected. Conclusions: Adjustment for time-dependent confounding and selection bias based on the ad hoc inverse probability weighting approach described in this report may provide a feasible alternative to extended Cox modeling or the point treatment analytic approaches (e.g. based on propensity score matching) that are often adopted in safety research with large data sets. Alternate algorithms are needed to permit the routine and more aggressive application of super learning with large data sets.
Estimand: The authors use IPCW which assumes coarsening-at-random. They further estimate the treatment effect had none of the participants deviated from the protocol. The estimand is explicitly stated as an adherence-corrected per-protocol contrast. Thus they use the hypothetical strategy to correct for non-adherence. DTRs are compared using a causal hazard ratio. Cumulative incidence is also reported as an outcome metric.
Estimator Description: The authors propose using a Marginal Structural Model (MSM) weighted by Inverse Probability of Treatment Weights (IPTW), where the novelty lies in the use of Super Learning (an ensemble machine learning method) to estimate the propensity scores (treatment and censoring mechanisms).
Authors: X. K. Nie, E. Brunskill, S. Wager
Year: 2020
Many applied decision-making problems have a dynamic component: The policymaker needs not only to choose whom to treat, but also when to start which treatment. For example, a medical doctor may choose between postponing treatment (watchful waiting) and prescribing one of several available treatments during the many visits from a patient. We develop an "advantage doubly robust" estimator for learning such dynamic treatment rules using observational data under the assumption of sequential ignorability. We prove welfare regret bounds that generalize results for doubly robust learning in the single-step setting, and show promising empirical performance in several different contexts. Our approach is practical for policy optimization, and does not need any structural (e.g., Markovian) assumptions. for this article are available online.
Estimand: The authors explicitly state that their approach does not require any structural (e..g, Markovian) assumptions, asides from sequential ignorability given a sufficient adjustment set. They use a simulated dataset to test the performance of the proposed method, with no explicit reference to ICE handling - they can be incorporated as part of the patient’s state (covariates). The primary measure is Regret, defined as the difference between the optimal value and the value of the chosen policy, which can viewed as a type of risk difference. They also estimate the Value Function using an advantage-based formulation.
Estimator Description: The paper proposes an Advantage Doubly Robust (ADR) estimator for learning optimal when-to-treat policies. The novelty lies in generalizing doubly robust techniques (specifically the so-called ‘advantage decomposition’) from single-step settings to the dynamic ‘when-to-treat’ problem without requiring Markovian assumptions (which are standard in Reinforcement Learning). It allows for global policy optimization using a ‘universal score’ that works for all policies in a class, rather than needing to re-estimate nuisance parameters for every policy.
Software: The authors only provide pseudocode for their policy optimization approach with terminal states.
Authors: Chao Yu, Yinzhao Dong, Jiming Liu, Guoqi Ren
Year: 2019
¡ovid:b¿BACKGROUND¡/ovid:b¿: Reinforcement learning (RL) provides a promising technique to solve complex sequential decision making problems in health care domains. However, existing studies simply apply naive RL algorithms in discovering optimal treatment strategies for a targeted problem. This kind of direct applications ignores the abundant causal relationships between treatment options and the associated outcomes that are inherent in medical domains., ¡ovid:b¿METHODS¡/ovid:b¿: This paper investigates how to integrate causal factors into an RL process in order to facilitate the final learning performance and increase explanations of learned strategies. A causal policy gradient algorithm is proposed and evaluated in dynamic treatment regimes (DTRs) for HIV based on a simulated computational model., ¡ovid:b¿RESULTS¡/ovid:b¿: Simulations prove the effectiveness of the proposed algorithm for designing more efficient treatment protocols in HIV, and different definitions of the causal factors could have significant influence on the final learning performance, indicating the necessity of human prior knowledge on defining a suitable causal relationships for a given problem., ¡ovid:b¿CONCLUSIONS¡/ovid:b¿: More efficient and robust DTRs for HIV can be derived through incorporation of causal factors between options of anti-HIV drugs and the associated treatment outcomes.
Estimand: The method explicitly assumes the environment is a Markov Decision Process (MDP). They apply the algorithm to learn a policy on a simulated HIV dataset, to determine a DTR. Further ICE handling is not specified. The method uses a Causal Factor to weigh the standard policy gradient which contrasts the probability of an event occurring conditional on receiving the treatment or not receiving the treatment.
Estimator Description: The paper proposes a Causal Policy Gradient (CPG) algorithm. The novelty involves modifying standard policy gradient Reinforcement Learning (RL) by integrating a causal factor into the gradient update step. This causal factor explicitly models the relationship between treatment actions and outcomes, aiming to make the learning process more efficient and the resulting policies more interpretable than ‘black-box’ RL methods.
Authors: A. Oganisian, K. D. Getz, T. A. Alonzo, R. Aplenc, J. A. Roy
Year: 2024
We develop a Bayesian semiparametric model for the impact of dynamic treatment rules on survival among patients diagnosed with pediatric acute myeloid leukemia (AML). The data consist of a subset of patients enrolled in a phase III clinical trial in which patients move through a sequence of four treatment courses. At each course, they undergo treatment that may or may not include anthracyclines (ACT). While ACT is known to be effective at treating AML, it is also cardiotoxic and can lead to early death for some patients. Our task is to estimate the potential survival probability under hypothetical dynamic ACT treatment strategies, but there are several impediments. First, since ACT is not randomized, its effect on survival is confounded over time. Second, subjects initiate the next course depending on when they recover from the previous course, making timing potentially informative of subsequent treatment and survival. Third, patients may die or drop out before ever completing the full treatment sequence. We develop a generative Bayesian semiparametric model based on Gamma Process priors to address these complexities. At each treatment course, the model captures subjects' transition to subsequent treatment or death in continuous time. G-computation is used to compute a posterior over potential survival probability that is adjusted for time-varying confounding. Using our approach, we estimate the efficacy of hypothetical treatment rules that dynamically modify ACT based on evolving cardiac function.
Estimand: The method assumes CAR and uses the hypothetical strategy to correct for informative censoring. ICE are not addressed. The estimator uses the restricted mean survival time (RMST) and the survival probability at specific times.
Estimator Description: The paper introduces a new Bayesian semiparametric model to estimate the survival distributions under dynamic treatment regimes (DTRs). It specifically addresses the challenge of informative timing, where the timing of treatment decisions (duration of previous courses) is informative of the patient's health status and potential survival. It combines this with a g-computation approach using Bayesian additive regression trees (BART) or similar flexible priors (though the specific implementation details emphasize the semi-parametric structure for the hazard).
Authors: Y. Tao, L. Wang
Year: 2017
Dynamic treatment regimes (DTRs) are sequential decision rules that focus simultaneously on treatment individualization and adaptation over time. To directly identify the optimal DTR in a multi-stage multi-treatment setting, we propose a dynamic statistical learning method, adaptive contrast weighted learning. We develop semiparametric regression-based contrasts with the adaptation of treatment effect ordering for each patient at each stage, and the adaptive contrasts simplify the problem of optimization with multiple treatment comparisons to a weighted classification problem that can be solved by existing machine learning techniques. The algorithm is implemented recursively using backward induction. By combining doubly robust semiparametric regression estimators with machine learning algorithms, the proposed method is robust and efficient for the identification of the optimal DTR, as shown in the simulation studies. We illustrate our method using observational data on esophageal cancer.
Estimand: The authors used the ICE treatment non-response and disease progression as covariates in the DTR they estimate, thus employing the DTR strategy. While they do not demonstrate the model on survival data, they do state that this is possible by employing Acceleration Failure Time Models. The estimator uses the Contrast Function (difference in expected outcomes between treatment options) to determine the optimal rule, which can be seen as a causal risk difference.
Estimator Description: The authors propose Adaptive Contrast Weighted Learning (ACWL), a method for estimating optimal dynamic treatment regimes (DTRs) specifically in settings with more than two treatment options at each stage. While previous classification-based methods (like Outcome Weighted Learning) worked well for binary treatments, extending them to multi-class settings was computationally difficult or relied on pairwise comparisons that didn't scale.
Authors: G Rhodes, M Davidian, W Lu
Year: 2024
Clinicians and patients must make treatment decisions at a series of key decision points throughout disease progression. A dynamic treatment regime is a set of sequential decision rules that return treatment decisions based on accumulating patient information, like that commonly found in electronic medical record (EMR) data. When applied to a patient population, an optimal treatment regime leads to the most favorable outcome on average. Identifying optimal treatment regimes that maximize residual life is especially desirable for patients with life-threatening diseases such as sepsis, a complex medical condition that involves severe infections with organ dysfunction. We introduce the residual life value estimator (ReLiVE), an estimator for the expected value of cumulative restricted residual life under a fixed treatment regime. Building on ReLiVE, we present a method for estimating an optimal treatment regime that maximizes expected cumulative restricted residual life. Our proposed method, ReLiVE-Q, conducts estimation via the backward induction algorithm Q-learning. We illustrate the utility of ReLiVE-Q in simulation studies, and we apply ReLiVE-Q to estimate an optimal treatment regime for septic patients in the intensive care unit using EMR data from the Multiparameter Intelligent Monitoring Intensive Care database. Ultimately, we demonstrate that ReLiVE-Q leverages accumulating patient information to estimate personalized treatment regimes that optimize a clinically meaningful function of residual life.
Estimand: The authors assume censoring to be non-informative wrt. treatment assignment. No ICE handling is described. The causal effect measure is the expected cumulative restricted residual life (the expected sum of remaining survival times at each decision point up to a restriction limit).
Estimator Description: The authors propose a residual life value estimator based on Q-learning called ReLiVE-Q. Residual is the survival time of a patient, given that they have already survived until time t.
Software: https://github.com/gmrhodes/ReLiVE-Q
Authors: MM Rizi, JA Dubin, MP Wallace
Year: 2024
Identifying interventions that are optimally tailored to each individual is of significant interest in various fields, in particular precision medicine. Dynamic treatment regimes (DTRs) employ sequences of decision rules that utilize individual patient information to recommend treatments. However, the assumption that an individual's treatment does not impact the outcomes of others, known as the no interference assumption, is often challenged in practical settings. For example, in infectious disease studies, the vaccine status of individuals in close proximity can influence the likelihood of infection. Imposing this assumption when it, in fact, does not hold, may lead to biased results and impact the validity of the resulting DTR optimization. We extend the estimation method of dynamic weighted ordinary least squares (dWOLS), a doubly robust and easily implemented approach for estimating optimal DTRs, to incorporate the presence of interference within dyads (i.e., pairs of individuals). We formalize an appropriate outcome model and describe the estimation of an optimal decision rule in the dyadic-network context. Through comprehensive simulations and analysis of the Population Assessment of Tobacco and Health (PATH) data, we demonstrate the improved performance of the proposed joint optimization strategy compared to the current state-of-the-art conditional optimization methods in estimating the optimal treatment assignments when within-dyad interference exists.
Estimand: The authors specifically developed this method to relax the interference assumption. ICE handling is not addressed. The measure is the mean dyad health (the expected value of a summary function of the dyad outcomes).
Estimator Description: The novelty is the extension of dWOLS to handle interference within dyadic networks (pairs of individuals, e.g., spouses), where one individual's treatment affects the other's outcome. It proposes a joint optimization strategy using a dyad-health function rather than optimizing for individuals independently (conditional optimization).
Authors: S. Rosthøj, R. Henderson, J.K. Barrett
Year: 2014
We review methods for determination of optimal dynamic treatment strategies and consider the consequences of patients missing scheduled clinic visits. We describe a Markov chain Monte Carlo procedure for parameter estimation in the presence of incomplete data. We propose an optimal dynamic fixed-dose treatment allocation rule that accommodates the possibility of patients missing future scheduled visits. We compare our strategy with a globally optimal strategy through simulations and an application on control of blood clotting time for patients on long-term anticoagulation.
Estimand: The estimator deals specifically with the event of missed visits. If dose adjustment depends on such visits, the authors propose to recommend fixed doses in case such visits do not occur. They do not assume that this necessarily an ICE (i.e., a confounding event). But they propose imputation methods based on MCAR and MAR to impute missing intercurrent measures due to missed values (Last Observation Carried Forward, interpolation and predicted value). They also assume correct model specification. The causal effect measure to compare DTRs is the difference in means of the potential outcome.
Estimator Description: The authors propose a novel ‘Optimal Dynamic Fixed-Dose (ODFD)’ strategy. Unlike standard optimal dynamic (OD) regimes that assume patients attend all future visits for dose adjustments, the ODFD strategy is designed to be robust to missed visits. It determines the best ‘fixed’ dose (unchangeable in the future) at the current visit, anticipating that future updates might not happen.
Authors: Soroush Saghafian
Year: 2024
A main research goal in various studies is to use an observational data set and provide a new set of counterfactual guidelines that can yield causal improvements. Dynamic Treatment Regimes (DTRs) are widely studied to formalize this process and enable researchers to find guidelines that are both personalized and dynamic. However, available methods in finding optimal DTRs often rely on assumptions that are violated in real-world applications (e.g., medical decision making or public policy), especially when (a) the existence of unobserved confounders cannot be ignored, and (b) the unobserved confounders are time varying (e.g., affected by previous actions). When such assumptions are violated, one often faces ambiguity regarding the underlying causal model that is needed to be assumed to obtain an optimal DTR. This ambiguity is inevitable because the dynamics of unobserved confounders and their causal impact on the observed part of the data cannot be understood from the observed data. Motivated by a case study of finding superior treatment regimes for patients who underwent transplantation in our partner hospital (Mayo Clinic) and faced a medical condition known as new-onset diabetes after transplantation, we extend DTRs to a new class termed Ambiguous Dynamic Treatment Regimes (ADTRs), in which the causal impact of treatment regimes is evaluated based on a “cloud” of potential causal models. We then connect ADTRs to Ambiguous Partially Observable Markov Decision Processes (APOMDPs) proposed by Saghafian (2018), and consider unobserved confounders as latent variables but with ambiguous dynamics and causal effects on observed variables. Using this connection, we develop two reinforcement learning methods termed Direct Augmented V-Learning (DAV-Learning) and Safe Augmented V-Learning (SAV-Learning), which enable using the observed data to effectively learn an optimal treatment regime. We establish theoretical results for these learning methods, including (weak) consistency and asymptotic normality. We further evaluate the performance of these learning methods both in our case study (using clinical data) and in simulation experiments (using synthetic data). We find promising results for our proposed approaches, showing that they perform well even compared with an imaginary oracle who knows both the true causal model (of the data-generating process) and the optimal regime under that model. Finally, we highlight that our approach enables a two-way personalization; obtained treatment regimes can be personalized based on both patients’ characteristics and physicians’ preferences. This paper was accepted by David Simchi-Levi, data science. Supplemental Material: The data files and online appendix are available at https://doi-org.vu-nl.idm.oclc.org/10.1287/mnsc.2022.00883.
Estimand: The authors assume an ambiguous partially observable Markov decision process. No ICE handling is discussed. The method estimates the expected value of the cumulative reward (utility) under the worst-case probability distribution within the ambiguity set.
Estimator Description: For successful estimation of causal effects of DTR, it is necessary to ensure unconfoundedness. Whether unconfoundedness is ambiguous: we can never rule out that unobserved confounders exist, and therefore have to assume a causal model underlying the reasoning that led us to choose a particular adjustment set of variables. The authors therefore propose ambiguous DTRs (ADTRs) in which the causal impact of a DTR is estimated based on a range of potential underlying causal mechanisms (thus acknowledging the ambiguity of structural assumptions). They connect ADTRs to Ambiguous Partially Observable Markov Decision Processes (APOMDPs) and consider unobserved confounders as latent variables but with ambiguous dynamics and causal effects on observed variables. Through this they develop two reinforcement learning methods named Direct Augmented V-Learning (DAV-Learning) and Safe Augmented V-Learning (SAV-Learning).
Authors: N Sani, JJR Lee, I Shpitser
Year: 2020
Causal inference quantifies cause effect relationships by means of counterfactual responses had some variable been artificially set to a constant. A more refined notion of manipulation, where a variable is artificially set to a fixed function of its natural value is also of interest in particular domains. Examples include increases in financial aid, changes in drug dosing, and modifying length of stay in a hospital. We define counterfactual responses to manipulations of this type, which we call shift interventions. We show that in the presence of multiple variables being manipulated, two types of shift interventions are possible. Shift interventions on the treated (SITs) are defined with respect to natural values, and are connected to effects of treatment on the treated. Shift interventions as policies (SIPs) are defined recursively with respect to values of responses to prior shift interventions, and are connected to dynamic treatment regimes. We give sound and complete identification algorithms for both types of shift interventions, and derive efficient semi-parametric estimators for the mean response to a shift intervention in a special case motivated by a healthcare problem. Finally, we demonstrate the utility of our method by using an electronic health record dataset to estimate the effect of extending the length of stay in the intensive care unit (ICU) in a hospital by an extra day on patient ICU readmission probability.
Estimand: They do not make additional assumptions besides identifiability. ICE handling is not addressed in the text or given example. The primary measure is the Expected Counterfactual Outcome (Mean) under the shift intervention.
Estimator Description: Rather than estimating the effect of a purely counterfactual treatment, the authors propose the notion of shift interventions which are a function of the naturally observed exposure variable’s value (e.g., a change in drug dosing, or a shift in the length of hospital stay). They define shift interventions on the treated which utilise treatment effects observed in the treated. They also define shift interventions as policies which are based on previously observed shift interventions and connected to DTRs. They provide sound and complete identification algorithms for both and showcase a semi-parametric estimator for the mean response to a shift intervention in a healthcare example. This implies application on ordinal or continuous exposures where ‘shifts’ are definable.
Authors: Juliana Schulz, Erica E. M. Moodie
Year: 2021
The goal of precision medicine is to tailor treatment strategies on an individual patient level. Although several estimation techniques have been developed for determining optimal treatment rules, the majority of methods focus on the case of a dichotomous treatment, an example being the dynamic weighted ordinary least squares regression approach of Wallace and Moodie. We propose an extension to the aforementioned framework to allow for a continuous treatment with the ultimate goal of estimating optimal dosing strategies. The proposed method is shown to be doubly robust against model misspecification whenever the implemented weights satisfy a particular balancing condition. A broad class of weight functions can be derived from the balancing condition, providing a flexible regression based estimation method in the context of adaptive treatment strategies for continuous valued treatments. for this article are available online.
Estimand: Only identifiability is assumed. Handling of ICE is not specified. The measure uses the Blip Function (expected difference in potential outcomes between a treatment level and a reference level given covariates).
Estimator Description: The authors extend dynamic weighted ordinary least squares for finding optimal dosing strategies (i.e., a continuous exposure variable).
Authors: C Shi, A Fan, R Song, W Lu
Year: 2018
Precision medicine is a medical paradigm that focuses on finding the most effective treatment decision based on individual patient information. For many complex diseases, such as cancer, treatment decisions need to be tailored over time according to patients' responses to previous treatments. Such an adaptive strategy is referred as a dynamic treatment regime. A major challenge in deriving an optimal dynamic treatment regime arises when an extraordinary large number of prognostic factors, such as patient's genetic information, demographic characteristics, medical history and clinical measurements over time are available, but not all of them are necessary for making treatment decision. This makes variable selection an emerging need in precision medicine. In this paper, we propose a penalized multi-stage A-learning for deriving the optimal dynamic treatment regime when the number of covariates is of the non-polynomial (NP) order of the sample size. To preserve the double robustness property of the A-learning method, we adopt the Dantzig selector which directly penalizes the A-leaning estimating equations. Oracle inequalities of the proposed estimators for the parameters in the optimal dynamic treatment regime and error bounds on the difference between the value functions of the estimated optimal dynamic treatment regime and the true optimal dynamic treatment regime are established. Empirical performance of the proposed approach is evaluated by simulations and illustrated with an application to data from the STAR*D study.
Estimand: The estimator assumes sparsity, i.e., that the true optimal decision rule depends only on a small subset of the available covariates. Treatment (non-)response is incorporated into the DTR. The estimator measures the Expected Potential Outcome (Value) under the optimal regime.
Estimator Description: The authors propose a Penalized Multi-stage A-learning method for estimating optimal DTRs. The novelty lies in extending Advantage-learning (A-learning) to the high-dimensional setting using variable selection techniques (LASSO/SCAD/MCP penalties) to identify the small subset of variables actually useful for decision-making. It solves the minimization problem simultaneously for all stages or sequentially.
Authors: J Shi, W Dempsey
Year: 2025
Advances in wearable technologies and health interventions delivered by smartphones have greatly increased the accessibility of mobile health (mHealth) interventions. Micro-randomized trials (MRTs) are designed to assess the effectiveness of the mHealth intervention and introduce a novel class of causal estimands called "causal excursion effects." These estimands enable the evaluation of how intervention effects change over time and are influenced by individual characteristics or context. Existing methods for analyzing causal excursion effects assume known randomization probabilities, complete observations, and a linear nuisance function with prespecified features of the high-dimensional observed history. However, in complex mobile systems, these assumptions often fall short: randomization probabilities can be uncertain, observations may be incomplete, and the granularity of mHealth data makes linear modeling difficult. To address this issue, we propose a flexible and doubly robust inferential procedure, called "DR-WCLS," for estimating causal excursion effects from a meta-learner perspective. We present the bidirectional asymptotic properties of the proposed estimators and compare them with existing methods both theoretically and through extensive simulations. The results show a consistent and more efficient estimate, even with missing observations or uncertain treatment randomization probabilities. Finally, the practical utility of the proposed methods is demonstrated by analyzing data from a multi-institution cohort of first-year medical residents in the United States.
Estimand: The authors explicitly state to calculate a ‘causal excursion effect‘ estimand, often targeted in micro-randomised trials (MRTs) in mHealth interventions. These estimands quantify how treatment effects vary over time as a function of pre-defined moderating variables (often defined as the difference in expected proximal outcome between treatment and no treatment, conditional on the moderating variables). They are thus different from conditional average treatment effects (CATE) which condition over the entire covariate history. Instead a subset of variables is investigated for possible moderation of the causal effect under investigation (e.g., the effect of certain push-notifications on mood). While causal excursion effects require complete data, known randomisation probabilities and correct model specification, these assumptions can be challenged in a mHealth context due to software error, missed observations and a high dimensional feature space (e.g., due to the use of sensory data from wearable devices). The authors use multiple imputation to handle missing repeated measures data, which is akin to the hypothetical strategy. Identifiability conditions are necessary for MRTs, however, the doubly robust nature requires only that either the treatment or the outcome model is specified (meaning that a misspecified treatment model due to uncertain allocation probabilities will not induce bias so long as the outcome model is correctly specified).
Estimator Description: The authors use the double machine learning framework to propose a new estimator for the calculation of causal excursion effects (see an explanation of this concept below). The proposed estimator is based on the weighted centered least squares (WCLS) criterion. They propose doubly-robust-WCLS to calculate causal excursion effects from a meta-learner perspective, which allows for the utilisation of common supervised learning algorithms. They show that this estimator is consistent even when its underlying assumptions (see below) are violated, i.e., in the face of missing data and uncertain assignment probabilities.
Authors: Chamani Shiranthika, Kuo-Wei Chen, Chung-Yih Wang, Chan-Yun Yang, B. H. Sudantha, Wei-Fu Li
Year: 2022
In recent years, reinforcement learning (RL) has achieved a remarkable achievement and it has attracted researchers' attention in modeling real-life scenarios by expanding its research beyond conventional complex games. Prediction of optimal treatment regimens from observational real clinical data is being popularized, and more advanced versions of RL algorithms are being implemented in the literature. However, RL-generated medications still need careful supervision of expertise parties or doctors in healthcare. Hence, in this paper, a Supervised Optimal Chemotherapy Regimen (SOCR) approach to investigate optimal chemotherapy-dosing schedule for cancer patients was presented by using Offline Reinforcement Learning. The optimal policy suggested by the RL approach was supervised by incorporating previous treatment decisions of oncologists, which could add clinical expertise knowledge on algorithmic results. Presented SOCR approach followed a model-based architecture using conservative Q-Learning (CQL) algorithm. The developed model was tested using a manually constructed database of forty Stage-IV colon cancer patients, receiving line-1 chemotherapy treatments, who were clinically classified as 'Bevacizumab based patient' and 'Cetuximab based patient'. Experimental results revealed that the supervision from the oncologists has considered the effect to stabilize chemotherapy regimen and it was suggested that the proposed framework could be successfully used as a supportive model for oncologists in deciding their treatment decisions.
Estimand: The method implicitly assumes the Markov property as it uses Reinforcement Learning (MDP framework), where the next state (tumor size) depends only on the current state and action. The DTR triggers a treatment switch at the ICE of disease progression (tumor growth) or toxicity. The effect is maximized by maximizing the expected cumulative reward (reduction in tumor size).
Estimator Description: The paper introduces a new method called SOCR (Supervised Optimal Chemotherapy Regimen). It uses Offline Reinforcement Learning (specifically Conservative Q-Learning) combined with a Supervised Learning module (KNN) to estimate and optimize dynamic treatment regimes (chemotherapy dosing schedules) using observational clinical data. The novelty lies in the hybrid architecture that combines Model-Based Offline Reinforcement Learning (using Conservative Q-Learning) with a Supervised Learning component (K-Nearest Neighbors). The supervised module acts as a ‘safety check’ or guide by incorporating previous oncologists' decisions to supervise the RL agent's policy, ensuring the suggested doses remain within clinically safe/realistic bounds.
Authors: G Simoneau, EEM Moodie, RW Platt, B Chakraborty
Year: 2018
A dynamic treatment regime (DTR) is a set of decision rules to be applied across multiple stages of treatments. The decisions are tailored to individuals, by inputting an individual's observed characteristics and outputting a treatment decision at each stage for that individual. Dynamic weighted ordinary least squares (dWOLS) is a theoretically robust and easily implementable method for estimating an optimal DTR. As many related DTR methods, the dWOLS treatment effects estimators can be non-regular when true treatment effects are zero or very small, which results in invalid Wald-type or standard bootstrap confidence intervals. Inspired by an analysis of the effect of diet in infancy on measures of weight and body size in later childhood-a setting where the exposure is distant in time and whose effect is likely to be small-we investigate the use of the $m$-out-of-$n$ bootstrap with dWOLS as method of analysis for valid inferences of optimal DTR. We provide an extensive simulation study to compare the performance of different choices of resample size $m$ in situations where the treatment effects are likely to be non-regular. We illustrate the methodology using data from the PROmotion of Breastfeeding Intervention Trial to study the effect of solid food intake in infancy on long-term health outcomes.
Estimand: No assumptions are made besides the identifiability assumptions. In a separate analysis, they corrected for informative censoring using IPW, which assumes coarsening-at-random. Other ICE are not specified: a DTR is suggested to recommend the time-point to introduce solid food in an infants diet depending on their evolving health (e.g., weight). The primary measure is the Blip Parameter, which represents the difference in expected counterfactual outcomes between the optimal treatment and a reference treatment, conditional on history.
Estimator Description: The paper proposes using the Dynamic Weighted Ordinary Least Squares (dWOLS) estimator combined with the m-out-of-n bootstrap. The novelty lies in addressing the non-regularity of dWOLS estimators (where asymptotic distributions are not uniformly normal when treatment effects are small or zero) by applying the m-out-of-n bootstrap to generate valid confidence intervals, rather than the standard bootstrap which fails in these settings.
Software: https://github.com/gabriellesimoneau/Rcode-Biostatistics
Authors: Aaron Sonabend-W, Nilanjana Laha, Ashwin N. Ananthakrishnan, Tianxi Cai, Rajarshi Mukherjee
Year: 2023
Reinforcement learning (RL) has shown great promise in estimating dynamic treatment regimes which take into account patient heterogeneity. However, health-outcome information, used as the reward for RL methods, is often not well coded but rather embedded in clinical notes. Extracting precise outcome information is a resource-intensive task, so most of the available well-annotated cohorts are small. To address this issue, we propose a semi supervised learning (SSL) approach that efficiently leverages a small-sized labeled data set with actual outcomes observed and a large unlabeled data set with outcome surrogates. In particular, we propose a semi-supervised, efficient approach to Q-learning and doubly robust off-policy value estimation. Generalizing SSL to dynamic treatment regimes brings interesting challenges: 1) Feature distribution for Q-learning is unknown as it includes previous outcomes. 2) The surrogate variables we leverage in the modified SSL framework are predictive of the outcome but not informative of the optimal policy or value function. We provide theoretical results for our Q function and value function estimators to understand the degree of efficiency gained from SSL. Our method is at least as efficient as the supervised approach, and robust to bias from misspecification of the imputation models.
Estimand: No assumptions are made besides identifiability. Besides treatment-switching, no ICE are addressed. The causal effect measure is captured by the value unction (expected counterfactual outcome) of the dynamic treatment regime.
Estimator Description: The method specifically addresses the challenge where the primary outcome is expensive to label (e.g., requires manual chart review) but surrogates are abundant in unlabeled data. It uses a two-step fitting procedure: first imputing missing outcomes in the unlabeled set using flexible non-parametric models based on surrogates (semi-supervised learning), and then using these imputations to estimate Q-functions. It uniquely handles the issue that feature distributions in RL are unknown (outcomes of stage t become features for stage t+1).
Software: github.com/asonabend/SSOPRL
Authors: R Song, W Wang, D Zeng, MR Kosorok
Year: 2015
A dynamic treatment regimen incorporates both accrued information and long-term effects of treatment from specially designed clinical trials. As these trials become more and more popular in conjunction with longitudinal data from clinical studies, the development of statistical inference for optimal dynamic treatment regimens is a high priority. In this paper, we propose a new machine learning framework called penalized Q-learning, under which valid statistical inference is established. We also propose a new statistical procedure: individual selection and corresponding methods for incorporating individual selection within penalized Q-learning. Extensive numerical studies are presented which compare the proposed methods with existing methods, under a variety of scenarios, and demonstrate that the proposed approach is both inferentially and computationally superior. It is illustrated with a depression clinical trial study.
Estimand: It is important to note that automated variable selection does not assure identification of the correct adjustment set required for conditional exchangeability. This paper is therefore targeted at trial data where exchangeability is more likely (at least at baseline). The authors predict future studies in which they extend this method to observational data by employing propensity scores. The method requires correct specification of the Q-functions.
Estimator Description: The authors propose Penalized Q-Learning, a method that integrates variable selection directly into the Q-learning framework using penalty functions (like SCAD or LASSO). The key novelty is the ability to simultaneously perform variable selection and estimation of the optimal dynamic treatment regime (DTR) while handling the non-smoothness of the Q-learning objective function. They prove that their estimator possesses the ‘oracle property’ (it works as well as if the true relevant variables were known beforehand) and provide valid asymptotic inference for the parameters, which was previously difficult for standard Q-learning due to non-regularity
Authors: Y Song, L Wang
Year: 2024
A dynamic treatment regime (DTR) is a sequence of treatment decision rules that dictate individualized treatments based on evolving treatment and covariate history. It provides a vehicle for optimizing a clinical decision support system and fits well into the broader paradigm of personalized medicine. However, many real-world problems involve multiple competing priorities, and decision rules differ when trade-offs are present. Correspondingly, there may be more than one feasible decision that leads to empirically sufficient optimization. In this paper, we propose a concept of "tolerant regime," which provides a set of individualized feasible decision rules under a prespecified tolerance rate. A multiobjective tree-based reinforcement learning (MOT-RL) method is developed to directly estimate the tolerant DTR (tDTR) that optimizes multiple objectives in a multistage multitreatment setting. At each stage, MOT-RL constructs an unsupervised decision tree by modeling the counterfactual mean outcome of each objective via semiparametric regression and maximizing a purity measure constructed by the scalarized augmented inverse probability weighted estimators (SAIPWE). The algorithm is implemented in a backward inductive manner through multiple decision stages, and it estimates the optimal DTR and tDTR depending on the decision-maker's preferences. Multiobjective tree-based reinforcement learning is robust, efficient, easy-to-interpret, and flexible to different settings. We apply MOT-RL to evaluate 2-stage chemotherapy regimes that reduce disease burden and prolong survival for advanced prostate cancer patients using a dataset collected at MD Anderson Cancer Center.
Estimand: No additional assumptions besides identifiability are made. The method addresses the problem of competing risks/priorities (e.g., toxicity vs. efficacy) using a Composite strategy where multiple outcomes are combined into a scalarized function for optimization. The causal effect measure is a multiobjective counterfactual mean outcome (specifically, a scalarized combination of means, or so-called purity measure).
Estimator Description: The authors propose multiobjective tree-based reinforcement learning. Its novelty involves defining ‘tolerant regimes’ (sets of decision rules that are empirically ‘good enough’ or within a tolerance threshold of the optimal) and extending tree-based RL to handle multiple competing objectives (e.g., efficacy vs. toxicity) simultaneously using a scalarized purity measure.
Software: https://github.com/Team-Wang-Lab/MOTRL
Authors: D Spicker, EEM Moodie, SM Shortreed
Year: 2024
Precision medicine is a framework for developing evidence-based medical recommendations that seeks to determine the optimal sequence of treatments tailored to all of the relevant patient-level characteristics which are observable. Because precision medicine relies on highly sensitive, patient-level data, ensuring the privacy of participants is of great importance. Dynamic treatment regimes (DTRs) provide one formalization of precision medicine in a longitudinal setting. Outcome-Weighted Learning (OWL) is a family of techniques for estimating optimal DTRs based on observational data. OWL techniques leverage support vector machine (SVM) classifiers in order to perform estimation. SVMs perform classification based on a set of influential points in the data known as support vectors. The classification rule produced by SVMs often requires direct access to the support vectors. Thus, releasing a treatment policy estimated with OWL requires the release of patient data for a subset of patients in the sample. As a result, the classification rules from SVMs constitute a severe privacy violation for those individuals whose data comprise the support vectors. This privacy violation is a major concern, particularly in light of the potentially highly sensitive medical data which are used in DTR estimation. Differential privacy has emerged as a mathematical framework for ensuring the privacy of individual-level data, with provable guarantees on the likelihood that individual characteristics can be determined by an adversary. We provide the first investigation of differential privacy in the context of DTRs and provide a differentially private OWL estimator, with theoretical results allowing us to quantify the cost of privacy in terms of the accuracy of the private estimators.
Estimand: No additional assumptions are made. ICE handling is not specified. The measure is the value function (expected outcome ‘supposing that the regime was followed‘).
Estimator Description: The authors highlight privacy concerns related to the use of outcome-weighted learning (OWL). OWL is based on support vector machines (SVM). The support vectors are influential data points. Thus, estimating DTRs using SVM can reveal highly sensitive data for a subset of patients. The authors therefore propose differentially private OWL (PrOWL) which allows for provable guarantees on the likelihood that individual characteristics can be determined by an adversary.
Authors: J Sun, B Fu, L Su
Year: 2025
Dynamic treatment regimes (DTRs) formalize medical decision-making as a sequence of rules for different stages, mapping patient-level information to recommended treatments. In practice, estimating an optimal DTR using observational data from electronic medical record (EMR) databases can be complicated by nonignorable missing covariates resulting from informative monitoring of patients. Since complete case analysis can provide consistent estimation of outcome model parameters under the assumption of outcome-independent missingness, Q-learning is a natural approach to accommodating nonignorable missing covariates. However, the backward induction algorithm used in Q-learning can introduce challenges, as nonignorable missing covariates at later stages can result in nonignorable missing pseudo-outcomes at earlier stages, leading to suboptimal DTRs, even if the longitudinal outcome variables are fully observed. To address this unique missing data problem in DTR settings, we propose 2 weighted Q-learning approaches where inverse probability weights for missingness of the pseudo-outcomes are obtained through estimating equations with valid nonresponse instrumental variables or sensitivity analysis. The asymptotic properties of the weighted Q-learning estimators are derived, and the finite-sample performance of the proposed methods is evaluated and compared with alternative methods through extensive simulation studies. Using EMR data from the Medical Information Mart for Intensive Care database, we apply the proposed methods to investigate the optimal fluid strategy for sepsis patients in intensive care units.
Estimand: The authors tackle the problem of covariates missing-not-at-random MNAR, motivated by their observation that informative monitoring leads to nonignorable missingness in electronic health records: patients with more severe conditions are monitored more closely, meaning that they are less likely to have missing observations. Furthermore, this means that the missingness depends on the missing data itself (condition severity is associated with missing data on condition severity), leading to nonignorable missingness (or MNAR). To resolve this they propose the use instrumental variables and IPW to weigh observations according to their missingness probability, and sensitivity parameters to quantify the uncertainty of the imputed pseudo-values. This is akin to the use of the hypothetical strategy to handle the ICE of missing repeated measures data. While the instrumental variable approach and IPW assume MAR (using observed variables to calculate missingness probabilities), the sensitivity parameters are used because this assumption may not fully hold, thus quantifying the uncertainty where the missingness is MNAR.
Estimator Description: The novelty lies in addressing the challenge of nonignorable missing data (i.e., MNAR) in the context of multi-stage Q-learning. Standard Q-learning fails because the pseudo-outcomes generated in the backward induction step become dependent on the missingness mechanism of future stages. The authors propose using IPW where the weights are estimated from a missingness model that accounts for the non-ignorable nature (using an instrumental variable assumption) to restore consistency.
Authors: Y. L. Sun, L. Wang
Year: 2020
A dynamic treatment regime (DTR) is a sequence of decision rules that adapt to the time-varying states of an individual. Black-box learning methods have shown great potential in predicting the optimal treatments; however, the resulting DTRs lack interpretability, which is of paramount importance for medical experts to understand and implement. We present a stochastic tree-based reinforcement learning (ST-RL) method for estimating optimal DTRs in a multistage multitreatment setting with data from either randomized trials or observational studies. At each stage, ST-RL constructs a decision tree by first modeling the mean of counterfactual outcomes via nonparametric regression models, and then stochastically searching for the optimal tree-structured decision rule using a Markov chain Monte Carlo algorithm. We implement the proposed method in a backward inductive fashion through multiple decision stages. The proposed ST-RL delivers optimal DTRs with better interpretability and contributes to the existing literature in its non-greedy policy search. Additionally, ST-RL demonstrates stable and outstanding performances even with a large number of covariates, which is especially appealing when data are from large observational studies. We illustrate the performance of ST-RL through simulation studies, and also a real data application using esophageal cancer data collected from 1170 patients at MD Anderson Cancer Center from 1998 to 2012. for this article are available online.
Estimand: No additional assumptions besides identifiability are made. In their illustrative example they incorporate the ICE of toxicity, disease progression and toxicity in a composite outcome (reward function). Other events can be incorporated as the time-varying states (covariates) of the individual that change over time and trigger different decision paths in the tree. The method estimates the Mean Counterfactual Outcome (or Value) under the optimal tree-structured rule.
Estimator Description: The paper proposes Stochastic Tree-based Reinforcement Learning (ST-RL). The primary novelty is the use of a "non-greedy policy search" algorithm. Unlike traditional tree-based methods (like CART) that grow trees greedily (step-by-step), ST-RL uses a Markov chain Monte Carlo (MCMC) algorithm to stochastically search the entire space of possible decision trees. This aims to avoid local optima and provide interpretable optimal DTRs
Authors: D Talbot, EEM Moodie, C Diorio
Year: 2023
Precision medicine aims to tailor treatment decisions according to patients' characteristics. G-estimation and dynamic weighted ordinary least squares are double robust methods to identify optimal adaptive treatment strategies. It is underappreciated that they require modeling all existing treatment-confounder interactions to be consistent. Identifying optimal partially adaptive treatment strategies that tailor treatments according to only a few covariates, ignoring some interactions, may be preferable in practice. Building on G-estimation and dWOLS, we propose estimators of such partially adaptive strategies and demonstrate their double robustness. We investigate these estimators in a simulation study. Using data maintained by the Centre des Maladies du Sein, we estimate a partially adaptive treatment strategy for tailoring hormonal therapy use in breast cancer patients. R software implementing our estimators is provided.
Estimand: The authors assume non-informative censoring and employ an accelerated failure time model to correct for censoring. The causal effect measure is the discrepancy between the two treatments under investigation (risk difference).
Estimator Description: The paper highlights that it is underappreciated that g-estimation and dWOLS require modeling of all treatment-confounder interactions. They propose optimal partially dynamic treatment regimes that tailour treatments just based on a few variables might be preferable. They propose a doubly-robust estimator to this end.
Authors: Matteo Tortora, Ermanno Cordelli, Rosa Sicilia, Marianna Miele, Paolo Matteucci, Giulio Iannello, Sara Ramella, Paolo Soda
Year: 2021
Lung cancer is by far the leading cause of cancer death among both men and women. Radiation therapy is one of the main approaches to lung cancer treatment, and its planning is crucial for the therapy outcome. However, the current practice that uniformly delivers the dose does not take into account the patient-specific tumour features that may affect treatment success. Since radiation therapy is by its very nature a sequential procedure, Deep Reinforcement Learning (DRL) is a well-suited methodology to overcome this limitation. In this respect, in this work we present a DRL controller optimizing the daily dose fraction delivered to the patient on the basis of CT scans collected over time during the therapy, offering a personalized treatment not only for volume adaptation, as currently intended, but also for daily fractionation. Furthermore, this contribution introduces a virtual radiotherapy environment based on a set of ordinary differential equations modelling the tissue radiosensitivity by combining both the effect of the radiotherapy treatment and cell growth. Their parameters are estimated from CT scans routinely collected using the Particle Swarm Optimization algorithm. This permits the DRL to learn the optimal behaviour through an iterative trial and error process with the environment. We performed several experiments considering three rewards functions modelling treatment strategies with different tissue aggres-siveness and two exploration strategies for the exploration-exploitation dilemma. The results show that our DRL approach can adapt to radiation therapy treatment, optimizing its behaviour according to the different reward functions and outperforming the current clinical practice.
Estimand: The method relies on the assumption that the clinical problem can be modeled as a Markov Decision Process (MDP). It also relies on the validity of the ordinary differential equations (ODEs) used to create the virtual radiotherapy environment and model tissue radiosensitivity. Handling of ICE is not specified as the algorithm was trained and evaluated solely within a virtual environment. The measure is the Cumulative Reward, which aggregates the immediate rewards received at each time step (reflecting tumor control and side effects). The effect is evaluated by comparing the cumulative reward of the DRL agent against current clinical practice.
Estimator Description: The paper introduces a Deep Reinforcement Learning (DRL) controller to optimize dynamic treatment regimes (daily dose fractionation) for lung cancer patients. The novelty lies in using Deep Reinforcement Learning (specifically D3QN) combined with a ‘virtual radiotherapy environment’ based on ordinary differential equations parameterized by daily CT scans (non-invasive data) to optimize daily fractionation, rather than just volume adaptation. The parameters of the differential equations are estimated using particle swarm optimisation.
Authors: KC Trenou, M Mésidor, A Eslami, H Nabi, C Diorio, D Talbot
Year: 2025
Estimating optimal adaptive treatment strategies (ATSs) can be done in several ways, including dynamic weighted ordinary least squares (dWOLS). This approach is doubly robust as it requires modeling both the treatment and the response, but only one of those models needs to be correctly specified to obtain a consistent estimator. For estimating an average treatment effect, doubly robust methods have been shown to combine better with machine learning methods than alternatives. However, the use of machine learning within dWOLS has not yet been investigated. Using simulation studies, we evaluate and compare the performance of the dWOLS estimator when the treatment probability is estimated either using machine learning algorithms or a logistic regression model. We further investigate the use of an adaptive m -out-of- n bootstrap method for producing inferences. SuperLearner performed at least as well as logistic regression in terms of bias and variance in scenarios with simple data-generating models and often had improved performance in more complex scenarios. Moreover, the m -out-of- n bootstrap produced confidence intervals with nominal coverage probabilities for parameters that were estimated with low bias. We also apply our proposed approach to the data from a breast cancer registry in Québec, Canada, to estimate an optimal ATS to personalize the use of hormonal therapy in breast cancer patients. Our method is implemented in the R software and available on GitHub https://github.com/kosstre20/MachineLearningToControlConfoundingPersonalizedMedicine.git. We recommend routine use of machine learning to model treatment within dWOLS, at least as a sensitivity analysis for the point estimates.
Estimand: No additional assumptions are made. Informative censoring was corrected using a hypothetical strategy by an accelerated failure time model, which requires conditional unconfoundedness wrt. censoring as a treatment variable. This is equivalent to assuming CAR.
Estimator Description: Doubly robust methods have been shown to perform better when using machine learning to estimate nuisance parameters. The use of machine learning for dWOLS has not been investigated. They compare the calculation of the treatment allocation probabilities for dWOLS using machine learning (random forest, naive Bayes, SVM, neural nets, SuperLearner) to logistic regression. They provide confidence intervals for inference purposes and a software implementation.
Software: https://github.com/kosstre20/MachineLearningToControlConfoundingPersonalizedMedicine.git
Authors: Mark J. van der Laan
Year: 2014
Suppose that we observe a population of causally connected units. On each unit at each time-point on a grid we observe a set of other units the unit is potentially connected with, and a unit-specific longitudinal data structure consisting of baseline and time-dependent covariates, a time-dependent treatment, and a final outcome of interest. The target quantity of interest is defined as the mean outcome for this group of units if the exposures of the units would be probabilistically assigned according to a known specified mechanism, where the latter is called a stochastic intervention. Causal effects of interest are defined as contrasts of the mean of the unit-specific outcomes under different stochastic interventions one wishes to evaluate. This covers a large range of estimation problems from independent units, independent clusters of units, and a single cluster of units in which each unit has a limited number of connections to other units. The allowed dependence includes treatment allocation in response to data on multiple units and so called causal interference as special cases. We present a few motivating classes of examples, propose a structural causal model, define the desired causal quantities, address the identification of these quantities from the observed data, and define maximum likelihood based estimators based on cross-validation. In particular, we present maximum likelihood based super-learning for this network data. Nonetheless, such smoothed/regularized maximum likelihood estimators are not targeted and will thereby be overly bias w.r.t. the target parameter, and, as a consequence, generally not result in asymptotically normally distributed estimators of the statistical target parameter., To formally develop estimation theory, we focus on the simpler case in which the longitudinal data structure is a point-treatment data structure. We formulate a novel targeted maximum likelihood estimator of this estimand and show that the double robustness of the efficient influence curve implies that the bias of the targeted minimum loss-based estimation (TMLE) will be a second-order term involving squared differences of two nuisance parameters. In particular, the TMLE will be consistent if either one of these nuisance parameters is consistently estimated. Due to the causal dependencies between units, the data set may correspond with the realization of a single experiment, so that establishing a (e.g. normal) limit distribution for the targeted maximum likelihood estimators, and corresponding statistical inference, is a challenging topic. We prove two formal theorems establishing the asymptotic normality using advances in weak-convergence theory. We conclude with a discussion and refer to an accompanying technical report for extensions to general longitudinal data structures.
Estimand: This method relaxes the SUTVA assumption by allowing for interference. This is accomplished by defining causal effects of interest as contrasts of the mean of the unit-specific outcomes under different interventions (thus falling under the hypothetical strategy, here applied to the ICE of interference). The estimator uses the Mean Difference (or Risk Difference) between counterfactual population means under different stochastic interventions.
Estimator Description: The paper proposes a Targeted Minimum Loss-based Estimator (TMLE) specifically designed for a population of causally connected units (e.g., networks, infectious disease transmission), where the outcome of one unit depends on the treatment and covariates of others (interference). The novelty lies in defining causal effects in a network setting using stochastic interventions and providing a TMLE that respects the complex dependence structure (using a ‘working model’ of independence for the fluctuation step while preserving consistency). This practically relaxes the SUTVA assumption for this estimator.
Authors: Mark J. van der Laan, Alexander R. Luedtke
Year: 2015
We consider estimation of and inference for the mean outcome under the optimal dynamic two time-point treatment rule defined as the rule that maximizes the mean outcome under the dynamic treatment, where the candidate rules are restricted to depend only on a user-supplied subset of the baseline and intermediate covariates. This estimation problem is addressed in a statistical model for the data distribution that is nonparametric beyond possible knowledge about the treatment and censoring mechanism. This contrasts from the current literature that relies on parametric assumptions. We establish that the mean of the counterfactual outcome under the optimal dynamic treatment is a pathwise differentiable parameter under conditions, and develop a targeted minimum loss-based estimator (TMLE) of this target parameter. We establish asymptotic linearity and statistical inference for this estimator under specified conditions. In a sequentially randomized trial the statistical inference relies upon a second-order difference between the estimator of the optimal dynamic treatment and the optimal dynamic treatment to be asymptotically negligible, which may be a problematic condition when the rule is based on multivariate time-dependent covariates. To avoid this condition, we also develop TMLEs and statistical inference for data adaptive target parameters that are defined in terms of the mean outcome under the estimate of the optimal dynamic treatment. In particular, we develop a novel cross-validated TMLE approach that provides asymptotic inference under minimal conditions, avoiding the need for any empirical process conditions. We offer simulation results to support our theoretical findings.
Estimand: The method works for any outcome that can be bounded (e.g., scaled to [0,1] for the logistic fluctuation).
Estimator Description: The authors propose a Targeted Minimum Loss-based Estimator (TMLE). The key novelty is providing a fully nonparametric inference framework for the value of the optimal regime itself, whereas previous methods (like Q-learning) often relied on parametric assumptions or did not provide formal asymptotic inference for the value of the estimated rule when the rule is data-adaptive. They introduce a ‘CV-TMLE’ (Cross-Validated TMLE) to avoid the ‘donsker class’ conditions usually required for the consistency of the rule estimation, making the inference robust even when aggressive machine learning (Super Learner) is used to estimate the nuisance parameters.
Authors: Mark J. van der Laan
Year: 2010
In this article, we provide a template for the practical implementation of the targeted maximum likelihood estimator for analyzing causal effects of multiple time point interventions, for which the methodology was developed and presented in Part I. In addition, the application of this template is demonstrated in two important estimation problems: estimation of the effect of individualized treatment rules based on marginal structural models for treatment rules, and the effect of a baseline treatment on survival in a randomized clinical trial in which the time till event is subject to right censoring.
Estimand: The estimator handles ICE by intervening on them an calculating counterfactual outcomes. It can therefore be used for both a Hypothetical or Treatment Policy estimand defined by the G-computation formula (intervening on the treatment nodes in the system). The estimator can intervene on censoring status (counterfactually speaking) through the use of inverse probability of censoring weights.
Estimator Description: ‘Targeted Maximum Likelihood Estimator (TMLE) is a two stage estimator where "the first stage applies regularized maximum likelihood based estimation... and the second stage targets the obtained fit from the first stage towards the target parameter of interest through a targeted maximum likelihood step. This step involves applying a fluctuation function to the initial estimate to remove bias.’
Authors: Mark J. van der Laan
Year: 2010
Given causal graph assumptions, intervention-specific counterfactual distributions of the data can be defined by the so called G-computation formula, which is obtained by carrying out these interventions on the likelihood of the data factorized according to the causal graph. The obtained G-computation formula represents the counterfactual distribution the data would have had if this intervention would have been enforced on the system generating the data. A causal effect of interest can now be defined as some difference between these counterfactual distributions indexed by different interventions. For example, the interventions can represent static treatment regimens or individualized treatment rules that assign treatment in response to time-dependent covariates, and the causal effects could be defined in terms of features of the mean of the treatment-regimen specific counterfactual outcome of interest as a function of the corresponding treatment regimens. Such features could be defined nonparametrically in terms of so called (nonparametric) marginal structural models for static or individualized treatment rules, whose parameters can be thought of as (smooth) summary measures of differences between the treatment regimen specific counterfactual distributions., In this article, we develop a particular targeted maximum likelihood estimator of causal effects of multiple time point interventions. This involves the use of loss-based super-learning to obtain an initial estimate of the unknown factors of the G-computation formula, and subsequently, applying a target-parameter specific optimal fluctuation function (least favorable parametric submodel) to each estimated factor, estimating the fluctuation parameter(s) with maximum likelihood estimation, and iterating this updating step of the initial factor till convergence. This iterative targeted maximum likelihood updating step makes the resulting estimator of the causal effect double robust in the sense that it is consistent if either the initial estimator is consistent, or the estimator of the optimal fluctuation function is consistent. The optimal fluctuation function is correctly specified if the conditional distributions of the nodes in the causal graph one intervenes upon are correctly specified. The latter conditional distributions often comprise the so called treatment and censoring mechanism. Selection among different targeted maximum likelihood estimators (e.g., indexed by different initial estimators) can be based on loss-based cross-validation such as likelihood based cross-validation or cross-validation based on another appropriate loss function for the distribution of the data. Some specific loss functions are mentioned in this article., Subsequently, a variety of interesting observations about this targeted maximum likelihood estimation procedure are made. This article provides the basis for the subsequent companion Part II-article in which concrete demonstrations for the implementation of the targeted MLE in complex causal effect estimation problems are provided.
Estimand: ICE can be addressed by intervening on them, counterfactually. To this end we must assume the missingness mechanism of MAR. Thereby we can estimate the treatment effect if no participant was censored and impute missing values using inverse probability of censoring weights (IPCW). This is a clear example of the hypothetical strategy for handling ICE. They used embedded DTR tailoring to handle non-response.
Estimator Description: Targeted Maximum Likelihood Estimator (TMLE) is a two stage estimator where the first stage applies regularized maximum likelihood based estimation... and the second stage targets the obtained fit from the first stage towards the target parameter of interest through a targeted maximum likelihood step. This step involves applying a fluctuation function to the initial estimate to remove bias.
Authors: M.J. Van Der Laan, S. Gruber
Year: 2014
We consider estimation of the effect of a multiple time point intervention on an outcome of interest, where the intervention nodes are subject to time-dependent confounding by intermediate covariates.In previous work van der Laan (2010) and Stitelman and van der Laan (2011a) developed and implemented a closed form targeted maximum likelihood estimator (TMLE) relying on the log-likelihood loss function, and demonstrated important gains relative to inverse probability of treatment weighted estimators and estimating equation based estimators. This TMLE relies on an initial estimator of the entire probability distribution of the longitudinal data structure. To enhance the finite sample performance of the TMLE of the target parameter it is of interest to select the smallest possible relevant part of the data generating distribution, which is estimated and updated by TMLE. Inspired by this goal, we develop a new closed form TMLE of an intervention specific mean outcome based on general longitudinal data structures. The target parameter is represented as an iterative sequence of conditional expectations of the outcome of interest. This collection of conditional means represents the relevant part, which is estimated and updated using the general TMLE algorithm. We also develop this new TMLE for other causal parameters, such as parameters defined by working marginal structural models. The theoretical properties of the TMLE are also practically demonstrated with a small scale simulation study.The proposed TMLE is building upon a previously proposed estimator Bang and Robins (2005) by integrating some of its key and innovative ideas into the TMLE framework.
Estimand: The proposed method uses the hypothetical strategy to deal with the ICE of non-adherence and informative censoring. They use TMLE estimate the potential outcome under full adherence and no censoring. This requires choosing an adjustment set that contains all common causes of treatment (which here includes censoring and adherence) and the outcome. This entails that the coarsening-at-random assumption must hold (which is identical to conditional exchangeability wrt. censoring). They estimate the effect of the treatment regime by calculating the intervention-specific mean.
Estimator Description: The novelty is a new closed-form TMLE for longitudinal data that is streamlined and computationally simpler than previous versions. Instead of estimating the entire density of the longitudinal data (as in previous TMLEs), this method iteratively estimates a sequence of conditional means (nested regressions) of the outcome. This focuses the estimation on the ‘relevant part’ of the data distribution needed for the target parameter, improving finite sample performance and reducing computational burden.
Software: The authors provide R code directly in the appendix of the paper.
Authors: Michael P. Wallace, Erica E. M. Moodie
Year: 2015
Personalized medicine is a rapidly expanding area of health research wherein patient level information is used to inform their treatment. Dynamic treatment regimens (DTRs) are a means of formalizing the sequence of treatment decisions that characterize personalized management plans. Identifying the DTR which optimizes expected patient outcome is of obvious interest and numerous methods have been proposed for this purpose. We present a new approach which builds on two established methods: Q-learning and G-estimation, offering the doubly robust property of the latter but with ease of implementation much more akin to the former. We outline the underlying theory, provide simulation studies that demonstrate the double-robustness and efficiency properties of our approach, and illustrate its use on data from the Promotion of Breastfeeding Intervention Trial.
Estimand: The method (like standard G-estimation) typically assumes a linear form for the ‘blip’ function (the treatment effect modification), though it can be extended. The authors use the PROBIT dataset to illustrate their method. They do not address any specific ICE.
Estimator Description: The authors propose a new method called dynamic weighted ordinary least squares (dWOLS). The novelty lies in combining the computational simplicity of Q-learning (which uses simple regression) with the double robustness property typically found in G-estimation. Standard Q-learning is not doubly robust (it fails if the outcome model is misspecified). G-estimation is doubly robust but computationally intensive (requiring grid searches or optimization of non-smooth estimating equations). This new estimator allows for closed-form estimation (like Q-learning) using a specific weighting scheme that incorporates the propensity score, thereby achieving double robustness without the computational burden.
Authors: H Xiong, F Wu, L Deng, M Su, Z Shahn, LH Lehman
Year: 2024
In the context of medical decision making, counterfactual prediction enables clinicians to predict treatment outcomes of interest under alternative courses of therapeutic actions given observed patient history. In this work, we present G-Transformer for counterfactual outcome prediction under dynamic and time-varying treatment strategies. Our approach leverages a Transformer architecture to capture complex, long-range dependencies in time-varying covariates while enabling g-computation, a causal inference method for estimating the effects of dynamic treatment regimes. Specifically, we use a Transformer-based encoder architecture to estimate the conditional distribution of relevant covariates given covariate and treatment history at each time point, then produces Monte Carlo estimates of counterfactual outcomes by simulating forward patient trajectories under treatment strategies of interest. We evaluate G-Transformer extensively using two simulated longitudinal datasets from mechanistic models, and a real-world sepsis ICU dataset from MIMIC-IV. G-Transformer outperforms both classical and state-of-the-art counterfactual prediction models in these settings. To the best of our knowledge, this is the first Transformer-based architecture that supports g-computation for counterfactual outcome prediction under dynamic and time-varying treatment strategies.
Estimand: No additional assumptions are made. In the context of the ICU application (MIMIC-IV), death and discharge are handled as absorbing states or end-of-trajectory events, which is essentially a while-on strategy (simulation stops or is fixed upon these events). The measure used is the expected counterfactual outcome under a specific regime.
Estimator Description: The authors propose the first transformer Architecture for g-computation to capture complex, long-range dependencies in time-varying covariates.
Authors: Y. X. Xu, P. Müller, A. S. Wahed, P. F. Thall
Year: 2016
We analyze a dataset arising from a clinical trial involving multi-stage chemotherapy regimes for acute leukemia. The trial design was a 2 x 2 factorial for frontline therapies only. Motivated, by the idea that subsequent salvage treatments affect survival time, we model therapy as a dynamic treatment regime (DTR), that is, an alternating sequence of adaptive treatments or other actions and transition times between disease states. These sequences may vary substantially between patients, depending on how the regime plays out. To evaluate the regimes, mean overall survival time is expressed as a weighted average of the means of all possible sums of successive transitions times. We assume a Bayesian nonparametric survival regression model for each transition time, with a dependent Dirichlet process prior and Gaussian process base measure (DDP-GP). Posterior simulation is implemented by Markov chain Monte Carlo (MCMC) sampling. We provide general guidelines for constructing a prior using empirical Bayes methods. The proposed approach is compared with inverse probability of treatment weighting, including a doubly robust augmented version of this approach, for both single-stage and multi-stage regimes with treatment assignment depending on baseline covariates. The simulations show that the proposed nonparametric Bayesian approach can substantially improve inference compared to existing methods. An R program for implementing the DDP-GP-based Bayesian nonparametric analysis is freely available at www.ams.jhu.edu/yxu70. Supplementary materials for this article are available online.
Estimand: The method does not assume non-informative censoring: instead it assumes that censoring is independent of transition times (part of the outcome) given baseline covariates and prior transition times. This is akin to a coarsening-at-random assumption. Due to the use of the Bayesian Nonparametric method we are alleviated of the need to make model-specific assumptions (such as proportional hazards). While the model does calculate time-to-event estimates for potential ICE, it does not incorporate them into the overall outcome measure (which remains the mean survival time). However, the model is able to recommend a DTR around these ICE and incorporate them into the treatment plan, thus applying the DTR strategy. While it might seem like the composite strategy is employed, the clock does not ‘stop‘ when the competing risk occurs.
Estimator Description: The authors propose a Bayesian Nonparametric (BNP) method using Dependent Dirichlet Process (DDP) mixtures to estimate the survival distribution under dynamic treatment regimes. The key novelty is handling sequential transition times that depend on the occurrence of an ICE (e.g., time to remission, time to relapse, time to death) as the part of the estimated quantity, rather than just survival time. This allows the model to capture the complex, history-dependent nature of disease progression (e.g., the time to relapse depends on the duration of the previous remission). By modeling the joint distribution of these gap times flexibly using BNP, they can reconstruct the overall survival distribution for any dynamic regime without making strong parametric assumptions (like proportional hazards) that are often violated in multi-stage settings.
Software: https:// www.ma.utexas.edu/users/yxu/
Authors: Fei Xue, Yanqing Zhang, Wenzhuo Zhou, Haoda Fu, Annie Qu
Year: 2022
An optimal dynamic treatment regime (DTR) consists of a sequence of decision rules in maximizing long-term benefits, which is applicable for chronic diseases such as HIV infection or cancer. In this article, we develop a novel angle-based approach to search the optimal DTR under a multicategory treatment framework for survival data. The proposed method targets to maximize the conditional survival function of patients following a DTR. In contrast to most existing approaches which are designed to maximize the expected survival time under a binary treatment framework, the proposed method solves the multicategory treatment problem given multiple stages for censored data. Specifically, the proposed method obtains the optimal DTR via integrating estimations of decision rules at multiple stages into a single multicategory classification algorithm without imposing additional constraints, which is also more computationally efficient and robust. In theory, we establish Fisher consistency and provide the risk bound for the proposed estimator under regularity conditions. Our numerical studies show that the proposed method outperforms competing methods in terms of maximizing the conditional survival probability. We apply the proposed method to two real datasets: Framingham heart study data and acquired immunodeficiency syndrome clinical data. for this article are available online.
Estimand: The authors apply the hypothetical strategy to handle the ICE of censoring. They do so by using IPW, which assumes CAR. The method estimates the Conditional Survival Probability (specifically, the probability of surviving beyond a specific time point).
Estimator Description: The paper introduces a new machine learning method called Multicategory Angle-based Learning (MAL) for estimating optimal Dynamic Treatment Regimes (DTRs). It specifically addresses the challenges of censored survival data and multicategory treatments (more than two treatment options) across multiple decision stages. The novelty lies in extending the ‘Angle-based Large-margin Classification’ (a machine learning classification technique) to the context of multi-stage dynamic treatment regimes with censored survival outcomes. Most existing methods focus on binary treatments or mean survival time; this method handles multiple treatment options and targets the conditional survival probability directly using a robust hinge loss function.
Authors: C Yin, R Liu, J Caterino, P Zhang
Year: 2022
Despite intense efforts in basic and clinical research, an individualized ventilation strategy for critically ill patients remains a major challenge. Recently, dynamic treatment regime (DTR) with reinforcement learning (RL) on electronic health records (EHR) has attracted interest from both the healthcare industry and machine learning research community. However, most learned DTR policies might be biased due to the existence of confounders. Although some treatment actions non-survivors received may be helpful, if confounders cause the mortality, the training of RL models guided by long-term outcomes (e.g., 90-day mortality) would punish those treatment actions causing the learned DTR policies to be suboptimal. In this study, we develop a new deconfounding actor-critic network (DAC) to learn optimal DTR policies for patients. To alleviate confounding issues, we incorporate a patient resampling module and a confounding balance module into our actor-critic framework. To avoid punishing the effective treatment actions non-survivors received, we design a short-term reward to capture patients' immediate health state changes. Combining short-term with long-term rewards could further improve the model performance. Moreover, we introduce a policy adaptation method to successfully transfer the learned model to new-source small-scale datasets. The experimental results on one semi-synthetic and two different real-world datasets show the proposed model outperforms the state-of-the-art models. The proposed model provides individualized treatment decisions for mechanical ventilation that could improve patient outcomes.
Estimand: The method assumes the Markov property. They handle biased induced by devaluation of treatment strategies used on critically ill patients that die by engineering the reward to capture short-term effects as well as long-term effects. This can be seen as a sort of hypothetical strategy. The causal effect measure is the cumulative reward.
Estimator Description: The paper introduces a new machine learning method called Deconfounding Actor-Critic Network (DAC). It is explicitly designed to estimate and optimize Dynamic Treatment Regimes (DTRs) using observational electronic health record (EHR) data. The novelty addresses the issue of confounding bias in Reinforcement Learning (RL) when applied to observational data. Standard RL agents might punish effective treatments if they were given to sicker patients who died (due to severity, not the treatment). To solve this, the authors introduce 1) a patient resampling module to balance the dataset, 2) a confounding balance module (using representation learning/autoencoders) to minimize the association between patient states and treatment assignments and 3) a short-term reward mechanism to capture immediate health changes (e.g., SOFA score improvement) rather than relying solely on long-term sparse outcomes (mortality), which helps in learning policy adaptation.
Authors: W Yu, H Bondell
Year: 2024
We propose a semiparametric approach to Bayesian modeling of dynamic treatment regimes that is built on a Bayesian likelihood-based regression estimation framework. Methods based on this framework exhibit a probabilistic coherence property that leads to accurate estimation of the optimal dynamic treatment regime. Unlike most Bayesian estimation methods, our proposed method avoids strong distributional assumptions for the intermediate and final outcomes by utilizing empirical likelihoods. Our proposed method allows for either linear, or more flexible forms of mean functions for the stagewise outcomes. A variational Bayes approximation is used for computation to avoid common pitfalls associated with Markov Chain Monte Carlo approaches coupled with empirical likelihood. Through simulations and analysis of the STAR*D sequential randomized trial data, our proposed method demonstrates superior accuracy over Q-learning and parametric Bayesian likelihood-based regression estimation, particularly when the parametric assumptions of regression error distributions may be potentially violated.
Estimand: The benefit of this estimator that it relaxes strong distributional assumptions (e.g., Normality). The responder status was an embedded tailoring variable in the M-bridge SMART design. Participants were intentionally re-randomized at the second stage based on this specific ICE. The effect measure is the expected final outcome under the optimal regime.
Estimator Description: The authors propose a semi-parametric Bayesian estimator for DTRs. The novelty of this method lies in replacing the parametric distributional assumptions (e.g., Gaussian errors) of standard Bayesian regression estimators with empirical likelihood (EL). This allows the method to maintain the probabilistic coherence of likelihood-based methods (which Q-learning lacks in some settings) while being robust to misspecification of the error distributions (e.g., non-normality).
Authors: Amir Ebrahimi Zade, Seyedhamidreza Shahabi Haghighi, M. Soltani
Year: 2022
Background and objectives: Glioblastoma multiforme (GBM) is the most common and deadly type of primary cancers of the brain and central nervous system in adults. Despite the importance of designing a personalized treatment regimen for the patient, clinical trials prescribe a set of conventional regimens for GBM patients. We propose a computerized framework for designing chemo-radiation therapy (CRT) regimen based on patient characteristics. Methods: An intelligent agent, based on deep reinforcement learning, interacts with a virtual personalized GBM. The proposed deep Q network (DQN) uses a deep neural network to estimate the state - action value function. The algorithm stores agent experiences in a replay memory to be used for training of the deep neural network. Also, the proliferation-invasion model is used to simulate spatiotemporal dynamics of GBM growth and its response to therapeutic agents. Results: Assuming tumor size at the end of the treatment course as a measure of the quality of the treatment regimen, experiments show that the proposed DQN is superior to the Q learning. Also, while the quality of the protocols obtained by the Q learning as well as its convergence speed decreases sharply with the increase in the dimensions of the state-action value function, the DQN is relatively robust against increasing the initial tumor size or lengthening the treatment period. Conclusion: Our results suggest that the optimal personalized treatment regimen may differ from the conventional regimens suggested by clinical trials. Given the scalability of the proposed DQN in designing treatment regimen for real size tumors, as well as its superiority over previous models, it is a suitable tool for designing personalized CRT regimen for GBM patients.
Estimand: The validity of the method relies on the assumption that the mathematical model (Pillai et al., 2011) for simulating tumor growth accurately represents the biological growth and treatment response of GBM in humans. Furthermore a Markov Decision Process is assumed and that the full state (tumor radius and cell densities) is observable or estimable at decision points. The ICE of toxicity is handled by being incorporated as a composite outcome within the reward function of the model. The effect measure is the difference in the final tumor radius between the proposed DRL method and the standard protocol.
Estimator Description: The paper introduces a new framework using Deep Reinforcement Learning (DRL), specifically Deep Q-Networks (DQN), to design and estimate the effect of personalized dynamic treatment regimes (chemo-radiation therapy) for Glioblastoma Multiforme (GBM). It utilizes a ‘virtual patient’ environment based on a mechanistic mathematical model to train the ML agent. The novelty lies in the application of Deep Q-Networks (DQN) to a specific, complex biological mathematical model (the Pillai model) of tumor growth. It formulates the clinical problem of concurrent chemo-radiation as a Reinforcement Learning problem, defining specific state spaces (tumor radius, cell densities) and reward functions (tumor reduction vs. toxicity) to optimize the therapeutic ratio in an in silico environment.
Authors: B Zhang, M Zhang
Year: 2018
A dynamic treatment regime is a sequence of decision rules, each corresponding to a decision point, that determine that next treatment based on each individual's own available characteristics and treatment history up to that point. We show that identifying the optimal dynamic treatment regime can be recast as a sequential optimization problem and propose a direct sequential optimization method to estimate the optimal treatment regimes. In particular, at each decision point, the optimization is equivalent to sequentially minimizing a weighted expected misclassification error. Based on this classification perspective, we propose a powerful and flexible C-learning algorithm to learn the optimal dynamic treatment regimes backward sequentially from the last stage until the first stage. C-learning is a direct optimization method that directly targets optimizing decision rules by exploiting powerful optimization/classification techniques and it allows incorporation of patient's characteristics and treatment history to improve performance, hence enjoying advantages of both the traditional outcome regression-based methods (Q- and A-learning) and the more recent direct optimization methods. The superior performance and flexibility of the proposed methods are illustrated through extensive simulation studies.
Estimand: Only the usual identifiability conditions must hold. Handling of ICE is not mentioned. The method uses a Contrast Function to determine the optimal decision, and minimizes a Weighted Misclassification Error to estimate the rule. This can be seen as a type of risk difference under alternative treatment regimes.
Estimator Description: The proposed method is C-learning (Classification-learning). It recasts the identification of optimal Dynamic Treatment Regimes (DTRs) as a sequential classification problem. The novelty is the transformation of the DTR estimation problem into a ‘weighted expected misclassification error’ minimization problem. This allows the use of powerful existing classification techniques (like CART or SVM) while decoupling the optimization step from the outcome modeling step. It aims to unify the benefits of regression-based methods (Q-learning) and direct optimization methods.
Authors: B Zhang, AA Tsiatis, M Davidian, M Zhang, E Laber
Year: 2012
A treatment regime maps observed patient characteristics to a recommended treatment. Recent technological advances have increased the quality, accessibility, and volume of patient-level data; consequently, there is a growing need for powerful and flexible estimators of an optimal treatment regime that can be used with either observational or randomized clinical trial data. We propose a novel and general framework that transforms the problem of estimating an optimal treatment regime into a classification problem wherein the optimal classifier corresponds to the optimal treatment regime. We show that commonly employed parametric and semi-parametric regression estimators, as well as recently proposed robust estimators of an optimal treatment regime can be represented as special cases within our framework. Furthermore, our approach allows any classification procedure that can accommodate case weights to be used without modification to estimate an optimal treatment regime. This introduces a wealth of new and powerful learning algorithms for use in estimating treatment regimes. We illustrate our approach using data from a breast cancer clinical trial.
Estimand: No additional assumptions besides the usual identifiability conditions are mentioned - but likely dependent on the precise classifier that is employed. They use AIPWE to weigh individual cases and correct for confounding making the method doubly robust. No specific ICE are mentioned in their illustrative example using data from the National Surgical Adjuvant Breast and Bowel Project (NSABP). The causal effect measure is defined in terms of contrasts of expected mean outcomes.
Estimator Description: The authors propose a novel framework that ‘transforms the problem of estimating an optimal treatment regime into a classification problem wherein the optimal classifier corresponds to the optimal treatment regime’. Instead of inverting a regression model, they estimate the regime by minimizing an ‘expected weighted misclassification error’. This allows the use of any classification method that can accommodate weighted data (e.g., Support Vector Machines, CART) to estimate the regime.
Authors: Zhang Zhang, Danhui Yi, Yiwei Fan
Year: 2022
Patients with chronic diseases, such as cancer or epilepsy, are often followed through multiple stages of clinical interventions. Dynamic treatment regimes (DTRs) are sequences of decision rules that assign treatments at each stage based on measured covariates for each patient. A DTR is said to be optimal if the expectation of the desirable clinical benefit reaches a maximum when applied to a population. When there are three or more options for treatments at each decision point and the clinical outcome of interest is a time-to-event variable, estimating an optimal DTR can be complicated. We propose a doubly robust method to estimate optimal DTRs with multicategory treatments and survival outcomes. A novel blip function is defined to measure the difference in expected outcomes among treatments, and a doubly robust weighted least squares algorithm is designed for parameter estimation. Simulations using various weight functions and scenarios support the advantages of the proposed method in estimating optimal DTRs over existing approaches. We further illustrate the practical value of our method by applying it to data from the Standard and New Antiepileptic Drugs study. In this analysis, the proposed method supports the use of the new drug lamotrigine over the standard option carbamazepine. When the actual treatments match the estimated optimal treatments, survival outcomes tend to be better. The newly developed method provides a practical approach for clinicians that is not limited to cases of binary treatment options.
Estimand: The method assumes that the probability of censoring is independent of future outcomes given the information history (i.e., CAR). This allows the authors to apply the hypothetical strategy to adjust for informative censoring using IPW. Furthermore, by using treatment failure as the main outcome on one example, they handle the ICE of toxicity leading to treatment failure too. The authors use a Blip Function. This measures the difference between the expected log-survival time under a specific treatment and a weighted average of expected outcomes under all possible treatments at that stage.
Estimator Description: The authors propose a doubly robust method to estimate optimal DTRs with multicategory treatments and survival outcomes. The novelty lies in the definition of a new blip function for multicategory treatments that avoids the need for an arbitrary ‘reference’ treatment (e.g., placebo), which previous methods relied upon. Instead, it uses a weighted average of outcomes.
Authors: Baqun Zhang, Anastasios A. Tsiatis, Eric B. Laber, Marie Davidian
Year: 2013
A dynamic treatment regime is a list of sequential decision rules for assigning treatment based on a patient's history. Q- and A-learning are two main approaches for estimating the optimal regime, i.e., that yielding the most beneficial outcome in the patient population, using data from a clinical trial or observational study. Q-learning requires postulated regression models for the outcome, while A-learning involves models for that part of the outcome regression representing treatment contrasts and for treatment assignment. We propose an alternative to Q- and A-learning that maximizes a doubly robust augmented inverse probability weighted estimator for population mean outcome over a restricted class of regimes. Simulations demonstrate the method's performance and robustness to model misspecification, which is a key concern.
Estimand: The method doesn’t make additional assumptions besides identifiability. Simulations are used with treatment switching is based on covariates, but no specific ICE are mentioned. The causal effect measure is a risk difference of mean outcomes - the regime leading to the optimal mean outcome is chosen.
Estimator Description: The authors propose a novel doubly robust estimator for optimal Dynamic Treatment Regimes (DTRs) in a sequential (multi-stage) setting. The method generalizes the authors' previous classification-based framework (see Zhang et al., 2012) to multiple stages. The novelty lies in defining the optimal regime estimation as a sequential classification problem where the ‘outcome’ for the classification at stage k is a constructed doubly robust augmented inverse probability weighted (AIPW) estimator of the ‘contrast function’ (the benefit of treatment vs. control).
Authors: Y Zhang, DM Vock, ME Patrick, TA Murray
Year: 2023
A sequential multiple assignment randomized trial, which incorporates multiple stages of randomization, is a popular approach for collecting data to inform personalized and adaptive treatments. There is an extensive literature on statistical methods to analyze data collected in sequential multiple assignment randomized trials and estimate the optimal dynamic treatment regime. Q-learning with linear regression is widely used for this purpose due to its ease of implementation. However, model misspecification is a common problem with this approach, and little attention has been given to the impact of model misspecification when treatment effects are heterogeneous across subjects. This article describes the integrative impact of two possible types of model misspecification related to treatment effect heterogeneity: omitted early-stage treatment effects in late-stage main effect model, and violated linearity assumption between pseudo-outcomes and predictors despite non-linearity arising from the optimization operation. The proposed method, aiming to deal with both types of misspecification concomitantly, builds interactive models into modified parametric Q-learning with Murphy's regret function. Simulations show that the proposed method is robust to both sources of model misspecification. The proposed method is applied to a two-stage sequential multiple assignment randomized trial with embedded tailoring aimed at reducing binge drinking in first-year college students.
Estimand: Besides identifiability this approach requires correct specification of the parametric model used to represent the Q-functions. The ICE (becoming a heavy drinker) was an embedded tailoring variable in the actual trial design. The M-bridge study was a two-stage Sequential Multiple Assignment Randomized Trial (SMART) where heavy drinkers were specifically re-randomized to different interventions (automated email or online health coach), while non-heavy drinkers continued self-monitoring. This is a treatment policy that embedded the ICE of non-response in the DTR as a tailoring variable. The causal effect measure is Murphy’s regret function (a kind of risk difference).
Estimator Description: The authors highlight that the impact of model misspecification is underestimated when the treatment effect varies across subjects. They describe two types of model misspecification arising in this situation: 1) omitted early-stage treatment effects in late-stage main effect models, 2) a violated linearity assumption between pseudo-outcomes and predictors despite non-linearity arising from the optimisation operation. They propose a modified interactive q-learning method to deal with both types of model misspecification in the face of treatment effect heterogeneity.
Authors: , , , ,
Year: 2023
In recent sequential multiple assignment randomized trials, outcomes were assessed multiple times to evaluate longer-term impacts of the dynamic treatment regimes (DTRs). Q-learning requires a scalar response to identify the optimal DTR. Inverse probability weighting may be used to estimate the optimal outcome trajectory, but it is inefficient, susceptible to model mis-specification, and unable to characterize how treatment effects manifest over time. We propose modified Q-learning with generalized estimating equations to address these limitations and apply it to the M-bridge trial, which evaluates adaptive interventions to prevent problematic drinking among college freshmen. Simulation studies demonstrate our proposed method improves efficiency and robustness.
Estimand: Due to the use of linear regression and marginal models, parametric modeling assumptions need to be made. In the M-bridge SMART study for college drinking, the tailoring variable was defined based on the participant's response to the initial treatment (e.g., whether they had 1 heavy drinking day during the last 4 weeks of the first stage), meaning the ICE was incorporated as a tailoring variable in the DTR. The causal effect measure is the regret between the optimal treatment and the reference treatment (i.e., risk difference).
Estimator Description: Sometimes the outcome might be a series of repeated-meaasures rather than an outcome at one specific time-point. To estimate the longitudinal effect of treatments (the associated ‘trend‘ of the optimal treatment regime), the authors propose to use marginal models with time-varying coefficients and estimate model parameters with generalised estimating equations.
Authors: Yingqi Zhao, Donglin Zeng, A. John Rush, Michael R. Kosorok
Year: 2012
There is increasing interest in discovering individualized treatment rules for patients who have heterogeneous responses to treatment. In particular, one aims to find an optimal individualized treatment rule which is a deterministic function of patient specific characteristics maximizing expected clinical outcome. In this paper, we first show that estimating such an optimal treatment rule is equivalent to a classification problem where each subject is weighted proportional to his or her clinical outcome. We then propose an outcome weighted learning approach based on the support vector machine framework. We show that the resulting estimator of the treatment rule is consistent. We further obtain a finite sample bound for the difference between the expected outcome using the estimated individualized treatment rule and that of the optimal treatment rule. The performance of the proposed approach is demonstrated via simulation studies and an analysis of chronic depression data.
Estimand: The proposed method doesn’t address handling of potential ICE - they use a Reward function based on treatment response and covariates to determine treatment-switching. They assume that the reward R must be bounded and non-negative (which can be achieved by adding a constant). The causal effect measure therefore depends on the definition of the reward, which can take different shapes but is phrased as a mean outcome in this paper.
Estimator Description: The authors propose Outcome Weighted Learning (OWL). The novelty is that it does not require conditional mean modeling because it directly estimates the decision rule which maximizes clinical response. Instead of inverting a regression model (which can lead to suboptimal rules if the model is overfitted), they show that estimating the optimal rule is equivalent to a weighted classification problem where each subject is weighted proportional to his or her clinical outcome. They solve this using a weighted Support Vector Machine (SVM) approach with a hinge loss function.
Authors: Y. Zhao, D. Zeng, M.A. Socinski, M.R. Kosorok
Year: 2011
Typical regimens for advanced metastatic stage IIIB/IV nonsmall cell lung cancer (NSCLC) consist of multiple lines of treatment. We present an adaptive reinforcement learning approach to discover optimal individualized treatment regimens from a specially designed clinical trial (a "clinical reinforcement trial") of an experimental treatment for patients with advanced NSCLC who have not been treated previously with systemic therapy. In addition to the complexity of the problem of selecting optimal compounds for first- and second-line treatments based on prognostic factors, another primary goal is to determine the optimal time to initiate second-line therapy, either immediately or delayed after induction therapy, yielding the longest overall survival time. A reinforcement learning method calledQ-learning is utilized, which involves learning an optimal regimen from patient data generated from the clinical reinforcement trial. Approximating theQ-function with time-indexed parameters can be achieved by using a modification of support vector regression that can utilize censored data. Within this framework, a simulation study shows that the procedure can extract optimal regimens for two lines of treatment directly from clinical data without prior knowledge of the treatment effect mechanism. In addition, we demonstrate that the design reliably selects the best initial time for second-line therapy while taking into account the heterogeneity of NSCLC across patients. © 2011, The International Biometric Society.
Estimand: The estimator can handle censoring but assumes censoring to be non-informative of the outcome. The method estimates the optimal sequence of treatment assignments, indicating a treatment policy strategy. In addition, it incorporates disease progression as trigger for treatment switching, so it deals with it by incorporating it in the decision process (DTR strategy).
Estimator Description: The paper proposes a Reinforcement Learning estimator, specifically Q-learning, implemented using a modified Support Vector Regression method called ν-SVR-C to handle censored data. This method estimates the optimal dynamic treatment regime (a sequence of decision rules) that maximizes survival.
Authors: YQ Zhao, R Zhu, G Chen, Y Zheng
Year: 2020
Dynamic treatment regimes are sequential decision rules that adapt throughout disease progression according to a patient's evolving characteristics. In many clinical applications, it is desirable that the format of the decision rules remains consistent over time. Unlike the estimation of dynamic treatment regimes in regular settings, where decision rules are formed without shared parameters, the derivation of the shared decision rules requires estimating shared parameters indexing the decision rules across different decision points. Estimation of such rules becomes more complicated when the clinical outcome of interest is a survival time subject to censoring. To address these challenges, we propose two novel methods: censored shared-Q-learning and censored shared-O-learning. Both methods incorporate clinical preferences into a qualitative rule, where the parameters indexing the decision rules are shared across different decision points and estimated simultaneously. We use simulation studies to demonstrate the superior performance of the proposed methods. The methods are further applied to the Framingham Heart Study to derive treatment rules for cardiovascular disease.
Estimand: The authors use a method by Goldberg and Kosorok (2012) that corrects the bias induced by informative censoring using inverse-probability-of-censoring weighting. This relies on the CAR assumption. Specific ICE handling is not addressed in the text or the illustrative example. The outcome of interest is the mean survival time.
Estimator Description: In many clinical settings it is desirable that the decision rules stay the same across time (while still being dynamic in the sense of proposing treatments based on the patient’s health status and medical history). This kind of parameter sharing across time is not available for estimation problems involving censored data. This paper proposes shared Q-learning and shared O-learning for survival data. The method also allows for incorporation of clinical preferences.
Authors: YQ Zhao, D Zeng, EB Laber, MR Kosorok
Year: 2015
Dynamic treatment regimes (DTRs) are sequential decision rules for individual patients that can adapt over time to an evolving illness. The goal is to accommodate heterogeneity among patients and find the DTR which will produce the best long term outcome if implemented. We introduce two new statistical learning methods for estimating the optimal DTR, termed backward outcome weighted learning (BOWL), and simultaneous outcome weighted learning (SOWL). These approaches convert individualized treatment selection into an either sequential or simultaneous classification problem, and can thus be applied by modifying existing machine learning techniques. The proposed methods are based on directly maximizing over all DTRs a nonparametric estimator of the expected long-term outcome; this is fundamentally different than regression-based methods, for example Q-learning, which indirectly attempt such maximization and rely heavily on the correctness of postulated regression models. We prove that the resulting rules are consistent, and provide finite sample bounds for the errors using the estimated rules. Simulation results suggest the proposed methods produce superior DTRs compared with Q-learning especially in small samples. We illustrate the methods using data from a clinical trial for smoking cessation.
Estimand: The method assumes non-negative outcomes (however, this can simply be achieved by adding a constant). No ICE were addressed in the illustrative example. They simply ignored missing data by performing complete case analysis and including drop-out in the reward structure. The measured outcome is the difference in the value function between the optimal regime and a reference regime. DTRs are selected by classifying participants as part of an ‘optimal treatment group’, thereby framing it as a classification problem.
Estimator Description: The authors propose Backward and Simultaneous Outcome Weighted Learning (BOWL and SOWL). BOWL differs from previous regression-based methods (like Q-learning or G-estimation) in that it reframes the problem of finding the optimal DTR as a sequential or simultaneous weighted classification problem. It seeks to classify patients into the ‘optimal treatment group’ by maximizing a weighted accuracy where the weights are proportional to the observed clinical outcome (rewards). This allows the use of powerful machine learning classifiers (like Support Vector Machines, SVMs) to directly estimate the optimal decision rule without explicitly modeling the complex underlying outcome distribution.
Authors: W Zheng, M Petersen, MJ van der Laan
Year: 2016
In social and health sciences, many research questions involve understanding the causal effect of a longitudinal treatment on mortality (or time-to-event outcomes in general). Often, treatment status may change in response to past covariates that are risk factors for mortality, and in turn, treatment status may also affect such subsequent covariates. In these situations, Marginal Structural Models (MSMs), introduced by Robins (1997. Marginal structural models Proceedings of the American Statistical Association. Section on Bayesian Statistical Science, 1-10), are well-established and widely used tools to account for time-varying confounding. In particular, a MSM can be used to specify the intervention-specific counterfactual hazard function, i. e. the hazard for the outcome of a subject in an ideal experiment where he/she was assigned to follow a given intervention on their treatment variables. The parameters of this hazard MSM are traditionally estimated using the Inverse Probability Weighted estimation Robins (1999. Marginal structural models versus structural nested models as tools for causal inference. In: Statistical models in epidemiology: the environment and clinical trials. Springer-Verlag, 1999:95-134), Robins et al. (2000), (IPTW, van der Laan and Petersen (2007. Causal effect models for realistic individualized treatment and intention to treat rules. Int J Biostat 2007;3:Article 3), Robins et al. (2008. Estimaton and extrapolation of optimal treatment and testing strategies. Statistics in Medicine 2008;27(23):4678-721)). This estimator is easy to implement and admits Wald-type confidence intervals. However, its consistency hinges on the correct specification of the treatment allocation probabilities, and the estimates are generally sensitive to large treatment weights (especially in the presence of strong confounding), which are difficult to stabilize for dynamic treatment regimes. In this paper, we present a pooled targeted maximum likelihood estimator (TMLE, van der Laan and Rubin (2006. Targeted maximum likelihood learning. The International Journal of Biostatistics 2006;2:1-40)) for MSM for the hazard function under longitudinal dynamic treatment regimes. The proposed estimator is semiparametric efficient and doubly robust, offering bias reduction over the incumbent IPTW estimator when treatment probabilities may be misspecified. Moreover, the substitution principle rooted in the TMLE potentially mitigates the sensitivity to large treatment weights in IPTW. We compare the performance of the proposed estimator with the IPTW and a on-targeted substitution estimator in a simulation study.
Estimand: This estimator can handle positivity threats (large weights) through the use of TMLE. It does employ a marginal structural model that requires correct model specification. However, it is doubly robust, meaning that only the outcome or treatment model need to be specified correctly. The MSM estimates the hazard if the population had followed a specific treatment regime (contrary to fact). However, no ICE specific are mentioned.
Estimator Description: The authors propose a Targeted Minimum Loss-based Estimator (TMLE) for the parameters of a Marginal Structural Model (MSM) for the hazard function. While Inverse Probability Weighted (IPW) estimators are commonly used for MSMs, they are inefficient and sensitive to extreme weights (lack of positivity). The novelty of this method is the development of a TMLE specifically for the hazard scale in a longitudinal setting. It maps the hazard estimation problem into a series of pooled logistic regressions (for discrete time) and updates the initial estimate using a ‘clever covariate’ to solve the efficient influence curve equation. This yields an estimator that is both locally efficient (achieving the semiparametric efficiency bound) and doubly robust, offering improvements over IPW when weights are large.
Authors: W Zheng, Z Luo, MJ van der Laan
Year: 2018
In health and social sciences, research questions often involve systematic assessment of the modification of treatment causal effect by patient characteristics. In longitudinal settings, time-varying or post-intervention effect modifiers are also of interest. In this work, we investigate the robust and efficient estimation of the Counterfactual-History-Adjusted Marginal Structural Model (van der Laan MJ, Petersen M. Statistical learning of origin-specific statically optimal individualized treatment rules. Int J Biostat. 2007;3), which models the conditional intervention-specific mean outcome given a counterfactual modifier history in an ideal experiment. We establish the semiparametric efficiency theory for these models, and present a substitution-based, semiparametric efficient and doubly robust estimator using the targeted maximum likelihood estimation methodology (TMLE, e.g. van der Laan MJ, Rubin DB. Targeted maximum likelihood learning. Int J Biostat. 2006;2, van der Laan MJ, Rose S. Targeted learning: causal inference for observational and experimental data, 1st ed. Springer Series in Statistics. Springer, 2011). To facilitate implementation in applications where the effect modifier is high dimensional, our third contribution is a projected influence function (and the corresponding projected TMLE estimator), which retains most of the robustness of its efficient peer and can be easily implemented in applications where the use of the efficient influence function becomes taxing. We compare the projected TMLE estimator with an Inverse Probability of Treatment Weighted estimator (e.g. Robins JM. Marginal structural models. In: Proceedings of the American Statistical Association. Section on Bayesian Statistical Science, 1-10. 1997a, Hernan MA, Brumback B, Robins JM. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. EPIDEMIOLOGY: 2000;11:561-570), and a non-targeted G-computation estimator (Robins JM. A new approach to causal inference in mortality studies with sustained exposure periods - application to control of the healthy worker survivor effect. Math Modell. 1986;7:1393-1512.). The comparative performance of these estimators is assessed in a simulation study. The use of the projected TMLE estimator is illustrated in a secondary data analysis for the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) trial where effect modifiers are subject to missing at random.
Estimand: Besides identifiability, they assume MAR to impute missing values using a hypothetical strategy (i.e., the report on symptoms was intervened to be non-missing). Besides loss-to-follow-up (due to patient dropout or non-response), no handling of ICE was mentioned. The method estimates the conditional mean outcome under a specific treatment assignment, adjusted for a counterfactual history of effect modifiers.
Estimator Description: The paper proposes a Targeted Maximum Likelihood Estimation (TMLE) estimator for the Counterfactual-History-Adjusted Marginal Structural Model (CHA-MSM). They also introduce a Projected TMLE for high-dimensional settings. The novelty is the development of a ‘projected influence function (and the corresponding projected TMLE estimator), which retains most of the robustness of its efficient peer and can be easily implemented in applications where the use of the efficient influence function becomes taxing’ (e.g., high-dimensional effect modifiers).
Authors: Yingchao Zhong, Chang Wang, Lu Wang
Year: 2021
In this paper, we consider personalized treatment decision strategies in the management of chronic diseases, such as chronic kidney disease, which typically consists of sequential and adaptive treatment decision making. We investigate a two-stage treatment setting with a survival outcome that could be right censored. This can be formulated through a dynamic treatment regime (DTR) framework, where the goal is to tailor treatment to each individual based on their own medical history in order to maximize a desirable health outcome. We develop a new method, Survival Augmented Patient Preference incorporated reinforcement Q-Learning (SAPP-Q-Learning) to decide between quality of life and survival restricted at maximal follow-up. Our method incorporates the latent patient preference into a weighted utility function that balances between quality of life and survival time, in a Q-learning model framework. We further propose a corresponding m-out-of-n Bootstrap procedure to accurately make statistical inferences and construct confidence intervals on the effects of tailoring variables, whose values can guide personalized treatment strategies.
Estimand: The authors use IPW to adjust for informative censoring which relies on the CAR assumption. They use a composite outcome that incorporates the patient-based -reference regarding quality of life and death. This is done to reflect the trade-off that must often be chosen between adverse side-effects (ICE: toxicity) and mortality. The authors thus employ the composite strategy to handle adverse ICE related to treatment. The measure is the Expected Cumulative Preference-Weighted Utility (Reward). The reward is a function of Quality of Life and Survival, weighted by patient preference.
Estimator Description: The authors propose Survival Augmented Patient Preference incorporated reinforcement Q-Learning (SAPP-Q-Learning) to decide between quality of life and survival restricted at maximal follow-up.
Authors: N Zhou, RD Brook, ID Dinov, L Wang
Year: 2022
The wide-scale adoption of electronic health records (EHRs) provides extensive information to support precision medicine and personalized health care. In addition to structured EHRs, we leverage free-text clinical information extraction (IE) techniques to estimate optimal dynamic treatment regimes (DTRs), a sequence of decision rules that dictate how to individualize treatments to patients based on treatment and covariate history. The proposed IE of patient characteristics closely resembles "The clinical Text Analysis and Knowledge Extraction System" and employs named entity recognition, boundary detection, and negation annotation. It also utilizes regular expressions to extract numerical information. Combining the proposed IE with optimal DTR estimation, we extract derived patient characteristics and use tree-based reinforcement learning (T-RL) to estimate multistage optimal DTRs. IE significantly improved the estimation in counterfactual outcome models compared to using structured EHR data alone, which often include incomplete data, data entry errors, and other potentially unobserved risk factors. Moreover, including IE in optimal DTR estimation provides larger study cohorts and a broader pool of candidate tailoring variables. We demonstrate the performance of our proposed method via simulations and an application using clinical records to guide blood pressure control treatments among critically ill patients with severe acute hypertension. This joint estimation approach improves the accuracy of identifying the optimal treatment sequence by 14-24% compared to traditional inference without using IE, based on our simulations over various scenarios. In the blood pressure control application, we successfully extracted significant blood pressure predictors that are unobserved or partially missing from structured EHR.
Estimand: No assumptions besides identifiability were made. The ICE of treatment non-response was handled by being incorporated in the DTR. The estimator maximizes the Expected Counterfactual Outcome (Mean Potential Outcome under the optimal rule).
Estimator Description: The paper introduces a new method (or rather, a specific novel pipeline/extension) for estimating optimal Dynamic Treatment Regimes (DTRs) by integrating Natural Language Processing (Information Extraction) with machine learning estimation. The novelty is the integration of Information Extraction (IE) from unstructured clinical text (using cTAKES and regular expressions) with Tree-based Reinforcement Learning (T-RL) to reduce unmeasured confounding and correct data errors found in structured data.
Authors: Shuying Zhu, Weining Shen, Haoda Fu, Annie Qu
Year: 2024
Schizophrenia is a severe mental disorder that distorts patients' perception of reality, and its treatment with antipsychotics can lead to significant side effects. Despite the heterogeneity in patient responses to treatments, most existing studies on individualized treatment regimes only focus on optimizing treatment efficacy, disregarding potential negative effects. To fill this gap, we propose a restricted outcome weighted learning method that optimizes efficacy outcomes while adhering to individual -level negative effect constraints. Our method is developed for multistage treatment decision problems that include single -stage decision as a special case. We propose an efficient learning algorithm that utilizes the difference -of -convex algorithm and the Lagrange multiplier to solve nonconvex optimization with nonconvex risk constraints. We also establish theoretical properties, including Fisher consistency and strong duality results, for the proposed method. We apply our method to a clinical study to design effective schizophrenia treatment {[}Stroup et al. (Schizophr. Bull. 29 (2003) 15-31)] and find that our approach reduces side -effect risk by at least 22.5\% and improves efficacy by at least 26.3\% compared to competing methods. In addition, we discover that certain covariates, such as the PANSS score, clinician global impressions severity score, and BMI, have a significant impact on controlling side effects and determining optimal treatment recommendations. These results are valuable in identifying subgroups of patients who need special attention when prescribing more aggressive treatment plans.
Estimand: No additional assumptions are made. The authors address the problem of intercurrent events (specifically adverse events/side effects like weight gain) using a composite-like logic by treating the adverse event as a constraint that must be satisfied at the individual level. The outcome measure is the expected efficacy.
Estimator Description: The authors propose a restricted outcome weighted learning (OWL) method that optimizes efficacy outcomes under constraints relating to negative effect treatment effects (e.g., toxicity).
Authors: W Zhu, D Zeng, R Song
Year: 2019
Dynamic treatment regimes are a set of decision rules and each treatment decision is tailored over time according to patients' responses to previous treatments as well as covariate history. There is a growing interest in development of correct statistical inference for optimal dynamic treatment regimes to handle the challenges of non-regularity problems in the presence of non-respondents who have zero-treatment effects, especially when the dimension of the tailoring variables is high. In this paper, we propose a high-dimensional Q-learning (HQ-learning) to facilitate the inference of optimal values and parameters. The proposed method allows us to simultaneously estimate the optimal dynamic treatment regimes and select the important variables that truly contribute to the individual reward. At the same time, hard thresholding is introduced in the method to eliminate the effects of the non-respondents. The asymptotic properties for the parameter estimators as well as the estimated optimal value function are then established by adjusting the bias due to thresholding. Both simulation studies and real data analysis demonstrate satisfactory performance for obtaining the proper inference for the value function for the optimal dynamic treatment regimes.
Estimand: The method assumes sparsity, i.e., that among the high-dimensional covariates, only a small subset truly contributes to the reward (Variable Selection). They offer an example in which treatment non-response was incorporated into the DTR. The method estimates the Q-function parameters, specifically the interaction terms which dictate the optimal treatment decision (the difference in expected reward between treatments).
Estimator Description: The paper proposes High-Dimensional Q-learning (HQ-learning). It integrates variable selection with the Q-learning framework. The novelty lies in addressing the dual challenge of high-dimensional covariates and non-regular inference (where asymptotic distributions are complex due to non-unique optimal treatments for non-responders). They use a ‘folded-concave penalty... to enforce sparsity’ and ‘hard thresholding... to eliminate the effects of the non-respondents,’ while introducing a bias-correction step to ensure the final inference (confidence intervals) is valid.
Authors: Theresa Blumlein, Joel Persson, Stefan Feuerriegel
Year: 2022
Dynamic treatment regimes (DTRs) are used in medicine to tailor sequential treatment decisions to patients by considering patient heterogeneity. Common methods for learning optimal DTRs, however, have shortcomings: they are typically based on outcome prediction and not treatment effect estimation, or they use linear models that are restrictive for patient data from modern electronic health records. To address these shortcomings, we develop two novel methods for learning optimal DTRs that effectively handle complex patient data. We call our methods DTR causal trees (DTR-CT) and DTR causal forest (DTR-CF). Our methods are based on a data-driven estimation of heterogeneous treatment effects using causal tree methods, specifically causal trees and causal forests, that learn non-linear relationships, control for time-varying confounding, are doubly robust, and explainable. To the best of our knowledge, our paper is the first that adapts causal tree methods for learning optimal DTRs. We evaluate our proposed methods using synthetic data and then apply them to real-world data from intensive care units. Our methods outperform state-of-the-art baselines in terms of cumulative regret and percentage of optimal decisions by a considerable margin. Our work improves treatment recommendations from electronic health record and is thus of direct relevance for personalized medicine.
Estimand: No additional assumptions are made. ICE handling is not specified. The effect measure is the advantage of the optimal action over the alternative at each step, recursively.
Estimator Description: This is the first paper to adapt causal tree methods for the estimation of heterogenous treatment effects to the DTR setting. The authors propose two methods: DTR causal trees (DTR-CT) and DTR causal forest (DTR-CF).
Software: https://github.com/tbluemlein/DTR
Authors: Rui Li, Stephanie Hu, Mingyu Lu, Yuria Utsumi, Prithwish Chakraborty, Daby M. Sow, Piyush Madan, Jun Li, Mohamed Ghalwash, Zach Shahn, Li-wei Lehman
Year: 2021
Counterfactual prediction is a fundamental task in decision-making. This paper introduces G-Net, a sequential deep learning framework for counterfactual prediction under dynamic time-varying treatment strategies in complex longitudinal settings. G-Net is based upon g-computation, a causal inference method for estimating effects of general dynamic treatment strategies. Past g-computation implementations have mostly been built using classical regression models. G-Net instead adopts a recurrent neural network framework to capture complex temporal and nonlinear dependencies in the data. To our knowledge, G-Net is the first g-computation based deep sequential modeling framework that provides estimates of treatment effects under \em{dynamic} and \em{time-varying} treatment strategies. We evaluate G-Net using simulated longitudinal data from two sources: CVSim, a mechanistic model of the cardiovascular system, and a pharmacokinetic simulation of tumor growth. G-Net outperforms both classical and state-of-the-art counterfactual prediction models in these settings.
Estimand: No additional assumptions are made. ICE handling is not specified. The effect measure of interest is the risk difference (for binary outcomes) and mean outcome difference (for continuous outcomes).
Estimator Description: The paper introduces G-Net, a sequential deep learning framework based on g-computation for counterfactual prediction under dynamic, time-varying treatment strategies. This is the first use of the deep-sequential modeling framework for g-computation of effects under dynamic and time-varying treatment regimes.
Authors: Sihyung Park, Wenbin Lu, Shu Yang
Year: 2025
Truncation by death, a prevalent challenge in critical care, renders traditional dynamic treatment regime (DTR) evaluation inapplicable due to ill-defined potential outcomes. We introduce a principal stratification-based method, focusing on the always-survivor value function. We derive a semiparametrically efficient, multiply robust estimator for multi-stage DTRs, demonstrating its robustness and efficiency. Empirical validation and an application to electronic health records showcase its utility for personalized treatment optimization.
Estimand: The method assumes monotonicity i.e., that for any given patient, their survival status under treatment would be no worse than their survival status had they not received the treatment. Further, it assumes principal ignorability which allows us to identify the stratum-specific outcomes using an observed stratum (given covariate and treatment history). The authors define a specific estimand called the always-survivor value function.
Estimator Description: The paper introduces a principal stratification-based method for estimating and learning optimal DTRs, specifically tackling the challenge of truncation by death. It derives a semiparametrically efficient, multiply robust estimator. The method the use of flexible machine learning methods to accurately model the nuisance components. The novelty lies in extending the principal stratification framework (specifically identifying the always-survivor stratum) to multi-stage dynamic treatment regimes, whereas previous methods were limited to single-decision points or survival-only outcomes.
Authors: Aniruddh Raghu, Matthieu Komorowski, Leo Anthony Celi, Peter Szolovits, Marzyeh Ghassemi
Year: 2017
Sepsis is a leading cause of mortality in intensive care units (ICUs) and costs hospitals billions annually. Treating a septic patient is highly challenging, because individual patients respond very differently to medical interventions and there is no universally agreed-upon treatment for sepsis. Understanding more about a patient's physiological state at a given time could hold the key to effective treatment policies. In this work, we propose a new approach to deduce optimal treatment policies for septic patients by using continuous state-space models and deep reinforcement learning. Learning treatment policies over continuous spaces is important, because we retain more of the patient's physiological information. Our model is able to learn clinically interpretable treatment policies, similar in important aspects to the treatment policies of physicians. Evaluating our algorithm on past ICU patient data, we find that our model could reduce patient mortality in the hospital by up to 3.6% over observed clinical policies, from a baseline mortality of 13.7%. The learned treatment policies could be used to aid intensive care clinicians in medical decision making and improve the likelihood of patient survival.
Estimand: The method assumes a Markovian process. ICE handling is unspecified. The outcome measure is the mortality rate defined through an expected discounted future reward.
Estimator Description: The paper introduces a continuous state-space model using deep reinforcement learning to deduce and evaluate optimal dynamic treatment policies for sepsis patients.
Authors: Yebin Tao, Lu Wang, Daniel Almirall
Year: 2018
Dynamic treatment regimes (DTRs) are sequences of treatment decision rules, in which treatment may be adapted over time in response to the changing course of an individual. Motivated by the substance use disorder (SUD) study, we propose a tree-based reinforcement learning (T-RL) method to directly estimate optimal DTRs in a multi-stage multi-treatment setting. At each stage, T-RL builds an unsupervised decision tree that directly handles the problem of optimization with multiple treatment comparisons, through a purity measure constructed with augmented inverse probability weighted estimators. For the multiple stages, the algorithm is implemented recursively using backward induction. By combining semiparametric regression with flexible tree-based learning, T-RL is robust, efficient and easy to interpret for the identification of optimal DTRs, as shown in the simulation studies. With the proposed method, we identify dynamic SUD treatment regimes for adolescents.
Estimand: The method assumes a Markov decision process. ICE handling is not addressed. The effect measure is the value (mean outcome) of the regime.
Estimator Description: The authors propose a tree-based reinforcement learning (T-RL) method for estimation of optimal DTR. T-RL builds a decision tree at each stage that maximizes a ‘purity’ measure. This purity measure is constructed using augmented IPW estimators, allowing the tree to optimize the counterfactual mean outcome directly while handling multiple treatment options.
Software: https://doi.org/10.1214/18-AOAS1137SUPPB
Authors: Michael P. Wallace, Erica E. M. Moodie, David A. Stephens
Year: 2018
Personalized medicine optimizes patient outcome by tailoring treatments to patient-level characteristics. This approach is formalized by dynamic treatment regimes (DTRs): decision rules that take patient information as input and output recommended treatment decisions. The DTR literature has seen the development of increasingly sophisticated causal inference techniques that attempt to address the limitations of our typically observational datasets. Often overlooked, however, is that in practice most patients may be expected to receive optimal or near-optimal treatment, and so the outcome used as part of a typical DTR analysis may provide limited information. In light of this, we propose considering a more standard analysis: ignore the outcome and elicit an optimal DTR by modeling the observed treatment as a function of relevant covariates. This offers a far simpler analysis and, in some settings, improved optimal treatment identification. To distinguish this approach from more traditional DTR analyses, we term it reward ignorant modeling, and also introduce the concept of multimethod analysis, whereby different analysis methods are used in settings with multiple treatment decisions. We demonstrate this concept through a variety of simulation studies, and through analysis of data from the International Warfarin Pharmacogenetics Consortium, which also serve as motivation for this work.
Estimand: The method assumes that most treatments in a given dataset are already near-optimal (expert-optimality). However, it does address cases in which some stages of the observed regimes include poor treatment choices. It does so by ‘using different methods at different stages depending on our beliefs about the optimality (or otherwise) of treatment’. ICE handling is not specified. The effect/comparison measure is the classification accuracy.
Estimator Description: The authors highlight that standard estimators of optimal DTRs implicitly assume that some patients have been treated poorly: after all an optimal or nearly optimal regime can only be defined in the presence of less optimal or poor regimes. However, if we consider the possibility that most patients in a given dataset are treated optimally, focusing on the outcome of a given regime might not be very informative. The authors therefore suggest reward-ignorant modeling of DTRs which focus on allocating patients based on their probability of being assigned to a given treatment regime based on their covariate history, rather than based on the outcome of their treatment. This shift in focus comes with some information loss (as we neglect the outcome) but greatly simplifies calculations.
Software: https://doi.org/10.1002/bimj.201700322
Authors: Hao Ying, Feng Lin, Rodger D. MacArthur, Jonathan A. Cohn, Daniel C. Barth-Jones, Hong Ye, Lawrence R. Crane
Year: 2007
The U.S. Department of Health and Human Services human immunodeficiency virus (HIV)/acquired immune deficiency syndrome (AIDS) treatment guidelines are modified several times per year to reflect the rapid evolution of the field (e.g., emergence of new antiretroviral drugs). As such, a treatment- decision support system that is capable of self-learning is highly desirable. Based on the fuzzy discrete event system (FDES) theory that we recently created, we have developed a self-learning HIV/AIDS regimen selection system for the initial round of combination antiretroviral therapy, one of the most complex therapies in medicine. The system consisted of a treatment objectives classifier, fuzzy finite state machine models for treatment regimens, and a genetic-algorithm-based optimizer. Supervised learning was achieved through automatically adjusting the parameters of the models by the optimizer. We focused on the four historically popular regimens with 32 associated treatment objectives involving the four most important clinical variables (potency, adherence, adverse effects, and future drug options). The learning targets for the objectives were produced by two expert AIDS physicians on the project, and their averaged overall agreement rate was 70.6%. The system's learning ability and new regimen suitability prediction capability were tested under various conditions of clinical importance. The prediction accuracy was found between 84.4% and 100%. Finally, we retrospectively evaluated the system using 23 patients treated by 11 experienced nonexpert faculty physicians and 12 patients treated by the two experts at our AIDS Clinical Center in 2001. The overall exact agreement between the 13 physicians' selections and the system's choices was 82.9% with the agreement for the two experts being both 100%. For the seven mismatched cases, the system actually chose more appropriate regimens in four cases and equivalent regimens in another two cases. It made a mistake in one case. These (preliminary) results show that 1) the System outperformed the nonexpert physicians and 2) it performed as well as the expert physicians did. This learning and prediction approach, as well as our original FDESs theory, is general purpose and can be applied to other medical or nonmedical problems.
Estimand: This method optimises for expert agreement. Therefore, it assumes that expert consensus provides the optimal DTR. The ICE of toxicity and non-adherence are addressed by making them part of the learning objective (i.e., a composite strategy). This is done by using a classifier to assign a patient to a composite learning objective based on their characteristics. The outcome optimised is the match with the expert panel's ranking of regimens.
Estimator Description: This method models the clinical decision process as a fuzzy discrete event system, where the ‘state’ is the patient's condition (fuzzy membership in categories like ‘high viral load’) and the ‘event’ is the treatment selection. It uses a genetic algorithm (an evolutionary ML method) to learn the optimal fuzzy membership functions and weights that best match expert consensus (supervised learning). This differs from standard statistical DTR methods (Q-learning, etc.) by using fuzzy logic to handle medical uncertainty and qualitative rules of thumb explicitly.
Authors: Yao Zhang, Mihaela van der Schaar
Year: 2020
Estimand: No additional assumptions are made. ICE handling is not specified. The effect measure is the mean outcome.
Estimator Description: The paper introduces gradient regularized V-learning (GRV), a novel method for estimating the value function of a DTR using neural networks. The novelty is the regularization of the outcome and propensity score models (which can be neural networks) using the gradient of the efficient influence function. This forces the nuisance parameters to satisfy optimality conditions that minimize the mean squared error of the value function estimator, making it robust and efficient in finite samples.
Authors: Jie Zhu, Blanca Gallego
Year: 2021
Causal inference in longitudinal observational health data often requires the accurate estimation of treatment effects on time-to-event outcomes in the presence of time-varying covariates. To tackle this sequential treatment effect estimation problem, we have developed a causal dynamic survival (CDS) model that uses the potential outcomes framework with the recurrent sub-networks with random seed ensembles to estimate the difference in survival curves of its confidence interval. Using simulated survival datasets, the CDS model has shown good causal effect estimation performance across scenarios of sample dimension, event rate, confounding and overlapping. However, increasing the sample size is not effective to alleviate the adverse impact from high level of confounding. In two large clinical cohort studies, our model identified the expected conditional average treatment effect and detected individual effect heterogeneity over time and patient subgroups. CDS provides individualised absolute treatment effect estimations to improve clinical decisions.
Estimand: No additional assumptions are made. ICE handling is not specified. The effect measure of interest is the absolute risk difference under a binary treatment variable.
Estimator Description: The paper introduces the Causal Dynamic Survival (CDS) model, which uses an ensemble of recurrent neural networks (RNNs) to estimate counterfactual hazard functions in the presence of time-varying covariates and time-varying treatments.
Software: https://github.com/EliotZhu/CDS