- JCausal Inference1
Source Condition Double Robust Inference on Functionals of Inverse Problems, with A. Bennett, X. Mao, W. Newey, V. Syrgkanis, and M. Uehara.
arXiv July 2023.
- Abstract: We consider estimation of parameters defined as linear functionals of solutions to linear inverse problems. Any such parameter admits a doubly robust representation that depends on the solution to a dual linear inverse problem, where the dual solution can be thought as a generalization of the inverse propensity function. We provide the first source condition double robust inference method that ensures asymptotic normality around the parameter of interest as long as either the primal or the dual inverse problem is sufficiently well-posed, without knowledge of which inverse problem is the more well-posed one. Our result is enabled by novel guarantees for iterated Tikhonov regularized adversarial estimators for linear inverse problems, over general hypothesis spaces, which are developments of independent interest.
- CCausal Inference1
Long-Term Causal Inference with Imperfect Surrogates using Many Weak Experiments, Proxies, and Cross-Fold Moments, with A. Bibaut, S. Ejdemyr, and M. Zhao.
arXiv November 2023.
- Abstract: Inferring causal effects on long-term outcomes using short-term surrogates is crucial to rapid innovation. However, even when treatments are randomized and surrogates fully mediate their effect on outcomes, it's possible that we get the direction of causal effects wrong due to confounding between surrogates and outcomes -- a situation famously known as the surrogate paradox. The availability of many historical experiments offer the opportunity to instrument for the surrogate and bypass this confounding. However, even as the number of experiments grows, two-stage least squares has non-vanishing bias if each experiment has a bounded size, and this bias is exacerbated when most experiments barely move metrics, as occurs in practice. We show how to eliminate this bias using cross-fold procedures, JIVE being one example, and construct valid confidence intervals for the long-term effect in new experiments where long-term outcome has not yet been observed. Our methodology further allows to proxy for effects not perfectly mediated by the surrogates, allowing us to handle both confounding and effect leakage as violations of standard statistical surrogacy conditions.
- CSequential Decision-Making1
Low-Rank MDPs with Continuous Action Spaces, with A. Bennett and M. Oprescu.
arXiv November 2023.
- Abstract: Low-Rank Markov Decision Processes (MDPs) have recently emerged as a promising framework within the domain of reinforcement learning (RL), as they allow for provably approximately correct (PAC) learning guarantees while also incorporating ML algorithms for representation learning. However, current methods for low-rank MDPs are limited in that they only consider finite action spaces, and give vacuous bounds as |A|→∞, which greatly limits their applicability. In this work, we study the problem of extending such methods to settings with continuous actions, and explore multiple concrete approaches for performing this extension. As a case study, we consider the seminal FLAMBE algorithm (Agarwal et al., 2020), which is a reward-agnostic method for PAC RL with low-rank MDPs. We show that, without any modifications to the algorithm, we obtain similar PAC bound when actions are allowed to be continuous. Specifically, when the model for transition functions satisfies a Holder smoothness condition w.r.t. actions, and either the policy class has a uniformly bounded minimum density or the reward function is also Holder smooth, we obtain a polynomial PAC bound that depends on the order of smoothness.
- CCausal Inference1
Inferring the Long-Term Causal Effects of Long-Term Treatments from Short-Term Experiments, with A. Tran and A. Bibaut.
arXiv November 2023.
- Abstract: We study inference on the long-term causal effect of a continual exposure to a novel intervention, which we term a long-term treatment, based on an experiment involving only short-term observations. Key examples include the long-term health effects of regularly-taken medicine or of environmental hazards and the long-term effects on users of changes to an online platform. This stands in contrast to short-term treatments or "shocks," whose long-term effect can reasonably be mediated by short-term observations, enabling the use of surrogate methods. Long-term treatments by definition have direct effects on long-term outcomes via continual exposure so surrogacy cannot reasonably hold. Our approach instead learns long-term temporal dynamics directly from short-term experimental data, assuming that the initial dynamics observed persist but avoiding the need for both surrogacy assumptions and auxiliary data with long-term observations. We connect the problem with offline reinforcement learning, leveraging doubly-robust estimators to estimate long-term causal effects for long-term treatments and construct confidence intervals. Finally, we demonstrate the method in simulated experiments.
- CCausal Inference1
Evaluating the Surrogate Index as a Decision-Making Tool Using 200 A/B Tests at Netflix, with V. Zhang, M. Zhao, and A. Le.
arXiv November 2023.
- Abstract: Surrogate index approaches have recently become a popular method of estimating longer-term impact from shorter-term outcomes. In this paper, we leverage 1098 test arms from 200 A/B tests at Netflix to empirically investigate to what degree would decisions made using a surrogate index utilizing 14 days of data would align with those made using direct measurement of day 63 treatment effects. Focusing specifically on linear "auto-surrogate" models that utilize the shorter-term observations of the long-term outcome of interest, we find that the statistical inferences that we would draw from using the surrogate index are ˜95% consistent with those from directly measuring the long-term treatment effect. Moreover, when we restrict ourselves to the set of tests that would be "launched" (i.e. positive and statistically significant) based on the 63-day directly measured treatment effects, we find that relying instead on the surrogate index achieves 79% and 65% recall.
- COptimization and Personalization|Applied ML0
Large Language Models as Zero-Shot Conversational Recommenders, with Z. He, Z. Xie, R. Jha, H. Steck, D. Liang, Y. Feng, B. Majumder, and J. McAuley.
- COptimization and Personalization|Sequential Decision-Making0
The Benefits of Being Distributional: Small-Loss Bounds for Reinforcement Learning, with K. Wang, K. Zhou, R. Wu, and W. Sun.
- COptimization and Personalization|Sequential Decision-Making0
Future-Dependent Value-Based Off-Policy Evaluation in POMDPs, with M. Uehara, H. Kiyohara, A. Bennett, V. Chernozhukov, N. Jiang, C. Shi, and W. Sun.
- COptimization and Personalization|Sequential Decision-Making0
Offline Minimax Soft-Q-learning Under Realizability and Partial Coverage, with M. Uehara, J. D. Lee, and W. Sun.
- COptimization and Personalization|Sequential Decision-Making1
Provable Offline Reinforcement Learning with Human Feedback, with W. Zhan, M. Uehara, J. D. Lee, and W. Sun.
arXiv May 2023.
- Abstract: In this paper, we investigate the problem of offline reinforcement learning with human feedback where feedback is available in the form of preference between trajectory pairs rather than explicit rewards. Our proposed algorithm consists of two main steps: (1) estimate the implicit reward using Maximum Likelihood Estimation (MLE) with general function approximation from offline data and (2) solve a distributionally robust planning problem over a confidence set around the MLE. We consider the general reward setting where the reward can be defined over the whole trajectory and provide a novel guarantee that allows us to learn any target policy with a polynomial number of samples, as long as the target policy is covered by the offline data. This guarantee is the first of its kind with general function approximation. To measure the coverage of the target policy, we introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability coefficient. We also establish lower bounds that highlight the necessity of such concentrability and the difference from standard RL, where state-action-wise rewards are directly observed. We further extend and analyze our algorithm when the feedback is given over action pairs.
- COptimization and Personalization|Causal Inference1
Off-Policy Evaluation for Large Action Spaces via Policy Convolution, with N. Sachdeva, L. Wang, D. Liang, and J. McAuley.
arXiv October 2023.
- Abstract: Developing accurate off-policy estimators is crucial for both evaluating and optimizing for new policies. The main challenge in off-policy estimation is the distribution shift between the logging policy that generates data and the target policy that we aim to evaluate. Typically, techniques for correcting distribution shift involve some form of importance sampling. This approach results in unbiased value estimation but often comes with the trade-off of high variance, even in the simpler case of one-step contextual bandits. Furthermore, importance sampling relies on the common support assumption, which becomes impractical when the action space is large. To address these challenges, we introduce the Policy Convolution (PC) family of estimators. These methods leverage latent structure within actions -- made available through action embeddings -- to strategically convolve the logging and target policies. This convolution introduces a unique bias-variance trade-off, which can be controlled by adjusting the amount of convolution. Our experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC, especially when either the action space or policy mismatch becomes large, with gains of up to 5 - 6 orders of magnitude over existing estimators.
- CApplied ML|Sequential Decision-Making1
JoinGym: An Efficient Query Optimization Environment for Reinforcement Learning, with K. Wang, J. Wang, Y. Li, I. Trummer, and W. Sun.
arXiv July 2023.
- Abstract: In this paper, we present JoinGym, an efficient and lightweight query optimization environment for reinforcement learning (RL). Join order selection (JOS) is a classic NP-hard combinatorial optimization problem from database query optimization and can serve as a practical testbed for the generalization capabilities of RL algorithms. We describe how to formulate each of the left-deep and bushy variants of the JOS problem as a Markov Decision Process (MDP), and we provide an implementation adhering to the standard Gymnasium API. We highlight that our implementation extsc{JoinGym} is completely based on offline traces of all possible joins, which enables RL practitioners to easily and quickly test their methods on a realistic data management problem without needing to setup any systems. Moreover, we also provide all possible join traces on 3300 novel SQL queries generated from the IMDB dataset. Upon benchmarking popular RL algorithms, we find that at least one method can obtain near-optimal performance on train-set queries but their performance degrades by several orders of magnitude on test-set queries. This gap motivates further research for RL algorithms that generalize well in multi-task combinatorial optimization problems.
- COptimization and Personalization|Sequential Decision-Making0
Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR, with K. Wang and W. Sun.
- COptimization and Personalization|Sequential Decision-Making0
Smooth Non-Stationary Bandits, with S. Jia, Q. Xie, and P. Frazier.
- COptimization and Personalization|Causal Inference0
B-Learner: Quasi-Oracle Bounds on Heterogeneous Causal Effects Under Hidden Confounding, with M. Oprescu, J. Dorn, M. Ghoummaid, A. Jesson, and U. Shalit.
- COptimization and Personalization|Sequential Decision-Making0
Computationally Efficient PAC RL in POMDPs with Latent Determinism and Conditional Embeddings, with M. Uehara, A. Sekhari, J. D. Lee, and W. Sun.
- C|JCausal Inference0
Inference on Strongly Identified Functionals of Weakly Identified Functions, with A. Bennett, X. Mao, W. Newey, V. Syrgkanis, and M. Uehara.
- CCausal Inference0
Minimax Instrumental Variable Regression and L2 Convergence Guarantees without Identification or Closedness, with A. Bennett, X. Mao, W. Newey, V. Syrgkanis, and M. Uehara.
- JCausal Inference|Sequential Decision-Making1
Near-Optimal Non-Parametric Sequential Tests and Confidence Sequences with Possibly Dependent Observations, with A. Bibaut and M. Lindon.
arXiv December 2022.
- Abstract: Sequential testing, always-valid p-values, and confidence sequences promise flexible statistical inference and on-the-fly decision making. However, unlike fixed-n inference based on asymptotic normality, existing sequential tests either make parametric assumptions and end up under-covering/over-rejecting when these fail or use non-parametric but conservative concentration inequalities and end up over-covering/under-rejecting. To circumvent these issues, we sidestep exact at-least-α coverage and focus on asymptotically exact coverage and asymptotic optimality. That is, we seek sequential tests whose probability of ever rejecting a true hypothesis asymptotically approaches α and whose expected time to reject a false hypothesis approaches a lower bound on all tests with asymptotic coverage at least α, both under an appropriate asymptotic regime. We permit observations to be both non-parametric and dependent and focus on testing whether the observations form a martingale difference sequence. We propose the universal sequential probability ratio test (uSPRT), a slight modification to the normal-mixture sequential probability ratio test, where we add a burn-in period and adjust thresholds accordingly. We show that even in this very general setting, the uSPRT is asymptotically optimal under mild generic conditions. We apply the results to stabilized estimating equations to test means, treatment effects, etc. Our results also provide corresponding guarantees for the implied confidence sequences. Numerical simulations verify our guarantees and the benefits of the uSPRT over alternatives.
- COptimization and Personalization|Sequential Decision-Making0
Provable Safe Reinforcement Learning with Binary Feedback, with A. Bennett and D. Misra.
- COptimization and Personalization|Causal Inference0
Robust and Agnostic Learning of Conditional Distributional Treatment Effects, with M. Oprescu.
- COptimization and Personalization|Causal Inference|Fairness0
What's the Harm? Sharp Bounds on the Fraction Negatively Affected by Treatment.
- COptimization and Personalization|Sequential Decision-Making0
Provably Efficient Reinforcement Learning in Partially Observable Dynamical Systems, with M. Uehara, A. Sekhari, J. D. Lee, and W. Sun.
- CCausal Inference0
The Implicit Delta Method, with J. McInerney.
- JOptimization and Personalization|Sequential Decision-Making1
A Review of Off-Policy Evaluation in Reinforcement Learning, with M. Uehara and C. Shi.
- COptimization and Personalization|Sequential Decision-Making0
Learning Bellman Complete Representations for Offline Policy Evaluation, with J. Chang, K. Wang, and W. Sun.
- JCausal Inference|Sequential Decision-Making1
Long-term Causal Inference Under Persistent Confounding via Data Combination, with G. Imbens, X. Mao, and Y. Wang.
arXiv February 2022.
- Abstract: We study the identification and estimation of long-term treatment effects when both experimental and observational data are available. Since the long-term outcome is observed only after a long delay, it is not measured in the experimental data, but only recorded in the observational data. However, both types of data include observations of some short-term outcomes. In this paper, we uniquely tackle the challenge of persistent unmeasured confounders, i.e., some unmeasured confounders that can simultaneously affect the treatment, short-term outcomes and the long-term outcome, noting that they invalidate identification strategies in previous literature. To address this challenge, we exploit the sequential structure of multiple short-term outcomes, and develop three novel identification strategies for the average long-term treatment effect. We further propose three corresponding estimators and prove their asymptotic consistency and asymptotic normality. We finally apply our methods to estimate the effect of a job training program on long-term employment using semi-synthetic data. We numerically show that our proposals outperform existing methods that fail to handle persistent confounders.
- C|JOptimization and Personalization|Causal Inference|Fairness0
Treatment Effect Risk: Bounds and Inference.
Management Science (Fast Track), 2023.
Proceedings of the 5th ACM Conference on Fairness, Accountability, and Transparency (FAccT), 2022 (Extended abstract).
arXiv January 2022.
GitHub.
Accepted to Fast Track in Management Science
- Abstract: Since the average treatment effect (ATE) measures the change in social welfare, even if positive, there is a risk of negative effect on, say, some 10% of the population. Assessing such risk is difficult, however, because any one individual treatment effect (ITE) is never observed so the 10% worst-affected cannot be identified, while distributional treatment effects only compare the first deciles within each treatment group, which does not correspond to any 10%-subpopulation. In this paper we consider how to nonetheless assess this important risk measure, formalized as the conditional value at risk (CVaR) of the ITE distribution. We leverage the availability of pre-treatment covariates and characterize the tightest-possible upper and lower bounds on ITE-CVaR given by the covariate-conditional average treatment effect (CATE) function. Some bounds can also be interpreted as summarizing a complex CATE function into a single metric and are of interest independently of being a bound. We then proceed to study how to estimate these bounds efficiently from data and construct confidence intervals. This is challenging even in randomized experiments as it requires understanding the distribution of the unknown CATE function, which can be very complex if we use rich covariates so as to best control for heterogeneity. We develop a debiasing method that overcomes this and prove it enjoys favorable statistical properties even when CATE and other nuisances are estimated by black-box machine learning or even inconsistently. Studying a hypothetical change to French job-search counseling services, our bounds and inference demonstrate a small social benefit entails a negative impact on a substantial subpopulation.
- JOptimization and Personalization|Causal Inference1
Doubly-Valid/Doubly-Sharp Sensitivity Analysis for Causal Inference with Unmeasured Confounding, with J. Dorn and K. Guo.
Minor revision in JASA.
arXiv December 2021.
GitHub.
- Abstract: We study the problem of constructing bounds on the average treatment effect in the presence of unobserved confounding under the marginal sensitivity model of Tan (2006). Combining an existing characterization involving adversarial propensity scores with a new distributionally robust characterization of the problem, we propose novel estimators of these bounds that we call "doubly-valid/doubly-sharp" (DVDS) estimators. Double sharpness corresponds to the fact that DVDS estimators consistently estimate the tightest possible (i.e., sharp) bounds implied by the sensitivity model even when one of two nuisance parameters is misspecified and achieve semiparametric efficiency when all nuisance parameters are suitably consistent. Double validity is an entirely new property for partial identification: DVDS estimators still provide valid, though not sharp, bounds even when most nuisance parameters are misspecified. In fact, even in cases when DVDS point estimates fail to be asymptotically normal, standard Wald confidence intervals may remain valid. In the case of binary outcomes, the DVDS estimators are particularly convenient and possesses a closed-form expression in terms of the outcome regression and propensity score. We demonstrate the DVDS estimators in a simulation study as well as a case study of right heart catheterization.
- JCausal Inference|Sequential Decision-Making1
Controlling for Unmeasured Confounding in Panel Data Using Minimal Bridge Functions: From Two-Way Fixed Effects to Factor Models, with G. Imbens and X. Mao.
arXiv August 2021.
- Abstract: We develop a new approach for identifying and estimating average causal effects in panel data under a linear factor model with unmeasured confounders. Compared to other methods tackling factor models such as synthetic controls and matrix completion, our method does not require the number of time periods to grow infinitely. Instead, we draw inspiration from the two-way fixed effect model as a special case of the linear factor model, where a simple difference-in-differences transformation identifies the effect. We show that analogous, albeit more complex, transformations exist in the more general linear factor model, providing a new means to identify the effect in that model. In fact many such transformations exist, called bridge functions, all identifying the same causal effect estimand. This poses a unique challenge for estimation and inference, which we solve by targeting the minimal bridge function using a regularized estimation approach. We prove that our resulting average causal effect estimator is root-N consistent and asymptotically normal, and we provide asymptotically valid confidence intervals. Finally, we provide extensions for the case of a linear factor model with time-varying unmeasured confounders.
- COptimization and Personalization|Applied ML|Fairness0
Estimating Structural Disparities for Face Models, with S. Ardeshir and C. Segalin.
- COptimization and Personalization|Causal Inference0
Doubly Robust Distributionally Robust Off-Policy Evaluation and Learning, with X. Mao, K. Wang, and Z. Zhou.
- C|JApplied ML|Causal Inference|Fairness0
An Empirical Evaluation of the Impact of New York's Bail Reform on Crime Using Synthetic Controls, with T. Bergin, A. Koo, S. Koppel, R. Peterson, R. Ropac, and A. Zhou.
- COptimization and Personalization|Causal Inference|Sequential Decision-Making0
Post-Contextual-Bandit Inference, with A. Bibaut, A. Chambaz, M. Dimakopoulou, and M. van der Laan.
- COptimization and Personalization|Causal Inference|Sequential Decision-Making0
Risk Minimization from Adaptively Collected Data: Guarantees for Supervised and Policy Learning, with A. Bibaut, A. Chambaz, M. Dimakopoulou, and M. van der Laan.
- JOptimization and Personalization|Causal Inference|Sequential Decision-Making0
Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in Partially Observed Markov Decision Processes, with A. Bennett.
- JCausal Inference1
Causal Inference Under Unmeasured Confounding With Negative Controls: A Minimax Learning Approach, with X. Mao and M. Uehara.
arXiv March 2021.
- Abstract: We study the estimation of causal parameters when not all confounders are observed and instead negative controls are available. Recent work has shown how these can enable identification and efficient estimation via two so-called bridge functions. In this paper, we tackle the primary challenge to causal inference using negative controls: the identification and estimation of these bridge functions. Previous work has relied on uniqueness and completeness assumptions on these functions that may be implausible in practice and also focused on their parametric estimation. Instead, we provide a new identification strategy that avoids both uniqueness and completeness. And, we provide a new estimators for these functions based on minimax learning formulations. These estimators accommodate general function classes such as reproducing Hilbert spaces and neural networks. We study finite-sample convergence results both for estimating bridge function themselves and for the final estimation of the causal parameter. We do this under a variety of combinations of assumptions that include realizability and closedness conditions on the hypothesis and critic classes employed in the minimax estimator. Depending on how much we are willing to assume, we obtain different convergence rates. In some cases, we show the estimate for the causal parameter may converge even when our bridge function estimators do not converge to any valid bridge function. And, in other cases, we show we can obtain semiparametric efficiency.
- JOptimization and Personalization0
Fast Rates for Contextual Linear Optimization, with Y. Hu and X. Mao.
- JOptimization and Personalization0
Stochastic Optimization Forests, with X. Mao.
- COptimization and Personalization|Causal Inference|Sequential Decision-Making0
Stateful Offline Contextual Policy Evaluation and Learning, with A. Zhou.
- COptimization and Personalization|Causal Inference0
Control Variates for Slate Off-Policy Evaluation, with F. Amat Gil, A. Chandrashekar, and N. Vlassis.
- C|JOptimization and Personalization|Sequential Decision-Making1
Fast Rates for the Regret of Offline Reinforcement Learning, with Y. Hu and M. Uehara.
Minor revision in Math of OR.
Proceedings of the 34th Conference on Learning Theory (COLT), 2021 (Extended abstract).
arXiv February 2021.
- Abstract: We study the regret of reinforcement learning from offline data generated by a fixed behavior policy in an infinite-horizon discounted Markov decision process (MDP). While existing analyses of common approaches, such as fitted Q-iteration (FQI), suggest a O(1/√n) convergence for regret, empirical behavior exhibits much faster convergence. In this paper, we present a finer regret analysis that exactly characterizes this phenomenon by providing fast rates for the regret convergence. First, we show that given any estimate for the optimal quality function Q*, the regret of the policy it defines converges at a rate given by the exponentiation of the Q*-estimate's pointwise convergence rate, thus speeding it up. The level of exponentiation depends on the level of noise in the decision-making problem, rather than the estimation problem. We establish such noise levels for linear and tabular MDPs as examples. Second, we provide new analyses of FQI and Bellman residual minimization to establish the correct pointwise convergence guarantees. As specific cases, our results imply O(1/n) regret rates in linear cases and exp(-Ω(n)) regret rates in tabular cases.
- JApplied ML|Causal Inference0
The Effect of Patient Age on Discharge Destination and Complications After Lumbar Spinal Fusion, with B. Pennicooke, M. Santacatterina, J. Lee, and E. Elowitz.
- C|JOptimization and Personalization|Sequential Decision-Making1
Finite Sample Analysis of Minimax Offline Reinforcement Learning: Completeness, Fast Rates and First-Order Efficiency, with M. Imaizumi, N. Jiang, W. Sun, M. Uehara, and T. Xie.
arXiv February 2021.
- Abstract: We offer a theoretical characterization of off-policy evaluation (OPE) in reinforcement learning using function approximation for marginal importance weights and q-functions when these are estimated using recent minimax methods. Under various combinations of realizability and completeness assumptions, we show that the minimax approach enables us to achieve a fast rate of convergence for weights and quality functions, characterized by the critical inequality (Bartlett et al., 2005). Based on this result, we analyze convergence rates for OPE. In particular, we introduce novel alternative completeness conditions under which OPE is feasible and we present the first finite-sample result with first-order efficiency in non-tabular environments, i.e., having the minimal coefficient in the leading term.
- JCausal Inference0
On the Optimality of Randomization in Experimental Design: How to Randomize for Minimax Variance and Design-Based Inference.
- JOptimization and Personalization|Causal Inference0
Rejoinder: New Objectives for Policy Learning.
- COptimization and Personalization|Fairness0
Fairness, Welfare, and Equity in Personalized Pricing, with A. Zhou.
- JCausal Inference0
The Variational Method of Moments, with A. Bennett.
- JOptimization and Personalization|Sequential Decision-Making1
DTR Bandit: Learning to Make Response-Adaptive Decisions With Low Regret, with Y. Hu.
- COptimization and Personalization|Causal Inference|Sequential Decision-Making0
Optimal Off-Policy Evaluation from Multiple Logging Policies, with Y. Saito and M. Uehara.
- COptimization and Personalization|Causal Inference|Sequential Decision-Making0
Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic Policies, with M. Uehara.
- JOptimization and Personalization|Causal Inference|Sequential Decision-Making0
Efficient Evaluation of Natural Stochastic Policies in Offline Reinforcement Learning, with M. Uehara.
Biometrika, 2023.
arXiv June 2020.
GitHub.
- Abstract: We study the efficient off-policy evaluation of natural stochastic policies, which are defined in terms of deviations from the behavior policy. This is a departure from the literature on off-policy evaluation where most work consider the evaluation of explicitly specified policies. Crucially, offline reinforcement learning with natural stochastic policies can help alleviate issues of weak overlap, lead to policies that build upon current practice, and improve policies' implementability in practice. Compared with the classic case of a pre-specified evaluation policy, when evaluating natural stochastic policies, the efficiency bound, which measures the best-achievable estimation error, is inflated since the evaluation policy itself is unknown. In this paper we derive the efficiency bounds of two major types of natural stochastic policies: tilting policies and modified treatment policies. We then propose efficient nonparametric estimators that attain the efficiency bounds under very lax conditions. These also enjoy a (partial) double robustness property.
- COptimization and Personalization|Causal Inference|Sequential Decision-Making0
Off-policy Evaluation in Infinite-Horizon Reinforcement Learning with Latent Confounders, with A. Bennett, L. Li, and A. Mousavi.
Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS), 2021.
arXiv July 2020.
Oral at AISTATS (3%)
- Abstract: Off-policy evaluation (OPE) in reinforcement learning is an important problem in settings where experimentation is limited, such as education and healthcare. But, in these very same settings, observed actions are often confounded by unobserved variables making OPE even more difficult. We study an OPE problem in an infinite-horizon, ergodic Markov decision process with unobserved confounders, where states and actions can act as proxies for the unobserved confounders. We show how, given only a latent variable model for states and actions, policy value can be identified from off-policy data. Our method involves two stages. In the first, we show how to use proxies to estimate stationary distribution ratios, extending recent work on breaking the curse of horizon to the confounded setting. In the second, we show optimal balancing can be combined with such learned ratios to obtain policy value while avoiding direct modeling of reward functions. We establish theoretical guarantees of consistency, and benchmark our method empirically.
- JCausal Inference1
On the Role of Surrogates in the Efficient Estimation of Treatment Effects With Limited Outcome Data, with X. Mao.
- COptimization and Personalization|Causal Inference|Sequential Decision-Making0
Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforcement Learning, with A. Zhou.
- COptimization and Personalization|Causal Inference|Sequential Decision-Making0
Statistically Efficient Off-Policy Policy Gradients, with M. Uehara.
- COptimization and Personalization|Causal Inference0
Efficient Policy Learning from Surrogate-Loss Classification Reductions, with A. Bennett.
- JCausal Inference0
Localized Debiased Machine Learning: Efficient Inference on Quantile Treatment Effects and Beyond, with X. Mao and M. Uehara.
- C|JOptimization and Personalization|Sequential Decision-Making0
Smooth Contextual Bandits: Bridging the Parametric and Non-differentiable Regret Regimes, with Y. Hu and X. Mao.
Operations Research, 2021.
Proceedings of the 33rd Conference on Learning Theory (COLT), 2020 (Extended abstract).
arXiv September 2019.
Finalist, Applied Probability Society Best Paper Competition
- Abstract: We study a nonparametric contextual bandit problem where the expected reward functions belong to a Hölder class with smoothness parameter β. We show how this interpolates between two extremes that were previously studied in isolation: non-differentiable bandits (β≤1), where rate-optimal regret is achieved by running separate non-contextual bandits in different context regions, and parametric-response bandits (β=∞), where rate-optimal regret can be achieved with minimal or no exploration due to infinite extrapolatability. We develop a novel algorithm that carefully adjusts to all smoothness settings and we prove its regret is rate-optimal by establishing matching upper and lower bounds, recovering the existing results at the two extremes. In this sense, our work bridges the gap between the existing literature on parametric and non-differentiable contextual bandit problems and between bandit algorithms that exclusively use global or local information, shedding light on the crucial interplay of complexity and regret in contextual bandits.
- JOptimization and Personalization|Causal Inference|Sequential Decision-Making0
Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning, with M. Uehara.
- C|JOptimization and Personalization|Causal Inference|Sequential Decision-Making0
Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes, with M. Uehara.
Journal of Machine Learning Research (JMLR), 21(167):1--63, 2020.
Proceedings of the 37th International Conference on Machine Learning (ICML), 2020 (Preliminary version).
arXiv August 2019.
GitHub.
- Abstract: Off-policy evaluation (OPE) in reinforcement learning allows one to evaluate novel decision policies without needing to conduct exploration, which is often costly or otherwise infeasible. We consider for the first time the semiparametric efficiency limits of OPE in Markov decision processes (MDPs), where actions, rewards, and states are memoryless. We show existing OPE estimators may fail to be efficient in this setting. We develop a new estimator based on cross-fold estimation of q-functions and marginalized density ratios, which we term double reinforcement learning (DRL). We show that DRL is efficient when both components are estimated at fourth-root rates and is also doubly robust when only one component is consistent. We investigate these properties empirically and demonstrate the performance benefits due to harnessing memorylessness efficiently.
- C|JOptimization and Personalization|Causal Inference|Fairness0
Assessing Algorithmic Fairness with Unobserved Protected Class Using Data Combination, with X. Mao and A. Zhou.
Management Science, 2020.
Proceedings of the 3rd ACM Conference on Fairness, Accountability, and Transparency (FAccT), 110, 2020 (Extended abstract).
arXiv June 2019.
GitHub. MS Blog.
- Abstract: The increasing impact of algorithmic decisions on people's lives compels us to scrutinize their fairness and, in particular, the disparate impacts that ostensibly-color-blind algorithms can have on different groups. Examples include credit decisioning, hiring, advertising, criminal justice, personalized medicine, and targeted policymaking, where in some cases legislative or regulatory frameworks for fairness exist and define specific protected classes. In this paper we study a fundamental challenge to assessing disparate impacts in practice: protected class membership is often not observed in the data. This is particularly a problem in lending and healthcare. We consider the use of an auxiliary dataset, such as the US census, that includes class labels but not decisions or outcomes. We show that a variety of common disparity measures are generally unidentifiable aside for some unrealistic cases, providing a new perspective on the documented biases of popular proxy-based methods. We provide exact characterizations of the sharpest-possible partial identification set of disparities either under no assumptions or when we incorporate mild smoothness constraints. We further provide optimization-based algorithms for computing and visualizing these sets, which enables reliable and robust assessments -- an important tool when disparity assessment can have far-reaching policy implications. We demonstrate this in two case studies with real data: mortgage lending and personalized medicine dosing.
- JOptimization and Personalization0
Data Pooling in Stochastic Optimization, with V. Gupta.
- JOptimization and Personalization|Causal Inference0
More Efficient Policy Learning via Optimal Retargeting.
Journal of the American Statistical Association (JASA), 2020, Special Issue on Precision Medicine and Individualized Policy Discovery.
arXiv June 2019.
GitHub.
Selected as JASA Discussion Paper
- Abstract: Policy learning can be used to extract individualized treatment regimes from observational data in healthcare, civics, e-commerce, and beyond. One big hurdle to policy learning is a commonplace lack of overlap in the data for different actions, which can lead to unwieldy policy evaluation and poorly performing learned policies. We study a solution to this problem based on retargeting, that is, changing the population on which policies are optimized. We first argue that at the population level, retargeting may induce little to no bias. We then characterize the optimal reference policy centering and retargeting weights in both binary-action and multi-action settings. We do this in terms of the asymptotic efficient estimation variance of the new learning objective. We further consider bias regularization. Extensive empirical results in a simulation study and a case study of targeted job counseling demonstrate that retargeting is a fairly easy way to significantly improve any policy learning procedure.
- JOptimization and Personalization|Causal Inference0
Minimax-Optimal Policy Learning Under Unobserved Confounding, with A. Zhou.
Management Science, 2020.
arXiv May 2018.
GitHub.
Winner, INFORMS 2018 Data Mining Best Paper Award
2nd place, INFORMS 2018 Junior Faculty Interest Group (JFIG) Paper Competition
- Abstract: We study the problem of learning personalized decision policies from observational data while accounting for possible unobserved confounding. Previous approaches, which assume unconfoundedness, that is, that no unobserved confounders affect both the treatment assignment as well as outcome, can lead to policies that introduce harm rather than benefit when some unobserved confounding is present as is generally the case with observational data. Instead, because policy value and regret may not be point-identifiable, we study a method that minimizes the worst-case estimated regret of a candidate policy against a baseline policy over an uncertainty set for propensity weights that controls the extent of unobserved confounding. We prove generalization guarantees that ensure our policy is safe when applied in practice and in fact obtains the best possible uniform control on the range of all possible population regrets that agree with the possible extent of confounding. We develop efficient algorithmic solutions to compute this minimax-optimal policy. Finally, we assess and compare our methods on synthetic and semisynthetic data. In particular, we consider a case study on personalizing hormone replacement therapy based on observational data, in which we validate our results on a randomized experiment. We demonstrate that hidden confounding can hinder existing policy-learning approaches and lead to unwarranted harm although our robust approach guarantees safety and focuses on well-evidenced improvement, a necessity for making personalized treatment policies learned from observational data reliable in practice.
- JOptimization and Personalization|Causal Inference0
Generalization Bounds and Representation Learning for Estimation of Potential Outcomes and Causal Effects, with F. Johansson, U. Shalit, and D. Sontag.
- COptimization and Personalization|Causal Inference|Sequential Decision-Making0
Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning, with M. Uehara.
- COptimization and Personalization|Fairness0
The Fairness of Risk Scores Beyond Classification: Bipartite Ranking and the xAUC Metric, with A. Zhou.
- CCausal Inference0
Deep Generalized Method of Moments for Instrumental Variable Analysis, with A. Bennett and T. Schnabel.
- COptimization and Personalization|Causal Inference|Fairness0
Assessing Disparate Impacts of Personalized Interventions: Identifiability and Bounds, with A. Zhou.
- COptimization and Personalization|Causal Inference0
Policy Evaluation with Latent Confounders via Optimal Balance, with A. Bennett.
- JOptimization and Personalization|Causal Inference0
Comment: Entropy Learning for Dynamic Treatment Regimes.
- JCausal Inference1
Kernel Optimal Orthogonality Weighting: A Balancing Approach to Estimating Effects of Continuous Treatments, with M. Santacatterina.
arXiv October 2019.
- Abstract: Many scientific questions require estimating the effects of continuous treatments. Outcome modeling and weighted regression based on the generalized propensity score are the most commonly used methods to evaluate continuous effects. However, these techniques may be sensitive to model misspecification, extreme weights or both. In this paper, we propose Kernel Optimal Orthogonality Weighting (KOOW), a convex optimization-based method, for estimating the effects of continuous treatments. KOOW finds weights that minimize the worst-case penalized functional covariance between the continuous treatment and the confounders. By minimizing this quantity, KOOW successfully provides weights that orthogonalize confounders and the continuous treatment, thus providing optimal covariate balance, while controlling for extreme weights. We valuate its comparative performance in a simulation study. Using data from the Women's Health Initiative observational study, we apply KOOW to evaluate the effect of red meat consumption on blood pressure.
- JCausal Inference0
Optimal Weighting for Estimating Generalized Average Treatment Effects, with M. Santacatterina.
- CCausal Inference0
DeepMatch: Balancing Deep Covariate Representations for Causal Inference Using Adversarial Training.
- COptimization and Personalization|Causal Inference0
Classifying Treatment Responders Under Causal Effect Monotonicity.
- COptimization and Personalization|Causal Inference0
Confounding-Robust Policy Improvement, with A. Zhou.
- CCausal Inference0
Removing Hidden Confounding by Experimental Grounding, with A. M. Puli and U. Shalit.
Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS), 10911--10920, 2018.
arXiv October 2018.
GitHub.
Spotlight at NeurIPS (3.5%)
- Abstract: Observational data is being increasingly used as a means for making individual-level causal predictions and intervention recommendations. The foremost challenge of causal inference from observational data is hidden confounding, whose presence cannot be tested in data and can invalidate any causal conclusion. Experimental data does not stuffer from confounding but is usually limited in both scope and scale. We introduce a novel method of using limited experimental data to correct the hidden confounding in causal effect models trained on larger observational data, even if the observational data does not fully overlap with the experimental data. Our method makes strictly weaker assumptions than existing approaches, and we prove conditions under which our method yields a consistent estimator. We demonstrate our method's efficacy using real-world data from a large educational experiment.
- COptimization and Personalization|Causal Inference0
Balanced Policy Evaluation and Learning.
- CCausal Inference0
Causal Inference with Noisy and Missing Covariates via Matrix Factorization, with X. Mao and M. Udell.
- COptimization and Personalization|Fairness0
Fairness under unawareness: assessing disparity when protected class is unobserved, with J. Chen, X. Mao, G. Svacha, and M. Udell.
Proceedings of the 2nd ACM Conference on Fairness, Accountability, and Transparency (FAccT), 339--348, 2019.
arXiv November 2018.
GitHub.
- Abstract: Assessing the fairness of a decision making system with respect to a protected class, such as gender or race, is complicated by not being able to observe class membership labels due to legal or operational reasons. Probabilistic models for predicting the protected class based on observed proxy variables, such as surname and geolocation for race, are sometimes used to impute these missing labels. Such methods are known to be used by government regulators and have been observed to exaggerate disparities. The reason why is unknown, as is whether overestimation is always the case. We decompose the bias of estimating outcome disparity via an existing threshold-based imputation method into multiple interpretable bias sources, which explains when over- or underestimation occurs. We also propose an alternative weighted estimator that uses soft classification rules rather than hard imputation, and show that its bias arises simply from the conditional covariance of the outcome with the true class membership. We illustrate our results with numerical simulations as well as an application to a dataset of mortgage applications, using geolocation as a proxy for race. We confirm that the bias of threshold-based imputation is generally upward; however, its magnitude varies strongly with the threshold chosen due to the complex interplay of multiple sources of bias uncovered by our theoretical analysis. Our new weighted estimator tends to have a negative bias that is much simpler to analyze and reason about.
- COptimization and Personalization|Causal Inference0
Interval Estimation of Individual-Level Causal Effects Under Unobserved Confounding, with X. Mao and A. Zhou.
- COptimization and Personalization|Causal Inference|Fairness0
Residual Unfairness in Fair Machine Learning from Prejudiced Data, with A. Zhou.
- COptimization and Personalization|Causal Inference0
Policy Evaluation and Optimization with Continuous Treatments, with A. Zhou.
Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS), 84:1243--1251, 2018.
arXiv February 2018.
GitHub.
Finalist, Best Paper of INFORMS 2017 Data Mining and Decision Analytics Workshop
- Abstract: We study the problem of policy evaluation and learning from batched contextual bandit data when treatments are continuous, going beyond previous work on discrete treatments. Previous work for discrete treatment/action spaces focuses on inverse probability weighting (IPW) and doubly robust (DR) methods that use a rejection sampling approach for evaluation and the equivalent weighted classification problem for learning. In the continuous setting, this reduction fails as we would almost surely reject all observations. To tackle the case of continuous treatments, we extend the IPW and DR approaches to the continuous setting using a kernel function that leverages treatment proximity to attenuate discrete rejection. Our policy estimator is consistent and we characterize the optimal bandwidth. The resulting continuous policy optimizer (CPO) approach using our estimator achieves convergent regret and approaches the best-in-class policy for learnable policy classes. We demonstrate that the estimator performs well and, in particular, outperforms a discretization-based benchmark. We further study the performance of our policy optimizer in a case study on personalized dosing based on a dataset of Warfarin patients, their covariates, and final therapeutic doses. Our learned policy outperforms benchmarks and nears the oracle-best linear policy.
- COptimization and Personalization|Causal Inference|Sequential Decision-Making0
Instrument-Armed Bandits.
- JApplied ML|Causal Inference0
More Robust Estimation of Sample Average Treatment Effects using Kernel Optimal Matching in an Observational Study of Spine Surgical Interventions, with B. Pennicooke and M. Santacatterina.
- JCausal Inference|Sequential Decision-Making0
Optimal Balancing of Time-Dependent Confounders for Marginal Structural Models, with M. Santacatterina.
- CCausal Inference1
Learning Weighted Representations for Generalization Across Designs, with F. Johansson, U. Shalit, and D. Sontag.
arXiv February 2018.
- Abstract: Predictive models that generalize well under distributional shift are often desirable and sometimes crucial to building robust and reliable machine learning applications. We focus on distributional shift that arises in causal inference from observational data and in unsupervised domain adaptation. We pose both of these problems as prediction under a shift in design. Popular methods for overcoming distributional shift make unrealistic assumptions such as having a well-specified model or knowing the policy that gave rise to the observed data. Other methods are hindered by their need for a pre-specified metric for comparing observations, or by poor asymptotic properties. We devise a bound on the generalization error under design shift, incorporating both representation learning and sample re-weighting. Based on the bound, we propose an algorithmic framework that does not require any of the above assumptions and which is asymptotically consistent. We empirically study the new framework using two synthetic datasets, and demonstrate its effectiveness compared to previous methods.
- COptimization and Personalization|Causal Inference0
Recursive Partitioning for Personalization using Observational Data.
Proceedings of the 34th International Conference on Machine Learning (ICML), 70:1789--1798, 2017.
arXiv August 2016.
Winner, Best Paper of INFORMS 2016 Data Mining and Decision Analytics Workshop
- Abstract: We study the problem of learning to choose from m discrete treatment options (e.g., news item or medical drug) the one with best causal effect for a particular instance (e.g., user or patient) where the training data consists of passive observations of covariates, treatment, and the outcome of the treatment. The standard approach to this problem is regress and compare: split the training data by treatment, fit a regression model in each split, and, for a new instance, predict all m outcomes and pick the best. By reformulating the problem as a single learning task rather than m separate ones, we propose a new approach based on recursively partitioning the data into regimes where different treatments are optimal. We extend this approach to an optimal partitioning approach that finds a globally optimal partition, achieving a compact, interpretable, and impactful personalization model. We develop new tools for validating and evaluating personalization models on observational data and use these to demonstrate the power of our novel approaches in a personalized medicine and a job training application.
- JCausal Inference0
Generalized Optimal Matching Methods for Causal Inference.
- JOptimization and Personalization|Sequential Decision-Making0
Dynamic Assortment Personalization in High Dimensions, with M. Udell.
- JCausal Inference0
Optimal A Priori Balance in the Design of Controlled Experiments.
- JOptimization and Personalization|Causal Inference0
The Power and Limits of Predictive Approaches to Observational-Data-Driven Optimization, with D. Bertsimas.
- JOptimization and Personalization0
From Predictive to Prescriptive Analytics, with D. Bertsimas.
Management Science, 66(3):1025--1044, 2020.
arXiv February 2014.
Finalist, POMS Applied Research Challenge 2016
- Abstract: In this paper, we combine ideas from machine learning (ML) and operations research and management science (OR\/MS) in developing a framework, along with specific methods, for using data to prescribe decisions in OR\/MS problems. In a departure from other work on data-driven optimization and reflecting our practical experience with the data available in applications of OR\/MS, we consider data consisting, not only of observations of quantities with direct effect on costs\/revenues, such as demand or returns, but predominantly of observations of associated auxiliary quantities. The main problem of interest is a conditional stochastic optimization problem, given imperfect observations, where the joint probability distributions that specify the problem are unknown. We demonstrate that our proposed solution methods are generally applicable to a wide range of decision problems. We prove that they are computationally tractable and asymptotically optimal under mild conditions even when data is not independent and identically distributed (iid) and even for censored observations. As an analogue to the coefficient of determination R\², we develop a metric P termed the coefficient of prescriptiveness to measure the prescriptive content of data and the efficacy of a policy from an operations perspective. To demonstrate the power of our approach in a real-world setting we study an inventory management problem faced by the distribution arm of an international media conglomerate, which ships an average of 1 billion units per year. We leverage both internal data and public online data harvested from IMDb, Rotten Tomatoes, and Google to prescribe operational decisions that outperform baseline measures. Specifically, the data we collect, leveraged by our methods, accounts for an 88\% improvement as measured by our coefficient of prescriptiveness.
- JOptimization and Personalization0
Robust Sample Average Approximation, with D. Bertsimas and V. Gupta.
Mathematical Programming, 171(1--2):217--282, 2018.
arXiv August 2014.
Winner, Best Student Paper Award, MIT Operations Research Center 2013
- Abstract: Sample average approximation (SAA) is a widely popular approach to data-driven decision-making under uncertainty. Under mild assumptions, SAA is both tractable and enjoys strong asymptotic performance guarantees. Similar guarantees, however, do not typically hold in finite samples. In this paper, we propose a modification of SAA, which we term Robust SAA, which retains SAA's tractability and asymptotic properties and, additionally, enjoys strong finite-sample performance guarantees. The key to our method is linking SAA, distributionally robust optimization, and hypothesis testing of goodness-of-fit. Beyond Robust SAA, this connection provides a unified perspective enabling us to characterize the finite sample and asymptotic guarantees of various other data-driven procedures that are based upon distributionally robust optimization. This analysis provides insight into the practical performance of these various methods in real applications. We present examples from inventory management and portfolio allocation, and demonstrate numerically that our approach outperforms other data-driven approaches in these applications.
- JOptimization and Personalization0
Data-Driven Robust Optimization, with D. Bertsimas and V. Gupta.
Mathematical Programming, 167(2):235--292, 2018.
arXiv January 2014.
Finalist, INFORMS Nicholson Paper Competition 2013
- Abstract: The last decade has seen an explosion in the availability of data for operations research applications as part of the Big Data revolution. Motivated by this data rich paradigm, we propose a novel schema for utilizing data to design uncertainty sets for robust optimization using statistical hypothesis tests. The approach is flexible and widely applicable, and robust optimization problems built from our new sets are computationally tractable, both theoretically and practically. Furthermore, optimal solutions to these problems enjoy a strong, finite-sample probabilistic guarantee. We also propose concrete guidelines for practitioners and illustrate our approach with applications in portfolio management and queueing. Computational evidence confirms that our data-driven sets significantly outperform conventional robust optimization techniques whenever data is available.
- CCausal Inference0
A Framework for Optimal Matching for Causal Inference.
- JOptimization and Personalization|Applied ML|Causal Inference0
Personalized Diabetes Management Using Electronic Medical Records, with D. Bertsimas, A. Weinstein, and D. Zhuo.
- COptimization and Personalization0
Revealed Preference at Scale: Learning Personalized Preferences from Assortment Choice, with M. Udell.
- JOptimization and Personalization|Applied ML0
Inventory Management in the Era of Big Data, with D. Bertsimas and A. Hussain.
- BApplied ML0
On the Predictive Power of Web Intelligence and Social Media.
- JCausal Inference0
The Power of Optimization Over Randomization in Designing Experiments Involving Small Samples, with D. Bertsimas and M. Johnson.
Operations Research, 63(4):868--876, 2015.
Code. Editorial in World Bank's Development Impact.
- Abstract: Random assignment, typically seen as the standard in controlled trials, aims to make experimental groups statistically equivalent before treatment. However, with a small sample, which is a practical reality in many disciplines, randomized groups are often too dissimilar to be useful. We propose an approach based on discrete linear optimization to create groups whose discrepancy in their means and variances is several orders of magnitude smaller than with randomization. We provide theoretical and computational evidence that groups created by optimization have exponentially lower discrepancy than those created by randomization.
- CApplied ML0
Predicting Crowd Behavior with Big Public Data.
Proceedings of the 23rd international conference on World Wide Web (WWW) companion, 23:625--630, 2014.
arXiv February 2014.
Slides. Media coverage.
Winner, INFORMS Social Media Analytics Best Paper Competition 2015
- Abstract: With public information becoming widely accessible and shared on today's web, greater insights are possible into crowd actions by citizens and non-state actors such as large protests and cyber activism. Turning public data into Big Data, company Recorded Future continually scans over 300,000 open content web sources in 7 languages from all over the world, ranging from mainstream news to government publications to blogs and social media. We study the predictive power of this massive public data in forecasting crowd actions such as large protests and cyber campaigns before they occur. Using natural language processing, event information is extracted from content such as type of event, what entities are involved and in what role, sentiment and tone, and the occurrence time range of the event discussed. The amount of information is staggering and trends can be seen clearly in sheer numbers. In the first half of this paper we show how we use this data to predict large protests in a selection of 19 countries and 37 cities in Asia, Africa, and Europe with high accuracy using standard learning machines. In the second half we delve into predicting the perpetrators and targets of political cyber attacks with a novel application of the na\ïve Bayes classifier to high-dimensional sequence mining in massive datasets.
- JOptimization and Personalization|Applied ML0
Scheduling, Revenue Management, and Fairness in an Academic-Hospital Division: An Optimization Approach, with D. Bertsimas and R. Baum.