2019.HST.953: Collaborative Data Science in Medicine

2019.HST.953: Collaborative Data Science in Medicine

HST.953: Collaborative Data Science in Medicine is a guide for students who are interested in performing retrospective research using data from electronic health records (Medical Information Mart for Intensive Care [MIMIC] database and eICU Collaborative Research Database [eICU-CRD]). The course covers steps of parsing a clinical question into a study design and methodology for data analysis and interpretation, with emphasis on the data curation process that is required before any analysis can be performed. Understanding and navigating the databases requires working closely with the clinicians who work in intensive care units, and can be much more challenging than the statistics and machine learning tasks. Activities include reviewing case studies from the MIMIC and eICU-CRD databases and a collaborative research project. Student teams will choose a question and clinician to work with for their project. Students will meet weekly with clinician mentors at pre-arranged times.


While clinical trials are best in inferring causality, they are not adept at demonstrating small effect size across a population, which is typical given heterogeneity of treatment effect. Moreover, clinical trials typically exclude important subgroups (older patients, those with chronic diseases): findings may not be generalizable to the real-world. Because of the limitations of clinical trials including cost, many practice guidelines are supported by low-quality evidence. To make matters worse, these guidelines are often adopted in countries where funding for research is limited. The digitalization of healthcare data may provides an opportunity to develop locally relevant practice guidelines rather than adopting those that are based on research on populations that may not generalize to. Digital data is proliferating in diverse forms within the healthcare field, not only because of the adoption of electronic health records, but also because of the growing use of wireless technologies for ambulatory monitoring. Since clinical trials may be too expensive to perform in most countries, digital health data provides an opportunity to conduct locally relevant research. Rigorous observational studies have been shown to correlate well with clinical trials across the medical literature in terms of estimates of risk and effect size. The world is abuzz with applications of machine learning in almost every field – commerce, transportation, banking, and more recently, healthcare. These breakthroughs are due to rediscovered algorithms, powerful computers to run them, and most importantly, the availability of bigger and better data to train the algorithms.

Course Information

A variety of datasets will be available, including MIMIC-III and Philips eICU from the USA.


There are no prerequisites for this course for MIT, Harvard and Wellesley students. For the rest, we require some experience with R, Python and/or SQL. Everyone is required to complete an online human subjects training (if they haven’t already done so), and sign a Data User Agreement to obtain access to the MIMIC and eICU Collaborative Research Database. This is a project-based course and all the students are required to participate in clinical research using one or both of the databases.


For more info please, contact us: HST953 Faculty


Aldo Arevalo

Miguel Armengol

Lucas Bulgarelli

Leo Anthony Celi

Kotaro Ebina

Marta Fernandes

Ryan Kindle

Alistair Jonhson

Regina Leung

Xiaoli Liu

Ming Yu Lu

Ned McCague

Anthony O’Brien

Kenneth Paik

Tom Pollard

Jesse Raffa

Andre Silva

Wei-Hung Weng