Our projects fall into three groups, Drug Safety Discovery, Advancing Cancer Research and Care,
and Data, Information, and Knowledge Extraction. You can find more details about each of these groups below
or you can browse our publicly available code here
and our publicly available resources here.
Drug Safety Discovery
Drug safety is an essential component of modern healthcare, aiming to maximize therapeutic
benefits while minimizing harm to patients through robust pharmacovigilance practices.
Adverse drug events (ADEs) are the fourth leading cause of death in the US and cost billions
of dollars annually in increased healthcare costs. At the Tatonetti Lab, we use advanced
data science techniques—including artificial intelligence and machine learning—to investigate
drug safety. By leveraging emerging resources such as electronic health records (EHRs) and
genomics databases, we aim to drive innovation and improve our understanding of medication
risks and benefits.
In this effort, we’ve released and maintaine the world’s most comprehensive side effect resource,
OnSIDES. OnSIDES includes over 3.6 million drug–adverse drug event (ADE) pairs for 2,793 drug
ingredients extracted from 46,686 publicly available drug labels. Additionally, we support
research into sex-specific drug effects
and provide access to Sex-ADE, a curated dataset of
side effects that differ between men and women. We also maintain the OffSIDES, KidSIDES and
TwoSIDES databases—OffSIDES identifies unexpected ADEs by analyzing large-scale observational
data (e.g., from EHRs). TwoSIDES extends this approach to study adverse drug–drug interactions,
while KidSIDES ocuses specifically on pediatric drug safety signals during childhood developmental
phases. Full details about these databases and how to access them are available
here.
In active research, we are investigating how the intermediate layers of large language models (LLMs)
that are trained for ADE prediction can be interpreted to better understand the relationships between
drugs and their associated ADE. This includes leveraging molecular structure embeddings as well as
patient data from EHRs, particularly unstructured clinical notes. We are also examining the adverse
effects associated with immune checkpoint inhibitor (ICI) treatments by combining clinical notes,
electronic health records, and mechanistic understanding of immune-related adverse drug reactions.
This work will build the foundation for predictive models that can
identify patients at risk of ADEs prior to treatment
.
Advancing Cancer Research & Care
Cancer remains the second leading cause of death in the United States—and the leading cause among
individuals under 85. Decades of intensive research have significantly improved cancer outcomes.
Notably, cancer mortality has continued to decline through 2021, largely due to earlier detection
and advances in treatment.
At the Tatonetti Lab, we dedicate another key area of our research to advancing cancer care by
leveraging AI—particularly through the application of LLMs to both structured and EHR data. Our
cancer-focused projects fall into two main categories: clinical applications and investigations into
the underlying biological mechanism of cancer:
Real world clinical application projects can be grouped in the following area:
- Cancer Staging from Unstructured Data: Cancer staging plays a critical role in prognosis yet
is often recorded in unstructured electronic health records. To address this we developed a
generalizable method using BERT-based models to automatically classify TNM stages from pathology
reports, achieving high accuracy and AUC scores. Our results were recently published in
nature communications.
-
LLMs for Lymphoma Care: In collaboration with Dr. Akil Merchant at Cedars-Sinai, we are applying
LLMs to improve the understanding, diagnosis, and treatment of lymphoma. By analyzing unstructured
clinical notes, we will identify at-risk patients and predict treatment outcomes.
-
Recurrence Risk Prediction: We are developing predictive models to assess the risk of pancreatic cancer
recurrence using individualized clinical data, aiming to support better post-treatment care.
-
Pressure Sensor Development: To improve hospital care for cancer patients, we are working on a
pressure sensor pad that can detect high-risk areas for ulcers and bedsores, helping prevent complications
during inpatient stays
-
Interactive Cancer Dashboard: We're creating a cancer dashboard—an interactive web application that
enables users to explore and analyze cancer patient data from Cedars-Sinai. Users can search and filter
diagnoses and medications to build dynamic patient cohorts. The dashboard generates visualizations and
tables that provide insights into demographics (such as ethnicity, sex, smoking status, and living status)
and medication usage patterns.
Beyond clinical applications, we are also investigating the biological mechanisms that drive cancer.
One of our ongoing studies examines the role of
Y chromosome loss in male cancer patients. This age-related mutation appears to help cancer cells
evade the immune system, contributing to aggressive bladder cancer. Paradoxically, it also makes the
disease more responsive to immune checkpoint inhibitors, a standard form of treatment.
Data, Information & Knowledge Extraction
The third key area of our research focuses on advancing data, information, and knowledge extraction
from both structured and unstructured EHRs.
As part of this effort, we developed Chappy, a secure chatbot designed for use within the Cedars-Sinai
ecosystem. Chappy enables users to interact with various large language models (LLMs) in a way that is
compliant with PHI regulations. These models are mainly deployed through Azure, the cloud infrastructure
built on the Cedars-Sinai–Microsoft platform partnership. This secure environment ensures that data
privacy is maintained while allowing users to modify and experiment with LLM capabilities safely.
We are also studying the impact of accent-related bias in state-of-the-art automatic speech recognition
(ASR) systems and LLMs. This project evaluates the transcription accuracy of voice recordings from
participants with diverse linguistic backgrounds and accents, assessing how ASR and LLM technologies
handle this variability.
We are leveraging the previously developed RIFTEHR
tool to infer familial relationships from emergency
contact data in Cedars-Sinai's EHR system, enabling large-scale heritability studies. Using this approach,
we’ve estimated the genetic contribution to nearly 500 diseases, providing a scalable alternative to
traditional genetic testing.