Trial Outcomes & Findings for Using Digital Data to Predict CHD (NCT NCT04574882)

NCT ID: NCT04574882

Last Updated: 2025-11-04

Results Overview

The primary outcome is topics and features (derived using the LDA method for clustering language data). For each participant, we included all available Facebook wall posts from the start of their account history through data collection, regardless of whether they occurred before or after a CHD diagnosis. We examined associations between linguistic features (unigrams, LIWC categories, LDA topics) and cardiovascular case status (CHD presence vs absence) using Pearson correlation and logistic regression. Latent LDA, a systematic method to identify text-based themes, was applied to generate 200 clusters of co-occurring words ("topics"). For each feature type (unigram, LIWC category, LDA topic), we fit separate logistic regression models and calculated Pearson correlation coefficients to assess predictive value for case status. Each language-derived feature was encoded as a normalized frequency count per user to enable consistent comparison across participants.

Recruitment status

COMPLETED

Target enrollment

781 participants

Primary outcome timeframe

Through study completion, an average of 3 years

Results posted on

2025-11-04

Participant Flow

Participant milestones

Participant milestones
Measure
Control
Patients aged 30-74 who have non-cardiovascular-related history Survey: Interested participants may complete the informed consent online. After informed consent, the participant will be asked to share the digital data types that they use (Facebook, Instagram, Twitter, Google search, step data) and then participants will complete a cross-sectional survey.
Case
Patients ages 30-74 with and without CHD (IICD 10: I63, I20-I25 ) within the last 5 years. Survey: Interested participants may complete the informed consent online. After informed consent, the participant will be asked to share the digital data types that they use (Facebook, Instagram, Twitter, Google search, step data) and then participants will complete a cross-sectional survey.
Overall Study
STARTED
455
326
Overall Study
COMPLETED
333
194
Overall Study
NOT COMPLETED
122
132

Reasons for withdrawal

Withdrawal data not reported

Baseline Characteristics

Participants were allowed to select 'Prefer not to say' for their sex. 38 case and 43 control participants selected that option

Baseline characteristics by cohort

Baseline characteristics by cohort
Measure
Case
n=326 Participants
Patients ages 30-74 with and without CHD (IICD 10: I63, I20-I25 ) within the last 5 years. Survey: Interested participants may complete the informed consent online. After informed consent, the participant will be asked to share the digital data types that they use (Facebook, Instagram, Twitter, Google search, step data) and then participants will complete a cross-sectional survey.
Control
n=455 Participants
Patients aged 30-74 who have non-cardiovascular-related history Survey: Interested participants may complete the informed consent online. After informed consent, the participant will be asked to share the digital data types that they use (Facebook, Instagram, Twitter, Google search, step data) and then participants will complete a cross-sectional survey.
Total
n=781 Participants
Total of all reporting groups
Age, Continuous
60.1 years
STANDARD_DEVIATION 9.5 • n=326 Participants
59.5 years
STANDARD_DEVIATION 9.2 • n=455 Participants
60.1 years
STANDARD_DEVIATION 9.8 • n=781 Participants
Sex: Female, Male
Female
141 Participants
n=288 Participants • Participants were allowed to select 'Prefer not to say' for their sex. 38 case and 43 control participants selected that option
287 Participants
n=412 Participants • Participants were allowed to select 'Prefer not to say' for their sex. 38 case and 43 control participants selected that option
428 Participants
n=700 Participants • Participants were allowed to select 'Prefer not to say' for their sex. 38 case and 43 control participants selected that option
Sex: Female, Male
Male
147 Participants
n=288 Participants • Participants were allowed to select 'Prefer not to say' for their sex. 38 case and 43 control participants selected that option
125 Participants
n=412 Participants • Participants were allowed to select 'Prefer not to say' for their sex. 38 case and 43 control participants selected that option
272 Participants
n=700 Participants • Participants were allowed to select 'Prefer not to say' for their sex. 38 case and 43 control participants selected that option
Race (NIH/OMB)
American Indian or Alaska Native
1 Participants
n=326 Participants
1 Participants
n=455 Participants
2 Participants
n=781 Participants
Race (NIH/OMB)
Asian
10 Participants
n=326 Participants
13 Participants
n=455 Participants
23 Participants
n=781 Participants
Race (NIH/OMB)
Native Hawaiian or Other Pacific Islander
0 Participants
n=326 Participants
0 Participants
n=455 Participants
0 Participants
n=781 Participants
Race (NIH/OMB)
Black or African American
55 Participants
n=326 Participants
152 Participants
n=455 Participants
207 Participants
n=781 Participants
Race (NIH/OMB)
White
217 Participants
n=326 Participants
254 Participants
n=455 Participants
471 Participants
n=781 Participants
Race (NIH/OMB)
More than one race
0 Participants
n=326 Participants
0 Participants
n=455 Participants
0 Participants
n=781 Participants
Race (NIH/OMB)
Unknown or Not Reported
43 Participants
n=326 Participants
35 Participants
n=455 Participants
78 Participants
n=781 Participants

PRIMARY outcome

Timeframe: Through study completion, an average of 3 years

Population: To evaluate predictive performance for ASCVD risk, we used regression and classification analyses. Pearson correlations assessed how language features predicted continuous risk scores. For classification, we measured AUC for binary risk (≥10% vs \<10%) and standard risk categories. Three logistic regression models were tested: language alone, demographics alone, and combined. Arms/groups were combined because the goal was to predict ASCVD risk continuously and categorically across participants.

The primary outcome is topics and features (derived using the LDA method for clustering language data). For each participant, we included all available Facebook wall posts from the start of their account history through data collection, regardless of whether they occurred before or after a CHD diagnosis. We examined associations between linguistic features (unigrams, LIWC categories, LDA topics) and cardiovascular case status (CHD presence vs absence) using Pearson correlation and logistic regression. Latent LDA, a systematic method to identify text-based themes, was applied to generate 200 clusters of co-occurring words ("topics"). For each feature type (unigram, LIWC category, LDA topic), we fit separate logistic regression models and calculated Pearson correlation coefficients to assess predictive value for case status. Each language-derived feature was encoded as a normalized frequency count per user to enable consistent comparison across participants.

Outcome measures

Outcome measures
Measure
Binary Classification (≥10% vs. <10%) Binary Classification
n=119 Participants
To evaluate predictive performance for atherosclerotic cardiovascular disease (ASCVD) risk scores, we conducted both regression and classification analyses. Pearson correlation coefficients were used to measure how well language features predicted continuous ASCVD risk scores. For classification tasks, we assessed the area under the receiver operating characteristic curve (AUC) for models predicting both binary risk status (high versus low ASCVD risk, defined as ≥10% versus \<10%) and established ASCVD risk categories (\<5%, 5%-7.4%, 7.5%-9.9%, and ≥10%).
All Risk Categories
n=219 Participants
To evaluate predictive performance for atherosclerotic cardiovascular disease (ASCVD) risk scores, we conducted both regression and classification analyses. Pearson correlation coefficients were used to measure how well language features predicted continuous ASCVD risk scores. For classification tasks, we assessed the area under the receiver operating characteristic curve (AUC) for models predicting both binary risk status (high versus low ASCVD risk, defined as ≥10% versus \<10%) and established ASCVD risk categories (\<5%, 5%-7.4%, 7.5%-9.9%, and ≥10%).
Latent Dirichlet Allocation (LDA) Topics - Topics / Themes Discussed Between Patients With and Without Heart Disease
Demographics only
0.85 proportion probability AUC (Area Under t
Interval 0.0 to 1.0
0.87 proportion probability AUC (Area Under t
Interval 0.0 to 1.0
Latent Dirichlet Allocation (LDA) Topics - Topics / Themes Discussed Between Patients With and Without Heart Disease
LIWC only
0.55 proportion probability AUC (Area Under t
Interval 0.0 to 1.0
0.67 proportion probability AUC (Area Under t
Interval 0.0 to 1.0
Latent Dirichlet Allocation (LDA) Topics - Topics / Themes Discussed Between Patients With and Without Heart Disease
Unigrams only
0.61 proportion probability AUC (Area Under t
Interval 0.0 to 1.0
0.71 proportion probability AUC (Area Under t
Interval 0.0 to 1.0
Latent Dirichlet Allocation (LDA) Topics - Topics / Themes Discussed Between Patients With and Without Heart Disease
LDA only
0.64 proportion probability AUC (Area Under t
Interval 0.0 to 1.0
0.67 proportion probability AUC (Area Under t
Interval 0.0 to 1.0
Latent Dirichlet Allocation (LDA) Topics - Topics / Themes Discussed Between Patients With and Without Heart Disease
LIWC + Demographics (age, sex, race)
0.79 proportion probability AUC (Area Under t
Interval 0.0 to 1.0
0.80 proportion probability AUC (Area Under t
Interval 0.0 to 1.0
Latent Dirichlet Allocation (LDA) Topics - Topics / Themes Discussed Between Patients With and Without Heart Disease
Unigrams + Demographics (age, sex, race)
0.61 proportion probability AUC (Area Under t
Interval 0.0 to 1.0
0.74 proportion probability AUC (Area Under t
Interval 0.0 to 1.0
Latent Dirichlet Allocation (LDA) Topics - Topics / Themes Discussed Between Patients With and Without Heart Disease
LDA + Demographics (age, sex, race)
0.81 proportion probability AUC (Area Under t
Interval 0.0 to 1.0
0.82 proportion probability AUC (Area Under t
Interval 0.0 to 1.0

OTHER_PRE_SPECIFIED outcome

Timeframe: Through study completion, an average of 3 years

Reliability in predicting CHD related event in patient as measured by Framingham Risk Score. The Framingham Risk Score (FRS) is a validated means of predicting cardiovascular disease (CVD) risk. Input variables include age, cigarette smoking, total cholesterol, HDL cholesterol, systolic blood pressure measurement and treatment for hypertension. Point values are calculated based on each of these risks. A 10-year risk score can be derived as a percentage. Risk scores range from 0-20%. Low Risk: Less than 10% risk that you will develop a heart attack or die from coronary disease in the next 10 years. Intermediate risk: A 10 to 20% risk that you will develop a heart attack or die from coronary disease in the next 10 years. High Risk: A greater than 20% risk that you will develop a heart attack or die from coronary disease in the next 10 years.

Outcome measures

Outcome data not reported

OTHER_PRE_SPECIFIED outcome

Timeframe: Through study completion, an average of 3 years

Prediction of cost for health care utilization between heart disease and non- heart disease subjects measured by insurance claims data

Outcome measures

Outcome data not reported

Adverse Events

Case

Serious events: 0 serious events
Other events: 0 other events
Deaths: 0 deaths

Control

Serious events: 0 serious events
Other events: 0 other events
Deaths: 0 deaths

Serious adverse events

Adverse event data not reported

Other adverse events

Adverse event data not reported

Additional Information

Director of Research

University of Pennsylvania

Phone: 2674280125

Results disclosure agreements

  • Principal investigator is a sponsor employee
  • Publication restrictions are in place