Trial Outcomes & Findings for Using Digital Data to Predict CHD (NCT NCT04574882)
NCT ID: NCT04574882
Last Updated: 2025-11-04
Results Overview
The primary outcome is topics and features (derived using the LDA method for clustering language data). For each participant, we included all available Facebook wall posts from the start of their account history through data collection, regardless of whether they occurred before or after a CHD diagnosis. We examined associations between linguistic features (unigrams, LIWC categories, LDA topics) and cardiovascular case status (CHD presence vs absence) using Pearson correlation and logistic regression. Latent LDA, a systematic method to identify text-based themes, was applied to generate 200 clusters of co-occurring words ("topics"). For each feature type (unigram, LIWC category, LDA topic), we fit separate logistic regression models and calculated Pearson correlation coefficients to assess predictive value for case status. Each language-derived feature was encoded as a normalized frequency count per user to enable consistent comparison across participants.
COMPLETED
781 participants
Through study completion, an average of 3 years
2025-11-04
Participant Flow
Participant milestones
| Measure |
Control
Patients aged 30-74 who have non-cardiovascular-related history
Survey: Interested participants may complete the informed consent online. After informed consent, the participant will be asked to share the digital data types that they use (Facebook, Instagram, Twitter, Google search, step data) and then participants will complete a cross-sectional survey.
|
Case
Patients ages 30-74 with and without CHD (IICD 10: I63, I20-I25 ) within the last 5 years.
Survey: Interested participants may complete the informed consent online. After informed consent, the participant will be asked to share the digital data types that they use (Facebook, Instagram, Twitter, Google search, step data) and then participants will complete a cross-sectional survey.
|
|---|---|---|
|
Overall Study
STARTED
|
455
|
326
|
|
Overall Study
COMPLETED
|
333
|
194
|
|
Overall Study
NOT COMPLETED
|
122
|
132
|
Reasons for withdrawal
Withdrawal data not reported
Baseline Characteristics
Participants were allowed to select 'Prefer not to say' for their sex. 38 case and 43 control participants selected that option
Baseline characteristics by cohort
| Measure |
Case
n=326 Participants
Patients ages 30-74 with and without CHD (IICD 10: I63, I20-I25 ) within the last 5 years.
Survey: Interested participants may complete the informed consent online. After informed consent, the participant will be asked to share the digital data types that they use (Facebook, Instagram, Twitter, Google search, step data) and then participants will complete a cross-sectional survey.
|
Control
n=455 Participants
Patients aged 30-74 who have non-cardiovascular-related history
Survey: Interested participants may complete the informed consent online. After informed consent, the participant will be asked to share the digital data types that they use (Facebook, Instagram, Twitter, Google search, step data) and then participants will complete a cross-sectional survey.
|
Total
n=781 Participants
Total of all reporting groups
|
|---|---|---|---|
|
Age, Continuous
|
60.1 years
STANDARD_DEVIATION 9.5 • n=326 Participants
|
59.5 years
STANDARD_DEVIATION 9.2 • n=455 Participants
|
60.1 years
STANDARD_DEVIATION 9.8 • n=781 Participants
|
|
Sex: Female, Male
Female
|
141 Participants
n=288 Participants • Participants were allowed to select 'Prefer not to say' for their sex. 38 case and 43 control participants selected that option
|
287 Participants
n=412 Participants • Participants were allowed to select 'Prefer not to say' for their sex. 38 case and 43 control participants selected that option
|
428 Participants
n=700 Participants • Participants were allowed to select 'Prefer not to say' for their sex. 38 case and 43 control participants selected that option
|
|
Sex: Female, Male
Male
|
147 Participants
n=288 Participants • Participants were allowed to select 'Prefer not to say' for their sex. 38 case and 43 control participants selected that option
|
125 Participants
n=412 Participants • Participants were allowed to select 'Prefer not to say' for their sex. 38 case and 43 control participants selected that option
|
272 Participants
n=700 Participants • Participants were allowed to select 'Prefer not to say' for their sex. 38 case and 43 control participants selected that option
|
|
Race (NIH/OMB)
American Indian or Alaska Native
|
1 Participants
n=326 Participants
|
1 Participants
n=455 Participants
|
2 Participants
n=781 Participants
|
|
Race (NIH/OMB)
Asian
|
10 Participants
n=326 Participants
|
13 Participants
n=455 Participants
|
23 Participants
n=781 Participants
|
|
Race (NIH/OMB)
Native Hawaiian or Other Pacific Islander
|
0 Participants
n=326 Participants
|
0 Participants
n=455 Participants
|
0 Participants
n=781 Participants
|
|
Race (NIH/OMB)
Black or African American
|
55 Participants
n=326 Participants
|
152 Participants
n=455 Participants
|
207 Participants
n=781 Participants
|
|
Race (NIH/OMB)
White
|
217 Participants
n=326 Participants
|
254 Participants
n=455 Participants
|
471 Participants
n=781 Participants
|
|
Race (NIH/OMB)
More than one race
|
0 Participants
n=326 Participants
|
0 Participants
n=455 Participants
|
0 Participants
n=781 Participants
|
|
Race (NIH/OMB)
Unknown or Not Reported
|
43 Participants
n=326 Participants
|
35 Participants
n=455 Participants
|
78 Participants
n=781 Participants
|
PRIMARY outcome
Timeframe: Through study completion, an average of 3 yearsPopulation: To evaluate predictive performance for ASCVD risk, we used regression and classification analyses. Pearson correlations assessed how language features predicted continuous risk scores. For classification, we measured AUC for binary risk (≥10% vs \<10%) and standard risk categories. Three logistic regression models were tested: language alone, demographics alone, and combined. Arms/groups were combined because the goal was to predict ASCVD risk continuously and categorically across participants.
The primary outcome is topics and features (derived using the LDA method for clustering language data). For each participant, we included all available Facebook wall posts from the start of their account history through data collection, regardless of whether they occurred before or after a CHD diagnosis. We examined associations between linguistic features (unigrams, LIWC categories, LDA topics) and cardiovascular case status (CHD presence vs absence) using Pearson correlation and logistic regression. Latent LDA, a systematic method to identify text-based themes, was applied to generate 200 clusters of co-occurring words ("topics"). For each feature type (unigram, LIWC category, LDA topic), we fit separate logistic regression models and calculated Pearson correlation coefficients to assess predictive value for case status. Each language-derived feature was encoded as a normalized frequency count per user to enable consistent comparison across participants.
Outcome measures
| Measure |
Binary Classification (≥10% vs. <10%) Binary Classification
n=119 Participants
To evaluate predictive performance for atherosclerotic cardiovascular disease (ASCVD) risk scores, we conducted both regression and classification analyses. Pearson correlation coefficients were used to measure how well language features predicted continuous ASCVD risk scores. For classification tasks, we assessed the area under the receiver operating characteristic curve (AUC) for models predicting both binary risk status (high versus low ASCVD risk, defined as ≥10% versus \<10%) and established ASCVD risk categories (\<5%, 5%-7.4%, 7.5%-9.9%, and ≥10%).
|
All Risk Categories
n=219 Participants
To evaluate predictive performance for atherosclerotic cardiovascular disease (ASCVD) risk scores, we conducted both regression and classification analyses. Pearson correlation coefficients were used to measure how well language features predicted continuous ASCVD risk scores. For classification tasks, we assessed the area under the receiver operating characteristic curve (AUC) for models predicting both binary risk status (high versus low ASCVD risk, defined as ≥10% versus \<10%) and established ASCVD risk categories (\<5%, 5%-7.4%, 7.5%-9.9%, and ≥10%).
|
|---|---|---|
|
Latent Dirichlet Allocation (LDA) Topics - Topics / Themes Discussed Between Patients With and Without Heart Disease
Demographics only
|
0.85 proportion probability AUC (Area Under t
Interval 0.0 to 1.0
|
0.87 proportion probability AUC (Area Under t
Interval 0.0 to 1.0
|
|
Latent Dirichlet Allocation (LDA) Topics - Topics / Themes Discussed Between Patients With and Without Heart Disease
LIWC only
|
0.55 proportion probability AUC (Area Under t
Interval 0.0 to 1.0
|
0.67 proportion probability AUC (Area Under t
Interval 0.0 to 1.0
|
|
Latent Dirichlet Allocation (LDA) Topics - Topics / Themes Discussed Between Patients With and Without Heart Disease
Unigrams only
|
0.61 proportion probability AUC (Area Under t
Interval 0.0 to 1.0
|
0.71 proportion probability AUC (Area Under t
Interval 0.0 to 1.0
|
|
Latent Dirichlet Allocation (LDA) Topics - Topics / Themes Discussed Between Patients With and Without Heart Disease
LDA only
|
0.64 proportion probability AUC (Area Under t
Interval 0.0 to 1.0
|
0.67 proportion probability AUC (Area Under t
Interval 0.0 to 1.0
|
|
Latent Dirichlet Allocation (LDA) Topics - Topics / Themes Discussed Between Patients With and Without Heart Disease
LIWC + Demographics (age, sex, race)
|
0.79 proportion probability AUC (Area Under t
Interval 0.0 to 1.0
|
0.80 proportion probability AUC (Area Under t
Interval 0.0 to 1.0
|
|
Latent Dirichlet Allocation (LDA) Topics - Topics / Themes Discussed Between Patients With and Without Heart Disease
Unigrams + Demographics (age, sex, race)
|
0.61 proportion probability AUC (Area Under t
Interval 0.0 to 1.0
|
0.74 proportion probability AUC (Area Under t
Interval 0.0 to 1.0
|
|
Latent Dirichlet Allocation (LDA) Topics - Topics / Themes Discussed Between Patients With and Without Heart Disease
LDA + Demographics (age, sex, race)
|
0.81 proportion probability AUC (Area Under t
Interval 0.0 to 1.0
|
0.82 proportion probability AUC (Area Under t
Interval 0.0 to 1.0
|
OTHER_PRE_SPECIFIED outcome
Timeframe: Through study completion, an average of 3 yearsReliability in predicting CHD related event in patient as measured by Framingham Risk Score. The Framingham Risk Score (FRS) is a validated means of predicting cardiovascular disease (CVD) risk. Input variables include age, cigarette smoking, total cholesterol, HDL cholesterol, systolic blood pressure measurement and treatment for hypertension. Point values are calculated based on each of these risks. A 10-year risk score can be derived as a percentage. Risk scores range from 0-20%. Low Risk: Less than 10% risk that you will develop a heart attack or die from coronary disease in the next 10 years. Intermediate risk: A 10 to 20% risk that you will develop a heart attack or die from coronary disease in the next 10 years. High Risk: A greater than 20% risk that you will develop a heart attack or die from coronary disease in the next 10 years.
Outcome measures
Outcome data not reported
OTHER_PRE_SPECIFIED outcome
Timeframe: Through study completion, an average of 3 yearsPrediction of cost for health care utilization between heart disease and non- heart disease subjects measured by insurance claims data
Outcome measures
Outcome data not reported
Adverse Events
Case
Control
Serious adverse events
Adverse event data not reported
Other adverse events
Adverse event data not reported
Additional Information
Results disclosure agreements
- Principal investigator is a sponsor employee
- Publication restrictions are in place