Trial Outcomes & Findings for Using Digital Data to Predict CHD (NCT NCT04574882)

Last Updated: 2025-11-04

Results Overview

The primary outcome is topics and features (derived using the LDA method for clustering language data). For each participant, we included all available Facebook wall posts from the start of their account history through data collection, regardless of whether they occurred before or after a CHD diagnosis. We examined associations between linguistic features (unigrams, LIWC categories, LDA topics) and cardiovascular case status (CHD presence vs absence) using Pearson correlation and logistic regression. Latent LDA, a systematic method to identify text-based themes, was applied to generate 200 clusters of co-occurring words ("topics"). For each feature type (unigram, LIWC category, LDA topic), we fit separate logistic regression models and calculated Pearson correlation coefficients to assess predictive value for case status. Each language-derived feature was encoded as a normalized frequency count per user to enable consistent comparison across participants.

Recruitment status

COMPLETED

Target enrollment

781 participants

Primary outcome timeframe

Through study completion, an average of 3 years

Results posted on

2025-11-04

Participant Flow

Participant milestones

Participant milestones
Measure	Control Patients aged 30-74 who have non-cardiovascular-related history Survey: Interested participants may complete the informed consent online. After informed consent, the participant will be asked to share the digital data types that they use (Facebook, Instagram, Twitter, Google search, step data) and then participants will complete a cross-sectional survey.	Case Patients ages 30-74 with and without CHD (IICD 10: I63, I20-I25 ) within the last 5 years. Survey: Interested participants may complete the informed consent online. After informed consent, the participant will be asked to share the digital data types that they use (Facebook, Instagram, Twitter, Google search, step data) and then participants will complete a cross-sectional survey.
Overall Study STARTED	455	326
Overall Study COMPLETED	333	194
Overall Study NOT COMPLETED	122	132

Reasons for withdrawal

Withdrawal data not reported

Baseline Characteristics

Participants were allowed to select 'Prefer not to say' for their sex. 38 case and 43 control participants selected that option

Baseline characteristics by cohort

Baseline characteristics by cohort
Measure	Case n=326 Participants Patients ages 30-74 with and without CHD (IICD 10: I63, I20-I25 ) within the last 5 years. Survey: Interested participants may complete the informed consent online. After informed consent, the participant will be asked to share the digital data types that they use (Facebook, Instagram, Twitter, Google search, step data) and then participants will complete a cross-sectional survey.	Control n=455 Participants Patients aged 30-74 who have non-cardiovascular-related history Survey: Interested participants may complete the informed consent online. After informed consent, the participant will be asked to share the digital data types that they use (Facebook, Instagram, Twitter, Google search, step data) and then participants will complete a cross-sectional survey.	Total n=781 Participants Total of all reporting groups
Age, Continuous	60.1 years STANDARD_DEVIATION 9.5 • n=326 Participants	59.5 years STANDARD_DEVIATION 9.2 • n=455 Participants	60.1 years STANDARD_DEVIATION 9.8 • n=781 Participants
Sex: Female, Male Female	141 Participants n=288 Participants • Participants were allowed to select 'Prefer not to say' for their sex. 38 case and 43 control participants selected that option	287 Participants n=412 Participants • Participants were allowed to select 'Prefer not to say' for their sex. 38 case and 43 control participants selected that option	428 Participants n=700 Participants • Participants were allowed to select 'Prefer not to say' for their sex. 38 case and 43 control participants selected that option
Sex: Female, Male Male	147 Participants n=288 Participants • Participants were allowed to select 'Prefer not to say' for their sex. 38 case and 43 control participants selected that option	125 Participants n=412 Participants • Participants were allowed to select 'Prefer not to say' for their sex. 38 case and 43 control participants selected that option	272 Participants n=700 Participants • Participants were allowed to select 'Prefer not to say' for their sex. 38 case and 43 control participants selected that option
Race (NIH/OMB) American Indian or Alaska Native	1 Participants n=326 Participants	1 Participants n=455 Participants	2 Participants n=781 Participants
Race (NIH/OMB) Asian	10 Participants n=326 Participants	13 Participants n=455 Participants	23 Participants n=781 Participants
Race (NIH/OMB) Native Hawaiian or Other Pacific Islander	0 Participants n=326 Participants	0 Participants n=455 Participants	0 Participants n=781 Participants
Race (NIH/OMB) Black or African American	55 Participants n=326 Participants	152 Participants n=455 Participants	207 Participants n=781 Participants
Race (NIH/OMB) White	217 Participants n=326 Participants	254 Participants n=455 Participants	471 Participants n=781 Participants
Race (NIH/OMB) More than one race	0 Participants n=326 Participants	0 Participants n=455 Participants	0 Participants n=781 Participants
Race (NIH/OMB) Unknown or Not Reported	43 Participants n=326 Participants	35 Participants n=455 Participants	78 Participants n=781 Participants

PRIMARY outcome

Timeframe: Through study completion, an average of 3 years

Population: To evaluate predictive performance for ASCVD risk, we used regression and classification analyses. Pearson correlations assessed how language features predicted continuous risk scores. For classification, we measured AUC for binary risk (≥10% vs \<10%) and standard risk categories. Three logistic regression models were tested: language alone, demographics alone, and combined. Arms/groups were combined because the goal was to predict ASCVD risk continuously and categorically across participants.

Outcome measures

Outcome measures
Measure	Binary Classification (≥10% vs. <10%) Binary Classification n=119 Participants To evaluate predictive performance for atherosclerotic cardiovascular disease (ASCVD) risk scores, we conducted both regression and classification analyses. Pearson correlation coefficients were used to measure how well language features predicted continuous ASCVD risk scores. For classification tasks, we assessed the area under the receiver operating characteristic curve (AUC) for models predicting both binary risk status (high versus low ASCVD risk, defined as ≥10% versus \<10%) and established ASCVD risk categories (\<5%, 5%-7.4%, 7.5%-9.9%, and ≥10%).	All Risk Categories n=219 Participants To evaluate predictive performance for atherosclerotic cardiovascular disease (ASCVD) risk scores, we conducted both regression and classification analyses. Pearson correlation coefficients were used to measure how well language features predicted continuous ASCVD risk scores. For classification tasks, we assessed the area under the receiver operating characteristic curve (AUC) for models predicting both binary risk status (high versus low ASCVD risk, defined as ≥10% versus \<10%) and established ASCVD risk categories (\<5%, 5%-7.4%, 7.5%-9.9%, and ≥10%).
Latent Dirichlet Allocation (LDA) Topics - Topics / Themes Discussed Between Patients With and Without Heart Disease Demographics only	0.85 proportion probability AUC (Area Under t Interval 0.0 to 1.0	0.87 proportion probability AUC (Area Under t Interval 0.0 to 1.0
Latent Dirichlet Allocation (LDA) Topics - Topics / Themes Discussed Between Patients With and Without Heart Disease LIWC only	0.55 proportion probability AUC (Area Under t Interval 0.0 to 1.0	0.67 proportion probability AUC (Area Under t Interval 0.0 to 1.0
Latent Dirichlet Allocation (LDA) Topics - Topics / Themes Discussed Between Patients With and Without Heart Disease Unigrams only	0.61 proportion probability AUC (Area Under t Interval 0.0 to 1.0	0.71 proportion probability AUC (Area Under t Interval 0.0 to 1.0
Latent Dirichlet Allocation (LDA) Topics - Topics / Themes Discussed Between Patients With and Without Heart Disease LDA only	0.64 proportion probability AUC (Area Under t Interval 0.0 to 1.0	0.67 proportion probability AUC (Area Under t Interval 0.0 to 1.0
Latent Dirichlet Allocation (LDA) Topics - Topics / Themes Discussed Between Patients With and Without Heart Disease LIWC + Demographics (age, sex, race)	0.79 proportion probability AUC (Area Under t Interval 0.0 to 1.0	0.80 proportion probability AUC (Area Under t Interval 0.0 to 1.0
Latent Dirichlet Allocation (LDA) Topics - Topics / Themes Discussed Between Patients With and Without Heart Disease Unigrams + Demographics (age, sex, race)	0.61 proportion probability AUC (Area Under t Interval 0.0 to 1.0	0.74 proportion probability AUC (Area Under t Interval 0.0 to 1.0
Latent Dirichlet Allocation (LDA) Topics - Topics / Themes Discussed Between Patients With and Without Heart Disease LDA + Demographics (age, sex, race)	0.81 proportion probability AUC (Area Under t Interval 0.0 to 1.0	0.82 proportion probability AUC (Area Under t Interval 0.0 to 1.0

OTHER_PRE_SPECIFIED outcome

Timeframe: Through study completion, an average of 3 years

Reliability in predicting CHD related event in patient as measured by Framingham Risk Score. The Framingham Risk Score (FRS) is a validated means of predicting cardiovascular disease (CVD) risk. Input variables include age, cigarette smoking, total cholesterol, HDL cholesterol, systolic blood pressure measurement and treatment for hypertension. Point values are calculated based on each of these risks. A 10-year risk score can be derived as a percentage. Risk scores range from 0-20%. Low Risk: Less than 10% risk that you will develop a heart attack or die from coronary disease in the next 10 years. Intermediate risk: A 10 to 20% risk that you will develop a heart attack or die from coronary disease in the next 10 years. High Risk: A greater than 20% risk that you will develop a heart attack or die from coronary disease in the next 10 years.

Outcome measures

Outcome data not reported

OTHER_PRE_SPECIFIED outcome

Timeframe: Through study completion, an average of 3 years

Prediction of cost for health care utilization between heart disease and non- heart disease subjects measured by insurance claims data

Outcome measures

Outcome data not reported

Adverse Events

Case

Serious events: 0 serious events

Other events: 0 other events

Deaths: 0 deaths

Control

Serious events: 0 serious events

Other events: 0 other events

Deaths: 0 deaths

Serious adverse events

Adverse event data not reported

Other adverse events

Adverse event data not reported

Additional Information

Director of Research

University of Pennsylvania

Phone: 2674280125

Email: [email protected]

Results disclosure agreements

Principal investigator is a sponsor employee
Publication restrictions are in place