Introduction
Heart disease remains one of the leading causes of death worldwide, and understanding the risk factors behind it is critical for both prevention and early intervention. In this paper, I use data from the well-known Framingham Heart Study to explore what conforms a patient’s risk of developing Coronary Heart Disease (CHD) over a ten-year period. The variable at the center of this analysis is ‘TenYearCHD’, which indicates whether a patient was considered at high risk of experiencing a heart-related event within the next decade. This binary (1 for high risk, 0 for low risk) serves as the outcome in a logistic regression model, which I use to understand and predict heart disease risk.
The dataset used is publicly available on Kaggle and contains a wide range of clinical and demographic information on each patient. These include medical indicators such as blood pressure, cholesterol, heart rate, and glucose levels, as well as lifestyle related variables like smoking habits, medication use, and education level. This rich dataset allows for a deeper investigation into how various factors interact and contribute to the overall risk value of a patient.
The primary aim of this project is twofold: first, to investigate which variables significantly influence a patient's risk classification (specifically smoking habits), and second, to build a model capable of accurately predicting risk for new patients. This article does not just aim to replicate what the researchers have done but also seeks to make the findings more interpretable and relatable to a broader audience.
My particular motivation for conducting this analysis stems from a curiosity about the role of smoking in heart disease. While it is well known that smoking negatively affects the lungs and respiratory system, it is equally important to highlight its role in cardiovascular health. The dataset includes not only whether a patient smokes, but also an estimate on how many cigarettes they smoke per day. This presents an opportunity to examine the nuanced impact of smoking intensity on heart disease risk. By focusing on this factor, I aim to raise awareness about the broader consequences of smoking and show how even data driven models like logistic regression can help visualize and quantify health risks that might otherwise be underestimated.
Background Research
The international Journal of Epidemiology describes the Framingham Heart Study (the FHS) as the “First Large-Scale Cardiovascular Epidemiology Study in the Country” (Tsao, 2015). This premier longitudinal study from the 1950’s stood out as the first of its kind, and many authors acknowledge its importance to the cardiovascular field. This study is remembered as the first to also establish relationships between Cardiovascular Diseases (CVD) and “obesity, systolic blood pressure, high cholesterol and cigarette smoking” (Tsao, 2015). The reason there are many published research papers is because chronic heart disease is one of the leading causes of death and disability worldwide (D’Agostino, et. al., 2013).
The survey was focused on the town of Framingham, Massachusetts, the study sampled 60% of adults aged 30-69 years, gathering a final sample of 5209 people. The study contained clinical parameters, but also medical family history, X-Rays, EKGs and tests. This all was to capture the multifactorial nature of the disease with a goal to “detect abnormalities thought to play a role in the development of CVD” (Tsao, 2015).
FHS casted a light on the causes of cardiovascular disease, and the longitudinal structure of the survey provided the adequate format to capture and study the multifactorial “clinical parameters” that may cause CVD, but also the structure of the data is helpful for the increase of the risk over time, since some authors point out that older age is associated with a higher risk of developing Coronary Heart Disease (CHD). The original hypothesis of the FHS, back in 1948, was that CVD was multifactorial (D’Agostino, et. al., 2013).
Some of the factors that are associated with triggering heart disease are age, sex, cholesterol, blood pressure and smoking (Latifah, Slamet, 2020). Age and Blood pressure have been extensively researched in the last 70 years using mainly the FHS dataset. Age is argued to have a significant effect on contributing as a risk factor to the risk of having heart disease; and many authors find that younger individuals have lower cardiovascular risk and higher overall survival (Terry et al., 2005).
Along the lines of the multifactor analysis of the FHS, some authors support the interaction between variables to also have a significant effect, for example, cholesterol levels were greater in smokers as compared with non-smokers, meaning that variables amplify synergistically (Tsao, 2015). Other authors found that even though the systolic and diastolic blood pressure are great indicators of possible heart disease, they have similar information. In 1971, a study found that there was a trend of declining relative importance of diastolic BP and a corresponding increase in the relevance of the systolic BP with advancing age, suggesting that these variables could be summarized into one as a predictor (Wilson, Castelli, Kannel, 1987).
Most of these studies train and test logistic regression models, since it has a good classification capability, but some opted for random forests or other regression models (Latifah, Slamet, 2020).
Research Questions
RQ1: How does smoking affect the risk of developing heart disease?
To assess how smoking influences the probability or the odds of developing heart disease in the next 10 years, I will use a Logistic Regression Model. Regression is used to analyze the particular effects, called coefficients, of the different independent variables into the dependent one, so to answer the RQ1, it is necessary to run a regression using smoking data and testing these hypotheses;
Null Hypothesis: there is no significant effect of smoking into TenYearCHD.
Alternate Hypothesis: there is an effect on average that can be captured of the number of cigarettes smoked a day into the TenYearCHD, to a 5% p value.
RQ2: Can we build a linear model that accurately classifies new data points?
Testing and iterating for a model that is acceptable, better than guessing at random. For this RQ2, the accuracy should be over 80%, and the precision and recall of the model should all be acceptable.
Null Hypothesis: there is not a logistic regression model that can accurately predict and classify, with an 80% accuracy, the risk of having chronic heart disease in ten years using the framingham dataset.
Alternate Hypotheses: it is possible to train a logistic regression model that can classify the target variable with an 80% accuracy.
Dataset and Variables
Each row in the dataset represents an individual patient from the town of Framingham, Massachusetts, as part of the well-known Framingham Heart Study. The dataset contains a variety of demographic, behavioral, and clinical attributes used to assess cardiovascular risk.
‘TenYearCHD’: ‘TenYearCHD’ is the target binary variable, representing whether a patient is at risk of developing coronary heart disease (CHD) within the next ten years, based on clinical assessments and historical data; a value of 1 means the patient is classified as risky. It is important to clarify that this variable reflects predicted risk. In this dataset, about 15% of the total sample (562 patients) are labeled as being at high risk for CHD. This class imbalance should be kept in mind when evaluating model performance, especially in terms of precision and recall, as the model might tend to favor predicting the majority class (no risk) without careful tuning.
‘Male’: is a binary indicator that denotes the sex of each individual in the dataset, with a value of 1 representing male. Out of the total 3,750 individuals remaining in the cleaned final dataset 2419 are female, meaning that women make up a slightly larger portion of the sample (64%).
‘PrevalentHyp’: is a binary indicator that shows whether a patient has a history of hypertension. A value of 1 indicates that the patient has been diagnosed with hypertension. In this dataset, 1316 individuals have a history of hypertension (35%).
‘Glucose': represents each patient's blood glucose level and is measured as a continuous numerical value. Values range from a low of 45 to a high of 400, indicating a wide spread in the data. The distribution is notably right-skewed, meaning that while most patients have relatively normal glucose levels, there is a small number of individuals that exhibit extremely high values. A closer look at the summary statistics reveals that the median glucose level is 78, and the mean is slightly higher at 82, which again reflects the influence of outliers on the distribution. The interquartile range (IQR), spans from 71 to 87.
An issue with this variable is the relatively high number of missing values. Nearly 10% of the entries for glucose are missing, so I opted for listwise deletion, meaning that any individual with a missing value for any variable, not just glucose, was removed from the dataset. While this approach ensures that the data used in modeling is complete, it comes at a cost: the dataset shrank by approximately 13%, leaving 3750 individuals for analysis out of the original sample.
‘Age’: ranges from 32 to 70 years old, with both the mean and median at 49 years old, suggesting a symmetrical distribution overall. This dataset is not the longitudinal FHS, each row is a person. There is a heavier concentration of younger individuals, particularly those between the ages of 32 and 45. As people grow older, their expected likelihood of developing cardiovascular conditions increases.
‘CigsPerDay’: represents the number of cigarettes smoked per day by each patient and ranges from 0 to 70 cigarettes. 2144 individuals report zero cigarettes per day, meaning that they do not smoke. Among the smokers, there is a noticeable concentration of patients who smoke 20 cigarettes a day, with a smaller group consuming 30 cigarettes daily. There were 29 missing values (NA), and the distribution of this variable is skewed, with a heavy mass at 0, and it highlights the challenge of modeling continuous variables when many individuals report no smoking at all.
‘TotChol’: ‘Total Cholesterol’ represents the cholesterol level in the blood, recorded as a continuous variable ranging from 100 to 700 mg/dL. This variable had 50 missing values. The distribution shows that about 60% of the individuals have cholesterol levels exceeding 240 mg/dL, which is typically considered a health risk threshold for cardiovascular disease. The mean cholesterol level is 236 mg/dL, while the median is 234 mg/dL, indicating a relatively symmetric distribution with a slight right skew.
‘SysBP’: represents the systolic blood pressure, which measures the pressure in the arteries when the heart beats. This is an essential indicator of cardiovascular health. The values in the dataset range from 70 to 300 mmHg, with 50% of the individuals having systolic blood pressure levels at or below 140 mmHg. The average systolic blood pressure is 132 mmHg, while the median is 128 mmHg, and the mode is 120 mmHg, suggesting that 120 is a commonly observed value in the dataset. Blood pressure readings higher than 140 mmHg are generally considered a sign of hypertension, which significantly increases the risk of heart disease, stroke, and other cardiovascular problems.
‘DiaBP’: is the diastolic blood pressure, which measures the pressure in the arteries when the heart rests between beats. The values in the dataset range from 40 to 150 mmHg, with 50% of the individuals having diastolic blood pressure levels at or below 80 mmHg. The mean diastolic blood pressure is 83 mmHg, while the median is 82 mmHg, indicating a fairly symmetric distribution. High diastolic blood pressure (generally considered above 90 mmHg) is associated with an increased risk of heart disease and stroke, especially when combined with high systolic blood pressure.
Healthy individuals typically have a sysBP under 120 mmHg and a diaBP under 80 mmHg.
Hypertension tends to be diagnosed when sysBP is between 130-180 mmHg and diaBP is between 80-110 mmHg, and a Hypertension Crisis occurs when sysBP exceeds 180 mmHg and diaBPexceeds 120 mmHg.
‘Education’: a categorical feature ranging from 1 to 4, indicating the patient’s highest level of education. Education level might not be directly relevant to health outcomes in this specific context. I chose to discard this variable early in the analysis.
‘CurrentSmoker’: binary indicator that identifies whether a patient currently smokes. Including this binary variable in the model seems redundant since the cigarettes per day data is more insightful to quantify the relationship between smoking status and the likelihood of being classified as high-risk for heart disease.
‘BPMeds’: binary indicator that shows whether a patient is currently taking medication for blood pressure. A value of 1 indicates that the patient takes medication, while a value of 0 indicates that the patient is not.
‘Prevalent Stroke’: binary indicator that denotes whether a patient has a history of stroke. Only 25 individuals in the whole dataset have a stroke recorded, which represents a very small proportion of the sample. Given the low prevalence of this condition, including the variable in the model would not provide much predictive power.
‘Diabetes’: binary indicator that shows whether a patient has been diagnosed with diabetes. In this dataset, only 109 individuals have diabetes, while the remaining 4,129 do not. Although diabetes is a major risk factor for coronary heart disease, the relatively low number of individuals with diabetes in this dataset may limit its impact in the logistic regression model.
‘BMI’: continuous index of a patient's body mass, calculated using height and weight. There are 19 missing values in this column.
‘HeartRate’: continuous measure of a patient’s resting heart rate, with values in the dataset ranging from 40 to 140 beats per minute (bpm).
Correlations
The correlation analysis revealed several notable relationships among the continuous variables. On the negative side, there were modest inverse correlations between age and the number of cigarettes smoked per day (-0.19), as well as between systolic blood pressure (SysBP) and cigarette consumption (-0.09), suggesting that younger individuals with lower blood pressure tend to smoke more. On the other side, age showed moderate positive correlations with total cholesterol (0.26), systolic blood pressure (0.39), and prevalent hypertension history (0.31), while systolic blood pressure was also positively correlated with total cholesterol (0.22) and prevalent hypertension history (0.70). When examining the relationship between the continuous variables and the target variable, TenYearCHD, the highest correlations were found with age (0.23), systolic blood pressure (0.22), glucose (0.12), and total cholesterol (0.09), indicating these factors are more closely associated with coronary heart disease risk. All of these positive correlations suggest that the older the individual is, they are associated with a higher cholesterol, higher blood pressures, higher glucose levels and a considerably smaller amount of average cigarettes smoked. This also results in this group of individuals to be associated with a higher predicted risk on average.
Among the binary variables, some meaningful patterns also emerged. There were negative correlations between being a current smoker and having a history of hypertension (-0.10), between taking blood pressure medication and being male (-0.05), and between BP medication use and current smoker (-0.05). Positive correlations were observed between blood pressure medication and hypertension history (0.26), between male and current smoking status (0.20), and between having diabetes and hypertension history. In terms of the target variable, the strongest binary correlations were found with prevalent hypertension (0.18) and male gender (0.10), both of which showed some association with increased coronary heart disease risk.
Risky Patients Subset
After this initial assessment of the variables, I created a subset of the data that only contains individuals that were assigned risk level 1, meaning that these individuals are the ones who were estimated to be at risk in ten years for CHD by the framingham study. This subset of the data is the Risky People Analysis, and some of the correlations show best when studying the risky people, such as: Systolic Blood Pressure vs Diastolic Blood Pressure (correlation of 0.79), Age vs Cigarettes Smoked Per Day (-0.19), and Age vs Pulse Pressure (0.39), which is the result of the subtraction between Systolic and Diastolic BP. SysBP and DiaBP are highly positively correlated, and there is a clear trend of both variables increasing. For the risky people, sysBP ranges from 75 to 300 and diaBP ranges from 50 to 150. The group of risky people who had a history of hypertension have a much higher average of both blood pressures than the individuals who did not have hypertension. This is because with a diagnosis of Hypertension, higher values of BP are expected.
While analyzing the relationship between Age and CigsPerDay, there is a clear decline in the amount of cigarettes smoked with the increasing age. It is expected that with aging, more and more patients are willing to smoke less or quit. There is a negative correlation between age and cigarettes smoked per day. Additionally, of the group of risky people, there is a difference in how many people smoke compared to the original dataset, there are far less ‘0’ values. Which suggests that smoking may be one of the indicators of what determines the risk level of a patient. For the patients with a history of hypertension, the number of cigarettes smoked per day is considerably lower than for those who do not have a history of hypertension.
The last variable relationship is age and pulse pressure (PP), which is a calculated variable that results from the subtraction of the diaBP from the sysBP. In this form, we can summarize the effect of both Blood Pressures, and compare that calculated variable to Age. The Pulse Pressure goes from 35 to 70. There is a clear positive correlation between age and PP, which is expected because older patients tend to have a higher Blood Pressure which could lead to other health problems. Again, patients with a history of hypertension have a much higher average of pulse pressure, since hypertension is linked to a higher blood pressure value.
Methodology
The main hypothesis is: How does smoking cigarettes affect the probability of the risk of developing heart disease? With the null hypothesis being that there is no effect of smoking on the level of risk assigned to the individual. To test this hypothesis I will use a logistic regression model, which is appropriate for a binary outcome variable and continuous or categorical predictors such as number of cigarettes smoked per day. The model coefficients cannot be directly interpreted in terms of the increased or decreased likelihood of heart disease, but the odds ratio is an indicator of the impact of one variable over the outcome. Can we predict the risk of CHD? That is our second hypothesis. In order to test for this, I decided to try 4 different logistic regressions in order to see which is the best model in terms of accuracy and R².
To test for these two hypotheses we must establish a threshold. For the first one, I suppose that the ‘currentSmoker’ binary variable does not have a significant impact on the risk of CHD, but rather ‘cigsPerDay’, may have a statistically significant effect. If this is the case, the odds of assigning a level of risk to a patient are higher for those who smoke, and those who smoke a lot. The second hypothesis tests for predictability. The model that predicts most correctly should have an accuracy of at least 85%, and a high R² too. If this turns out to be the case, then our model could very well be used to assign a level of risk to a new data point.
Models
To train these models, it is important to randomly split the dataset into two sets; train and test, with an 80-20 split, which means 80% of the data is used to train the model and the test set is used to evaluate how the model reacts to new data points.
Model 1: Logistic Regression
The first model includes all the variables in our data. Only some variables prove to have a significant effect on the outcome variable. These are; age, male, cigs per day, prevalentHyp, totChol, sysBP and glucose.
It is important to test the multicollinearity assumption, this means that there should be a VIF (variance inflation factor) analysis of the variables being used in the model. All variables should have a coefficient equal to or lower than 3, and the VIF (Variance Inflation Factor) for this model shows that both ‘sysBP’ and ‘diaBP’ exceed this threshold. This suggests that the model may be suffering from multicollinearity, meaning that these variables are strongly correlated with one or more other predictors in the model, or each other. Multicollinearity can inflate the standard errors of the coefficients, making them less reliable and potentially distorting the interpretation of the model. To improve the model, it may be necessary to remove or combine correlated predictors.
The variables that the model determines have a statistically significant effect are male, age, cigsPerDay, prevalentHyp, totChol, sysBP, and glucose. However, these coefficients have to be modified to be interpreted accurately, in terms of odds ratios and if the variable increases in one unit, the expected effect is either going to be positive, negative or none. The odds ratios are compared to 1, if they are close to 1 then the effect is minimal or zero, if the odds are positive then there is an associated increase in the odds of the event occurring, and if the odds are negative then on average, the odds of the event happening are diminished.
The logistic regression results suggest that the odds of being classified as high risk for coronary heart disease (CHD) are 65% higher for males compared to females. This means that, holding other variables constant, being male significantly increases the likelihood of CHD risk. Age has an odds ratio of 6.7%, meaning that for each additional year of age, the odds of being at risk increase by 6.7%. This is particularly relevant since the youngest individuals in the dataset are 32 years old, and risk appears to increase steadily with age.
The variable CigsPerDay shows that for each additional cigarette smoked per day, the odds of CHD risk increase by 2.0%. This reflects the compounding effect of daily smoking on heart health over time. Having a history of hypertension (PrevalentHyp) raises the odds of being at risk by 30%, emphasizing that past or existing blood pressure problems are a strong predictor of future heart conditions. Total cholesterol (totChol) has a smaller but still measurable effect, with an odds ratio of 0.22% per unit. This means that for every 1 mg/dL increase in cholesterol, the odds of CHD risk increase by 0.22%, highlighting the cumulative effect of having high cholesterol levels.
For systolic blood pressure (sysBP), the odds increase by 1.5% per mm Hg. This indicates that higher systolic pressure is meaningfully associated with greater CHD risk, aligning with medical guidelines that classify elevated blood pressure as a key risk factor. Lastly, glucose levels have an odds ratio of 0.94% per unit, meaning that for every mg/dL increase in blood glucose, the odds of being considered high-risk go up slightly, underscoring how metabolic health also contributes to cardiovascular outcomes.
Model 2: Stepwise Logistic Regression
The second model uses the first model as an input, and chooses to remove variables based on a certain criteria that can be specified. In this case, the stepwise model selects variables, or predictors, based on the AIC (Akaike Information Criterion), and the iteration with the lowest AIC value is returned. The direction of the stepwise is crucial, forward starts with an empty model and adds predictors, the backwards approach starts with all of them and then starts removing variables, but both directions are usually used at once, finding the best-fitting model. This best model is the one that retains the least amount of variables that predict a significant amount of the variance.
After running the stepwise model, the resulting regression output showed that the Variance Inflation Factor (VIF) values were all below the threshold of 3, including for variables like ‘sysBP’ which previously indicated potential multicollinearity concerns. Specifically, having no VIF value above 2.0 suggests that the model no longer suffers from problematic levels of multicollinearity, meaning that the predictors included in the final model are not excessively correlated with one another. The variables that the model chose to keep are the same ones as the first, but only keeping the ones that had a significant effect on the predicted variable; male, age, cigsPerDay, prevalentHyp, totChol, sysBP, and glucose.
The logistic regression results suggest that the odds ratios of the stepwise model are similar to the first model. Being classified as high risk for coronary heart disease (CHD) is still 65% higher for males compared to females. Age suffered an increase in the odds ratio to 6.8%. CigsPerDay shows that for each additional cigarette smoked per day, the odds of CHD risk increase by 2.2%. Having a history of hypertension (PrevalentHyp) has a smaller effect but still raises the odds of being at risk by 28.6%. Systolic blood pressure (sysBP) kept the same effect, the odds increasing by 1.4% per mmHg. Total cholesterol (totChol) maintained a smaller effect, with an odds ratio of 0.20% per increase of ml/dL. Lastly, glucose levels diminished the first odds ratio to 0.90% per unit.
The results from the logistic regression show that the model I built is significantly better at predicting heart disease risk than a model with no variables at all. In statistical terms, the reduction in ‘deviance’ (302.5) tells us how much better our model is at explaining the outcome, and the improvement is quite large. I also looked at how many variables were used to get this improvement, there were 7 key predictors selected. To check whether this improvement could have happened just by chance, I used a chi-squared test that returned an extremely small p-value (basically zero), meaning it's very unlikely that our results are random. Overall, this tells us that our model is useful and meaningful for identifying patients at higher risk of heart disease.
Model 3: Logistic Regression with Interaction Terms
To strengthen the predictive power of my model while keeping it grounded in clinical logic, I selected four additional variables based on well-established patterns in cardiovascular research. First, I introduced a binary variable for individuals with a diastolic blood pressure over 110 (High diaBP), as this level is clinically recognized as dangerously high and strongly associated with hypertensive crisis, making it a meaningful threshold for risk assessment. I also categorized age into three distinct groups; young adults (32–45 years old), middle-aged adults (45–60 years old), and older adults (60–70 years old). To reflect how heart disease risk accelerates in non-linear ways across life stages, this was to prove that being old has a different impact than being young, rather than an interpretation of years increasing and risk increasing with each one. Next, I created a binary variable for heavy smokers, defined as those who smoke more than 10 cigarettes per day. This cut-off is informed by studies showing significantly elevated cardiovascular risk among heavier smokers and the distribution of the variable cigsPerDay showed this group of individuals, distinguishing them from occasional or light smokers. Finally, I added an interaction term between ‘prevalent hypertension’ and ‘male’ to capture the sex-based differences in how high blood pressure contributes to heart disease. These additions allow the model to reflect more nuanced patterns of risk, consistent with the literature, without overfitting or relying on overly complex variable structures.
The variables that the model identified as statistically significant include: male, total cholesterol, systolic blood pressure, glucose, high diastolic blood pressure, age group indicators for ‘old’ and ‘young’, heavy smoker status, and the interaction between male and hypertension. Notably, variables such as the regular age variable, the "middle-aged" category, cigarettes per day, prevalent hypertension and diastolic blood pressure were excluded. This is likely because including all closely related or overlapping variables would cause the model to overfit by perfectly capturing the variation in the data. The VIF analysis indicated that systolic blood pressure (sysBP) has a value exceeding 3, suggesting the model is affected by multicollinearity.
The odds of being at risk for coronary heart disease are 51% higher for males compared to females. Individuals with high diastolic blood pressure (greater than 110) have a 54.8% higher likelihood of being at risk. The odds of being at risk increase by 64% for those categorized as "old" (ages 60-70), while the odds decrease by 64.3% for "young" individuals (ages 32-45) when compared to the reference group. Heavy smokers (smoking more than 10 cigarettes per day) are 67.2% more likely to be at risk for coronary heart disease. Additionally, the odds for individuals with both male gender and a history of hypertension are 28.6% higher compared to others. For every one mmHG increase in systolic blood pressure, the odds of being at risk increase by 1.6%. As glucose levels increase, the odds of being at risk increase by 0.97%, while for total cholesterol, the odds of being at risk increase by 0.26% for every unit increase in cholesterol.
Model 4: Stepwise Logistic Regression with Interaction Terms
The fourth model uses the third model as an input, and removes or selects variables, or predictors, based on the AIC (Akaike Information Criterion), and iterates until the lowest AIC value is obtained with the best iteration. This model is the one that retains the least amount of variables that predict a significant amount of the variance. The Variance Inflation Factor (VIF) values were all below 3 for all variables, including ‘sysBP’ which previously indicated potential multicollinearity. Specifically, having no VIF value above 1.7 suggests that the model does not have surprising levels of multicollinearity. The statistically significant variables that the model kept are the same ones as the third model, but only the ones that had a significant effect on the predicted variable; male, cigsPerDay, prevalentHyp, totChol, sysBP, glucose, highDiaBP, old, young, heavy_smoker and male_hypertension.
The odds ratios of this stepwise model are similar to the third model. Being classified as high risk for coronary heart disease (CHD) is 53% higher for males compared to females. Old individuals have a higher risk at 64%, but if the individual is young, is it expected to still have a decrease of 64%. HeavySmokers shows that for people who smoked more than 10 cigarettes a day, the odds of CHD risk increase even more in the stepwise model, from 67% to 71%. If the patient is a male and also has a history of hypertension (Male_Hypertension) the odds of being a risky patient are still 28%. Systolic blood pressure (sysBP) kept the same effect, the odds increasing by 1.6% per mmHg increased. Total cholesterol (totChol) maintained the smaller effect, with an odds ratio of 0.25% per increase of ml/dL. Glucose levels kept the same odds ratio as the third model, with an increase of 0.92% per unit.
Results
RQ1: How does smoking affect the risk of developing heart disease?
Across all four models, the variable ‘currentSmoker’ lost statistical significance when ‘cigsPerDay’ was included, suggesting that the number of cigarettes smoked daily captures more meaningful variation than simply whether someone smokes. Interestingly, the ‘cigsPerDay’ variable itself had a skewed distribution, with nearly half of the observations being zero, effectively preserving the information from the original binary smoker variable while adding granularity. In both Models 1 and 2, ‘cigsPerDay’ showed a statistically significant effect (p-value < 0.001) on the likelihood of developing heart disease, with odds ratios of 2.0% and 2.2%, respectively. This implies that each additional cigarette smoked per day is associated with approximately a 2.2% increase in the odds of being classified as high-risk, holding other factors constant. Models 3 and 4 replaced ‘cigsPerDay’ with a binary variable ‘Heavy_Smoker’, defined as smoking 10 or more cigarettes per day. This new variable was also statistically significant in both models, with odds ratios of 67.2% and 70.8%, indicating a substantial increase in risk for heavy smokers compared to those who smoke less or not at all.
RQ2: Can we build a linear model that accurately classifies new data points?
To evaluate whether a logistic regression model can classify future chronic heart disease (CHD) cases with at least 80% accuracy, four different models were constructed and compared using a range of performance metrics. Model 1 included all 13 available variables, while Model 2 applied stepwise regression to Model 1, reducing the number of predictors to 7. Similarly, Model 3 extended the feature set by engineering new binary and interaction variables, resulting in 14 predictors, and Model 4 applied stepwise regression to Model 3, ending with 9 variables.
All four models demonstrated high accuracy, ranging from 84.8% to 85.5%, successfully surpassing the 80% threshold outlined in the alternate hypothesis. This suggests that a logistic regression model can, at least in terms of accuracy, outperform random guessing and provide meaningful classification of individuals at risk of developing CHD within ten years. However, accuracy alone is insufficient when evaluating predictive performance on an imbalanced dataset like this, where non-events (no CHD) dominate. As such, precision and recall offer additional insight.
Precision is the proportion of predicted positives that are actually positive, going from 0.425 to 0.487 across models. Model 4 achieved the highest precision, indicating that nearly half of its CHD-positive predictions were correct. While promising, this still means that over half of predicted cases may be false positives. Recall, or sensitivity, was more concerning. All four models returned low recall values (~0.157 to 0.176), meaning that the models missed around 82% to 84% of actual CHD cases. This suggests the models are far more likely to correctly predict who won’t get CHD than who will.
The McFadden’s R² values, which range from 11.14% to 11.76%, suggest that while the models do better than a null model, they explain only a modest portion of the variance in the outcome. The small variation in McFadden’s R² across models indicates that the feature engineering in Model 3 and its stepwise reduction in Model 4 did not dramatically improve explanatory power over the original models.
The Akaike Information Criterion (AIC) provides a measure of model quality that penalizes complexity. Model 2 had the lowest AIC (2301), suggesting that it offers the most efficient trade-off between fit and simplicity. The difference in deviance, how much the model improves from the null model, was highest for Model 1 (304) and lowest for Model 4 (288), though the differences are relatively minor. The Area Under the Curve (AUC) values also hovered around 0.72–0.74, with Model 2 slightly outperforming the rest (0.7428), confirming that all models perform moderately well in distinguishing between positive and negative CHD cases.
In conclusion, the findings support the alternate hypothesis that a logistic regression model can predict CHD risk with over 80% accuracy. However, the low recall across all models raises concerns about their ability to reliably identify at-risk individuals. While Model 4 offers the highest precision and accuracy, Model 2 performs best in terms of overall model quality (lowest AIC and highest AUC). Future iterations could explore alternative modeling techniques, such as random forests or penalized logistic regression, to improve recall without sacrificing accuracy or interpretability.
Conclusion
All Models estimated the effects of smoking, but only 3 and 4 estimated that being a heavy smoker increases the patients odds by 70% of being classified ‘at risk’ of developing heart disease in ten years. The effect of the higher smoking data points is statistically significant (p value = 0).
Model 4 emerges as the best-performing model overall. It achieves the highest accuracy, precision, and recall among all four models tested. While it includes more variables than the most minimal models, its enhanced predictive power makes it the most effective at correctly identifying individuals at risk. The model strikes a solid balance between complexity and performance, demonstrating strong classification ability in a real-world health context.
This model showed that understanding the data, and diving into the medical research can improve a simple logistic regression, and achieve better metrics. In this case, some continuous variables could be switched by a binary factor that indicated presence of a high value or not; diastolic blood pressure and cigarettes smoked per day, into binary factors high diastolic BP and heavy smoker. An added variable of ‘age’ also made the effects more interpretable. The three new age groups indicated that being young lowered the risk and being old increased the risk of having heart disease. Not only did these new variables improve the original model in predictability, but it made the results more interpretable, and higher odds easier to understand. The models all showed an R² over 10%, which is considered standard for the medical field and this type of classification study.
Both Null hypothesis were rejected, and the alternate hypothesis were supported by the results; it is possible to build a good logistic regression model, and there is a significant effect of being a heavy smoker on the risk of having heart issues in ten years.
Visualizations
Histogram: Age Variable Distribution, separated by risk group ‘TenYearCHD’
Histogram: CigsPerDay Variable Distribution, the variable shows how many individuals smoke how many cigarettes per day.
Histogram: totChol Variable Distribution, the variable contains the values of total Cholesterol in each patient.
Histogram: sysBP Variable Distribution, the variable sysBP contains the data for the systolic blood pressure of the patients. Additionally, the histogram is separated by the risky group of individuals and the not risky patients.
Histogram: diaBP Variable Distribution, the graphs shows the diastolic blood pressure of the patients, separated into the two groups of patients; risky and not risky.
Histogram: BMI Variable Distribution, the graph shows the distribution of the Body Mass Index variable.
Histogram: HeartRate Variable Distribution, the histogram shows the values of the heart rate of all the patients.
Matrix: Correlations of Continuous Variables, the matrix shows the correlation between all of the continuous variables in the dataset.
Matrix: Correlations of Binary Variables, the matrix shows the correlation between all of the continuous variables in the dataset.
Matrix: Correlations of All Variables
ScatterPlot: SysBP versus DiaBP, legend = PrevalentHyp, and the data is the subset of risky individuals, meaning TenYearCHD ==1. The graph shows how the values interact with each other, the trends, and the differences between the two groups; people with hypertension or not.
ScatterPlot: Age versus CigsPerDay, legend = PrevalentHyp, and the data is the subset of risky individuals, meaning TenYearCHD ==1. The graph shows how the values interact with each other, the trends, and the differences between the two groups; people with hypertension or not.
ScatterPlot: Age versus PP (Pulse Pressure), legend = PrevalentHyp, and the data is the subset of risky individuals, meaning TenYearCHD ==1. The graph shows how the values interact with each other, the trends, and the differences between the two groups; people with hypertension or not.
Regression Summary: Model 1 logistic regression output.
VIF Analysis Summary: Model 1 logistic regression Variance Inflation Factor Table
Odds Ratio: Model 1 logistic regression odds ratio table.
Regression Summary: Model 2 Stepwise model regression output:
VIF Analysis Summary: Model 2 stepwise regression Variance Inflation Factor:
Odds Ratio: Model 2 stepwise regression odds ratio table.
Regression Summary: Model 3 logistic regression with interaction terms and new variables regression output table.
VIF Analysis Summary: Model 3 logistic regression Variance Inflation Factor:
Odds Ratio: Model 3 logistic regression odds ratios table.
Regression Summary: Model 4 stepwise model regression output table.
VIF Analysis Summary: Model 4 logistic regression Variance Inflation Factor:
Odds Ratio: Model 4 logistic regression odds ratio table.
References
WHO. “Cardiovascular Diseases (CVDs),” June 11, 2011. https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds).
Olvera Lopez, Edgardo, Brian D. Ballard, and Arif Jan. “Cardiovascular Disease.” In StatPearls. Treasure Island (FL): StatPearls Publishing, 2025. http://www.ncbi.nlm.nih.gov/books/NBK535419/.
Tsao, Connie W, and Ramachandran S Vasan. “The Framingham Heart Study: Past, Present and Future.” International Journal of Epidemiology 44, no. 6 (December 1, 2015): 1763–66. https://doi.org/10.1093/ije/dyv336.
D’Agostino, Ralph B., Michael J. Pencina, Joseph M. Massaro, and Sean Coady. “Cardiovascular Disease Risk Assessment: Insights from Framingham.” Global Heart, Framingham Legacy Issue, 8, no. 1 (March 1, 2013): 11–23. https://doi.org/10.1016/j.gheart.2013.01.001.
Franklin, Larson, Khan, Wong, Leip, Kennel, and Levy. “Does the Relation of Blood Pressure to Coronary Heart Disease Risk Change With Aging?” In Circulation, Vol. 103 (Issue 9), 2001. https://doi.org/10.1161/01.CIR.103.9.1245.
Detrano, Robert, Andras Janosi, Walter Steinbrunn, Matthias Pfisterer, Johann-Jakob Schmid, Sarbjit Sandhu, Kern H. Guppy, Stella Lee, and Victor Froelicher. “International Application of a New Probability Algorithm for the Diagnosis of Coronary Artery Disease.” The American Journal of Cardiology 64, no. 5 (August 1989): 304–10. https://doi.org/10.1016/0002-9149(89)90524-9.
Brand, Richard, Ray Rosenman, Robert Sholtz, and Meyer Friedman. “Multivariate Prediction of Coronary Heart Disease in the Western Collaborative Group Study Compared to the Findings of the Framingham Study.” 53, no. 2 (February 1, 1976): 348–55. https://doi.org/10.1161/01.CIR.53.2.348.
Castelli, WilliamP, ThomasR Dawber, Manning Feinleib, RobertJ Garrison, PatriciaM Mcnamara, and WilliamB Kannel. “THE FILTER CIGARETTE AND CORONARY HEART DISEASE: THE FRAMINGHAM STUDY.” The Lancet, Originally published as Volume 2, Issue 8238, 318, no. 8238 (July 18, 1981): 109–13. https://doi.org/10.1016/S0140-6736(81)90297-X.
Terry, Dellara F., Michael J. Pencina, Ramachandran S. Vasan, Joanne M. Murabito, Philip A. Wolf, Margaret Kelly Hayes, Daniel Levy, Ralph B. D’Agostino, and Emelia J. Benjamin. “Cardiovascular Risk Factors Predictive for Survival and Morbidity-Free Survival in the Oldest-Old Framingham Heart Study Participants.” Journal of the American Geriatrics Society 53, no. 11 (2005): 1944–50. https://doi.org/10.1111/j.1532-5415.2005.00465.x.
Latifah, Firda Anindita, Isnandar Slamet, and Sugiyanto. “Comparison of Heart Disease Classification with Logistic Regression Algorithm and Random Forest Algorithm.” AIP Conference Proceedings 2296, no. 1 (November 16, 2020): 020021. https://doi.org/10.1063/5.0030579.
Wilson, Peter W. F., William P. Castelli, and William B. Kannel. “Coronary Risk Prediction in Adults (The Framingham Heart Study).” The American Journal of Cardiology, A Symposium: Doxazosin: Coronary Artery Disease Risk Factor Management, 59, no. 14 (May 29, 1987): G91–94. https://doi.org/10.1016/0002-9149(87)90165-2.
Castelli, W. P. “Epidemiology of Coronary Heart Disease: The Framingham Study.” The American Journal of Medicine, Coronary Heart Disease: Hypertension and Other Risk Factors, 76, no. 2, Part A (February 27, 1984): 4–12. https://doi.org/10.1016/0002-9343(84)90952-5.
Salton, Carol J., Michael L. Chuang, Christopher J. O’Donnell, Michelle J. Kupka, Martin G. Larson, Kraig V. Kissinger, Robert R. Edelman, Daniel Levy, and Warren J. Manning. “Gender Differences and Normal Left Ventricular Anatomy in an Adult Population Free of Hypertension. A Cardiovascular Magnetic Resonance Study of the Framingham Heart Study Offspring Cohort.” Journal of the American College of Cardiology 39, no. 6 (March 20, 2002): 1055–60. https://doi.org/10.1016/s0735-1097(02)01712-6.
Shu, Ting, Bob Zhang, and Yuan Yan Tang. “Effective Heart Disease Detection Based on Quantitative Computerized Traditional Chinese Medicine Using Representation Based Classifiers.” Evidence-Based Complementary and Alternative Medicine 2017, no. 1 (2017): 7483639. https://doi.org/10.1155/2017/7483639.







