The Effects of Demographics and Geography on Health

Yoo-Jean Han

BUS 314 - Final Project

Research Questions

My research questions for this project focus on finding relationships between health and demographics, and health and geography within Medicare data. The questions are as follows:

Is there a relationship between gender and each health condition?

This analysis would determine if a patient's gender could be a potential indicator for any health conditions that would require hospitalization. This could help patients be more aware of their health and increase the focus on prevention rather than treatment, which would decrease the chance of mortality.

There are only extremely small differences between the number of males and females for each condition, so there is no distinct relationship between the two.

Is health condition affected by age range in any way?

This analysis would focus on finding a relationship between age and risk for certain health conditions. Similar to the previous question, this could also work towards increasing awareness and prevention of certain illnesses rather than treatment.

Again, there are very minute differences among the distribution of age among each health condition, so there is no noticeable effect of age range on health condition.

Does race have any impact on hospitalization or mortality?

This analysis would work towards identifying any racial bias within the healthcare system, as well as any relationships between race and health. Obviously, healthcare is a very important field that must eliminate racial bias, so being able to find any possible trends is crucial. There is also a chance a certain race has a higher chance of hospitalization for some unknown reason, which would also be important to identify.

There is a very marginal difference between the proportion of hospitalization conditions by race, but race does seem to have an impact on the proportion of mortality conditions.

Are there any indications in the patient's hospitalization condition to predict mortality?

This analysis would focus on finding certain variables that could possibly predict the mortality of some patients. By increasing our knowledge on what the likelihood of mortality is, we would be more motivated to prevent that from occurring. Additionally, finding trends on what makes a patient more likely to die, would bring more insight on what conditions or characteristics are more deadly to spread awareness.

There seems to be some correlation between mortality and geographic area, but other factors have a very low correlation with mortality.

Are there any demographic groups that could be more likely to be hospitalized with cancer?

Looking at the United States map, cancer was the reason for most hospitalizations per state, so finding any potential indications for risk of cancer would be beneficial for the sake of cancer research, potential treatment, and prevention for the disease.

There are no strong correlations between demographics and cancer patients, but there does seem to be a weak correlation between geographic area and cancer.

Does there exist any relationship between the cause of death and patient demographics?

With this analysis, there would be an increase in insight on which groups of people have an increased risk of certain health conditions that cause death. Knowing this would bring about more discussion and awareness on why specific groups of people are more likely to suffer from certain health conditions, which could lead to more work to find a solution to this issue.

There is no significant correlation between any of the demographic categories and cause of death.

Are there any demographic trends based on geography?

This analysis would focus on finding any relationship between population demographics and location, based on this hospitalization data. This could provide insight on the geographic location of specific groups of people, which could in turn coincide with the analyses we did above, which found demographic and health trends.

There are some geographic trends with age and race; more populous states such as California, Florida, and New Jersey have a slightly older average age, and east, southern states have a higher proportion of Black people, while northern states around Montana and the Dakotas have a higher proportion of White people. The other demographics do not have any noticeable trends.

Are there any particular states or regions where hospitalization is more likely?

This analysis would determine the hospitalization rate based on state or county. This could help to know where most of the data from this dataset originated from, which could be an indication of other location-related factors, such as government funding, political affiliation, population, resources, etc. Knowing this information would create context and a much bigger picture of the data that is being analyzed.

The states with the most hospitalizations are Texas, Alabama, and Virginia, which are all considered southern states. We must take into account population size of each state as well.

What is the effect of the type of geographic area on mortality?

This analysis would work toward finding any differences between living in a rural or urban area on mortality. This could provide insight on any differences in healthcare or Medicare between these two types of areas and how that affects mortality. With this information, it would be possible to allocate any resources to areas suffering unnecessarily.

There is a much higher proportion of deaths in urban areas than in rural areas.

Is there a relationship between state and the reason for hospitalization?

This analysis would focus on the possibility that different states and geographic areas are prone to different health concerns and conditions. There could be some unknown relationship between geography and hospitalizations, which could be a sign of other environmental factors that affect health. Knowing this, there could be better steps to prevent certain conditions and hospitalizations.

There is no relationship between how rural/urban an area is and the reason for hospitalization. The proportion of each condition is also almost equal in every state, but as mentioned before, cancer is the cause for the most hospitalizations (by a very small percent) in almost all states.

Motivation and Background

Healthcare is one of the most important things to research for the betterment of humans. With healthcare data, we may be able to predict individuals with a higher risk for certain life-threatening diseases, and work torwards prevention to increase chances for survival. We may also be able to find correlations between different unforeseen variables that may lead to improved awareness of what can impact human health. There are numerous benefits to improving health data, for both individual health as well as population health.

Datasets

The dataset that was used in health data from the Centers for Medicare Disparities division. This data focuses on outcome measures by state and county, and includes only data from the year 2020. There are three tables, one with geographic information on US states and counties, one with hospitalization data, and one with mortality data.

The data can be downloaded here: https://wlu.instructure.com/courses/6524/assignments/36346

Ethical Considerations of Data

The biggest ethical concern when it comes to healthcare data in general is patient privacy. Many hospitals are wary of releasing patient information for public use in case of any data leaks or violations to privacy. Additionally, the collection of this data could be considered unethical if the data is used maliciously. Data relating to patient hospitalizations and mortalities could be used in an attempt to expose the medical and/or healthcare system, which is not the intent of this data.

Methodology

The analysis used in this assignment consisted of image analysis from data visualizations, trend analysis, and correlation analysis. When analyzing data visualizations, the focus should be on finding any key differences between any groups that are being compared. For example, we would look for a difference in proportions or ratios, or a difference in the count of a variable. This could be an indication of any technical issues with the data or of a potential relationship between variables. Next, for trend analysis, the key is to find any patterns in the data that could lead to any insight of the variables. For this specific project, most of the trend analysis was on geographic data. The patterns found on the maps could be related to geographic region, urban and rural area, state characteristics, population, etc. Any trends could help determine the geographic effect of certain variables. In general, trends and patterns in data are a sign of a statistical relationship between variables. Finally, correlation analysis gives the most concrete depiction of statistical relationships between variables. The variables used for correlation analysis must be numeric or logical. The results from a correlation analysis clearly tell us which variables are correlated with one another, which shows the relationships of variables. Strong correlation is indicated by 0.70 and above or -0.70 and below, while values close to 0 have no correlation. The correlation analysis is the most certain method of analysis to determine relationships within data.

Results

Is there a relationship between gender and each health condition?

Analyzing this graph, we can see that there is not a very large difference between each condition among each gender. Additionally, they are also quite evenly split within each gender as well. This indicates that there is not a very strong relationship between gender and health condition within this data. This is not particularly surprising, as conditions such as health failure can affect everyone as they are not hereditary or gender-specific.

Is health condition affected by age range in any way?

We can see here that the number of hospitalizations for each health condition has a very similar number of instances among each age range. The one that differs slightly is Chronic Obstructive Pulmonary Disease, or COPD, which affects a slightly bigger number of individuals who are 85+ years of age and slightly smaller number of individuals who are < 65 years of age. This coincides with the fact that COPD is known to have higher prevalence in the elderly. The other health conditions are not known to follow a similar trend, which would make sense why the proportion of ages is almost equal among the other four. Therefore, besides COPD, age range does not seem to affect health condition.

Does race have any impact on hospitalization or mortality?

These two graph analyze the proportion of each health condition by race for both hospitalizations and mortality. For the hospitalization data, we can see that there is an almost equal proportion of each health condition between the two races. Thus we can assume that race does not have an impact on hospitalization. However, when analyzing the mortality data, we see that there is a much clearer difference between the two causes of death among white and black patients. A much higher percentage of black patients experience death from heart failure versus white patients. This may be due to the smaller sample of black patients versus white patients, or there could be an unknown, underlying health cause that is differentiating the two. More research may need to be done to find the cause of this, but we can deduce that race does have an impact on mortality.

Are there any indications in the patient's hospitalization condition to predict mortality?

Looking at this correlation table, we can see that most of the variables, have 0 or close to 0 correlation with each other. The one pair that stands out are death and rural geographic area, which have a correlation of -0.42. This is not an extremely high correlation, but it is significantly higher than the other correlations. This means that as the living area becomes more rural, then the chance of mortality decreases. Another way to phrase this is as the living area becomes more urban, the chance of mortality increases. This could be due to a number of reasons. First, urban areas have a higher population, so there could just be a higher number of people dying. Another reason is that patients are more likely to go to a hospital to receive care, where they may die, while those in rural areas may not have as much access to healthcare. There may also be some unknown health benefits in living in rural areas, such as better air quality less access to fast food. The second highest correlation is 0.19 between death and race. This is quite a weak relationship, but it could be an indication of some racial disparity in the healthcare system. Thus, there are no concrete indications to predict mortality, but there could be a correlation between chances of death and geographic area.

Are there any demographic groups that could be more likely to be hospitalized with cancer?

This map tells us that cancer is the cause for most hospitalizations for each state, which is why this analyze could be very important.

However, looking at this correlation map, we can see that there is almost no correlation among variables with the yn_cancer variable. This indicates that gender, race, or age have no relationship with a patient having cancer. Thus, there are no demographic groups that could be more likely to be hospitalized with cancer.

Does there exist any relationship between the cause of death and patient demographics?

This correlation map can show us if there are any meaningful relationships with the yn_heart variable, which indicates cause of death, and any of the demographic variables. Again, none of the correlations are strong, but the "strongest" correlation with yn_heart is the race variable. This could imply that there is some weak relationship between race and cause of death. This was also mentioned in Question 3, where a much higher proportion of black patients died of heart failure than white patients. However, we can see here that this is not a solid correlation between the two variables, so there are no specific relationships between patient demographics and cause of death.

Are there any demographic trends based on geography?

The average age based on the age ranges given in the dataset were quite similar among the states, but there are some differences in distribution. We can see that more populous states such as California, Florida, and New Jersey have a slightly older average age. Some states with a lower average age are Oregon, Oklahoma, Kentucky, Vermont, and Maine. There is no definite reason to why this is, but Florida is known to attract retired individuals, which could have an impact.

By looking at this map, we can see that east, southern states have a higher proportion of Black people, while northern states around Montana and the Dakotas have a higher proportion of White people. This demographic trend matches with general population trends, which indicates that the dataset is not inaccurate when it comes to racial distribution. The other demographics such as gender do not have any noticeable geographic trends.

Are there any particular states or regions where hospitalization is more likely?

By analyzing this graph we can see that the states with the most hospitalizations are Texas, Alabama, and Virginia, which are all considered southern states. There also seem to be less hospitalizations in the northeast region and the mountainous areas in the west. Obviously states with higher populations will have a larger number of hospitalizations, so we must take into account population size of each state. However, there could also be other reasoning as to why certain states have less hospitalizations. The areas with a smaller number of hospitalizations are more mountainous and rural, which could be another piece of evidence that rural areas have a relationship with better health. Overall there do seem to be some states that have a higher proportion of hospitalizations than others, but we cannot determine why that is.

What is the effect of the type of geographic area on mortality?

This map shows the percent of rural areas in a state, and we can compare it with the next map to find any relationship between type of area and mortality. In this map, it claims that Nevada, Utah, Colorado, and Wyoming are 0% rural, which seems to be a flaw in the data. However the rest of the map can be considered.

We can see that the states with the most recorded deaths are Florida, Texas, and Pennsylvania. There are other states in the northeast that also have a higher number of mortalities. In the Rocky Mountain region, there seems to be on average a lower number of mortalities. Most of this data matches with the map with the number of hospitalizations per state, which makes intuitive sense. Additionally, it does seem that there is a little bit of correlation between rural states and a lower number of mortalities. Therefore, this could indicate some relationship between rural/urban areas and mortality.

Is there a relationship between state and the reason for hospitalization?

Analyzing this graph, we can see that there does not seem to be a very large difference in proportions among the five different health conditions between each state. Cancer does seem to have a slightly higher percentage of hospitalizations per state, but it is a very small difference. However, the minute differences indicate that there is not a very big correlation between state and the reason for hospitalization.

Reproducing Results

Step 1: Importing Data

Each tab can be uploaded individually into Exploratory. First, click the + button next to Data Frames, then click "File Data." Next click the "Excel File" option. Repeat these steps for each sheet in the file.

Step 2: Cleaning Data

Looking at the counties dataset, there were no missing data, or anything that looked too messy, so there were just a couple things that needed to be cleaned. A couple states were abbreviated, so those values were replaces to follow the same format as the others. Next, there were a couple incorrect rows mixing up counties from Alaska and Alabama, so those were corrected. Then, we removed any duplicate rows to ensure one county/state combination.

Next for the hospitalizations dataset, there were some more steps needed to be taken to clean the data. First, the relevant columns were determined and kept, and the others were removed. Then, there were some missing data in each column, so those were removed. Next, the state column did not have consistent values, so those were all cleaned. Similarly, some columns had some misspelled values, so those were corrected. Then, for the county column, the word "county" was removed from each row in order to be consistent with the counties table to be able to join them later on. Some variables were also converted into either factor, logical, or numeric data in order to perform correlation analysis later on.

Finally, for the mortalities table, very similar steps as before were taken to clean this data. Only the relevant columns in the dataset were kept, values were made consistent, "county" was removed for consistency, and variables were converted.

Next, the counties table was joined to the hospitalizations table and also to the mortalities table. Next, one large table was created by joining all three to see any trends among them.

Step 3: Running Data Analysis

The first step for data analysis is creating the visualizations. In order to compare proportions of different variables, bar charts or pie charts are the best choice. Correlation matrices are also used in order to determine if any variables have relationships with one another. The logical and numerical data columns are the only ones that can be used to make matrices. Next, is to create map visualizations. The state data is preferable to the county data, since the county data does not have anything for the state of Louisiana.

Step 4: Interpreting Results

When looking at bar charts or pie charts, we are looking for any noticeable differences in proportions between different groups. This could be an indication of a relationship between relationships. When looking at correlation matrices, we are looking for any significant correlation values. However, with this data, there were no correlation values above 0.50, so we are just looking for the most correlated relationships. In terms of map data, we are mostly trying to observe any trends across geography or anything that seems to stand out on the map.

Reflection

This assignment helped me demonstrate all of the data analysis skills I have learned in this course. I learned how to effectively clean data, how to transform data, how to perform data analysis, how to present data effectively to a layperson, and more. This project was one of the most comprehensive data focused assignments I have completed, where I was in charge of each step from beginning to end. I believe I have effectively analyzed the data given and communicated the conclusions. Some helpful information before beginning this assignment would be more background about Medicare, how healthcare systems record hospitalizations and mortalities, and the health conditions that were mentioned in the data. This would provide more context and understanding about the dataset, which could lead to more effective analysis. Additionally, it would be helpful to know more about correlation matrices in general to see more in depth analysis on the relationships between variables. In the future, I would be more efficient with my data cleaning by doing the same steps at once, instead of doing them sporadically. I would also spend more time exploring the dataset first and finding background information. Some advice for future students would be to add in some more additional data to the dataset, especially more numeric data. This could include population data, any missing data (such as the missing Louisiana county data) and numeric data pertaining to population demographics and healthcare/illnesses.