In this dataset has been published in the UCI Machine Learning Repository. It has 303 rows and 16 coulumns. I am going to check for outliers, missing values, and the trends and relationships among different features with R. For the original dataset, please click here
This is the code that used in data wrangling step before visualising the data
# Load required packages.
library(janitor)
library(tidyr)
library(stringr)
library(readr)
library(forcats)
library(dplyr)
library(tibble)
library(exploratory)
# Set working directory so that the script can read saved data file.
setwd("data path"); jsonlite::toJSON(TRUE)
# Steps to produce the output
exploratory::read_delim_file("https://raw.githubusercontent.com/PacktWorkshops/The-Data-Analysis-Workshop/master/Chapter07/Dataset/heart.csv", delim = NULL, quote = "\"" , col_names = TRUE , na = c('') , locale=readr::locale(encoding = "UTF-8", decimal_mark = ".", tz = "Africa/Cairo", grouping_mark = "," ), trim_ws = TRUE , progress = FALSE) %>%
readr::type_convert() %>%
exploratory::clean_data_frame() %>%
#rename the dataframe for its orginal names
rename(chest_pain = cp, rest_bp = trestbps, fast_bld_sugar = fbs, rest_ecg = restecg, st_deper = oldpeak, max_hr = thalach, ex_angina = exang, colored_vessels = ca, thalassemia = thal, cholestrol=chol)
In the previous chart, there are a few outliers beyond the 370.
Note: from the preceding boxplots, there are some outliers. However, they will not be imputed as there are few data to start with. i just show them.
We can observe that the youngest patient was 29 years old, while the oldest was 77, and the majority of patients were between 50s and 60s years old. The most common age is 58 years old.
From the previous graphs, we can conclude that 72 out of 96 female patients have been diagnosed with heart disease. This scenario is opposite for the male patients, most of them have not been diagnosed with heart disease.
This chart shows us that most of patients have typical angina, and the next is non-anginal pain followed by angina. Most of patients who had typical angina were not diagnose with heart disease. The largest group who had been diagnosed with heart disease had non-anginal pain.
Most of patients who have 0 colored vessels have been diagnosed with heart disease, which implies a strong negative correlation between colored vessels and heart disease.
We can procced that by categrizing the age group into different bins then plot each group against the presence and the absence of the heart disease.
The previous observation has been confirmed. Some patients who are the youngest have been diagnosed with heart disease when compared to those who have not been diagnosed.
A scatter plot is the horse for doing this step
Again, categorizing the cholesterol against the presence and the abence of
From the previous correlations, slope, maximum heart rate and chest pain have a positive correlation with the target column.
This dataset contains medical data of 303 patients. This report has checked for ouliers, distributions and some correlations against the presence and absence of heart disease.