class:center,middle # Introduction to Exploratory Data Analysis & Charts ## By: Ryan D. Watts --- ## Objectives * What is Exploratory Data Analysis? * How to conduct an EDA? * Common chart types and when to use them? * Reporting after conducting an EDA? --- ## What is Exploratory Data Analysis? According to ***[Sisense](https://www.sisense.com)*** a leading analytics and business intelligence company exploratory data analysis is defined as: > Exploratory Data Analysis (EDA) is the first step in your data analysis process. Here, you make sense of the data you have and then figure out what questions you want to ask and how to frame them, as well as how best to manipulate your available data sources to get the answers you need.*[Full Article](https://www.sisense.com/blog/exploratory-data-analysis/)* > --- By conducting an exploratory data analysis you are taking the initial steps to understand and make sense of the data you are working with. At this stage in the process we are: * Gathering an understanding of the dataset * Identifying anomalies or errors in the data * Cleaning/Wrangling the data for further investigation * Potentially identifying or confirming our hypothesis * Preparing the data for forthcoming steps such as machine learning, reporting, business intelligence insights, etc... --- ## Steps to conduct an EDA * Retrieve/Obtain the dataset(s) in question *for larger teams or a Data Science team this task is often handled by a data wrangler/data engineer, but can often be completed by an analyst or the data scientists themselves* * Utilizing a subset of the data conduct an overview of the data structure and meaning. Such as identify the date formats if the dataset contains dates, identify informalities, granularity, and consistency of the data structure --- ### *continued...* * Revisit the hypothesis or form a hypothesis based on the dataset. This is extremely important as it helps identify whether or not the dataset will answer or improve the hypothesis you are attempting to make. * Improve the data structure by wrangling any data-points that may need improvement in order to support or negate the stated hypothesis. --- ## Example of EDA Let's import the flights dataset and explore the columns. ```r colnames(flights) ``` ``` ## [1] "FL_DATE" "CARRIER" "FL_NUM" ## [4] "ORIGIN" "city" "state" ## [7] "DEST" "DEST_CITY_NAME" "DEP_TIME" ## [10] "DEP_DELAY" "ARR_TIME" "ARR_DELAY" ## [13] "CANCELLED" "CANCELLATION_CODE" "AIR_TIME" ## [16] "DISTANCE" "day_of_week" "Name" ## [19] "Latitude" "Longitude" "FL_YEAR" ``` --- Now let's explore a subset of the dataset. ```r head(flights) ``` ``` ## # A tibble: 6 x 21 ## FL_DATE CARRIER FL_NUM ORIGIN city state DEST DEST_CITY_NAME ## <date> <chr> <int> <chr> <chr> <chr> <chr> <chr> ## 1 2016-09-01 AA 1 BOS Boston MA JFK New York, NY ## 2 2016-09-01 AA 1 JFK New York NY LAX Los Angeles, ~ ## 3 2016-09-01 AA 2 LAX Los Angeles CA JFK New York, NY ## 4 2016-09-01 AA 3 JFK New York NY LAX Los Angeles, ~ ## 5 2016-09-01 AA 4 LAX Los Angeles CA JFK New York, NY ## 6 2016-09-01 AA 5 DFW Dallas/Fort~ TX HNL Honolulu, HI ## # ... with 13 more variables: DEP_TIME <chr>, DEP_DELAY <int>, ## # ARR_TIME <chr>, ARR_DELAY <int>, CANCELLED <int>, ## # CANCELLATION_CODE <chr>, AIR_TIME <int>, DISTANCE <int>, ## # day_of_week <ord>, Name <chr>, Latitude <dbl>, Longitude <dbl>, ## # FL_YEAR <dbl> ``` --- We know that we want a year to present our data, so let's go ahead and create that column now. ```r library(dplyr) flights <- flights %>% mutate(FL_YEAR = as.numeric(format(FL_DATE, '%Y'))) ``` --- ```r head(flights$FL_YEAR) ``` ``` ## [1] 2016 2016 2016 2016 2016 2016 ``` ```r colnames(flights) ``` ``` ## [1] "FL_DATE" "CARRIER" "FL_NUM" ## [4] "ORIGIN" "city" "state" ## [7] "DEST" "DEST_CITY_NAME" "DEP_TIME" ## [10] "DEP_DELAY" "ARR_TIME" "ARR_DELAY" ## [13] "CANCELLED" "CANCELLATION_CODE" "AIR_TIME" ## [16] "DISTANCE" "day_of_week" "Name" ## [19] "Latitude" "Longitude" "FL_YEAR" ``` As we can see we have now created a new column called FL_YEAR and we have successfully populated the FL_YEAR rows with the correct year format. --- ## Chart Types Let's discuss different chart types and when we should use them. Next we will discuss the following charts. * Bar Graph * Line Graph * Stacked Variances * Scatter Plot * Bubble Chart --- The most important step of data visualizations is understanding what you are trying to present to the audience. This is usually determined during or before the Exploratory Data Analysis step. The second step is to determine the audience you are presenting. The third step is identifying the best way to present this information to the audience. And finally determining which charts or graphs are best suited for the various audiences you've identified. --- ## Bar, Line, Pie Chart These 3 charts are best suited for comparative analysis. When comparing various forms or points of your data to a user it is best to utilize one of these 3 charts. It is also important to note the quantity of items you are comparing. I highly suggest using a different chart if you are providing a high level overview of comparative data points. --- .pull-left[
Loading...
] .pull-right[
Loading...
] ---
Loading...
--- ## Pie, Stacked Bar/Column, Area When conducting a composite analysis it is best to utilize one of the charts mentioned above. Be sure to recognize that composition should be consist of data that are similar in structure and comparison. For example it is safe to use a Pie chart when comparing the various device or browser types within a user's visit in Google Analytics. Showing this composition allows you to quickly identify which segment of the data is tailored to the selected metric. It does not makes sense to utilize a pie chart to compare `device type` and `browser type` in the same story, however it is completely safe to compare them individually. --- .pull-left[
Loading...
] .pull-right[
Loading...
] --- ## Scatter, Bar, Column These 3 charts are best used when conducting a distribution analysis. Distribution analysis is usually done to understand the data points in greater detail as it relates to key elements of the data itself. For example understanding the range, tendencies and outliers of user behavior on a website for form submissions are better displayed using distribution analysis. When a user visit page x what how likely is that user to convert on page y. In order to answer this question we have to look at the behavior of other users that entered the site on page x, as well as compare the behavior of users that converted on page y. From this type of chart we would be able to draw a correlation between users that are entering page x and users that are converting on page y. Thus giving us an insight to potential problems or benefits to the conversion point between page x and page y. --- ## Scatter, Bubble, Line Scatter, bubble and line charts are great charts to use for relational comparison. It can also be noted that these 3 charts are also great for time series analysis. During exploratory data analysis that precedes a machine learning or model classification task these charts are great to see the relational dependencies of your data. In order to properly construct a machine learning model and conduct feature selection and classification within machine learning it is important to understand not only the points of your data but also how those points relate to one another. Depending on the question you are trying to predict in machine learning, the outcome and method you use to train your model may be determined based on the structure and relationship between your data points. ---
Loading...