"Exploratory data analysis is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as the things we believe might be there. “ - quoted in Exploratory Data Analysis Tukey PDF on Nonparametric Statistical Data Modeling.
This data science blog will discover what is exploratory data analysis (EDA), the importance of performing EDA when solving data science problems, the various exploratory data analysis techniques that one can use when working with machine learning projects, and an example on implementing exploratory data analysis in Python.
Table of Contents
Exploratory Data Analysis (EDA) is best described as an approach to find patterns, spot anomalies or differences, and other features that best summarise the main characteristics of a data set.
This approach involves the use of various EDA techniques, many of which include data visualization methods, to glean insights into the data, validate the assumptions on which we will base our future inferences, and even determine prudent models which define the data with the minimum number of variables.
However, it is important to remember that Exploratory Data Analysis is barely a set of techniques, steps, or rules; rather it is anything but. Quoting straight from the Engineering Statistics Handbook, Exploratory Data Analysis is ‘a philosophy’ towards how the data is to be analyzed.
And despite its apparent importance (after all, who wouldn’t want more efficient models!), it is more often than not that, because of this lack of rigid structure and elusive nature, EDA isn’t used nearly as often as it should be.
Let’s methodically demystify the EDA concept, starting from the very basics of differences between EDA vs data analysis and moving through to the exploratory data analysis steps and techniques sprinkled with a few simple Python examples you can try yourself. And hopefully, by the end of it, you’ll agree that EDA is not as intimidating as it is often made out to be when working with data science and machine learning projects.
Why not call EDA just plain classical data analysis. If the answer to the above question isn’t obvious already, let’s just put it out there: Because it’s not!
Since you’ve made it up to here, thankfully, you won’t be left having to take the above statement with a pinch of salt. While you must have already vaguely sensed some of the differences between classical data analysis and EDA the following explanation should serve to give you a clearer picture.
EDA is indeed a data analysis approach, however, it differs starkly from the classical approach in the very way it seeks to find a solution to a problem, or for that matter the way it addresses one.
Get Closer To Your Dream of Becoming a Data Scientist with 70+ Solved End-to-End ML Projects
In the classical approach, the model is imposed on the data and the analysis and testing follow. With EDA, on the other hand, the collected data set is first analyzed to infer what model would be best suited for the data by investigating its underlying structure.
EDA is a data-focused approach - both in its structure and the models it suggests. On the flip side, classical data analysis is aimed at generating predictions from models and is generally quantitative in nature. Even the rigidity and formality that are prevalent in classical techniques are absent in EDA. The two methods differ even by the way they deal with information in that classical estimation techniques focus only on a few important characteristics resulting in a loss of information whereas EDA techniques make almost no assumption and often make use of all available data.
Now that we have gone over the differences in a theoretical sense, let’s try to wrap it up with an Exploratory Data Analysis Python example.
Say you been given the following dataset based on a survey filled by the customers on the impact proportion of nuts to chocolate in ice cream would have on their willingness to opt for the particular brand or their preference:
If you were to adopt classical data analysis you would be able to quickly establish the positive linear relation between the proportion of chocolate to preference and then analyze and test the predictions of your machine learning model.
Or you might even choose to establish the negative relation between the proportion of nuts to preference. (You can obtain this graph by replacing ‘Chocolate’ in the above code block with ‘Nuts’ and using m=-0.12 and c=12.25)
While this is all good and in this case the machine learning model fits quite well, your approach to exploratory data analysis, however, wouldn’t be so straightforward. Your goal here after all is to open-mindedly explore and question not only what is in the data but also what is not.
In this pursuit, there are no limits to the questions you can ask. You could seek to understand for example, whether there are outliers or whether your distribution of preference is skewed
And you might choose a box plot to help you with that:
Alternatively, you might choose to plot a histogram for preference to different proportions of nuts.
With this, you may even wonder whether those who preferred more than 70-80% of nuts and those who preferred less than 15% of nuts (Hello allergens!) were sufficiently represented. Additionally, even if the obvious outlier is ignored, you might have noticed that the distribution seems quite skewed.
If it isn’t obvious enough already, the routes you can take here to uncover the underlying structure of your data and to essentially ‘listen’ to the data are infinite. The questions you may choose to ask may differ from person to person, but with enough experience, you should be able to arrive at similar conclusions.
That being said, despite the myriad of avenues you could explore there are but a few commonly used techniques in Exploratory Data Analysis, knowing which, could definitely help you get started.
EDA relies so heavily on statistical graphics that the two terms have come to be used almost synonymously. The reason for this heavy reliance on graphics is directly related to the fact that graphics complement the natural pattern recognition capabilities that humans possess. The added aid that graphical techniques provide to uncover structural secrets of patterns is what has made them an indispensable tool in the quest to gain new insights into data.
Get FREE Access to Machine Learning Example Codes for Data Cleaning, Data Munging, and Data Visualization
This does not mean that EDA does not use any quantitative techniques used in classical analysis. The commonly used EDA techniques can therefore be broadly classified as:
Some of the commonly used graphical techniques are:
Box plots can be used to display the distribution of the dataset in a standardized way with on a summary of five numbers, namely:
The Interquartile Range (IQR) is equal to the difference between the 25th and the 75th percentile.
Box plots, although primitive, are useful to identify outliers and also to check whether a distribution is skewed. The position of the median relative to the first quartile and the third quartile indicates the skew in the variable’s distribution while the spacing between the different parts of the box plot serves to pictorially represent the spread.
We can obtain a box plot of the ‘Preference’ of the given data as follows:
Histograms can be used to summarize both continuous and discrete data. They help to visualize the data distribution. They serve especially well to indicate gaps in data and even outliers.
The given data provides the percentage of people who prefer a particular proportion of nuts. Therefore the histogram for this can be obtained as,
Unlike the previous methods that are univariate (i.e. involving only a single variable) scatter plot reveals the relationship between two variables. The relationships reveal themselves in the form of structures in the plot such as lines or curves that cannot simply be explained as randomness.
Get More Practice, More Data Science and Machine Learning Projects, and More guidance. Fast-Track Your Career Transition with ProjectPro
Quantitative techniques are very similar to graphical techniques in the data they present and vary only in the way they present their findings. They are, therefore, less used not because of their inferiority in quantitative performance but rather by personal preference or convenience. Some of the commonly used quantitative techniques are:
Determining the variance or other related parameters of a data set describes the spread of the data or how far the values are from the center. Every variable will have its own unique pattern of variation and investigating this can often lead to interesting findings.
The Analysis of Variation (ANOVA) test is widely used in the testing of experimental data
A statement that is assumed to be true unless there is strong evidence contradicting it is called a ‘statistical hypothesis’. These statements can be certain assumptions regarding the data set. The process used to determine whether such a proposition is true is termed ‘hypothesis testing’.
Hypothesis testing is accomplished in a series of steps. In this process, a null hypothesis, which is initially assumed to be true, is replaced by an alternative hypothesis if the testing results in the null hypothesis being rejected. This is done by comparing a quantitative measure called the ‘test statistic’, which shows whether sample data is in agreement with the null hypothesis, to a critical value to decide on the rejection of the null hypothesis.
The EDA techniques we have gone over in this section are by no means an exhaustive list of techniques that can be used for accomplishing EDA. On venturing to use EDA and exploring on your own you are bound to discover other techniques and also find that some work for you better than the others do.
Now that we have gone over the techniques and understood their significance, let us move on to the bigger picture. As you might have already guessed, the process of exploratory data analysis isn't what one might call ‘plain sailing’. Rather, it involves an iterative process of:
EDA is a creative process and this lack of a strict set of rules both allows and necessitates that you be curious. It is especially important during the initial phases of EDA that you explore every avenue that occurs to you without being deterred by the fact that probably only a few of them might ultimately lead to fruition.
Once you have asked the questions, it is necessary that you appropriately visualize and analyze the data in accordance with the questions in order to gain further insights into your data.
The observations you have made from the previous two steps will open up new avenues of data exploration allowing you to refine your questions and even make more informed inquiries.
As this process progresses you are bound to arrive at some particularly productive or informative findings which will evolve into the results of your exploratory data analysis.
EDA can be broadly divided into four categories based on the number of variables and types of techniques:
In this type, only one variable is analyzed. Consequently, there are no relationships to be analyzed or found as would be the case if there were multiple variables. This type of data analysis primarily deals with describing and understanding the data distribution.
As in the previous case, this type of data analysis also deals with only one variable. However, it differs in the fact that graphical methods are adopted here in order to visualize the data. The techniques which can be used for this purpose include box plots and histograms.
This type of analysis involves two or more variables as a result in addition to understanding the distribution of a variable it is necessary to analyze the relationship between the variables.
Your graphics are used to describe and visualize the relationships between two or more variables. Statistical graphics which are popularly used for this purpose include the scatter plot and heat maps.
Before we dive into this section let’s once more reiterate (if it hasn't been done enough already not discounting the ambiguous heading) that there is no perfect way to go about with EDA. There are just ways that work and those that don't. The steps listed below are, therefore, just one of the logical ways you could explore while you get started:
Having covered most of what we need to know to get started with EDA, in order that we don’t lose track of what we seek to achieve from all this, let us quickly summarise our goals.
The primary purpose of EDA includes:
Given all that there is to gain, EDA must be our initial interaction with any data set. With practice and informed use of techniques, the process of EDA is bound to become less abstruse.
It is entirely possible, on your initial attempts to apply EDA, that you end up with more questions than answers. But as one of the world-renowned promoters and contributors to EDA, John Tukey, has said - “Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.”