Data science is solely based on data. If your data is good you will get good results else, you might have heard of famous data science proverb – Garbage in Garbage out. A good (rather useful I should say) data science product is like a recipe even if one ingredient is not good, final product will not amuse the audience.
The picture shown above talks about following important parts of a Data Science product:
If you would like more information about Data Science Training, click the Request Info. button on top of this page.
Assuming you understand your business requirement sufficiently, let’s discuss what do we mean by Data Mining, Statistics & Machine Learning? What are the differences among them?
Data Mining, Statistics and Machine Learning are interesting data driven disciplines that help organizations make better decisions and positively affect the growth of any business.
According to Wasserman, a professor in both Department of Statistics and Machine Learning at Carnegie Mellon, what is the difference between data mining, statistics and machine learning?
“The short answer is: None. They are … concerned with the same question: how do we learn from data?”
Considering Wasserman’s answer, the three disciplines are considerably the same but with minor differences, rather they can be referred to as identical twins which make use of different words and terminology and follow different notations.
Data mining is a very first step of Data Science product. Data mining is a field where we try to identify patterns in data and come up with initial insights.
E.g., you got the data and you identified missing values then you saw that missing values are mostly coming from recordings taken manually.
Few people mistake Data mining with data extraction. Data mining comes into play once you have collected data.
Companies use powerful data mining techniques coupled with advanced tools to extract valuable information out of large amount of data.
E.g., Walmart collects point of sales data from their 3,000+ stores across the world and stores it into their Data Warehouse. Walmart suppliers have access to this database and they identify the buying patterns among Walmart customers and use this to maintain their inventory in future. Walmart data warehouse processes more than a million such queries every year.
Data mining uses power of machine learning, statistics and database techniques to mine large databases and come up with patterns.
Mostly data mining uses cluster analysis, anomaly detection, association rule mining etc. to find out patterns in data.
In short Data Mining is finding out hidden and interesting patterns stored in large data warehouses using the power of statistics, artificial intelligence, machine learning and database management techniques.
Statistics is the base of all Data Mining and Machine learning algorithms.
Statistics is the study of collecting, analyzing and studying data and come up with inferences and prediction about future.
Major task of a statistician is to estimate population from sample metrics. Statistics also deal with designing surveys and experiments in order to get quality data which can further be used to make estimation about the population. If we have to formally list down the task of statistics, it will be as follows
Statistics is used to summarize numbers for example finding out descriptive statistics like Mean, Median, Mode, Standard Deviation, Variance, Percentiles, Testing hypotheses etc.
Machine learning is a part of data science which majorly focuses on writing algorithms in a way such that machines (Computers) are able to learn on their own and use the learnings to tell about new dataset whenever it comes in.Machine learning uses power of statistics and learns from the training dataset. For example, we use regressions, classifications etc. to learn from training data and use those learnings to estimate test dataset.
The objective of Data Mining and Statistics is to perform data analysis but both are different tools. Data mining process involved modelling, predicting and optimizing a dataset while Statistics describes how efficient a dataset is –more or less.
Data Mining |
Statistics |
Explorative – Dig out the data first, discover novel patterns and then make theories. |
Confirmative – Provide theory first and then test it using various statistical tools. |
Involves Data Cleaning |
Statistical methods applied on Clean Data |
Usually involves working with large datasets. |
Usually involves working with small datasets. |
Makes generous use of heuristics think |
There is no scope for heuristics think. |
Inductive process |
Deductive (Does not involve making any predictions) |
Numeric and Non-Numeric Data |
Numeric Data |
Less concerned about data collection. |
More concerned about data collection. |
Some of the popular data mining methods include –Estimation, Classification, Neural Networks, Clustering, Association, and Visualization. |
Some of the popular statistical methods include –Inferential and Descriptive Statistics. |
In short,