Data science is solely based on data. If your data is good you will get good results else, you might have heard of famous data science proverb – Garbage in Garbage out. A good (rather useful I should say) data science product is like a recipe even if one ingredient is not good, final product will not amuse the audience.
The picture shown above talks about following important parts of a Data Science product:
- Data – Data part of it, needs no introduction. For a data science product data is enough but for a good Data Science product good and sufficient data is needed and that is primary task of Data Mining which we will discuss in detail
- Business – When we try to solve a business problem using data, we need to ask questions even before looking into data. The same problem can be solved in different ways given the business needs. For example, let say an airline wants to increase their customer base. If marketing team is solving this problem they will look for potential target population, operation team will look forward to change flight timings or increasing flights etc.
- Math – Assuming you have good data and you understand the business objective as well the next part is to solve the problem. Which hypotheses to solve, how to prove/ disprove hypotheses, which method to use in order to solve a particular problem are few areas which are tackled by Math or Statistics
- Technology – Once you have techniques in mind, you have finalized the hypotheses the next task is to solve the problem in a best possible way and in a shorter time that’s what we need Technology for. You might have built best regression techniques in theory but if you can’t run it on real data in shortest possible time there is no use of that
If you would like more information about Data Science Training, click the Request Info. button on top of this page.
Assuming you understand your business requirement sufficiently, let’s discuss what do we mean by Data Mining, Statistics & Machine Learning? What are the differences among them?
Data Mining vs. Statistics vs. Machine Learning
Data Mining, Statistics and Machine Learning are interesting data driven disciplines that help organizations make better decisions and positively affect the growth of any business.
According to Wasserman, a professor in both Department of Statistics and Machine Learning at Carnegie Mellon, what is the difference between data mining, statistics and machine learning?
“The short answer is: None. They are … concerned with the same question: how do we learn from data?”
Considering Wasserman’s answer, the three disciplines are considerably the same but with minor differences, rather they can be referred to as identical twins which make use of different words and terminology and follow different notations.
Data Mining
Data mining is a very first step of Data Science product. Data mining is a field where we try to identify patterns in data and come up with initial insights.
E.g., you got the data and you identified missing values then you saw that missing values are mostly coming from recordings taken manually.
Few people mistake Data mining with data extraction. Data mining comes into play once you have collected data.
Companies use powerful data mining techniques coupled with advanced tools to extract valuable information out of large amount of data.
E.g., Walmart collects point of sales data from their 3,000+ stores across the world and stores it into their Data Warehouse. Walmart suppliers have access to this database and they identify the buying patterns among Walmart customers and use this to maintain their inventory in future. Walmart data warehouse processes more than a million such queries every year.
Data mining uses power of machine learning, statistics and database techniques to mine large databases and come up with patterns.
Mostly data mining uses cluster analysis, anomaly detection, association rule mining etc. to find out patterns in data.
In short Data Mining is finding out hidden and interesting patterns stored in large data warehouses using the power of statistics, artificial intelligence, machine learning and database management techniques.
Statistics
Statistics is the base of all Data Mining and Machine learning algorithms.
Statistics is the study of collecting, analyzing and studying data and come up with inferences and prediction about future.
Major task of a statistician is to estimate population from sample metrics. Statistics also deal with designing surveys and experiments in order to get quality data which can further be used to make estimation about the population. If we have to formally list down the task of statistics, it will be as follows
- Designing surveys and experiments
- Summarizing and understanding data
- Estimating population behavior
- Prediction or estimation of future
Statistics is used to summarize numbers for example finding out descriptive statistics like Mean, Median, Mode, Standard Deviation, Variance, Percentiles, Testing hypotheses etc.
Machine Learning
Machine learning is a part of data science which majorly focuses on writing algorithms in a way such that machines (Computers) are able to learn on their own and use the learnings to tell about new dataset whenever it comes in.Machine learning uses power of statistics and learns from the training dataset. For example, we use regressions, classifications etc. to learn from training data and use those learnings to estimate test dataset.
Data Mining vs. Statistics - Similarities and Differences Unleashed
The objective of Data Mining and Statistics is to perform data analysis but both are different tools. Data mining process involved modelling, predicting and optimizing a dataset while Statistics describes how efficient a dataset is –more or less.
Data Mining |
Statistics |
Explorative – Dig out the data first, discover novel patterns and then make theories. |
Confirmative – Provide theory first and then test it using various statistical tools. |
Involves Data Cleaning |
Statistical methods applied on Clean Data |
Usually involves working with large datasets. |
Usually involves working with small datasets. |
Makes generous use of heuristics think |
There is no scope for heuristics think. |
Inductive process |
Deductive (Does not involve making any predictions) |
Numeric and Non-Numeric Data |
Numeric Data |
Less concerned about data collection. |
More concerned about data collection. |
Some of the popular data mining methods include –Estimation, Classification, Neural Networks, Clustering, Association, and Visualization. |
Some of the popular statistical methods include –Inferential and Descriptive Statistics. |
Machine Learning vs. Statistics
- Machine Learning and Statistics both are concerned on how we learn from data but statistics is more concerned about the inference that can be drawn from the model whereas machine learning focuses on optimization and performance.
- Statistical learning involves forming a hypothesis (making assumptions that are validated before building models) before building a model. In machine learning models, the machine learning algorithms are directly run on the model making the data speak instead of guiding it in a specific direction with initial hypothesis.
- Statistics is all about sample, population, and hypothesis whereas machine learning is all about predictions, supervised and unsupervised learning.
- Machine Learning is about building algorithms that help machines emulate human learning whereas Statistics is about converting the data into aggregate numbers which help understand the structure in data.
In short,
- Statistics quantifies data from sample and estimates population behavior
- Data mining finds out pattern in data
- Machine learning learns from training data and predicts or estimates future