"A significant constraint on realizing value from Big Data will be a shortage of talent, particularly of people with deep expertise in statistics and machine learning, and the managers and analysts who know how to operate companies by using insights from Big Data. We project a need for 1.5 million additional managers and analysts in the United States who can ask the right questions and consume the results of the analysis of Big Data effectively.”- McKinsey Report
Data scientist job title is rising along the big data technology. It is an undeniable fact that data scientist or related roles such as data analysts, data engineers and statisticians are among the most sought after careers now. Attracted by the great compensation benefits, increased number of job opportunities and visibility to business leaders, professionals are heading towards the data scientist career path without much knowledge of the required business and technical skills, attitude and day to day responsibilities of a data scientist.
CLICK HERE to get the 2016 data scientist salary report delivered to your inbox!
Data Science is an extremely fun and challenging field to be in- data scientists are doing some pretty amazing things by playing with organizations’ data to draw business insights. It is no surprise that many people are looking to find out the answer to the question- “How to program your way into data science?”
At DeZyre we have always believed that it is better to learn from Industry Professionals on how to get into the industry. We have organized a DeZyre InSync session to answer this specific question – “How to program your way into data science?”
We had the pleasure to invite Eeshan Chatterjee, Data Scientist at MEDIA iQ Digital Ltd. MEDIA iQ Digital is an analytics technology company that unlocks insights to help businesses drive growth. The analytic technology at MEDIA iQ helps in driving prediction at scale so that the buying outputs of a business can be improved across various campaigns. Eeshan is an integral part of the analytics research for the display advertisements space and digital marketing at MEDIA iQ.
You can click on the link below to listen to a recording of the recent webinar on “How to program your way into data science?” by Eeshan Chatterjee.
What is data in the business world?
If you can observe the data, record it, store it and measure it – and doing that will help drive business growth, then this can be termed as ‘data in the business world’ that is important to any organization.
What data does my business generate?
“Each and every department right from CEO’s office to the janitorial division collects data.”
Each and every department of an organization whether it is –Sales & Marketing, Production, Operations, Finance, HR department, Supply Chain or any other division of a business organization, generates data. The idea behind storing data collected from all the divisions of an organization is to get a wholesome picture of the business. This helps organizations to look at the business from various perspectives together and that’s what is popularly known as “Data Science”.
What is Data Science?
Data Science helps businesses look at things that were nearly impossible or difficult to look at earlier. Data science helps businesses analyse various aspects like-
- How does change in customer behaviour impact the business?
- How does change in the supply chain process impact marketing?
Data Science is a progressive step in various interdisciplinary subjects like business analytics that consists of modelling, mathematics, computer programming, statistics and data analytics. Data Science basically deals with using automated methods for analysing huge amounts of data to extract knowledge and meaningful insights from it.
Twitters’ Sid Patil very well pairs up Data Science with the popular Blackjack Game, he says-
“Blackjack is the only game in the casino that has a memory. What happened in the past is indicative of what will happen in the future, and this is very much like the world of analytics. We’re constantly trying to understand what has happened in order to understand the probability what might happen next, with varying amounts of certainty.”
The HypeCycle and Data Science
Image Credit : gartner.com
Every new thing in the technology world goes through the hype cycle; data science is one among them and is widely accepted by all. As it is evident from the below Gartner HypeCycle, Data science started to climb up to the peak of inflated expectations in the hype cycle in 2014 and as of today it’s right at the top. In future,the hyped up data scientists are going to do some great things with data which was not possible earlier like-
- How can an organization accelerate tomorrow’s business?
- How does depending on data provide better results?
- How can businesses mine data in an effective manner?
“ In tomorrow’s business, big data can tell you more about your operations than your people alone.”- says Emma Byrne
The Basics- How did we arrive at Data Science?
Businesses have always collected data and analysed it to make better business decisions, however business analytics has evolved exponentially since the 1970’s
- The Era of Statistical Insight-In the era of statistical insight, businesses used to model metrics, measure KPIs and lot of time was spent in operations research. For instance, UPS has been a pioneer in using data for making effective business decisions to give better deliveries since 1970’s.
- The Era of Business Intelligence- 10 years ago the era of business intelligence was ushered in through dashboards (dashboards are kind of static charts that don’t allow any kind of interaction).Dynamic, frequent updates to those dashboards was not possible, but there was a need for real time data analysis - and then emerged business analytics.
- The Era of Data Science-The dashboards have evolved into cockpits which are interactive charts unlike the dashboards (dash-boring or dash-boredom). For instance, a marketing manager needs to talk to the charts by dynamically allocating dollars and see how it will affect the business reach or a production manager needs to see if changing the production life cycle will affect the final output. Distributed computation has become really huge in the era of data science. Intelligent systems have evolved as a core part of data science which requires training machines to make better business decision.
The only thing that has not changed since the 1970’s is the motive to help businesses make better decisions.We are in an age where businesses talk to data at all times to make better business decisions.
The most important question is –“Can a statistician call himself a Data Scientist?” The answer is ‘not exactly’ - because the job role of a data scientist is diversified and demands a wider skill set. So what are the skills that need to be mastered. The most popular Venn diagram in the field of data science explains the amalgamation of skills professionals must possess, who want to enter the field of data science -
- He/she should be strong at Math and Statistics.
- He/she must be a subject matter expert in specific business domain – a person who knows the business very well.
- He/she must be technologically good with a computer programming background.
But is that true? Can we actually find professionals who possess all these skills? It is very difficult to find professionals who fit all the criteria mentioned. If by any luck, there are such brilliant people then why would they work for someone?
- A person who is relatively good at math and computer programming can be termed as a “Data Engineer”.
- A person who is relatively good at math and possess strong business or domain knowledge can be termed as a “Business Analyst or a Requirements Analyst”.
- A person who is relatively good at computer programming and possess some kind of business domain knowledge can be termed as a “Data Design Architect”.
The job role of a data scientist is to bring in all these roles together i.e. striking a balance with the above skills along with good design thinking.
The popular Data Science Cheatsheet
Image Credit : datasciencecentral.com
The next most important question that bothers professionals yearning to enter the field of data science is – “Do Math and Business Acumen also require programming knowledge?” The answer is definitely ‘yes’. If we look at the above, one cannot tick off 15% of the above checklist without programming. Once a person learns the fundamentals and basics of statistics, the cheatsheet shows a complete new field of programming that data scientists must learn to proceed further. Having learnt programming, the next step is to master machine learning which requires solving complex mathematical problems which would be difficult to solve manually-so the only solution is to write a program for solving the mathematical or statistical problem.Visualization is also an integral part of data science that requires coding using various data visualization tools. If we look at all the sections in the cheatsheet, it can be observed that all of them require some kind of programming or the other. The “Toolbox” in the cheatsheet summarizes the various skills a data scientist must master to program their way into data science.
Programming for Math-The Algo Whiz Codebook
- The first step is to choose a scripting language for programming math. The most popular choices that are available today are R language and Python Language.
- The next step is to look for the required packages. Both R language and Python Language have in-built packages freely available for almost all techniques that a data scientist might want to use today.
- Visualize the result through interactive plots that can be created by using the scripting languages.
R or Python –The debate settled
With R and Python Language having similar capabilities, it becomes difficult for data scientists to decide – as to which is a better choice of the scripting language.
Data Scientists should Choose R Language When-
- They are beginning to explore the data.
- They are developing analysis methodology or looking to find a onetime business insight.
- They want to explore a broad spectrum of techniques to find best ensembles to use.
- R is preferred over Python for complex data structures.
Data Scientists should Choose Python Language When-
- If a person has good understanding of the techniques and of the data they are working with then Python is a better choice.
- Python Language should be used when the analysis methodology is to be deployed on large scale production systems.
- Python Language is a better choice if you want to train deep models on GPU’s.
Programming for Tech
For a data scientist, it is important to know the basics of C++ and Java as they form the backbone of all at-scale data systems.
When it comes to data in data science, programming for technology should be done on 4 general platforms-
1)Creating data platforms that ingest or manage data. These form the backbone of the data services.
2)The next step is to scale out or distribute data across multiple systems using Hadoop, YARN, Scala,and JADE (for multiple parallel analysis).
3)Having distributed the data, a data scientist must effectively process the data using low level subroutines written in C++.
4) Using GPU’s for processing - which are written in C or C++ or making use of machine learning algorithms with millions of data points.
Programming for Business
This is the most interesting part of data science- the results of the analysis should be presented in a manner that can be easily understood by end users.
There are different libraries of visualization available like d3.js that allows data scientists to provide interactive charts for answering business questions intuitively.
- The drill down capabilities and bird’s eye provide multiple perspectives of a business for making well informed decisions without losing the context.
Design Thinking and Programming
A scientist approaches a problem with a very ‘problem focused approach’ whereas an architect approaches a problem with a ‘solution based approach’. A scientist breaks down the problem and gets to the best solution whereas an architect does not think on the intricacies of the problem but just does multiple iterations to arrive at a nice solution to the problem.
Design Thinking is an elegant approach that combines the two approaches-break down and analyse the problem into sub parts and bits. Look at the ones that are important- categorize the wants and needs. “Wants” are the good to have parts and “Needs” are the cant do without parts. Then synthesize the best solution from multiple solutions possible.The entire concept of design thinking depends on a future goal and not a solution. Design Thinking is the ability to define a better goal irrespective of the problem – that means looking at the BIG PICTURE. The ability to define a state where the problem has been solved and the solution of the problem can be used in several other ways then the desired future state has been achieved. There are several roadblocks to reach the desired future state and data scientists need to come up with all the possible solutions by prototyping them. The last step is to develop an at scale solution for the prototype.
If there’s any question you think would be helpful in programming your way into data science, feel free to ask in the comments below!