Is "becoming a data scientist" one of your resolutions for 2021? Data science careers have seen tremendous growth over the years. On top of commanding high data scientist salaries( average data scientist salary is $96501), data science beginners can expect growth opportunities to level up in their data science career as they upskill and gain experience. As a jack of all trades (and master of quite a few), beginners will need a well-rounded set of data science skills to enter the most in-demand job market. While the data scientist career path is not straight and narrow, knowing what a data scientist career path looks like from a real data scientist is the best and most actionable career advice a data science beginner can get.
We asked a data science expert for his best advice to traverse the data scientist career path. Just one big idea can change the world, and so can one amazing Pytorch library. Meet our inquisitive Data Scientist Manu Joseph who has envisioned a Pytorch library that will be useful for both the research as well as the industry application. Manu Joseph is a self-taught Data Scientist. Having transitioned from software engineering to Data Science he owns his project end-to-end from converting a business problem to a Data Science problem and executing the solution. Recently he launched a new Library - Pytorch Tabular which is a framework/ wrapper library that aims to make Deep Learning with Tabular data easy and accessible to real-world cases and research alike. His attempt with PyTorch Tabular is to make the “software engineering” part of working with Neural Networks as easy and effortless as possible. Read on this data science expert advice to learn more about how to make a career transition into data science in 2021.
Q) How did you first get into data analytics?
A) If you look at my data science career path it begins from engineering to software engineering at Cognizant for a couple of years. Then I did my MBA in Supply chain and Operations as a major, post which I started working as a supply chain consultant and then moved into analytics consultancy and then became a data scientist. I've always been fascinated with numbers, modeling real-world business problems as mathematical problems and you can see that in my choices - engineering and even the MBA specialization. I chose something more mathematical. After the MBA I joined the company as an SCM consultant where I had to utilize a lot of data. So my data scientist career path started there although it was restricted to classical statistics and supply chain mathematics. Then I realized that this is what I want to do and then I slowly made a career transition to data science. My time at Philips facilitated this career transition into data science where I was an analytics consultant. My peers there were doing a myriad of things and I got interested in this hence I thought to upskill myself. I then took the Andrew NG Coursera course. I, however, did not stop after doing that data science MOOC because I've seen a lot of people who are trying to get into the field they take their first data science MOOC and then kind of stop learning data science and that's the first mistake that you can do because a MOOC is a starting point and one shouldn’t give up at the very start.
Q) What do you suggest they do next after a data science course?
A) After a MOOC, put whatever data science skills that you've learned into practice by working on diverse real-world data science projects. If it is already a part of your work then that's brilliant. It wasn’t for me, so while doing the MOOC I learnt python for data science in parallel, and whatever SQL processes were happening in my line of work I would replicate that in Python to get hands-on experience. I was looking for an opportunity in this area and when that chance came I approached my manager and told him that this is an opportunity that I want to pursue and we can do this together. He was very supportive and that’s how I transitioned to a data science career.
Q) Apart from the Andrew NG Coursera MOOC, are there other specific data science blogs, resources, and projects that you would refer back to?
A) I don't have a single source where I keep going back to apart from the Kaggle forums which is like a treasure trove of information although I don't spend much time participating in data science competitions I certainly read the forums. The data science solutions there give a lot of ideas and tricks that you can use to kind of make your machine learning models perform better. My main go-to source is Google. A simple Google search will show at least two or three good data science blogs relevant to the topic that you want to learn. Those blogs may have mentioned some research papers or some other blogs to read and understand.
Q) When you talk about your background you mention being involved in end-to-end projects and it's probably produced a pretty straightforward definition for you but then for the benefit of the readers, can you help us understand what is an end-to-end project. What are the different parts of that chain?
A) As a part of my role, I am involved from the conception of a project to the sales initiatives assisting the business developer to pitch successfully. Once the client is with us I work with them to fine-tune and nail down the business problem. Customers come to us with a problem and then it's up to us to work with them to find out a way to apply analytics to solve that problem and provide value to the customer. Then the usual data science project lifecycle cycle happens -the pre-processing, cleaning, and analysis process. When I just began my Data Science career I was working more on the data science project lifecycle. Now since I have multiple machine learning projects under my belt I provide technical guidance in most of these projects. The last and important point is stakeholder management. Throughout the data science project, we have to consider our relationship with the customer because we are a consulting organization so maintaining and growing that relationship is important too.
Q) I was curious about how you somehow find time outside of all this to write a blog called deep and shallow. Can you help us understand what is the motivation behind it and how has it helped you do that?
A) Finding time is something that I am constantly working on because it's a very difficult thing to find time for yourself. The motivation to write a blog was inspired by Richard Feynman’s quote “Pretend to teach a concept you want to learn about to a student in the sixth grade.’ Being a person who constantly wants to upgrade himself or learn new things, I felt that this is one of the best ways that I can force myself to do. Basically, before writing any topic I will read and explore that topic to get a basic understanding. To make this understanding finer I condense it into a blog. This blog also serves as a go-to reference for me because if I forget something or if I want to impart my knowledge to someone I can do it through my blog which helps me flourish in my career too.
Q) I noticed one of your data science projects which talked about predicting the uplift of a promotion for a CPG company. The specific question I had was if I Google for that project I’ll find lots of code on Github or Kaggle. How is a real data science project for a client different from what someone would find on a Kaggle and what is that delta gap there?
A) I think one of the key differences that you will find between a Kaggle project and a real-life project is the data. In a Kaggle project, the data set is carefully curated and there are no missing rows, hardly any irrelevant columns and even the distribution of data is also constant between your training and test set. But in a real data science project, the data is scattered across a million places that you need to find, bring it together, and connect properly. In real-world data science problems, because the data itself is a lot more unorganized and unclean and to cope with that you have to come up with very specific ways of cleaning it up or solving it and then eventually presenting it in a manner that the business can interpret and get value from it. All these are absent in a textbook example or on an online forum. The Kaggle project is mainly concentrating on the modeling part of it. There's a lot of things that you can do in that model for which Kaggle is amazing but outside that there are a few things that are not covered in Kaggle.
Q) For a data science career aspirant who's trying to break into the field, if they had to focus on only one thing what should that be?
A) I would say it would be reading since this field is very dynamic. If you're not upskilling yourself regularly there'll always be some new data science or machine learning technique that you don't know about. There is constant research happening at a mind-boggling speed that it becomes difficult to stay contemporary. Even if you don't want to keep updated on the most cutting-edge research you still need to constantly read. The best way to do that is to read things that are outside of your comfort zone and then practice them. It helps develop skills as well as confidence-building.
Q) You recently released your own Pytorch library called Pytorch tabular. My first question is what is so special about Pytorch compared to Keras, why is it suddenly so popular? And the second question is what motivated you to come up with this library? What is that thing you found missing that made you build this library?
A) I'll begin with the first one. If you want to understand deep Learning when using Keras you just take a layer, stack it on top of it, and then you do the fit. One doesn't get to know what's happening inside and that's okay for some use cases where standard things are available. But if you want to do something research-oriented or if you want to tweak some machine learning model it is difficult. Keras has the back in this Tensorflow. Tensorflow 1.0 - the old one was a nightmare to work with according to me because it's very difficult to debug. On the other hand, Pytorch is very pythonic and you can debug it very easily. You can drop down to the intricacies of the model and make changes which I find phenomenal and that's why I moved completely towards Pytorch. I heard that Tensorflow 2.0 is better. The snags have been resolved and it's become similar to Pytorch now.
Pytorch Tabular, basically the tabular word standard tabular, tables, regressions, and classification in that kind of modality has been predominantly dominated by gradient boosting which is all of your XGboost and GBMs. Recently right there has been a concentrated effort in getting deep learning to work better in this modality and there were few research papers that came recently but when I was looking at it there was no real framework out there that tackled this modality apart from fast AI. Fast AI did just that but is a little more difficult to hack because it has a lot of custom optimizers inside the code. I was hence leaning towards another framework which is Pytorch at the base and then there is an awesome library called Pytorch lightning which basically abstracts the training part of the Pytorch into a very scalable and usable platform and then using these two as the base I built standard data ingestion and configuration Tensor and a base model which can be extended to any other machine learning model just by changing one method. What I envision is that this will be useful for both the research as well as the industry application. This takes off a lot of this software engineering needs to do to make a deep learning model work and then just put it into a .fit method which is simply Keras and also allows you the flexibility of Pytorch by enabling you to have custom models.
Q)One of the things we focus on at ProjectPro is basically to help users get their work done faster by giving them reusable templates for data science and machine learning projects. I'm curious from your own experience what are some hacks or tactics or processes that you have relied on to get your projects done more efficiently and faster?
A) Throw XGboost at it, almost always works right? But more seriously, the tools that I use kind of change according to the time that I work. Currently, I do my initial investigation and modeling through Pycaret which is a low code mechanism. It helps us iterate through a lot of different models at one simple call to the API. After that initial step is done I figure out what is the model that I want to do or what kind of models I want to explore then drop back to my codebase because I keep building them so basically whenever I’m working on a machine learning project I keep saving those codes into modular forms into one of my kind of library and then reuse it. I kind of prototype it or make a POC using Pycaret and then rely on my data science repository of libraries or modules of code to kind of release it.
In today’s competitive world everyone wants to be the best, every company wants to hire the best. Well, you certainly could be an asset to your company. All you have to do is constantly upgrade your Data science skills and accomplish the task of releasing your projects faster. Thankfully, you don’t have to look any further because the ProjectPro team is constantly working to create a library of solved end-to-end data science and machine learning projects. With their solution code, you can always deliver your projects faster than usual and constantly enhance your skill with their tutorial videos.