Evolution of Data Science: From SAS to LLMs

Explore the evolution of data science from early SAS to cutting-edge LLMs and discover industry-transforming use cases with insights from an industry expert.

Get access to all Machine Learning Projects View all Machine Learning Projects

Evolution of Data Science: From SAS to LLMs

Last Updated: 18 Apr 2024

Let’s captivate the whirlwind evolution of data science with Generative AI and LLMs, ushering in a new era of fierce competition. With Meta's revelation of Llama 3's imminent debut, poised to rival GPT-4, the Data Science community is enthusiastic. This revelation underscores the significance of Large Language Models (LLMs) in today's evolving data science landscape. In ProjectPro's recent podcast episode, we delve into the evolution of data science tools with insights from industry expert Ajay Ohri, boasting over two decades of experience. Explore this podcast episode by ProjectPro to learn how data science tools have transformed from traditional SAS to cutting-edge LLMs.

Evolution of Data Science Tools

The discussion in the "Project Pro Industry Talks" podcast begins by tracing the evolution of data science tools. Ajay highlights the shift from SAS to more versatile and user-friendly programming languages such as R and Python, underscoring their pivotal role in advancing data science. He notes, "While R and Python have been around for some time, it wasn't until around 2008 that R emerged as a prominent analytics language. Python gained momentum as a data science platform after 2012."

LLM Project to Build and Fine Tune a Large Language Model

Downloadable solution code | Explanatory videos | Tech Support

Start Project

He also highlighted the importance of Cloud computing tools like GCP, Azure, and AWS in efficiently handling vast datasets and extracting valuable insights from them. However, he acknowledges that the emergence of large language models (LLMs) represents a paradigm shift in the industry. Their immense processing capabilities and natural language understanding have unlocked new avenues for data scientists, with code generation as one of the most noteworthy advancements.

LLM’s Impact on the Evolution of Data Science Workflow

Ajay Ohri provides a comprehensive overview of the machine learning project lifecycle, delving into the potential integration of LLMs across various stages of the machine learning pipeline. He underscores the promise of LLMs but points out that subjective analysis by data scientists is indispensable in certain aspects like data sourcing, labeling, and preprocessing, which cannot be entirely automated.

Furthermore, Ajay highlights the pivotal role of LLMs in code generation at the onset of the data pipeline. However, he maintains that model training and evaluation necessitate the expertise of data scientists for insightful analysis, indicating that complete automation in these areas is not yet achievable. Ajay emphasizes the evolving role of data scientists, emphasizing their collaboration with LLMs in code generation to enhance accuracy and expedite model development, ultimately leading to more robust solutions.

Traversing the Path of Data Science Evolution: Challenges in LLM Adoption for ROI

Ajay points out the challenge smaller companies face in incorporating GenAI capabilities due to the high cost of LLM development and training and the lack of pricing transparency. He proposes collaborative partnerships with larger IT firms as a solution for seamless LLM integration, acknowledging that only companies with substantial financial resources can adopt LLMs in the initial stages.

However, a significant obstacle to widespread LLM adoption is the impact of open-source LLMs on commercial offerings. Ajay underscores the importance of robust data protection measures to mitigate privacy concerns as organizations leverage LLMs to drive ROI. He emphasizes the growing demand for specialized GenAI professionals and data scientists who must continually enhance their skills in this evolving domain. With the hype surrounding LLMs and increased adoption, ethical considerations remain crucial for maximizing their potential benefits.

Evolution of Data Science and MLOps Roles with LLMs

Ajay highlights the evolving landscape of MLOps and data science roles, noting, "In an ideal world, MLOps should not be separate, but the skill set is not universal among Data Scientists." MLOps, akin to DevOps with Machine learning principles, represents a critical skill that Data Scientists need to embrace, tailored to the requirements of the Data Life Cycle. The traditional demarcation between data engineers and data scientists is fading, with an anticipated convergence in the future, fostering better collaboration within data teams.

Ajay also underscores the challenges associated with establishing custom clouds for LLMs within organizations. He advocates for leveraging off-the-shelf applications for Large Language Models, citing their ease of maintenance and preservation of codebase integrity within GenAI or Deep Machine environments. While purchasing cloud services offers expedited access to platforms like Vertex AI, constructing a new infrastructure entails substantial resources and ongoing maintenance efforts, particularly for ML pipeline development and deployment using frameworks such as Flask and FastAPI.

In decision-making regarding LLM implementation, several factors must be considered, including existing infrastructure, capacity for incremental investment, data confidentiality levels, and regulatory compliance obligations. For example, industries like banking and finance prioritize model building due to regulatory constraints, while e-commerce industries often prefer model deployment on cloud platforms.

Enterprise-Grade Use Cases of MLOps

Ajay further elaborates on these concepts by sharing insights from his experience in various industry use cases during our podcast discussion on the evolution of data science.

In one MLOps project, Ajay collaborated on developing and deploying ML models integrated with MongoDB for data processing. Leveraging Python, the team seamlessly integrated MongoDB data and deployed the models on Vertex AI. Throughout the project, they meticulously calculated and analyzed metrics such as accuracy, precision, recall, and ROC within their software.

Another notable project involved creating a predictive model to optimize digital sales for a retail client. Given the complexity of numerous SKU variables and their interdependencies, the team utilized advanced feature engineering techniques. The model was subsequently deployed on Google Cloud, accompanied by dashboard creation for monitoring purposes. To ensure real-time model scoring with fresh data, they employed a pickle file.

Evolution of Data Science Skill Set: Expert Recommendations

Continuing our discussion on the evolution of data science, Ajay sheds light on crucial skill sets important for aspiring and experienced data scientists. He emphasizes on the importance of mastering MLFlow for efficient model tracking and management, alongside leveraging AutoML tools such as PyCaret, TPOT, and H2O to streamline the model development process. Additionally, certification in cloud platforms like Azure and Google Vertex AI is highlighted as essential for showcasing expertise in deploying models at scale.

Ajay highlights the growing importance of staying updated with emerging skills like Large Language Models (LLMs), Generative AI (GenAI), and Machine Learning Operations (MLOps). Alongside these, traditional skills such as Python programming and data modeling remain essential. As the industry shifts towards enterprise-friendly deployment methods like LLMOps, being proficient in these new technologies becomes increasingly valuable for data scientists.

While Ajay suggests that Docker and Kubeflow may be optional, they can enhance a data scientist's toolkit. Lastly, understanding security protocols for ensuring safe and secure model deployments is essential in today's data science world.

Emerging GenAI Use Cases in the Enterprise: The Future of Data Science

In the final segment of the podcast, he particularly highlights code generation as a game-changing use case, noting its ability to streamline model development processes and reduce workload. However, he advises caution regarding text prediction applications, citing potential biases observed in projects like Google’s Gemini. Furthermore, Ajay foresees a significant evolution in GenAI tools, advocating for customized Large Language Models (LLMs) customized to specific enterprise requirements, citing Microsoft as a prime example.

He goes on to stress on the importance of training in LLMs and GenAI, proposing a comprehensive approach involving training programs on platforms like Coursera and Udemy, coupled with hands-on project experience from platforms like ProjectPro. He believes this approach will foster a collaborative ecosystem where trained professionals contribute to open-source repositories, facilitating continuous innovation and improvement of GenAI capabilities.

We hope the podcast offered valuable insights into the evolving landscape of data science and the transformative potential of AI, particularly LLMs, in the field of data science. With effective collaboration between domain experts, data scientists, and AI specialists, organizations can harness the full potential of these technologies. However, it is crucial for the community to prioritize responsible AI development to ensure that these advancements benefit society.

Explore other top trending data science podcasts with industry experts on ProjectPro’s YouTube channel. Check Out ProjectPro repository that has over 250+ solved end-to-end hands-on data science and big data projects.