If you are an aspiring data scientist- always learning, exploring and playing with data then this blog post will help you get ready to begin your career in data science with Python. Python Language has a rich and healthy ecosystem with ample of libraries for data analysis, data I/O and data munging. The best way to make sure that you are all set to become a data scientist is to make yourself well-versed with the various Python libraries and tools that people use in the industry for doing data science. We asked our data science faculty to list five Python libraries for data science that they think every data scientist must know how to use. Check them out below:

All of us can easily do some kind of data analysis using pen and paper on small data sets. Imagine a situation, where we have to analyze millions of petabytes of data. We would require specialized tools and techniques to analyze and derive meaningful information from huge datasets. Pandas Python is one of those libraries for data analysis, that contains high-level data structures and tools to help data scientists or data analysts manipulate data in a very simple and easy way.

To provide a very simple and yet effective way to analyze data requires the ability to index, retrieve, split, join, restructure and various other analysis on both multi and single dimensional data. Pandas data analysis library has some unique features that provide these capabilities-

These two are high performance array and table structures, for representing the heterogeneous and homogeneous data sets in Pandas Python.

Pandas python provides the flexibility for reshaping the data structures so that the data can be inserted in both rows and columns of tabular data.

To allow automatic alignment of data and indexing, pandas provides labelling on series and tabular data.

Heterogeneous indexing of data spread across multiple axes, which helps in creating more than one label on each data item.

The functionality to perform split-apply-combine on series as well on tabular data.

Using pandas, programmers can easily identify and mix missing data in both floating and non-floating pointing numbers.

**vii) Powerful capabilities to load and save data from various formats such as JSON, CSV, HDF5, etc.**

**viii) Conversion from NumPy and Python data structures to pandas objects.**

**ix) Slicing and sub setting of datasets, which include merging and joining data sets with SQL- like constructs.**

Although, pandas provides many statistical methods, it merely is not enough for doing data science in Python. Pandas depends upon other python libraries for data science like NumPy, SciPy, Sci-Kit Learn, Matplotlib, ggvis in the Python ecosystem to draw conclusions from large data sets. Thus, making it possible for Pandas applications to take advantage of the robust and extensive Python framework.

There are many people who often say that “Python is amazing for doing data science but they have spent 3 days installing Python and other libraries to learn doing data science in Python.” It is not recommended to install the PyData stack manually particularly when one does not know which libraries they will actually need. If you are one among them, then Anaconda by Continuum is for you.

Anaconda, is one of the most popular Python distribution offering both paid and free components. Anaconda is very popular among the open source community because of its cross-platform support to run on Windows, Mac or Linux

The base package of Anaconda installs pandas as a part of the default installation process, which makes it easy to begin using pandas. The default installation also installs IPython Notebook server, which can be used to run the applications interactively.

Excited? Now, let’s install Anaconda and pandas, to write some cool stuff!!

You can download the latest Anaconda from the Continuum Analytics website https://www.continuum.io/downloads. Once you visit the website, it will automatically detect the OS and provide you with different options for downloading.

After downloading the installer, Windows system provides the executable as shown below-

After executing the installer and the screen will guide you to install Anaconda, just follow the on screen commands and finish the installation process.

After the installation process is completed, open the command prompt and type python, the below screen will appear on successful installation of Python -

Now since Anaconda is installed successfully, we need to check, if pandas installed is the most recent version or not. Pandas version can be verified using conda package manager from the command line as follows-

conda list pandas

If the pandas version installed is not a recent one, then use the below command to update Pandas-

conda update pandas

This command will download the latest version of pandas and all its dependencies as follows-

We will write our first application in IPython interpreter, as it provides a very convenient way for writing Python applications.

Open the command prompt and type ipython as shown below-

In [1]: import pandas as pd

In [2]: mydf = pd.DataFrame.from_items ([('column1', [1, 2, 3])])

In [3]: print (mydf)

column1

0 1

1 2

2 3

That’s it! It’s very easy to write pandas applications using IPython. We can also write using the web based GUI of IPython Notebook.

Learn Data Science in Python for all your Data and Analytic Needs

Numerical Python code name: - NumPy, is a Python library for numerical calculations and scientific computations. NumPy provides numerous features which can be used by Python enthusiasts and programmers to work with high-performing arrays and matrices. NumPy arrays provide vectorization of mathematical operations, which gives it a performance boost over Python’s looping constructs.

pandas Series and DataFrame objects rely primarily on NumPy arrays for all the mathematical calculations like slicing elements and performing vector operations. Below are some of the features provided by NumPy-

- Integration with legacy languages.
- Mathematical Operations: It provides all the standard functions required to perform operations on large data sets in a very fast and efficient way, which otherwise have to be performed through looping constructs.
- ndarray: It is a fast and efficient multidimensional array which can perform vector based arithmetic operations and has powerful broadcasting capabilities.
- I/O Operations: It provides various tools which can be used to write/read very large data sets from disk. It also supports I/O operations on memory based file mappings.
- Fourier transform capabilities, Linear Algebra and Random Number Generation.

If you have installed Anaconda as mentioned above, then NumPy will get installed automatically, as it is one of the dependency of pandas. But, in case you have downloaded it via some other tools, then you need to download NumPy separately, after installing Python. Also, you need to keep in mind that, NumPy has to be installed first and then any other add-ons can be installed.

Open command prompt and type ipython. Then type below commands

In [1]: import umpy as np

In [2]: myarray = np.array([7,4,3,8,9],int)

In [3]: myarray

Out[3]: array([7, 4, 3, 8, 9])

In [5]: myarray[:2]

Out[5]: array([7, 4])

**iii) Accessing elements from Array**

In [6]: myarray[3]

Out[6]: 8

In [1]: import numpy as np

In [2]: myarray = np.array([4,5,7,8],int)

In [3]: mystring = myarray.tostring()

In [4]: mystring

Out[4]: b'\x04\x00\x00\x00\x05\x00\x00\x00\x07\x00\x00\x00\x08\x00\x00\x00'

Phew!!! Those were some cool commands, let’s move forward to our next Python library in the list.

Click Here to Enrol Now for Free Introductory Data Science in Python Course

Scientific Python code name, SciPy-It is an assortment of mathematical functions and algorithms which are built on top of Python’s extension NumPy. SciPy provides various high-level commands and classes for manipulating and visualizing data. SciPy is useful for data-processing and prototyping of systems.

Apart from this, SciPy provides other advantages for building scientific applications and many specialized sophisticated applications that are backed by powerful and fast growing Python community.

As we are using Anaconda for installing Python modules and running the commands, for SciPy also we will be using Anaconda. Being a dependency for pandas, SciPy also gets installed with the default installation of Anaconda. For Python lovers, SciPy is also available for downloading separately using this http://www.scipy.org/install.html link.

In [1]: import numpy as np

In [2]: import scipy as sp

In [3]: import scipy.linalg as spalg

In [4]: myarray1 = np.array([[1,2],[3,4]])

In [6]: myarray1

Out[6]:

array([[1, 2],

[3, 4]])

In [7]: spalg.inv(myarray1)

Out[7]:

array ([[-2. , 1. ],

[ 1.5, -0.5]])

In [8]: myarray2 = np.array([[5,6]])

In [11]: myarray1.dot(myarray2.T)

Out[11]:

array([[17],

[39]])

Awesome!!! We have just have completed three most popular libraries of Python for data science and it’s time to go ahead with the next one.

We all have heard this quote “Necessity is the mother of all invention”. The same holds true for matplotlib. This open source project was developed to handle different types of data generated from multiple sources in the epilepsy. matplotlib is a 2D graphical Python library. However, it also supports 3D graphics, but this is very limited. With increasing demand for Python in many folds in the recent years, growth of matplotlib has given tough competition to giants like MATLAB and Mathematica.

You might be very happy to hear this that, when you installed Anaconda, matplotlib has also got installed under default installation. So, you do not need to do any extra installation. For knowledge seekers who want to build through the source code of Matplotlib, visit http://matplotlib.org/users/installing.html.

Demo for creating date plots, loading a default Yahoo csv file which comes with default installation.

In [1]: import datetime as dt

In [2]: import numpy as np

In [3]: import matplotlib.pyplot as matpy

In [4]: import matplotlib.dates as matdt

In [5]: import matplotlib.cbook as cbook

In [6]: yrs = matdt.YearLocator()

In [7]: mnt = matdt.MonthLocator()

In [8]: yrsFmt = matdt.DateFormatter('%Y')

In [9]: dataFile = cbook.get_sample_data('goog.npy')

In [10]: try:

r = np.load(dataFile,encoding='bytes').view(np.recarray)

except TypeError:

r = np.load(dataFile).view(np.recarray)

In [13]: fig,ax = matpy.subplots()

In [14]: ax.plot(r.date,r.adj_close)

Out[14]: []

In [15]: ax.xaxis.set_major_locator(yrs)

In [16]: ax.xaxis.set_major_formatter(yrsFmt)

In [17]: ax.xaxis.set_minor_locator(mnt)

In [19]: mindate = dt.date(r.date.min().year,1,1)

In [20]: maxdate = dt.date(r.date.max().year+1,1,1)

In [21]: ax.set_xlim(mindate,maxdate)

Out[21]: (731581.0, 733408.0)

In [22]: def price(x):

return '$%1.2f' % x

In [23]: ax.format_xdata = matdt.DateFormatter('%Y-%m-%d')

In [25]: ax.format_ydata = price

In [26]: ax.grid(True)

In [27]: fig.autofmt_xdate()

In [28]: matpy.show()

Once the last command matpy.show () is executed, a pop-up window will appear with the result as shown below-

With some basic coding and commands you are able to create a visual graph based on the data, imagine the brilliance and power of matplotlib.

For all the machine learning practitioners who want to bring machine learning into the production systems, Sci-Kit Learn is the savior, Sci-Kit Learn has several supervised and unsupervised machine learning algorithms which have a level of robustness and support required for use in production applications. As this library provides various learning algorithms, it has been named as Sci-Kit Learn. Sci-Kit Learn focuses on code quality, good documentation, ease of use and performance.Sci-Kit Learn has a steep learning curve.

Sci-Kit Learn is built upon SciPy and thus to use Sci-Kit Learn it is necessary to install various other Python libraries – Pandas, NumPy, SciPy, SymPy and IPython (the enhanced interactive console).However, on installing Anaconda-Sci-Kit Learn is also installed by default.

Learn Data Science in Python to build predictive models

At the beginning of this article you might have heard only about the popular libraries in python for data science but now you can do some basic coding and make wonders using Python libraries with your datasets. Python ecosystem is a huge ocean with so many libraries to be unleashed for data scientists. These were just few of them. Subscribe to our blog for more updates on exploring other Python libraries.

CLICK HERE to get the 2016 data scientist salary report delivered to your inbox!

The goal of this data science project is to build a predictive model and find out the sales of each product at a given Big Mart store.

In this ensemble machine learning project, we will predict what kind of claims an insurance company will get. This is implemented in python using ensemble machine learning algorithms.

Text data requires special preparation before you can start using it for any machine learning project.In this ML project, you will learn about applying Machine Learning models to create classifiers and learn how to make sense of textual data.

Data Science Project in Python- Build a machine learning algorithm that automatically suggests the right product prices.

In this project, we are going to work on Deep Learning using H2O to predict Census income.

In this project, we are going to work on Sequence to Sequence Prediction using IMDB Movie Review Dataset using Keras in Python.

In this R data science project, we will explore wine dataset to assess red wine quality. The objective of this data science project is to explore which chemical properties will influence the quality of red wines.

Data Science Project in R -Build a machine learning algorithm to predict the future sale prices of homes.

This data science in python project predicts if a loan should be given to an applicant or not. We predict if the customer is eligible for loan based on several factors like credit score and past history.

PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

- Top 100 Hadoop Interview Questions and Answers 2017
- Pig Interview Questions and Answers
- Hive Interview Questions and Answers
- HBase Interview Questions and Answers
- MapReduce Interview Questions and Answers
- HDFS Interview Questions and Answers
- Real-Time Hadoop Interview Questions and Answers
- Hadoop Admin Interview Questions and Answers
- Basic Hadoop Interview Questions and Answers
- Apache Spark Interview Questions and Answers
- Data Analyst Interview Questions and Answers
- 100 Data Science Interview Questions and Answers (General)
- 100 Data Science in R Interview Questions and Answers
- 100 Data Science in Python Interview Questions and Answers
- Data Cleaning in Python
- Python Pandas Dataframe Tutorials
- Recap of Hadoop News for September 2018
- Introduction to TensorFlow for Deep Learning
- Recap of Hadoop News for August 2018
- AWS vs Azure-Who is the big winner in the cloud war?
- Top 5 Reasons to Learn AWS
- Top 50 AWS Interview Questions and Answers for 2018
- Recap of Hadoop News for July 2018
- Top 10 Machine Learning Projects for Beginners

- Hadoop Online Tutorial – Hadoop HDFS Commands Guide
- MapReduce Tutorial–Learn to implement Hadoop WordCount Example
- Hadoop Hive Tutorial-Usage of Hive Commands in HQL
- Hive Tutorial-Getting Started with Hive Installation on Ubuntu
- Learn Java for Hadoop Tutorial: Inheritance and Interfaces
- Learn Java for Hadoop Tutorial: Classes and Objects
- Learn Java for Hadoop Tutorial: Arrays
- Apache Spark Tutorial–Run your First Spark Program
- PySpark Tutorial-Learn to use Apache Spark with Python
- R Tutorial- Learn Data Visualization with R using GGVIS
- Neural Network Training Tutorial
- Python List Tutorial
- MatPlotLib Tutorial
- Decision Tree Tutorial
- Neural Network Tutorial
- Performance Metrics for Machine Learning Algorithms
- R Tutorial: Data.Table
- SciPy Tutorial
- Step-by-Step Apache Spark Installation Tutorial
- Introduction to Apache Spark Tutorial
- R Tutorial: Importing Data from Web
- R Tutorial: Importing Data from Relational Database
- R Tutorial: Importing Data from Excel
- Introduction to Machine Learning Tutorial
- Machine Learning Tutorial: Linear Regression
- Machine Learning Tutorial: Logistic Regression
- Support Vector Machine Tutorial (SVM)
- K-Means Clustering Tutorial
- dplyr Manipulation Verbs
- Introduction to dplyr package
- Importing Data from Flat Files in R
- Principal Component Analysis Tutorial
- Pandas Tutorial Part-3
- Pandas Tutorial Part-2
- Pandas Tutorial Part-1
- Tutorial- Hadoop Multinode Cluster Setup on Ubuntu
- Data Visualizations Tools in R
- R Statistical and Language tutorial
- Introduction to Data Science with R
- Apache Pig Tutorial: User Defined Function Example
- Apache Pig Tutorial Example: Web Log Server Analytics
- Impala Case Study: Web Traffic
- Impala Case Study: Flight Data Analysis
- Hadoop Impala Tutorial
- Apache Hive Tutorial: Tables
- Flume Hadoop Tutorial: Twitter Data Extraction
- Flume Hadoop Tutorial: Website Log Aggregation
- Hadoop Sqoop Tutorial: Example Data Export
- Hadoop Sqoop Tutorial: Example of Data Aggregation
- Apache Zookepeer Tutorial: Example of Watch Notification
- Apache Zookepeer Tutorial: Centralized Configuration Management
- Hadoop Zookeeper Tutorial
- Hadoop Sqoop Tutorial
- Hadoop PIG Tutorial
- Hadoop Oozie Tutorial
- Hadoop NoSQL Database Tutorial
- Hadoop Hive Tutorial
- Hadoop HDFS Tutorial
- Hadoop hBase Tutorial
- Hadoop Flume Tutorial
- Hadoop 2.0 YARN Tutorial
- Hadoop MapReduce Tutorial
- Big Data Hadoop Tutorial for Beginners- Hadoop Installation