If you are an aspiring data scientist- always learning, exploring and playing with data then this blog post will help you get ready to begin your career in data science with Python. Python Language has a rich and healthy ecosystem with ample of libraries for data analysis, data I/O and data munging. The best way to make sure that you are all set to become a data scientist is to make yourself well-versed with the various Python libraries and tools that people use in the industry for doing data science. We asked our data science faculty to list five Python libraries for data science that they think every data scientist must know how to use. Check them out below:
All of us can easily do some kind of data analysis using pen and paper on small data sets. Imagine a situation, where we have to analyze millions of petabytes of data. We would require specialized tools and techniques to analyze and derive meaningful information from huge datasets. Pandas Python is one of those libraries for data analysis, that contains high-level data structures and tools to help data scientists or data analysts manipulate data in a very simple and easy way.
To provide a very simple and yet effective way to analyze data requires the ability to index, retrieve, split, join, restructure and various other analysis on both multi and single dimensional data. Pandas data analysis library has some unique features that provide these capabilities-
These two are high performance array and table structures, for representing the heterogeneous and homogeneous data sets in Pandas Python.
Pandas python provides the flexibility for reshaping the data structures so that the data can be inserted in both rows and columns of tabular data.
To allow automatic alignment of data and indexing, pandas provides labelling on series and tabular data.
Heterogeneous indexing of data spread across multiple axes, which helps in creating more than one label on each data item.
The functionality to perform split-apply-combine on series as well on tabular data.
Using pandas, programmers can easily identify and mix missing data in both floating and non-floating pointing numbers.
vii) Powerful capabilities to load and save data from various formats such as JSON, CSV, HDF5, etc.
viii) Conversion from NumPy and Python data structures to pandas objects.
ix) Slicing and sub setting of datasets, which include merging and joining data sets with SQL- like constructs.
Although, pandas provides many statistical methods, it merely is not enough for doing data science in Python. Pandas depends upon other python libraries for data science like NumPy, SciPy, Sci-Kit Learn, Matplotlib, ggvis in the Python ecosystem to draw conclusions from large data sets. Thus, making it possible for Pandas applications to take advantage of the robust and extensive Python framework.
There are many people who often say that “Python is amazing for doing data science but they have spent 3 days installing Python and other libraries to learn doing data science in Python.” It is not recommended to install the PyData stack manually particularly when one does not know which libraries they will actually need. If you are one among them, then Anaconda by Continuum is for you.
Anaconda, is one of the most popular Python distribution offering both paid and free components. Anaconda is very popular among the open source community because of its cross-platform support to run on Windows, Mac or Linux
The base package of Anaconda installs pandas as a part of the default installation process, which makes it easy to begin using pandas. The default installation also installs IPython Notebook server, which can be used to run the applications interactively.
Excited? Now, let’s install Anaconda and pandas, to write some cool stuff!!
You can download the latest Anaconda from the Continuum Analytics website https://www.continuum.io/downloads. Once you visit the website, it will automatically detect the OS and provide you with different options for downloading.
After downloading the installer, Windows system provides the executable as shown below-
After executing the installer and the screen will guide you to install Anaconda, just follow the on screen commands and finish the installation process.
After the installation process is completed, open the command prompt and type python, the below screen will appear on successful installation of Python -
Now since Anaconda is installed successfully, we need to check, if pandas installed is the most recent version or not. Pandas version can be verified using conda package manager from the command line as follows-
conda list pandas
If the pandas version installed is not a recent one, then use the below command to update Pandas-
conda update pandas
This command will download the latest version of pandas and all its dependencies as follows-
We will write our first application in IPython interpreter, as it provides a very convenient way for writing Python applications.
Open the command prompt and type ipython as shown below-
In : import pandas as pd
In : mydf = pd.DataFrame.from_items ([('column1', [1, 2, 3])])
In : print (mydf)
That’s it! It’s very easy to write pandas applications using IPython. We can also write using the web based GUI of IPython Notebook.
Learn Data Science in Python for all your Data and Analytic Needs
Numerical Python code name: - NumPy, is a Python library for numerical calculations and scientific computations. NumPy provides numerous features which can be used by Python enthusiasts and programmers to work with high-performing arrays and matrices. NumPy arrays provide vectorization of mathematical operations, which gives it a performance boost over Python’s looping constructs.
pandas Series and DataFrame objects rely primarily on NumPy arrays for all the mathematical calculations like slicing elements and performing vector operations. Below are some of the features provided by NumPy-
If you have installed Anaconda as mentioned above, then NumPy will get installed automatically, as it is one of the dependency of pandas. But, in case you have downloaded it via some other tools, then you need to download NumPy separately, after installing Python. Also, you need to keep in mind that, NumPy has to be installed first and then any other add-ons can be installed.
Open command prompt and type ipython. Then type below commands
In : import umpy as np
In : myarray = np.array([7,4,3,8,9],int)
In : myarray
Out: array([7, 4, 3, 8, 9])
In : myarray[:2]
Out: array([7, 4])
iii) Accessing elements from Array
In : myarray
In : import numpy as np
In : myarray = np.array([4,5,7,8],int)
In : mystring = myarray.tostring()
In : mystring
Phew!!! Those were some cool commands, let’s move forward to our next Python library in the list.
Click Here to Enrol Now for Free Introductory Data Science in Python Course
Scientific Python code name, SciPy-It is an assortment of mathematical functions and algorithms which are built on top of Python’s extension NumPy. SciPy provides various high-level commands and classes for manipulating and visualizing data. SciPy is useful for data-processing and prototyping of systems.
Apart from this, SciPy provides other advantages for building scientific applications and many specialized sophisticated applications that are backed by powerful and fast growing Python community.
As we are using Anaconda for installing Python modules and running the commands, for SciPy also we will be using Anaconda. Being a dependency for pandas, SciPy also gets installed with the default installation of Anaconda. For Python lovers, SciPy is also available for downloading separately using this http://www.scipy.org/install.html link.
In : import numpy as np
In : import scipy as sp
In : import scipy.linalg as spalg
In : myarray1 = np.array([[1,2],[3,4]])
In : myarray1
In : spalg.inv(myarray1)
array ([[-2. , 1. ],
[ 1.5, -0.5]])
In : myarray2 = np.array([[5,6]])
In : myarray1.dot(myarray2.T)
Awesome!!! We have just have completed three most popular libraries of Python for data science and it’s time to go ahead with the next one.
We all have heard this quote “Necessity is the mother of all invention”. The same holds true for matplotlib. This open source project was developed to handle different types of data generated from multiple sources in the epilepsy. matplotlib is a 2D graphical Python library. However, it also supports 3D graphics, but this is very limited. With increasing demand for Python in many folds in the recent years, growth of matplotlib has given tough competition to giants like MATLAB and Mathematica.
You might be very happy to hear this that, when you installed Anaconda, matplotlib has also got installed under default installation. So, you do not need to do any extra installation. For knowledge seekers who want to build through the source code of Matplotlib, visit http://matplotlib.org/users/installing.html.
Demo for creating date plots, loading a default Yahoo csv file which comes with default installation.
In : import datetime as dt
In : import numpy as np
In : import matplotlib.pyplot as matpy
In : import matplotlib.dates as matdt
In : import matplotlib.cbook as cbook
In : yrs = matdt.YearLocator()
In : mnt = matdt.MonthLocator()
In : yrsFmt = matdt.DateFormatter('%Y')
In : dataFile = cbook.get_sample_data('goog.npy')
In : try:
r = np.load(dataFile,encoding='bytes').view(np.recarray)
r = np.load(dataFile).view(np.recarray)
In : fig,ax = matpy.subplots()
In : ax.plot(r.date,r.adj_close)
In : ax.xaxis.set_major_locator(yrs)
In : ax.xaxis.set_major_formatter(yrsFmt)
In : ax.xaxis.set_minor_locator(mnt)
In : mindate = dt.date(r.date.min().year,1,1)
In : maxdate = dt.date(r.date.max().year+1,1,1)
In : ax.set_xlim(mindate,maxdate)
Out: (731581.0, 733408.0)
In : def price(x):
return '$%1.2f' % x
In : ax.format_xdata = matdt.DateFormatter('%Y-%m-%d')
In : ax.format_ydata = price
In : ax.grid(True)
In : fig.autofmt_xdate()
In : matpy.show()
Once the last command matpy.show () is executed, a pop-up window will appear with the result as shown below-
With some basic coding and commands you are able to create a visual graph based on the data, imagine the brilliance and power of matplotlib.
For all the machine learning practitioners who want to bring machine learning into the production systems, Sci-Kit Learn is the savior, Sci-Kit Learn has several supervised and unsupervised machine learning algorithms which have a level of robustness and support required for use in production applications. As this library provides various learning algorithms, it has been named as Sci-Kit Learn. Sci-Kit Learn focuses on code quality, good documentation, ease of use and performance.Sci-Kit Learn has a steep learning curve.
Sci-Kit Learn is built upon SciPy and thus to use Sci-Kit Learn it is necessary to install various other Python libraries – Pandas, NumPy, SciPy, SymPy and IPython (the enhanced interactive console).However, on installing Anaconda-Sci-Kit Learn is also installed by default.
Learn Data Science in Python to build predictive models
At the beginning of this article you might have heard only about the popular libraries in python for data science but now you can do some basic coding and make wonders using Python libraries with your datasets. Python ecosystem is a huge ocean with so many libraries to be unleashed for data scientists. These were just few of them. Subscribe to our blog for more updates on exploring other Python libraries.
CLICK HERE to get the 2016 data scientist salary report delivered to your inbox!