We always get a lot of questions about where one should start when getting into the field of machine learning and data science. Always, our answer is that mathematics and the basics of Python are mandatory. In fact, you can subscribe to our blog and get free Math for Machine Learning guide here. However, we haven’t covered Python basics on this blog, so we decided to change that. In this article, we learn about two libraries without Data Scientists would be lost – NumPy and Pandas. You can kickstart your data science career with this two-part cheatsheet.

Are you afraid that AI might take your job? Make sure you are the one who is building it.

STAY RELEVANT IN THE RISING AI INDUSTRY! ūüĖĖ

1. NumPy

This library is the granddad of all other important data science libraries. It is a fundamental library for scientific computing in Python. Basically, all other libraries like Pandas, Matplotlib, SciKit Learn, TensorFlow, Pytorch are built on top of it. In its essence, it is a multidimensional array library. So simple? As you probably already know, linear algebra is one of the main building blocks of machine learning. It makes it all possible. Manipulating vectors (represented as arrays in Python) and matrixes is much easier with NumPy.

Decision Tree

Ok, but why use it over Python standard lists? Among several reasons the main one is speed. Unlike lists, NumPy uses fixed types and that is where the speed comes from. For example, if we observe a simple 32 bits integer, Python lists would store much more information than NumPy. Built-in integer type stores the following information: size, reference count, object type, object value. So instead of just 4 bytes that NumPy would use, the list use ~28 bytes.

Apart from that Numpy uses contiguous memory in heap, unlike lists. This also brings effective cache utilization. In a nutshell, one of the reasons that NumPy is faster than lists is that it uses less memory and fixed types. Another benefit NumPyhas way more operations and features than lists. NumPy comes with Anaconda Python distribution, so if you installed Anaconda you are good to go. Otherwise Install it with command:

To import NumPy in your project, just use:

1.1 NumPy Array

NumPy array is the main building block of the NumPy. In data science, it is often used to model and abstract vectors and matrixes. Here is how you can initialize simple array:

Matrix (2d array) can be initialized like this:

You can get a lot of information about NumPy Array. For example, if you need dimensions of it you can call the ndim attribute:

Decision Tree

However, more often you will need the shape of the array, meaning when you are working on machine learning and deep learning applications shapes of the matrixes give you information about the input and output of your architecture. You can get a shape like this:

As you can see from here, the array is a one-dimensional array with three elements, while twod_array is a 2×3 matrix. Apart from that, you can always get the type of the array by checking the¬†dtype attribute. This attribute can be used during initialization of the array as well to define the type of array:

You may be interested in how many elements there are in some array, and what is its memory footprint. Here is how you can do that with the attributes itemsize, size and nbytes:

1.2 Reading and Updating Elements, Rows and Columns with Indexes

There are many options when it comes to manipulation with the arrays in the NumPy. Let’s initialize a matrix that we use for these experiments:

Sometimes you need a specific element from the array. That is done with indexes. For example, you want to get the fourth element from the second row from the array x. That is done like this:

The first number in the [] brackets is the index along the first dimension of the array, while the second number is the index along the second dimension of the array x. Note that the indexes in Python are starting from 0. Let’s practice it and get the sixth element from the first row:

Another way to access the same element is by using negative indexes. These indexes indicate that we “count” from the back of the array. It is one neat trick:

If we want to get a complete row as a separate NumPy array, we can do it like this:

The ‘:’ indicate that everything should be taken from that dimension, ie. axis. We can utilize negative indexes too:

In the same way, by using ‘:’, we can get complete columns as a separate NumPy array:

The cool thing is that we can get subset of this matrix. For example, if we want the first three elements from the first row, we can do so like this:

Indexing can be used for updating values in the array too:

In the same way, you can do a bulk update:

Finally, another cool thing with the NumPy array is Boolean indexing. This means that you can get a true/false value for each element in the array based on some condition. For example, let’s check out wich elements in array x are larger than 10:

1.3 NumPy Array Initialization

NumPy provides numerous functions for array initialization. For example, if you need a matrix with all zeros, you can do so like this:

Or, if you need a matrix with all ones:

Sure, you can initialize a complete matrix with a certain number:

The interesting thing about the full method is that there is a variation called full_like. This function picks up dimensions from another array that is already created and creates a new matrix with a defined value. Check it out:

Decision Tree

If you need a matrix with random values, NumPy gets you covered. There is a whole submodule called random, using which you can create different random values:

You can do the same thing with integers:

To create identity matrix, you can use identity function, which receives dimensions of identitiy matrix:

Also, there are multiple ways to create new matrixes from arrays. This can be done with methods like repeat and stack:

As you can see repeat has an axis parameter, using which you can define by which axis array is going to be repeated. The stack method is similar, but it is more useful, since with it you can merge multiple arrays:

Finally, arrays can be reshaped using the reshape method. This functionality is very useful for deep learning applications:

1.4 Operations with Scalars

With NumPy Arrays it is quite easy to perform simple operations like addition, multiplication and division with the scalars (ie. simple numbers). Let’s observe NumPy array y:

Or if we want to subtract number 3 from each element:

Recommendation systems 2

Multiplication? You name it:

The division is supported too:

You can do exponential operations as well:

1.5 Linear Algebra with NumPy

As we mentioned, NumPy is quite a useful library for linear algebra. This means that we can easily do operations with other arrays. Let’s say that we want to add some array to the array y:

Or subtract values of one array from another array:

You can do element-wise multiplication of two arrays:

Or divide one array with another:

If you want to multiply two matrixes, this can be done with matmul method:

If you need determinant of some matrix, that can be done too:

Recommendation systems 2

1.6 Basic Statistics with NumPy

There are many statistics options that you can perform with NumPy. In this tutorial, we cover only the basics. Let’s create an array from random integer values:

For this array (or any other for that matter) we can quickly get max value:

Or minimal value:

Also, we can get the mean and median value of the array wich is quite usefull:

1.6 Trigonometry with NumPy

Finally, Numpy has some really useful options when it comes to Trigonometry, like getting sine and cosine of values in the array:

2. Pandas

Pandas is one of the popular libraries that is built on top of NumPy. Some people are considering the most important tool of the data analysts and indeed it is quite useful. There are many things you can do with this library, including data pre-processing and data cleanup. It is one of the best tools for exploratory data analysis and feature engineering. Pandas’ data structures are fast, flexible and designed to make data analysis easy.

It is part of Anaconda Python Distribution, but you can install it with pip too.

Let’s see what it is all about!

2.1 Pandas Series

Series are the basic building block of Pandas library. They can be observed as NumPy array, but with indexes that can be labeled. For example, you can create a simple series like this:

We may loosely define it as a “label indexed array”. The indexing work the same way as with NumPy arrays, with the exception that you can use labels as indexes:

In the same way that we sliced the NumPy Array with ‘:’, the same thing can be done here:

Recommendation Systems

2.2 Pandas DataFrames

If the Series are arrays DataFrames are matrices. DataFrames have rows and columns, and each of them is labeled. Here is how we create one:

Recommendation Systems

There is various information that DataFrame provides out of the box, like shape:

At any moment we can get index information:

The same goes for the Columns info:

As well as the number of instances per column:

Finall, complete description of the DataFrame can be retrieved like this:

If you need to extend your DataFrame with another column that can be done fairly easy:

Recommendation Systems

2.3 Reading Operations

The cool thing about Pandas is that there are multiple options for selecting columns, rows as well as specific elements. For example, if we want to select one element, there are multiple ways we can do that. We can use labels:

There is also function iat, which we can use if we know the position of the element in the DataFrame:

Recommendation Systems

With labels, we can select the whole column. This operation returns Series object:

Rows can be selected with iloc method:

Boolean indexing is available in Pandas too, but there is more control since you can use each column separately:

2.4 Dropping rows/columns

The drop method is the method that we use when we want to remove the series from DataFrame. Series can be row or column. This is controlled by the axis parameter. Here is how we can remove row:

Recommendation Systems

The column can be removed if we use axis=1:

Recommendation Systems

2.5 Sorting and Ranking

Pandas is sometimes compared with Excel, because of the tabular operations you can do with it. Sorting and rankig are one of them. For example, you can sort data by index:

Recommendation Systems

In our example DataFrame looks exactly the same, however, if our index labels were not sorted, this method would do the trick. The sorting can be done by each column:

Recommendation Systems

The another useful thing is adding ranks to the values which can be done with rank method:

Recommendation Systems
Recommendation Systems

2.6 Basic Statistics with Pandas

Pandas provide several options for statistics of each column and row. The basic information about each column can be retrieved with the describe method:

Recommendation Systems

As you can see all the important points are here, like mean, median, standard deviation, max, min, Q1 and Q3. It is a pretty good indication of the distribution of each column. Almost, every point from the table above has a separate function that you can use in case you need it. If you need mean value of a column use mean function:

For median value, there is median function:

Maximum and minimum have functions max and min respectively:

You can also get the index of the maximal and minimal value, using idxmax and idxmin respectevly:

Apart from that you can get the sum of each column using sum() method:

Recommendation Systems

Or get cumulative sum usig cumsum method:

Recommendation Systems

2.7 Using Functions with DataFrames

If you want to make some sort of transformations on the data in the DataFrame you can utilize functions. One way is to use unnamed lambda functions and apply the method:

Recommendation Systems

2.8 Manipulate Data from CSV

Tabular data is often stored in CSV files. We can load this tabular data in the DataFrames using the read_csv function. For example, let’s load data from PalmerPenguin dataset:

Recommendation Systems

Then you can get more infromation and manipulate this data with Pandas:

Recommendation Systems

Once you are done you can store data into some new csv file like this:

2.9 Manipulate Data from SQL

Tabular data often comes from the SQL database. In order to load this data into DataFrame, you can use SQL-Alchemy library in combination with Pandas:

As you can see you can send SQL queries for specific tables. Also you can write data to SQL like this:

Conclusion

In this article, we covered the basics of two fundamental Data Science libraries – NumPy and Pandas. Of course, we were not able to go through all options that these libraries provide, but we find that these cheatsheets can be a valid guide for begginers and as a refference guide for more expirienced data scientist.

Thank you for reading!

Nikola M. Zivkovic

Nikola M. Zivkovic

CAIO at Rubik's Code

Nikola M. Zivkovic a CAIO at Rubik’s Code and the author of book “Deep Learning for Programmers“. He is loves knowledge sharing, and he is experienced speaker. You can find him speaking at meetups, conferences and as a guest lecturer at the University of Novi Sad.

Rubik’s Code is a boutique data science and software service company with more than 10 years of experience in Machine Learning, Artificial Intelligence & Software development. Check out the services we provide.