We always get a lot of questions about where one should start when getting into the field of **machine learning** and data science. Always, our answer is that mathematics and the basics of *Python* are mandatory. In fact, you can subscribe to our blog and get free **Math for Machine Learning guide here**. However, we haven’t covered *Python* basics on this blog, so we decided to change that. In this article, we learn about two libraries without Data Scientists would be lost – **NumPy** and **Pandas**. You can kickstart your data science career with this two-part cheatsheet.

Are you afraid that AI might take your job? Make sure you are the one who is building it.

STAY RELEVANT IN THE RISING AI INDUSTRY! ðŸ––

## 1. NumPy

This library is the granddad of all other important data science libraries. It is a **fundamental** library for scientific computing in Python. Basically, all other libraries like *Pandas*, *Matplotlib*, *SciKit Learn*, **TensorFlow**, **Pytorch **are built on top of it. In its essence, it is a **multidimensional** array library. So simple? As you probably already know, **linear algebra** is one of the main building blocks of machine learning. It makes it all possible. Manipulating vectors (represented as arrays in *Python*) and matrixes is much easier with **NumPy**.

Ok, but why use it **over** *Python* standard lists? Among several reasons the main one is **speed**. Unlike lists, *NumPy* uses **fixed types** and that is where the speed comes from. For example, if we observe a simple 32 bits integer, *Python* lists would store much more information than *NumPy*. Built-in integer type stores the following information: size, reference count, object type, object value. So instead of just 4 bytes that *NumPy* would use, the list use ~28 bytes.

Apart from that Numpy uses **contiguous** memory in heap, unlike lists. This also brings effective **cache** utilization. In a nutshell, one of the reasons that *NumPy* is faster than lists is that it uses less memory and fixed types. Another benefit *NumPy*has way more operations and features than lists. *NumPyÂ *comes with *Anaconda Python* distribution, so if you installed *Anaconda* you are good to go. Otherwise Install it with command:

To import *NumPy* in your project, just use:

### 1.1 NumPy Array

*NumPy* array is the main building block of the *NumPy*. In data science, it is often used to model and abstract **vectors** and **matrixes**. Here is how you can initialize simple array:

Matrix (2d array) can be initialized like this:

You can get a lot of information about *NumPy* *Array*. For example, if you need **dimensions** of it you can call theÂ *ndim* attribute:

However, more often you will need the shape of the array, meaning when you are working on machine learning and **deep learning** applications shapes of the matrixes give you information about the input and output of your architecture. You can get a shape like this:

As you can see from here, the *array* is a one-dimensional array with three elements, while twod_array is a 2×3 matrix. Apart from that, you can always get the type of the array by checking theÂ *dtype* attribute. This attribute can be used during **initialization** of the array as well to define the **type** of array:

You may be interested in how many elements there are in some array, and what is its memory footprint. Here is how you can do that with the attributes *itemsize*, *size* and *nbytes*:

### 1.2 Reading and Updating Elements, Rows and Columns with Indexes

There are many options when it comes to manipulation with the arrays in the *NumPy*. Let’s initialize a matrix that we use for these experiments:

Sometimes you need a specific element from the array. That is done with **indexes**. For example, you want to get the fourth element from the second row from the array *x*. That is done like this:

The first number in the [] brackets is the index along the first dimension of the array, while the second number is the index along the second dimension of the array *x*. Note that the indexes in *Python* are starting from 0. Let’s practice it and get the sixth element from the first row:

Another way to access the same element is by using **negative** indexes. These indexes indicate that we “count” from the back of the array. It is one neat trick:

If we want to get a complete row as a separate *NumPy* array, we can do it like this:

The ‘:’ indicate that everything should be taken from that dimension, ie. axis. We can utilize negative indexes too:

In the same way, by using ‘:’, we can get complete columns as a separate *NumPy* array:

The cool thing is that we can get subset of this matrix. For example, if we want the first three elements from the first row, we can do so like this:

Indexing can be used for updating values in the array too:

In the same way, you can do a **bulk** update:

Finally, another cool thing with the NumPy array is **Boolean** indexing. This means that you can get a true/false value for each element in the array based on some **condition**. For example, let’s check out wich elements in array *x* are larger than 10:

### 1.3 NumPy Array Initialization

*NumPy* provides numerous functions for array initialization. For example, if you need a matrix with all **zeros**, you can do so like this:

Or, if you need a matrix with all ones:

Sure, you can initialize aÂ **complete** matrix with a certain number:

The interesting thing about the *full* method is that there is a variation called *full_like*. This function picks up **dimensions** from another array that is already created and creates a new matrix with a defined value. Check it out:

If you need a matrix with **random** values, *NumPy* gets you covered. There is a whole submodule called ** random**, using which you can create different random values:

You can do the same thing with **integers**:

To create **identity matrix**, you can use identity function, which receives dimensions of identitiy matrix:

Also, there are multiple ways to create new matrixes from arrays. This can be done with methods like *repeat* and *stack*:

As you can see repeat has an axis parameter, using which you can define by which axis array is going to be repeated. The stack method is similar, but it is more useful, since with it you can merge multiple arrays:

Finally, arrays can be reshaped using the *reshape* method. This functionality is very useful for deep learning applications:

### 1.4 Operations with Scalars

With *NumPy* *Arrays* it is quite easy to perform simple operations like addition, multiplication and division with the scalars (ie. simple numbers). Let’s observe *NumPy* array* y*:

Or if we want to subtract number 3 from each element:

Multiplication? You name it:

The division is supported too:

You can do exponential operations as well:

### 1.5 Linear Algebra with NumPy

As we mentioned, *NumPy* is quite a useful library for **linear algebra**. This means that we can easily do operations with other arrays. Let’s say that we want to add some array to the array *y*:

Or subtract values of one array from another array:

You can do **element-wise** multiplication of two arrays:

Or divide one array with another:

If you want to multiply two matrixes, this can be done with *matmul* method:

If you need determinant of some matrix, that can be done too:

### 1.6 Basic Statistics with NumPy

There are many statistics options that you can perform with *NumPy*. In this tutorial, we cover only the basics. Let’s create an array from **random** integer values:

For this array (or any other for that matter) we can quickly get **max** value:

Or minimal value:

Also, we can get the **mean** and **median** value of the array wich is quite usefull:

### 1.6 Trigonometry with NumPy

Finally, Numpy has some really useful options when it comes to *Trigonometry*, like getting sine and cosine of values in the array:

## 2. Pandas

**Pandas** is one of the popular libraries that is built on top of *NumPy*. Some people are considering the most important tool of the **data analysts** and indeed it is quite useful. There are many things you can do with this library, including data **pre-processing** and data **cleanup**. It is one of the best tools for exploratory data analysis and **feature engineering**. Pandas’ data structures are fast, flexible and designed to make data analysis easy.

It is part of *Anaconda*Â Python Distribution, but you can install it withÂ *pip* too.

Let’s see what it is all about!

### 2.1 Pandas Series

Series are the basic building block of *Pandas* library. They can be observed as *NumPy* array, but with indexes that can be **labeled**. For example, you can create a simple series like this:

We may loosely define it as a “label indexed array”. The indexing work the same way as with NumPy arrays, with the exception that you can use labels as indexes:

In the same way that we sliced the NumPy Array with ‘:’, the same thing can be done here:

### 2.2 Pandas DataFrames

If the Series are arrays DataFrames are matrices. DataFrames have rows and columns, and each of them is labeled. Here is how we create one:

There is various information that *DataFrame* provides out of the box, like *shape*:

At any moment we can get index **information**:

The same goes for the **Columns** info:

As well as the number of instances per column:

Finall, complete description of the *DataFrame* can be retrieved like this:

If you need to extend your *DataFrame*Â with another column that can be done fairly easy:

### 2.3 Reading Operations

The cool thing about *Pandas* is that there are multiple options for selecting columns, rows as well as specific elements. For example, if we want to select one element, there are multiple ways we can do that. We can use labels:

There is also function ** iat**, which we can use if we know the position of the element in the

*DataFrame*:

With labels, we can select the whole column. This operation returns *Series* object:

Rows can be selected with *iloc* method:

Boolean indexing is available in *Pandas* too, but there is more control since you can use each column separately:

### 2.4 Dropping rows/columns

The *drop* method is the method that we use when we want to remove the series from *DataFrame*. Series can be row or column. This is controlled by the **axis** parameter. Here is how we can remove row:

The column can be removed if we use *axis=1*:

### 2.5 Sorting and Ranking

*Pandas* is sometimes compared with *Excel*, because of the tabular operations you can do with it. Sorting and rankig are one of them. For example, you can sort data by index:

In our example *DataFrame* looks exactly the same, however, if our index labels were not sorted, this method would do the trick. The sorting can be done by each **column**:

The another useful thing is adding ranks to the values which can be done with rank method:

### 2.6 Basic Statistics with Pandas

*Pandas* provide several options for statistics of each column and row. The basic information about each column can be retrieved with the *describe* method:

As you can see all the important points are here, like mean, median, standard deviation, max, min, Q1 and Q3. It is a pretty good indication of the **distribution** of each column. Almost, every point from the table above has a separate function that you can use in case you need it. If you need **mean** value of a column use mean function:

For **median** value, there is *median* function:

Maximum and minimum have functions *max* and *min* respectively:

You can also get the index of the maximal and minimal value, using *idxmax* and *idxmin* respectevly:

Apart from that you can get the sum of each column using *sum()* method:

Or get cumulative sum usig *cumsum* method:

### 2.7 Using Functions with DataFrames

If you want to make some sort of transformations on the data in the *DataFrame* you can utilize **functions**. One way is to use unnamed lambda functions and apply the method:

### 2.8 Manipulate Data from CSV

Tabular data is often stored in CSV files. We can load this tabular data in the DataFrames using the *read_csv* function. For example, let’s load data from *PalmerPenguin* dataset:

Then you can get more infromation and manipulate this data with *Pandas*:

Once you are done you can store data into some new csv file like this:

### 2.9 Manipulate Data from SQL

Tabular data often comes from the SQL database. In order to load this data into *DataFrame*, you can use *SQL-Alchemy* library in combination with *Pandas*:

As you can see you can send SQL queries for specific tables. Also you can write data to SQL like this:

## Conclusion

In this article, we covered the basics of two fundamental Data Science libraries – *NumPy* and *Pandas*. Of course, we were not able to go through all options that these libraries provide, but we find that these cheatsheets can be a valid guide for begginers and as a refference guide for more expirienced data scientist.

Thank you for reading!

#### Nikola M. Zivkovic

CAIO at Rubik's Code

Nikola M. Zivkovic a CAIO at **Rubik’s Code** and the author of book “**Deep Learning for Programmers**“. He is loves knowledge sharing, and he is experienced speaker. You can find him speaking at meetups, conferences and as a guest lecturer at the University of Novi Sad.

**Rubikâ€™s Code**Â is a boutique data science and software service company with more than 10 years of experience in Machine Learning, Artificial Intelligence & Software development. Check out theÂ **servicesÂ **we provide.

## Trackbacks/Pingbacks