We always get a lot of questions about where one should start when getting into the field of **machine learning** and data science. Always, our answer is that mathematics and the basics of *Python* are mandatory. In fact, you can subscribe to our blog and get free **Math for Machine Learning guide here**. However, we haven’t covered *Python* basics on this blog, so we decided to change that. In this article, we learn about two libraries without Data Scientists would be lost – **NumPy** and **Pandas**. You can kickstart your data science career with this two-part cheatsheet.

Are you afraid that AI might take your job? Make sure you are the one who is building it.

STAY RELEVANT IN THE RISING AI INDUSTRY! 🖖

## 1. NumPy

This library is the granddad of all other important data science libraries. It is a **fundamental** library for scientific computing in Python. Basically, all other libraries like *Pandas*, *Matplotlib*, *SciKit Learn*, **TensorFlow**, **Pytorch **are built on top of it. In its essence, it is a **multidimensional** array library. So simple? As you probably already know, **linear algebra** is one of the main building blocks of machine learning. It makes it all possible. Manipulating vectors (represented as arrays in *Python*) and matrixes is much easier with **NumPy**.

Ok, but why use it **over** *Python* standard lists? Among several reasons the main one is **speed**. Unlike lists, *NumPy* uses **fixed types** and that is where the speed comes from. For example, if we observe a simple 32 bits integer, *Python* lists would store much more information than *NumPy*. Built-in integer type stores the following information: size, reference count, object type, object value. So instead of just 4 bytes that *NumPy* would use, the list use ~28 bytes.

Apart from that Numpy uses **contiguous** memory in heap, unlike lists. This also brings effective **cache** utilization. In a nutshell, one of the reasons that *NumPy* is faster than lists is that it uses less memory and fixed types. Another benefit *NumPy*has way more operations and features than lists. *NumPy *comes with *Anaconda Python* distribution, so if you installed *Anaconda* you are good to go. Otherwise Install it with command:

`pip install numpy`

To import *NumPy* in your project, just use:

`import numpy as np`

### 1.1 NumPy Array

*NumPy* array is the main building block of the *NumPy*. In data science, it is often used to model and abstract **vectors** and **matrixes**. Here is how you can initialize simple array:

```
array = np.array([1, 2, 3])
print(array)
```

`[1 2 3]`

Matrix (2d array) can be initialized like this:

```
twod_array = np.array([[1, 2, 3], [4, 5, 6]])
print(twod_array)
```

```
[[1 2 3]
[4 5 6]]
```

You can get a lot of information about *NumPy* *Array*. For example, if you need **dimensions** of it you can call the *ndim* attribute:

```
print(array.ndim)
print(twod_array.ndim)
```

```
1
2
```

However, more often you will need the shape of the array, meaning when you are working on machine learning and **deep learning** applications shapes of the matrixes give you information about the input and output of your architecture. You can get a shape like this:

```
print(array.shape)
print(twod_array.shape)
```

```
(3,)
(2, 3)
```

As you can see from here, the *array* is a one-dimensional array with three elements, while twod_array is a 2×3 matrix. Apart from that, you can always get the type of the array by checking the *dtype* attribute. This attribute can be used during **initialization** of the array as well to define the **type** of array:

```
array = np.array([1, 2, 3], dtype='int16')
array.dtype
```

`dtype('int16')`

You may be interested in how many elements there are in some array, and what is its memory footprint. Here is how you can do that with the attributes *itemsize*, *size* and *nbytes*:

```
print(f"**********************************************************************")
print(f"Array has type {array.dtype} which means that every item takes {array.itemsize} bytes.")
print(f"2D Array has {array.size} elements.")
print(f"Array total size is {array.nbytes} bytes.")
print(f"**********************************************************************")
print(f"2D Array has type {twod_array.dtype} which means that every item takes \
{twod_array.itemsize} bytes.")
print(f"2D Array has {twod_array.size} elements.")
print(f"2D Array total size is {twod_array.nbytes} bytes.")
```

```
**********************************************************************
Array has type int16 which means that every item takes 2 bytes.
2D Array has 3 elements.
Array total size is 6 bytes.
**********************************************************************
2D Array has type int32 which means that every item takes 4 bytes.
2D Array has 6 elements.
2D Array total size is 24 bytes.
```

### 1.2 Reading and Updating Elements, Rows and Columns with Indexes

There are many options when it comes to manipulation with the arrays in the *NumPy*. Let’s initialize a matrix that we use for these experiments:

```
x = np.array([[1, 2, 3, 4, 5, 6, 7],[8, 9, 10, 11, 12, 13, 14]])
print(x)
print(f"Shape:{x.shape}")
```

```
[[ 1 2 3 4 5 6 7]
[ 8 9 10 11 12 13 14]]
Shape:(2, 7)
```

Sometimes you need a specific element from the array. That is done with **indexes**. For example, you want to get the fourth element from the second row from the array *x*. That is done like this:

```
print("The fourth element of the second row is:")
print(x[1, 3]) # Indexes are starting from 0
```

```
The fourth element of the second row is:
11
```

The first number in the [] brackets is the index along the first dimension of the array, while the second number is the index along the second dimension of the array *x*. Note that the indexes in *Python* are starting from 0. Let’s practice it and get the sixth element from the first row:

```
print("The sixth element of the first row is:")
print(x[0, 5]) # Indexes are starting from 0
```

```
The sixth element of the first row is:
6
```

Another way to access the same element is by using **negative** indexes. These indexes indicate that we “count” from the back of the array. It is one neat trick:

```
print("The sixth element of the first row is:")
print(x[0, -2]) # Negative indexes go from the back of the array
```

```
The sixth element of the first row is:
6
```

If we want to get a complete row as a separate *NumPy* array, we can do it like this:

```
print("The first row is:")
print(x[0, :])
```

```
The first row is:
[1 2 3 4 5 6 7]
```

The ‘:’ indicate that everything should be taken from that dimension, ie. axis. We can utilize negative indexes too:

```
print("The second row is:")
print(x[-1, :]) # Negative indexes go from the back
```

```
The second row is:
[ 8 9 10 11 12 13 14]
```

In the same way, by using ‘:’, we can get complete columns as a separate *NumPy* array:

```
print("The first column is:")
print(x[:, 0])
```

```
The first column is:
[1 8]
```

The cool thing is that we can get subset of this matrix. For example, if we want the first three elements from the first row, we can do so like this:

```
print("The first 3 elements from the first row:")
print(x[0, 0:3])
```

```
The first 3 elements from the first row:
[1 2 3]
```

Indexing can be used for updating values in the array too:

```
# Replace the value of the 6th element in the second row with 33
x[1, 5] = 33
print(x)
```

```
[[ 1 2 3 4 5 6 7]
[ 8 9 10 11 12 33 14]]
```

In the same way, you can do a **bulk** update:

```
# Bulk Update
# Replace elements from third to fifth in the first row with value 56
x[0, 2:5] = 56
print(x)
```

```
[[ 1 2 56 56 56 6 7]
[ 8 9 10 11 12 33 14]]
```

Finally, another cool thing with the NumPy array is **Boolean** indexing. This means that you can get a true/false value for each element in the array based on some **condition**. For example, let’s check out wich elements in array *x* are larger than 10:

`x > 10`

```
array([[False, False, True, True, True, False, False],
[False, False, False, True, True, True, True]])
```

### 1.3 NumPy Array Initialization

*NumPy* provides numerous functions for array initialization. For example, if you need a matrix with all **zeros**, you can do so like this:

```
# All zeros 5x6 matrix
zeros = np.zeros((5, 6))
print(zeros)
```

```
[[0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0.]]
```

Or, if you need a matrix with all ones:

```
# All ones 5x6 matrix
ones = np.ones((5, 6))
print(ones)
```

```
[[1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1.]]
```

Sure, you can initialize a **complete** matrix with a certain number:

```
# Any other number 5x6 matrix
ninenine = np.full((5,6), 99)
print(ninenine)
```

```
[[99 99 99 99 99 99]
[99 99 99 99 99 99]
[99 99 99 99 99 99]
[99 99 99 99 99 99]
[99 99 99 99 99 99]]
```

The interesting thing about the *full* method is that there is a variation called *full_like*. This function picks up **dimensions** from another array that is already created and creates a new matrix with a defined value. Check it out:

```
# Any other number with dimensions from another array (full_like)
fivefive = np.full_like(x, 55)
print(fivefive)
```

```
[[55 55 55 55 55 55 55]
[55 55 55 55 55 55 55]]
```

If you need a matrix with **random** values, *NumPy* gets you covered. There is a whole submodule called ** random**, using which you can create different random values:

```
# 5x6 matrix with random float numbers
randomize = np.random.rand(5, 6)
print(randomize)
```

```
[[0.96285976 0.00931921 0.61453111 0.6846065 0.17965427 0.92548501]
[0.78119724 0.25421679 0.33713402 0.78393532 0.44122679 0.24001506]
[0.27472269 0.99649043 0.48591665 0.13464157 0.39398014 0.16291523]
[0.48315635 0.33255073 0.16103341 0.55914281 0.59750496 0.80263915]
[0.08985945 0.84173854 0.87351364 0.3576588 0.3742723 0.00807349]]
```

You can do the same thing with **integers**:

```
# 5x6 matrix with random int numbers
randomizeint = np.random.randint(6, size=(5,6))
print(randomizeint)
```

```
[[5 0 1 3 5 4]
[3 5 3 4 1 5]
[1 3 1 0 3 4]
[1 3 2 3 0 4]
[2 1 2 4 3 2]]
```

To create **identity matrix**, you can use identity function, which receives dimensions of identitiy matrix:

```
# 3x3 identity matrix
identity = np.identity(3)
print(identity)
```

```
[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]]
```

Also, there are multiple ways to create new matrixes from arrays. This can be done with methods like *repeat* and *stack*:

```
# Repeat
print(f"*************************Initial*********************************")
initial = np.array([[2, 11, 9]])
print(initial)
print(f"*************************Expended1*******************************")
expanded1 = np.repeat(initial, 3, axis=1)
print(expanded1)
print(f"*************************Expended2*******************************")
expanded2 = np.repeat(initial, 3, axis=0)
print(expanded2)
```

```
*************************Initial*********************************
[[ 2 11 9]]
*************************Expended1*******************************
[[ 2 2 2 11 11 11 9 9 9]]
*************************Expended2*******************************
[[ 2 11 9]
[ 2 11 9]
[ 2 11 9]]
```

As you can see repeat has an axis parameter, using which you can define by which axis array is going to be repeated. The stack method is similar, but it is more useful, since with it you can merge multiple arrays:

```
# Stack
array1 = np.array([1, 2, 3])
array2 = np.array([4, 5, 6])
stacked = np.vstack([array1, array2])
print(stacked)
```

```
[[1 2 3]
[4 5 6]]
```

Finally, arrays can be reshaped using the *reshape* method. This functionality is very useful for deep learning applications:

```
# Reshape
stacked.reshape((3, 2))
```

```
array([[1, 2],
[3, 4],
[5, 6]])
```

### 1.4 Operations with Scalars

With *NumPy* *Arrays* it is quite easy to perform simple operations like addition, multiplication and division with the scalars (ie. simple numbers). Let’s observe *NumPy* array* y*:

```
y = np.array([1, 2, 3, 4, 5, 6])
print(y)
```

`[1 2 3 4 5 6]`

Or if we want to subtract number 3 from each element:

`print(y - 3)`

`[-2 -1 0 1 2 3]`

Multiplication? You name it:

`print(y * 3)`

`[ 3 6 9 12 15 18]`

The division is supported too:

`print(y / 3)`

`[0.33333333 0.66666667 1. 1.33333333 1.66666667 2. ] `

You can do exponential operations as well:

`print(y ** 3)`

`[ 1 8 27 64 125 216]`

### 1.5 Linear Algebra with NumPy

As we mentioned, *NumPy* is quite a useful library for **linear algebra**. This means that we can easily do operations with other arrays. Let’s say that we want to add some array to the array *y*:

```
z = np.array([6, 5, 4, 3, 2, 1])
print(y + z)
```

`[7 7 7 7 7 7]`

Or subtract values of one array from another array:

`print(y - z)`

`[-5 -3 -1 1 3 5]`

You can do **element-wise** multiplication of two arrays:

`print(y * z)`

`[ 6 10 12 12 10 6]`

Or divide one array with another:

`print(y / z)`

`[0.16666667 0.4 0.75 1.33333333 2.5 6. ]`

If you want to multiply two matrixes, this can be done with *matmul* method:

```
x = np.ones((2, 3))
y = np.full((3, 2), 6)
print(x)
print(y)
```

```
[[1. 1. 1.]
[1. 1. 1.]]
[[6 6]
[6 6]
[6 6]]
```

```
# Matrix multiplication
np.matmul(x, y)
```

```
array([[18., 18.],
[18., 18.]])
```

If you need determinant of some matrix, that can be done too:

```
# Derminant of 3x3 identity matrix
np.linalg.det(np.identity(3))
```

`1.0`

### 1.6 Basic Statistics with NumPy

There are many statistics options that you can perform with *NumPy*. In this tutorial, we cover only the basics. Let’s create an array from **random** integer values:

```
x = np.random.randint(10, size=11)
print(x)
```

`[9 5 5 3 0 2 8 5 2 2 2]`

For this array (or any other for that matter) we can quickly get **max** value:

`np.max(x)`

`9`

Or minimal value:

`np.min(x)`

`0`

Also, we can get the **mean** and **median** value of the array wich is quite usefull:

`np.mean(x)`

`1.0`

`np.median(x)`

`1.0`

### 1.6 Trigonometry with NumPy

Finally, Numpy has some really useful options when it comes to *Trigonometry*, like getting sine and cosine of values in the array:

`np.sin(y)`

```
array([[-0.2794155, -0.2794155],
[-0.2794155, -0.2794155],
[-0.2794155, -0.2794155]])
```

`np.cos(y)`

```
array([[0.96017029, 0.96017029],
[0.96017029, 0.96017029],
[0.96017029, 0.96017029]])
```

## 2. Pandas

**Pandas** is one of the popular libraries that is built on top of *NumPy*. Some people are considering the most important tool of the **data analysts** and indeed it is quite useful. There are many things you can do with this library, including data **pre-processing** and data **cleanup**. It is one of the best tools for exploratory data analysis and **feature engineering**. Pandas’ data structures are fast, flexible and designed to make data analysis easy.

It is part of *Anaconda* Python Distribution, but you can install it with *pip* too.

Let’s see what it is all about!

### 2.1 Pandas Series

Series are the basic building block of *Pandas* library. They can be observed as *NumPy* array, but with indexes that can be **labeled**. For example, you can create a simple series like this:

```
x = pd.Series(np.random.rand(3), index= ['a', 'b', 'c'])
x
```

```
a 0.661511
b 0.441051
c 0.139237
dtype: float64
```

We may loosely define it as a “label indexed array”. The indexing work the same way as with NumPy arrays, with the exception that you can use labels as indexes:

`print(x[0])`

`0.6615108128588524`

`print(x['a'])`

`0.6615108128588524`

In the same way that we sliced the NumPy Array with ‘:’, the same thing can be done here:

`print(x[:2])`

```
a 0.661511
b 0.441051
dtype: float64
```

### 2.2 Pandas DataFrames

If the Series are arrays DataFrames are matrices. DataFrames have rows and columns, and each of them is labeled. Here is how we create one:

```
df = pd.DataFrame(x, columns = ['Column 1'])
df
```

There is various information that *DataFrame* provides out of the box, like *shape*:

`df.shape`

`(3, 1)`

At any moment we can get index **information**:

`df.index`

`Index(['a', 'b', 'c'], dtype='object')`

The same goes for the **Columns** info:

`df.columns`

`Index(['Column 1'], dtype='object')`

As well as the number of instances per column:

`df.count()`

```
Column 1 3
dtype: int64
```

Finall, complete description of the *DataFrame* can be retrieved like this:

`df.info()`

```
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, a to c
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Column 1 3 non-null float64
dtypes: float64(1)
memory usage: 128.0+ bytes
```

If you need to extend your *DataFrame* with another column that can be done fairly easy:

```
df['Column 2'] = df['Column 1'] * 3
df
```

### 2.3 Reading Operations

The cool thing about *Pandas* is that there are multiple options for selecting columns, rows as well as specific elements. For example, if we want to select one element, there are multiple ways we can do that. We can use labels:

`df.loc['a', 'Column 1'] # This funcion is label based`

`0.6615108128588524`

There is also function ** iat**, which we can use if we know the position of the element in the

*DataFrame*:

`df.iat[0, 0] # Use iat if you only need to get or set a single value`

`0.6615108128588524`

With labels, we can select the whole column. This operation returns *Series* object:

`df['Column 1'] # Returns a Series by columb label`

```
a 0.661511
b 0.441051
c 0.139237
Name: Column 1, dtype: float64
```

Rows can be selected with *iloc* method:

`df.iloc[0] # Select Row`

```
Column 1 0.661511
Column 2 1.984532
Name: a, dtype: float64
```

Boolean indexing is available in *Pandas* too, but there is more control since you can use each column separately:

`df['Column 1'] > 0.5 # Boolean Indexing`

```
a True
b False
c False
Name: Column 1, dtype: bool
```

### 2.4 Dropping rows/columns

The *drop* method is the method that we use when we want to remove the series from *DataFrame*. Series can be row or column. This is controlled by the **axis** parameter. Here is how we can remove row:

```
df1 = df.drop('c', axis=0) #Axis 0 indicates that this is a row
df1
```

The column can be removed if we use *axis=1*:

```
df1 = df1.drop('Column 2', axis=1) # Axis 1 indicates that this is a column
df1
```

### 2.5 Sorting and Ranking

*Pandas* is sometimes compared with *Excel*, because of the tabular operations you can do with it. Sorting and rankig are one of them. For example, you can sort data by index:

`df.sort_index()`

In our example *DataFrame* looks exactly the same, however, if our index labels were not sorted, this method would do the trick. The sorting can be done by each **column**:

`df.sort_values(by='Column 2')`

The another useful thing is adding ranks to the values which can be done with rank method:

`df.rank()`

### 2.6 Basic Statistics with Pandas

*Pandas* provide several options for statistics of each column and row. The basic information about each column can be retrieved with the *describe* method:

`df.describe()`

As you can see all the important points are here, like mean, median, standard deviation, max, min, Q1 and Q3. It is a pretty good indication of the **distribution** of each column. Almost, every point from the table above has a separate function that you can use in case you need it. If you need **mean** value of a column use mean function:

`df.mean()`

```
Column 1 0.413933
Column 2 1.241799
dtype: float64
```

For **median** value, there is *median* function:

`df.median()`

```
Column 1 0.441051
Column 2 1.323153
dtype: float64
```

Maximum and minimum have functions *max* and *min* respectively:

`df.min()`

```
Column 1 0.139237
Column 2 0.417710
dtype: float64
```

`df.max()`

```
Column 1 0.661511
Column 2 1.984532
dtype: float64
```

You can also get the index of the maximal and minimal value, using *idxmax* and *idxmin* respectevly:

`df.idxmin()`

```
Column 1 c
Column 2 c
dtype: object
```

`df.idxmin()`

```
Column 1 a
Column 2 a
dtype: object
```

Apart from that you can get the sum of each column using *sum()* method:

`df.sum()`

```
Column 1 1.241799
Column 2 3.725396
dtype: float64
```

Or get cumulative sum usig *cumsum* method:

`df.cumsum()`

### 2.7 Using Functions with DataFrames

If you want to make some sort of transformations on the data in the *DataFrame* you can utilize **functions**. One way is to use unnamed lambda functions and apply the method:

`df.apply(lambda x: x + 3)`

### 2.8 Manipulate Data from CSV

Tabular data is often stored in CSV files. We can load this tabular data in the DataFrames using the *read_csv* function. For example, let’s load data from *PalmerPenguin* dataset:

```
data = pd.read_csv('./data/penguins_size.csv')
data
```

Then you can get more infromation and manipulate this data with *Pandas*:

`data.describe()`

Once you are done you can store data into some new csv file like this:

`data.to_csv('./data/new_file.csv')`

### 2.9 Manipulate Data from SQL

Tabular data often comes from the SQL database. In order to load this data into *DataFrame*, you can use *SQL-Alchemy* library in combination with *Pandas*:

```
from sqlalchemy import create_engine
engine = create_engine('sqlite:///:memory:')
pd.read_sql('SELECT * FROM palmer_penguins;', engine)
pd.read_sql_table('palmer_penguins', engine)
pd.read_sql_query('SELECT * FROM palmer_penguins;', engine)
```

As you can see you can send SQL queries for specific tables. Also you can write data to SQL like this:

`data.to_sql('penguin_data', engine)`

## Conclusion

In this article, we covered the basics of two fundamental Data Science libraries – *NumPy* and *Pandas*. Of course, we were not able to go through all options that these libraries provide, but we find that these cheatsheets can be a valid guide for begginers and as a refference guide for more expirienced data scientist.

Thank you for reading!

#### Nikola M. Zivkovic

CAIO at Rubik's Code

Nikola M. Zivkovic a CAIO at **Rubik’s Code** and the author of book “**Deep Learning for Programmers**“. He is loves knowledge sharing, and he is experienced speaker. You can find him speaking at meetups, conferences and as a guest lecturer at the University of Novi Sad.

**Rubik’s Code** is a boutique data science and software service company with more than 10 years of experience in Machine Learning, Artificial Intelligence & Software development. Check out the **services **we provide.

## Trackbacks/Pingbacks