In the previous three articles, we explored the world of Self-Organizing Maps. First, we got some theoretical background on the subject. Then in the second article, we saw how we could implement Self-Organizing Maps using TensorFlow. After that, in the third article, we have done the same thing in a different technology and implemented Self-Organizing Maps using C#. In all those articles, we focused on how Self-Organizing Maps utilize unsupervised learning for clustering data. While they are using similar mechanisms as standard, feed-forward neural networks these maps are able to cluster input data into different categories without getting expected results beforehand.

They are able to do so by making a relationship in the input data on their own. It is important to emphasize that concept of neurons, connection and weighted connections are having a different meaning in Self-Organizing Maps. Neurons are grouped into two categories. The first category is a collection of input neurons. Their number corresponds to the number of features that we have in a used dataset. The second category is a collection of output neurons. These neurons are usually organized as one or two-dimensional arrays and are triggered only by certain input values.

It is also important that every neuron in the Self-Organizing map have a location assigned to it. These locations are an important parameter since it is considered that neurons that lie close to each other have similar properties and actually represent a cluster. If you want to learn more details on the structure of Self-Organizing Maps and their learning process, you may do so here.

In this article, we are going to focus more on the ways we can use Self-Organizing Maps in a real-world problem. We will explore how to detect credit card frauds using this mechanism. For that purpose, we will use TensorFlow implementation that we have already made. But before we jump into the solution of the problem, the first thing we need to check out is the dataset we will use for this purpose.

Statlog (Australian Credit Approval) Dataset

In this article, we will use the Statlog (Australian Credit Approval) Dataset. This dataset holds credit card applications. Basically, the bank issues credit cards to customers and they are trying to figure out did any faulty application got approved by mistake. Our goal is to detect possible frauds so they can be further investigated by the bank. This way, the bank can protect itself from possible losses in the future. In the dataset, all attribute names and values have been changed to meaningless symbols to protect the confidentiality of the data. This dataset has a good mix of attributes – continuous, nominal with small numbers of values, and nominal with larger numbers of values, which makes it perfect for the practice. The attributes of this dataset are:

  • CustomerID – Id of the customer
  • A1 – categorical, possible values: 0 and 1
  • A2 – continuous
  • A3 – continuous
  • A4 – categorical, possible values: 1, 2 and 3
  • A5 – categorical, possible values: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14
  • A6 – categorical, possible values: 1, 2, 3, 4, 5, 6, 7, 8, 9
  • A7 – continuous
  • A8 – categorical, possible values: 0 and 1
  • A9 – categorical, possible values: 0 and 1
  • A10 – continuous
  • A11 – categorical, possible values: 0 and 1
  • A12 – categorical, possible values: 1, 2 and 3
  • A13 – continuous
  • A14 – continuous
  • A15 (Class) – class attribute, possible values: 1 and 2

As you can see that features have no semantics, except the final feature which is indicating had credit card been issued to the customer or not. What we want to do is explore attributes from A1 to A14, and use Self-Organizing Map to figure out which customer committed fraud. We will use Python 3.6.5 and Spyder IDE. Also, TensorFlow 1.10.0 version is used. Here you can find a quick guide on how to install it and how to start working with it.

Feature Analysis

Before we proceed with the implementation, let’s explore the dataset a little bit. Since we are not having the semantics of each column in the dataset it is important for us to find out as much as we can about the data. Here is what we will get once we load data into a variable:

We can see that the first column represents the identification number of the customer. This should be excluded from our calculations since it is not carrying important information. Another thing we can notice from the first glance is that our continuous variables are not in scale, but we will explore that in more details during the outliers detection phase. Let’s see if we are missing any data in this dataset. Missingno Python library is a great tool for that. Here is what we get if we apply it to our dataset:

We can see that there is no missing data in our dataset, which is a big relief. There would be indicators in missingno output if there were some. If that was the case, we would have to make multiple experiments to determine which algorithm should we apply when replacing missing data. This would have to be done because the semantics of the features is unknown.

Now, we can proceed with outlier detection. It makes no sense to include categorical data in this analysis, so we extracted only continuous features. Detecting outliers, meaning data samples which are vastly different from the rest of the samples, is actually our main goal and it will solve our whole problem. We are actually trying to detect customers which are standing out of the crowd, so to say. Their unified result should be different from the rest. That is why detecting outliers on every individual feature are important as well. We use seaborn library for visualization and we get really interesting results:

We see that majority of our data is out of scale, meaning we will have to apply some sort of scaling algorithm before applying any machine learning algorithm (in this post we will use Self-Organizing Maps, but we could use some other clustering algorithm as well). Also, we see that A14 is having huge fluctuations, which could be some sort of indication. If we were having some other problem and using some other dataset, we might want to choose to remove outliers from our calculations, but as we mentioned previously, these values are crucial for this problem.

Finally, let’s check the correlation between the data in our dataset:

We can see that there is a high correlation value between A8 feature and the final outcome. However, we won’t remove this feature at the moment, like we might do in some other problems. It is important to remember this because we could optimize our solution later and get even more precise results.

Here is code that we used for this analysis:


import numpy as np
import pandas as pd
import missingno as msno
import seaborn as sn
import matplotlib.pyplot as plt
data = pd.read_csv('Credit_Card_Applications.csv')
data = data.drop(["CustomerID"],axis=1)
# Missing data detection
msno.matrix(data,figsize=(10,3))
# Outliners detection and class imbalance
continiousData = pd.DataFrame()
continousVariableList = ["A2", "A3", "A7", "A10", "A13", "A14"]
for var in continousVariableList:
continiousData[var] = data[var].astype("float32")
fig, axes = plt.subplots(nrows=1,ncols=1)
fig.set_size_inches(10, 20)
sn.boxplot(data=continiousData,orient="v",ax=axes)
# Correlation analysis
corrMatt = data.corr()
mask = np.array(corrMatt)
mask[np.tril_indices_from(mask)] = False
fig,ax= plt.subplots()
fig.set_size_inches(20,10)
sn.heatmap(corrMatt, mask=mask,vmax=.8, square=True,annot=True)

Solving the Problem

For solving this problem, we will use TensorFlow implementation from one of the previous articles. We are going to do it like this:


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
data = pd.read_csv('Credit_Card_Applications.csv')
X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values
sc = MinMaxScaler(feature_range = (0, 1))
X = sc.fit_transform(X)
from somtf import SOM
som = SOM(x = 10, y = 10, input_dim=15, learning_rate=0.5, num_iter = 100, radius = 1.0)
som.train(X);

As you can see, the first thing we have done is loading data into data variable. After that, we divided input data from output data. We will only use input data for training the Self-Organizing Map. Here is data after separation:

During the feature analysis, we determined that we will need to scale data. For that purpose, we use MinMaxScaler from the Sci-Kit Learn library. This class will scale each individual feature within the defined range. Defined values, in this case, are 0 and 1. This means that each feature will be scaled within 0 and 1. This is how the input looks like after scaling is applied:

Finally, we will use TensorFlow implementation and train Self-Organizing Map. We are using 10×10 map for representing this data. Start learning rate is 0.5, the initial radius is 1.0 and training is done in 100 iterations.

Once training is complete, we want to get MID, or Mean Inter-neuron Distances between neurons. If you remember, we are trying to detect outliers, meaning the results that stand out from the others. Now, we are having 10×10 map of neurons, which represent clustered customers by their features. Each neuron has a certain value. If we calculate differences between those values we get Mean Inter-neuron Distances. This distance tells us which neuron is more different than the other and with that which customers are standing out, ie. which ones committed fraud.

If we map out those distances we will get something like this:

So in our 10×10 map of nodes, the ones with the coordinates (1, 6) and (5, 1) deviate the most. This means that those customers which are connected to these neurons are more likely to commit the fraud than the rest of them. So, when we reverse map those we get the list of potentials frauds and there are 23 of them:

Conclusion

In previous articles, we talked a lot about the possibilities of Self-Organizing Maps, while in this article we utilized them and saw how one can use this type of neural networks in a real-world example. We also saw how we can find our way around dataset even though we don’t know semantic behind the features, and that unsupervised learning can connect the dots on its own. To sum it up, we used all the nice things we learned in previous articles and used it in a practical example.

Thank you for reading!


This article is a part of  Artificial Neural Networks Series, which you can check out here.


Read more posts from the author at Rubik’s Code.


 

Discover more from Rubix Code

Subscribe now to keep reading and get access to the full archive.

Continue reading