Not many people know, but apart from doing all techy stuff, I sometimes make music. I even have few EP releases and one album. They are terrible, but hey, it is something to pass the time. Anyhow, after getting deeper into machine learning, deep learning and artificial intelligence, I wondered is there a way to use some technique to help me create music. Of course, there online there are many sources giving you different approaches to do this. However, the most common approach and the most basic one suggest using Restricted Boltzmann machines, which we explored in one of the previous articles and implemented it in both Python and C#.
To sum it up, Restricted Boltzmann Machine is the special kind of neural networks. In fact, they are a part of so-called Energy-Based models – deep learning models which utilize physics concept of energy. In these models, dependencies between variables in the system are associated with a scalar value, which represents the energy to the complete system. This scalar value actually is called energy and it represents a measure of the probability that the system will be in a certain state.
However, Restricted Boltzmann Machine is not just Energy-Based Model, it is also a special case of Boltzmann Machine. Boltzmann Machine is a neural network that consists of symmetrically connected neurons as shown in the image above. Every neuron in this system has a binary state, meaning that it can be either on or off. The decision regarding the state is made stochastically. Since all neurons are connected to each other, calculating weights for all connections is resource-demanding, so this architecture needed to be optimized.
In the end, we ended up with the Restricted Boltzmann Machine, optimized version of the Boltzmann machine. In order to simplify initial architecture, and minimize necessary calculations this model has only two layers of neurons – visible and hidden. This is presented in the image above. As you can see the hidden neurons are connected only to the visible ones and vice-versa. This effectively means that there are no connections between neurons in the same layer.
How does it work?
There are two big parts in the learning process of the Restricted Boltzmann Machine: Gibbs Sampling and Contrastive Divergence. You can find more on the topic in this article. However, we will explain them here in fewer details. First, we need to calculate the probabilities that neuron from the hidden layer is activated based on the input values on the visible layer – Gibbs Sampling. Based on this value we will either activate the neuron on or not.
In the next step, we will use the Contrastive Divergence to update the weights. This process is a bit tricky to explain, so stay with me. We will consider the scenario in which we have the visible layer with four nodes and a hidden layer with three nodes, just like in the image above. For example, let’s say that input values on the visible layer are [0, 1, 1, 0].
Then by applying formulas from this article, we calculate the probability for neurons in the hidden layer. If the probability is high, the neuron from the hidden layer will have status 1; otherwise, it will have status 0. For example, based on current weights and biases we get that values of the hidden layer are [0, 1, 1].
Once we get the states of the neurons in the hidden layer we can calculate the so-called positive gradient. This is done by getting the outer product of visible layer states [0, 1, 1, 0] and the hidden layer states [0, 1, 1]. If you are not familiar with it, the outer product of two these two matrixes is defined like this:
v[0]*h[0] v[0]*h[1] v[0]*h[2] v[1]*h[0] v[1]*h[1] v[1]*h[2] v[2]*h[0] v[2]*h[1] v[2]*h[2] v[3]*h[0] v[3]*h[1] v[3]*h[2]
where v represents a state from the neuron of the visible layer and h represents a state of the neuron of the hidden layer. For the current values we will get this result:
0 0 0 0 1 1 0 1 1 0 0 0
This matrix is actually corresponding to all connections in this system. This means that the first element is observed as some kind of property or action on the connection between neurons v[0] and h[0]. The positive gradient is called like that because whenever we have value 1 in this matrix, we will increase the value of the weight on that connection. So, in our example we will do so for connections between v[1]h[1], v[1]h[2], v[2]h[1] and v[2]h[2].
We performed the first step in convoluted Contrastive Divergence process. After the positive gradient, we calculate a negative gradient and that is done like this. First, by using formulas from this article once again, probabilities for the neurons in the visible layer, using values from the hidden layer are calculated. Using these probabilities we calculate the temporary Contrastive Divergence states for the visible layer – v'[n]. Let’s say that for our case we get the values [0, 0, 0, 1].
In our final step, we calculate probabilities for the neurons in the hidden layer once again, only this time we use the Contrastive Divergence states of the visible layer calculated previously. We calculate the Contrastive Divergence states for the hidden layer – h'[n], and for this example get the results [0, 0, 1].
This time we use the outer product of visible layer neuron Contrastive Divergence states [0, 0, 0, 1] and hidden layer Contrastive Divergence neuron states [0, 0, 1] to get these values:
0 0 0 0 0 0 0 0 0 0 0 1
Just like during the positive gradient calculations, wherever we have value 1 in the negative gradient matrix we will subtract the learning rate from the weight between two neurons. In this example, we will subtract the learning rate from the weights of the connection between neurons v[4]h[3].
MIDI
Ok, now when we are all caught up with the architecture we are going to use, let’s get familiar with MIDI format, which we will also utilize in this solution. This technical standard is describing communication protocol between a variety of electronic music instruments, synthesizers and computer. In an essence, it encodes events that synthesizer needs to know about in order to synthesize sound. These events are note-on, note-off, tempo changes, pitch, velocity and so on. It can carry up to sixteen channels of information. This is how these files are interpreted in Ableton Live, a software music sequencer and digital audio workstation:
Using this format you can create a map of your composition and then send it to the synthesizer or some computer program. This device or program will use this map to play the sequence on it own. This also gives you the ability to “color” the map with different sounds. In this implementation, we will use three information from the MIDI file. We want to know which note is pressed and when it is released. This is presented on the keyboard like this:
What we want to do is represent this information in numerical form, ie. encode it. This is done like this:
Note pressed .... Note released . . . F4# 000000000000 .... 0000000000000 F4 000000000000 .... 0000000000000 E4 000000000000 .... 0000000000000 D#4 000001000000 .... 0000010000000 D4 000000100000 .... 0000001000000 C#4 000000000000 .... 0000000000000 C4 000100011111 .... 0001000111111 . . .
Each column represents one timestamp. We present “note pressed” events in the first n columns of the matrix, the rest are registered for the “note released” events. That means if note C4 is pressed in the timestamp i, we will have value 1 in the row of the matrix dedicated to that note and in column i. The same is if the note is released, we will have value 1 in the matrix the row of the matrix dedicated to that note and in the column n+i.
Converting MIDI files into binary matrixes is implemented inside of MidiCoordinator class. This code is inspired and heavily based on Dan Shiebler’s code for midi manipulation. In here, the python-midi library is used, which contains fundamental tools for reading and writing MIDI files in python. Here is the code of the MidiCoordinator:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import midi | |
import numpy as np | |
class MidiCoordinator(object): | |
def __init__(self, lowerBound, upperBound): | |
self._lowerBound = lowerBound | |
self._upperBound = upperBound | |
self._span = upperBound–lowerBound | |
def midiToMatrix(self, midifile, squash=True): | |
schema = midi.read_midifile(midifile) | |
totalTime = [track[0].tick for track in schema] | |
posns = [0 for track in schema] | |
matrix = [] | |
time = 0 | |
state = [[0,0] for x in range(self._span)] | |
matrix.append(state) | |
end = False | |
while not end: | |
if time % (schema.resolution/4) == (schema.resolution / 8): | |
oldstate = state | |
state = [[oldstate[x][0],0] for x in range(self._span)] | |
matrix.append(state) | |
for i in range(len(totalTime)): #For each track | |
if end: | |
break | |
while totalTime[i] == 0: | |
track = schema[i] | |
pos = posns[i] | |
evt = track[pos] | |
if isinstance(evt, midi.NoteEvent): | |
if (evt.pitch < self._lowerBound) or (evt.pitch >= self._upperBound): | |
pass | |
else: | |
if isinstance(evt, midi.NoteOffEvent) or evt.velocity == 0: | |
state[evt.pitch–self._lowerBound] = [0, 0] | |
else: | |
state[evt.pitch–self._lowerBound] = [1, 1] | |
elif isinstance(evt, midi.TimeSignatureEvent): | |
if evt.numerator not in (2, 4): | |
end = True | |
break | |
try: | |
totalTime[i] = track[pos + 1].tick | |
posns[i] += 1 | |
except IndexError: | |
totalTime[i] = None | |
if totalTime[i] is not None: | |
totalTime[i] -= 1 | |
if all(t is None for t in totalTime): | |
break | |
time += 1 | |
S = np.array(matrix) | |
statematrix = np.hstack((S[:, :, 0], S[:, :, 1])) | |
statematrix = np.asarray(statematrix).tolist() | |
return matrix | |
def matrixToMidi(self, matrix, name="example"): | |
matrix = np.array(matrix) | |
if not len(matrix.shape) == 3: | |
matrix = np.dstack((matrix[:, :self._span], matrix[:, self._span:])) | |
matrix = np.asarray(matrix) | |
pattern = midi.Pattern() | |
track = midi.Track() | |
pattern.append(track) | |
tickscale = 55 | |
lastcmdtime = 0 | |
prevstate = [[0,0] for x in range(self._span)] | |
for time, state in enumerate(matrix + [prevstate[:]]): | |
offNotes = [] | |
onNotes = [] | |
for i in range(self._span): | |
n = state[i] | |
p = prevstate[i] | |
if p[0] == 1: | |
if n[0] == 0: | |
offNotes.append(i) | |
elif n[1] == 1: | |
offNotes.append(i) | |
onNotes.append(i) | |
elif n[0] == 1: | |
onNotes.append(i) | |
for note in offNotes: | |
track.append(midi.NoteOffEvent(tick=(time–lastcmdtime)*tickscale, pitch=note + self._lowerBound)) | |
lastcmdtime = time | |
for note in onNotes: | |
track.append(midi.NoteOnEvent(tick=(time–lastcmdtime)*tickscale, velocity=40, pitch=note + self._lowerBound)) | |
lastcmdtime = time | |
prevstate = state | |
eot = midi.EndOfTrackEvent(tick=1) | |
track.append(eot) | |
midi.write_midifile("{}.mid".format(name), pattern) |
Cool, now we can process MIDI files in our solution and extract important information from it. Now, all we need is a dataset on which we can use these functions. For that purpose, you need to download some songs on which you want your solution to be trained on. This way you will point the direction in which your generates songs will sound like. Here is a nice database of MIDI files of popular songs. Remember, you will need a lot of data for your training to be completed, hundreds of songs.
Once we have the files we want to use the previously implemented MidiCoordinator to import the songs and transform them into matrices. This is done by the SongImporter class:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from tqdm import tqdm | |
import glob | |
import numpy as np | |
class SongImporter(object): | |
def __init__(self, path, midi_coordinator): | |
self._path = path | |
self._midi_coordinator = midi_coordinator | |
def getSongs(self): | |
files = glob.glob('{}/*.mid*'.format(self._path)) | |
songs = [] | |
for file in tqdm(files): | |
try: | |
song = np.array(self._midi_coordinator.midiToMatrix(file)) | |
if np.array(song).shape[0] > 50: | |
songs.append(song) | |
except Exception as e: | |
raise e | |
return songs |
Implementation
The important thing to mention before going deeper into the specifics of the implementation is that in differ to the previous TensorFlow articles Linux was used. Here is the complete list of technologies that are used:
- Ubuntu 18.4.1
- Python 3.6
- TensorFlow 1.10.0
- Spyder IDE
Here you can find a simple guide on how to quickly install TensorFlow and start working with it. As mentioned before, we use Spyder IDE because it is quite good for demonstration purposes. If you find it more convenient, you can use Jupyter as well.
So, after this data is handled we can proceed with more interesting topics, like generating music using Restricted Boltzmann Machine. This code is based on the more general implementation we explored in one of the previous articles. However, modifications were necessary due to the specific nature of the data. The class that handles this implementation is called MusicGenerator. Here we finally use TensorFlow to generate music. Let’s take a look at the complete MusicGenerator class:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import tensorflow as tf | |
from tensorflow.python.ops import control_flow_ops | |
from tqdm import tqdm | |
import numpy as np | |
class MusicGenerator(object): | |
def __init__(self, midi_coordinator, learning_rate = 0.005, num_timesteps = 15, batch_size = 100, num_of_epochs = 200): | |
self.learning_rate = tf.constant(0.005, tf.float32) | |
self.num_timesteps = num_timesteps | |
self.batch_size = batch_size | |
self.num_of_epochs = num_of_epochs | |
self._midi_coordinator = midi_coordinator | |
# Define TF variables and placeholders | |
self._visible_dim = 2*(midi_coordinator._upperBound – midi_coordinator._lowerBound)*num_timesteps | |
self._hidden_dim = 50 | |
self._input = tf.placeholder(tf.float32, [None, self._visible_dim], name="input") | |
self._weights = tf.Variable(tf.random_normal([self._visible_dim, self._hidden_dim], 0.01), name="weights") | |
self._hidden_bias = tf.Variable(tf.zeros([1, self._hidden_dim], tf.float32, name="hidden_bias")) | |
self._visible_bias = tf.Variable(tf.zeros([1, self._visible_dim], tf.float32, name="visible_bias")) | |
visible_cdstates = self.gibsSampling(1) | |
hidden_states = self.callculate_state(tf.sigmoid(tf.matmul(self._input, self._weights) + self._hidden_bias)) | |
hidden_cdstates = self.callculate_state(tf.sigmoid(tf.matmul(visible_cdstates, self._weights) + self._hidden_bias)) | |
size = tf.cast(tf.shape(self._input)[0], tf.float32) | |
weights_delta = tf.multiply(self.learning_rate/size, tf.subtract(tf.matmul(tf.transpose(self._input), hidden_states), tf.matmul(tf.transpose(visible_cdstates), hidden_cdstates))) | |
visible_bias_delta = tf.multiply(self.learning_rate/size, tf.reduce_sum(tf.subtract(self._input, visible_cdstates), 0, True)) | |
hidden_bias_delta = tf.multiply(self.learning_rate/size, tf.reduce_sum(tf.subtract(hidden_states, hidden_cdstates), 0, True)) | |
self._updates = [self._weights.assign_add(weights_delta), self._visible_bias.assign_add(visible_bias_delta), self._hidden_bias.assign_add(hidden_bias_delta)] | |
def gibsSampling(self, number_of_iterations): | |
counter = tf.constant(0) | |
[_, _, visible_cdstates] = control_flow_ops.while_loop(lambda count, num_iter, *args: count < num_iter, | |
self.singleStep, [counter, tf.constant(number_of_iterations), self._input]) | |
# Stop tensorflow from propagating gradients back through the gibbs step | |
visible_cdstates = tf.stop_gradient(visible_cdstates) | |
return visible_cdstates | |
def callculate_state(self, probability): | |
return tf.floor(probability + tf.random_uniform(tf.shape(probability), 0, 1)) | |
def singleStep(self, count, index, input_indexed): | |
hidden_states = self.callculate_state(tf.sigmoid(tf.matmul(input_indexed, self._weights) + self._hidden_bias)) | |
visible_cdstates = self.callculate_state(tf.sigmoid(tf.matmul(hidden_states, tf.transpose(self._weights)) + self._visible_bias)) | |
return count+1, index, visible_cdstates | |
def generateSongs(self, songs): | |
with tf.Session() as sess: | |
init = tf.global_variables_initializer() | |
sess.run(init) | |
for epoch in tqdm(range(self.num_of_epochs)): | |
for song in songs: | |
song = np.array(song) | |
song = song[:int(np.floor(song.shape[0]/self.num_timesteps)*self.num_timesteps)] | |
song = np.reshape(song, [song.shape[0]/self.num_timesteps, song.shape[1]*self.num_timesteps]) | |
for i in range(1, len(song), self.batch_size): | |
tr_x = song[i:i+self.batch_size] | |
sess.run(self._updates, feed_dict={self._input: tr_x}) | |
sample = self.gibsSampling(1).eval(session=sess, feed_dict={self._input: np.zeros((50, self._visible_dim))}) | |
for i in range(sample.shape[0]): | |
if not any(sample[i,:]): | |
continue | |
matrix = np.reshape(sample[i,:], (self.num_timesteps, 2*self._midi_coordinator._span)) | |
self._midi_coordinator.matrixToMidi(matrix, "song_{}".format(i)) |
That is quite a lot of code, so let’s dissect it into smaller chunks and explain what each piece means. The majority of the code is in the constructor of the class, which takes dimensions of the hidden and visible layer, learning rate and an instance of the MidiCoordinator class as input parameters. The first thing we do inside of the constructor is the initialization of the variables and placeholders:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
self.learning_rate = tf.constant(0.005, tf.float32) | |
self.num_timesteps = num_timesteps | |
self.batch_size = batch_size | |
self.num_of_epochs = num_of_epochs | |
self._midi_coordinator = midi_coordinator | |
# Define TF variables and placeholders | |
self._visible_dim = 2*(midi_coordinator._upperBound – midi_coordinator._lowerBound)*num_timesteps | |
self._hidden_dim = 50 | |
self._input = tf.placeholder(tf.float32, [None, self._visible_dim], name="input") | |
self._weights = tf.Variable(tf.random_normal([self._visible_dim, self._hidden_dim], 0.01), name="weights") | |
self._hidden_bias = tf.Variable(tf.zeros([1, self._hidden_dim], tf.float32, name="hidden_bias")) | |
self._visible_bias = tf.Variable(tf.zeros([1, self._visible_dim], tf.float32, name="visible_bias")) |
At the moment the learning rate is set to 0.005 and information from the constructor input is saved into separate fields. One interesting field is num_timestamps. Later, this field is used to determine how many timesteps will be generated during the creation of the song.
We also calculated the size of the visible and hidden layers of Restricted Boltzmann Machine. As you can see the number of these neurons is dictated by the number of inputs that we can have, and in this case that is the difference between highest note in our training set, and lowest note in our training set. The number of neurons in the hidden layer is hardcoded to 50. Apart from that, biases for both levels and weights are defined, as well as the placeholder for the input values. Once this is done, we are processing input in the manner described previously:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
visible_cdstates = self.gibsSampling(1) | |
hidden_states = self.callculate_state(tf.sigmoid(tf.matmul(self._input, self._weights) + self._hidden_bias)) | |
hidden_cdstates = self.callculate_state(tf.sigmoid(tf.matmul(visible_cdstates, self._weights) + self._hidden_bias)) | |
size = tf.cast(tf.shape(self._input)[0], tf.float32) | |
weights_delta = tf.multiply(self.learning_rate/size, tf.subtract(tf.matmul(tf.transpose(self._input), hidden_states), tf.matmul(tf.transpose(visible_cdstates), hidden_cdstates))) | |
visible_bias_delta = tf.multiply(self.learning_rate/size, tf.reduce_sum(tf.subtract(self._input, visible_cdstates), 0, True)) | |
hidden_bias_delta = tf.multiply(self.learning_rate/size, tf.reduce_sum(tf.subtract(hidden_states, hidden_cdstates), 0, True)) | |
self._updates = [self._weights.assign_add(weights_delta), self._visible_bias.assign_add(visible_bias_delta), self._hidden_bias.assign_add(hidden_bias_delta)] |
First, we calculate the possibilities for the hidden layer based on the input values and values of the weights and biases. Based on that probability and with the help of calculate_state function, states of the hidden layer are calculated. Then the probability and temporary Contrastive Divergence states for the visible layer are calculated. Then the process is done for the Contrastive Divergence states of the hidden layer as well. Once this is performed we can calculate the positive and negative gradient and update the weights and the biases. Also, we define _updates operation which is later used in the TensorFlow session to run these calculations.
This process heavily relies on helpers functions and some of the mentioned operations are actually performed inside them so make sure you check them out:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def gibsSampling(self, number_of_iterations): | |
counter = tf.constant(0) | |
[_, _, visible_cdstates] = control_flow_ops.while_loop(lambda count, num_iter, *args: count < num_iter, | |
self.singleStep, [counter, tf.constant(number_of_iterations), self._input]) | |
# Stop tensorflow from propagating gradients back through the gibbs step | |
visible_cdstates = tf.stop_gradient(visible_cdstates) | |
return visible_cdstates | |
def callculate_state(self, probability): | |
return tf.floor(probability + tf.random_uniform(tf.shape(probability), 0, 1)) | |
def singleStep(self, count, index, input_indexed): | |
hidden_states = self.callculate_state(tf.sigmoid(tf.matmul(input_indexed, self._weights) + self._hidden_bias)) | |
visible_cdstates = self.callculate_state(tf.sigmoid(tf.matmul(hidden_states, tf.transpose(self._weights)) + self._visible_bias)) | |
return count+1, index, visible_cdstates |
Ok, this class is now initialized and we can start training. This is done inside the generateSongs method:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def generateSongs(self, songs): | |
with tf.Session() as sess: | |
init = tf.global_variables_initializer() | |
sess.run(init) | |
for epoch in tqdm(range(self.num_of_epochs)): | |
for song in songs: | |
song = np.array(song) | |
song = song[:int(np.floor(song.shape[0]/self.num_timesteps)*self.num_timesteps)] | |
song = np.reshape(song, [song.shape[0]/self.num_timesteps, song.shape[1]*self.num_timesteps]) | |
for i in range(1, len(song), self.batch_size): | |
tr_x = song[i:i+self.batch_size] | |
sess.run(self._updates, feed_dict={self._input: tr_x}) | |
sample = self.gibsSampling(1).eval(session=sess, feed_dict={self._input: np.zeros((50, self._visible_dim))}) | |
for i in range(sample.shape[0]): | |
if not any(sample[i,:]): | |
continue | |
matrix = np.reshape(sample[i,:], (self.num_timesteps, 2*self._midi_coordinator._span)) | |
self._midi_coordinator.matrixToMidi(matrix, "song_{}".format(i)) |
The first part of this function session is used to initialize all variables and to train the Restricted Boltzmann machine, while the second part of the function is actually used to generate music.
Usage
The whole ecosystem of classes can be used like this:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from midi_coordinator import MidiCoordinator | |
from song_improter import SongImporter | |
from music_generator import MusicGenerator | |
coordinator = MidiCoordinator(24, 102) | |
importer = SongImporter('midi_songs', coordinator) | |
generator = MusicGenerator(coordinator) | |
imported_songs = importer.getSongs() | |
generator.generateSongs(imported_songs) |
Instances of MidiCoordinator, SongImporter and MusicGenerator are created. Make sure to enter the correct folder path for the input value of the SongImporter instance. In my case, the name is ‘midi_songs’. Then we just call getSongs function of the importer object. This function will return us songs prepared for the MusicGenerator. Once that is done, the only thing left to do is the call of the generateSongs method of the generator with returned data.
Conclusion
In this article, we had a chance to utilize knowledge which we gathered from the previous articles. In a practical example, we used the theory for the concrete problem. If you are interested in creating general Restricted Boltzmann machine you can find a more comprehensive and complete solution here.
Thank you for reading!
This article is a part of Artificial Neural Networks Series, which you can check out here.
Read more posts from the author at Rubik’s Code.
Trackbacks/Pingbacks