Music Transcription using a Convolutional Neural Network

Background

As part of our project, we have developed a perceptual neural network that automatically replicates parts of piano music. Music advice is a daunting task, requiring special skills for people. We work to automate the process by using an online MIDI file library for audio synthesis and provide basic key labels for learning your network.

Music Transcription using a Convolutional Neural Network

The idea for our project was inspired by last year's projects for music forecasting and synthesis. Duplication of music is a work that can be produced from previous works. Subsequent research showed us a commercial software, Lunaverus, capable of duplicating music, so we tried to change its style.
There are several key components to our project:

Data Collection
Procession
Model creation
Model training

Data

To get started, we needed to find a large amount of data to work on. Most of the network music and music data on the Internet was from raw audio files. Cruel audio files have a lot of noise, which is why we try to avoid using them as a first step. Instead, we look for MIDI audio files, which are files in a particular format that indicate when the note should be pressed and for how long it will play. We target piano music as our data source to reduce the range of the instrument. In the search for a dataset, we crossed multiple MIDI files on some websites. These MIDI files include classical music, video games, Christmas, and even international music for the EPiano Junior Competition. However, our best database turned out to be the public domain of "live" MIDI performances for electronic piano, "Yamaha Discovery." This public domain contains more than 10,000 MIDI-format piano music files, resulting in more than enough data to work with, including the other MIDI files we found. Our pre-processing phase has transformed these 10,000 files into over 1 million training examples to feed our network.

Processing

Our goal was to use MIDI files and convert them to raw audio files, then create a spectrum for the Convolution Neural Network (CNN) that would be used as input. The spectrogram shows the power of different frequencies in a song over time, and CNN can be used to understand unstructured image data. To create spectrograms, we use two techniques: short-term frequency transformation and continuous Q transformation.
We first split our Spectrum and MIDI files into Windows for one second to fill them in as our CNN inputs. We finished with 1 million audio clips in a second and translated 18,137 of them into Spectogram.

Working with CNN, we noticed that there was a second too wide a window to come out of the output to process the original audio file. The reason for this is that Ann will play only those notes within a second window every second, instead of separating these notes into her second time slots. An example would be that if four seconds of notes were played continuously in one second, then all four notes would be played in one second instead of continuously in the output.
To address this issue, we thought of a better time window for the MIDI file and spectrum. With every second, we can play an eight-part melody for 120 minutes per minute. This song was reasonably good at reproducing, so we split the Spectrum and MIDI file later in seconds. We finished the second moment with 2 million video clips transformed into 10,517 spectrograms. We fed and re-trained these new entries in our NN. Since MIDI files are a series of delayed events, it can be a bit difficult to separate them at the perfect time in seconds. As a Horvist, we do our best to take pieces that are at least one second long, though they are allowed to grow longer.

We use a program called FloodSynth to generate audio waves from our distributed MIDI files and then convert them to spectrograms. However, there are many variations of the spectrogram. One variable uses a short-lived Furiev transformer (STFT), which converts small segments of the audio signal into the frequency of their components. The result is an image in which time increases on the x axis and the frequency values are on the y axis. However, we have found some shortcomings in using STFT in music. Since each piano's keys generate power in terms of twenty frequencies, the line graph with linear space will scatter most of the information. This creates plenty of space in the upper half of the frequency table.

The alternative to STFT is a permanent Q change. Like STFT, it converts audio pieces to spectrograms. However, it uses logically distributed filters to break the signal, making it a more evenly distributed spectrogram. We tried to fill in both images in our network and found that a constant change of Q exceeded the STFT for our purposes.

Our entire processing process has created a ridiculous amount of data. Yesterday, we were able to produce 2 million pieces of our songs. Of those two million, we managed to convert only 20kg of them into audio files and spectrographs due to time constraints.
During training / testing, we convert Sigmund MIDI files to "flying piano rolls" on the fly. Each piano roll is presented as a vector of 0/1 with 128 inputs. Each entry corresponds to a specific piano KG and has 1 if that note was played at that time.
Model creation
The model we use will be CNN due to its high image accuracy. Other music projects implemented last semester used short-term network memory (LSTM), but since the music transcript is time-consuming we can go ahead and use CNN. We based our initial model on CNN Tutorial, which used MNIST as data. Since MNIST is an obvious problem for CNN, we have to make some changes.
We wanted to be able to find more notes at the same time, so we dropped the SoftManks layer, which is usually used for sorting on CNN ends. It also meant that our loss function could not be a dottic cross-entropy function because our result would not be specific. Instead, we use the sigmoid layer with the binary cross-entropy loss function. The Sigmoid Swift Max allows for the opposite possibilities of the layer, which is exactly what we wanted.
As mentioned above, the exact network architecture is exactly what you can find for the MNST Challenge. The layers we used were:
Conv2D-tanh (5x5)
Fall (0.5)
Max Polling 2D (2x2)
Conv2D-tanh (3x3)
Fall (0.5)
Max Polling 2D (2x2)
Sigmoid
Plant Processing
The action after network output was quite clear. Every spectrum. Audio per second. The network converted this audio to piano roll in MIDI format. Then we put together all the MIDI files to create a song and then process it using Flow and Garage Bank.
Model training results
We were able to get great accuracy on our model. Part of this is due to the fact that most vector inputs are 0 at any given time (most piano keys are not pressed). With 128 buttons, if the song plays only one note at a time, achieving 99% accuracy is a tri-trivial thing. However, for more sophisticated songs, there can be a decent health metric. We also found that our training tests and cross-entropy decreased over time, which is an encouraging sign.

The network wasn't able to copy the left-hand side, but it seems to have copied this melody correctly. There may have been some data problems that we didn't know about. The more complex records yielded mixed results and the more dirty.
To conclude
We're delighted that our network has been able to offer some great things, even for the simplest things. Most of the performance benefits come from switching from STFT spectrographs to cue spectrographs with steady variability, which is probably the easiest for network learning.
There are many things we can do to improve performance, but we have failed to test them due to time constraints. One way to increase the accuracy of our model is to use LSTM in addition to our CNN. Our CNN will play an important role in deciding whether the note should be in a particular location, but LSTM can help reduce interference or noise. LSTM can see trends in music and may show high probability of a particular note if CNN is separated between multiple notes in a particular location.
We can also try to convert our network architecture into a single network. Resets use the rest of the hop connection between layers, which allows subsequent layers to learn only the residuals between all the above-mentioned production and real traces in the network. In practice, this allows better performance of some deep network and image processing tasks.

Another thing we can do to improve the model is better data processing. Since many of our exciting modeling benefits come from cleaner data, we suspect that cleaner data will also give us the greatest boost in performance. For example, the constant change of Q has helped a lot, but there are other things we can do to help the network better differentiate notes. First, we can try different scales of constant Q change. The network does not correctly detect the left-hand spectrum, and this may be because the lower frequency is still too close. We can choose an adjusted scale that can provide higher resolution in the lower notes where we need it.

The last thing that can improve our quality results is the discovery of a preliminary note. The initial note discovery determines where the note starts in each audio clip. Most musical instruments produce signals that are initially overpowered and then eroded over time. An algorithm can find the signatures to determine where each note is. This will allow our transcript software to detect the duration of the notes.
Overall, we had a great time with this project and there is still a lot of work to do.