Chord recognition in symbolic music
(c) Ed Earl 2019; all rights reserved.
(c) Ed Earl 2019; all rights reserved.
This post describes an attempt to address the problem of recognizing chords in symbolic music. The aim is to develop an offline program for labelling each timestep of a MIDI file with a single chord from a simple dictionary of chord labels. No prior information about the location of chord changes in the music is assumed.
Harmonic analyses of fifty songs from "A Corpus Study of Rock Music" (http://rockcorpus.midside.com) are used as ground truth data. The harmonic analyses are carefully aligned with MIDI transcriptions of the songs (a laborious process).
Firstly, timesteps are identified at which simple chords (triads and the most common sevenths) are fully voiced, with no other notes sounding. I assume that at such timesteps, the correct label must be a chord consisting of the sounding notes, plus possibly some others. For example, at a timestep with only a D, F and A sounding, acceptable chord labels would include D minor (D F A) or B flat major seventh (B flat D F A), but not E major. To “seed” the list of chord labels, the simplest matching label for each such timestep is extended until the next such timestep, and the first label is extended backwards to the start of the song.
Next, the algorithm attempts to relabel blocks of chords by repeatedly iterating backwards from the end of the song to the start, with an inner loop trying each possible block length at the target timestep, evaluating each possible block relabelling against a set of constraints. The algorithm’s overall stopping criterion is a failure to improve with reference to a set of related constraints within a certain number of iterations.
For each block, possible chord labels are eliminated until either one label remains, in which case the block is relabelled, or no labels remain, in which case the block is not relabelled. The process used to eliminate candidate chord labels uses the following constraints.
This approach frames chord recognition as a classification problem. A function is defined to calculate amplitude for a given pitch class at a given timestep, using a decay function to model decaying amplitude for sustained notes. The following features are then computed for each timestep.
The complete set of features used as input to the classifier for each timestep consists of the above features computed for twenty past and twenty future timesteps, plus features for the amplitude of each pitch class at the target timestep (computed using a variety of parameters). The class label at each timestep is the chord label.
The training set consists of forty songs. Each song is transposed to each of the twelve pitch classes, then features and labels are computed. A random forest classifier is trained. Its accuracy is tested against a set of ten songs which were not used for training (and are not transposed before testing). The final accuracy figure is 69.7%.
The harmonic analyses used as ground truth (again, "A Corpus Study of Rock Music", http://rockcorpus.midside.com) contain alternative analyses for several songs, evidence that chord perception is not unambiguous for expert human listeners. However, this does not seem to account for the classification failures in the algorithms’ output. Chord labels are often clearly inappropriate. The algorithms will often produce a bad chord label for several notes after a chord change, sometimes then changing again to the correct chord label. This is especially interesting given that the hand-written approach specifically penalizes greater numbers of chord changes. Apparently the evidence for the “bad” chord label is strong enough during the first few notes to justify an extra chord change.
Human listeners are often able to agree on an unambiguous chord change on as little as a solo bass note, when taking all musical factors into account. Although the descriptions in the sections above omit a great deal of experimentation with computed features, techniques and classifier types, several factors which seem musically significant have not yet been modelled, and might be explored.
Beat, pulse and groove provide important clues to the location of chord changes, which tend to happen in regular places within a particular piece of music. The start of a bar is common, but if the groove is syncopated, chord changes may occur a quaver or semiquaver before the start of a bar. (In printed music, chords are nevertheless often notated at bar lines in this situation; the harmonic analyses used as ground truth for this study follow this practice.)
Repetition is important (particularly in rock music, but also in other genres). Figures and sections repeated from earlier in the music are an important clue that the chord is also the same. Exact repetition is rare, however; small variations are found everywhere, and repeating a figure of notes against different chords is also a common device.
Larger structure is also significant. Cues such as dynamics and arrangement signpost the way to section changes, bringing the expectation of harmonic development - new chords.
Formulaic cadences and chord progressions are particularly common in rock music, but are also important in other genres. However, as with other elements of music, composers’ chord choices exploit both familiarity and surprise to create interest. Listeners’ expectations are an essential part of the process.
I’m surprised that although chord recognition in audio receives research attention, relatively little research seems to be done on chord recognition in symbolic music. It seems to be fairly difficult to achieve high accuracy, even though the data set I used is rock music, which is commonly believed to be harmonically basic, formulaic and easy to grasp.
A much larger set of data, consisting of music transcriptions in a symbolic format such as MIDI, with chord labels for each timestep, would be of great benefit.