akiramaezawa,katsutoshiitoyama, automatedviolinfingering … · 2020. 1. 14. · figure 1. violin...

Automated Violin FingeringTranscription ThroughAnalysis of an AudioRecording

Akira Maezawa, Katsutoshi Itoyama,Kazunori Komatani, Tetsuya Ogata,and Hiroshi G. OkunoDepartment of Intelligence Scienceand TechnologyKyoto University Graduate School of InformaticsYoshida HonmachiSakyo, Kyoto 606-8501, Japanakira [email protected]@[email protected]{ogata, okuno}@i.kyoto-u.ac.jp

Abstract: We present a method to recuperate fingerings for a given piece of violin music in order to recreate the timbre ofa given audio recording of the piece. This is achieved by first analyzing an audio signal to determine the most likely se-quence of two-dimensional fingerboard locations (string number and location along the string), which recovers elementsof violin fingering relevant to timbre. This sequence is then used as a constraint for finding an ergonomic sequenceof finger placements that satisfies both the sequence of notated pitch and the given fingerboard-location sequence.

Fingerboard-location-sequence estimation is based on estimation of a hidden Markov model, each state of whichrepresents a particular fingerboard location and emits a Gaussian mixture model of the relative strengths of harmonics.The relative strengths of harmonics are estimated from a polyphonic mixture using score-informed source segregation,and compensates for discrepancies between observed data and training data through mean normalization.

Fingering estimation is based on the modeling of a cost function for a sequence of finger placements. We tailor ourmodel to incorporate the playing practices of the violin.

We evaluate the performance of the fingerboard-location estimator with a polyphonic mixture, and with recordingsof a violin whose timbral characteristics differ significantly from that of the training data. We subjectively evaluatethe fingering estimator and validate the effectiveness of tailoring the fingering model towards the violin.

In musical instrument performance, deciding thesequence of finger placements needed to produce agiven sequence of pitches, known as the fingering,is an important and sometimes difficult problemthat musicians need to solve. Fingering decisionscan be difficult because the fingering must beboth musical and ergonomic, two often-conflictingideals. For example, the “easiest” fingering on aviolin often involves unmusically abrupt changes oftimbre. This is because most pitches can be foundin more than one location on the instrument (i.e.,on more than one string). Each string, however, hasa different timbre (for reasons we explain later), andoften the easiest fingering involves changes betweenstrings. An experienced musician would choose abalanced fingering that not only satisfies ergonomicfinger placements but also expresses the musician’sartistic values.

The essence of violin fingering resides in findingboth an ergonomic sequence of finger placements

Computer Music Journal, 36:3, pp. 57–72, Fall 2012c© 2012 Massachusetts Institute of Technology.

and a musically appropriate fingerboard locationsequence, the sequence of locations on the finger-board on which the finger presses the string. Eachfingerboard location specifies both the longitudinallocation along the string and the latitudinal posi-tion, i.e., which of the four strings is played. Eachstring is tuned differently and has a distinct timbre.Therefore, it is essential for a violinist to choosea fingerboard-location sequence that sounds musi-cally well-motivated. For example, in Air on the GString, August Wilhelmj’s well-known arrangementof the second movement of J. S. Bach’s OrchestralSuite No. 3, the arranger specifies the entire soloviolin part to be played on one string (the G string),most likely to maintain the consistent, warm timbrethat is characteristic of the G string.

This study aims to develop a method for analyzingan audio recording of a violin and estimating thefingering required to recreate the “sound” of aparticular artist’s performance. Such a methodwould allow a beginner, for example, to analyze therecordings of past masters and gain insights on howto imitate them. It could also help violin students

Maezawa et al. 57

to appreciate different musical values by comparingfingerings played by different musicians. Findingsimilarities of fingering among musicians wouldallow one to find a stylistically suitable way to playa particular piece, a difficult task for a violiniststudying alone without a teacher’s help.

Our goal requires (1) analyzing an audio signal toestimate the fingerboard location sequence, and (2)choosing an ergonomic finger placement sequencethat satisfies the sequence of notated notes and theestimated fingerboard location. Existing methodsin fingering estimation do not focus on timbraldifferences caused by different fingerboard locations,however, and existing methods for estimating thefingerboard-location sequence need robustness in apolyphonic mixture.

Most studies in fingering-estimation researchintend not to retrieve a fingering from a particularrecording, but rather to determine an ergonomicfingering. In these studies, the general frameworkinvolves designing a cost function for moving thehand from one shape to another, and finding asequence of hand movements that minimizes theaccrued cost. In the seminal work by Sayegh (1989),the cost uses the distance that the hand traverseshorizontally and vertically across the instrument.Other work involves similar kinds of costs basedon the actual distance that the hand needs totraverse based on different constraints posed bydifferent instruments, such as the guitar (Radicioni,Anselma, and Lombardo 2004) or the piano (Kasimi,Nichols, and Raphael 2007; Yonebayashi, Kameoka,and Sagayama 2007). The costs may be given apriori or learned using training data (Radisavljevicand Driessen 2004). The formulation of fingeringestimation as a cost-minimization problem has beengiven a probabilistic interpretation using hiddenMarkov models (HMM; Yonebayashi, Kameoka, andSagayama 2007). These studies all seem to have theperspective that musicality resides in the sequenceof notated notes, which overlooks the expressionadded by the performer. Such an orientation impliesthat given a symbolic representation of music, afingering may be determined on a one-to-one basis.Each person, however, given a piece of music,might play it using a different fingering basedon his or her musical values and physiological

constraints. Clearly, performance analysis based onan audio signal or video is essential for realizingour goal of reconstructing fingerings from recordedperformances.

Studies in detailed performance analysis fail whenthe music in the audio recording is polyphonic, aswell as when the sound of the instrument thatneeds to be analyzed is significantly differentfrom the training data. In practical situations,however, it is essential to be able to analyzea musical phrase within a polyphonic mixture.Methods for analyzing audio spectra to determinethe control input for the violin (Krishnaswamyand Smith 2003; Barbancho 2009) or the guitar(Traube and Smith 2000) do not work well oninstruments with acoustic characteristics other thanthose used in training, or make highly restrictiveassumptions on how the instrument is played.Greater accuracy is reported using audiovisualfusion (Zhang, Zhu, and Leow 2007; Lu et al. 2008).Some studies use only visual information to analyzethe fingering of a guitar (Burns and Wanderley2006). In either case, the necessity of video severelylimits the kind of musical recordings that can beanalyzed.

In this article, we develop a method to recuperatethe fingering from an audio recording of a violin in apolyphonic mixture by analyzing the audio recordingand finding an ergonomic fingering that capturesthe particular recording’s timbre as expressed in aspecific fingerboard-location sequence. Our methodis a two-step procedure. First, it analyzes the audiosignal to estimate the fingerboard location sequence.Second, it determines an ergonomic fingering thatsatisfies the fingerboard-location sequence. Whenanalyzing the fingerboard-location sequence, we usefeatures that are robust in a polyphonic mixture.Moreover, the method uses a feature-adaptationmechanism to improve the robustness when theinstrument sounds are different from the ones usedfor training. For the ergonomic fingering decision,we incorporate a cost function that reflects practicesof violin playing (which is not necessarily applicableto other stringed instruments such as the guitaror the cello). We evaluate the performance of thefingerboard-location estimator and the fingering-decision method.

58 Computer Music Journal

Figure 1. Violin fingering.

Violin Fingering: A Primer

We shall briefly review the fundamentals of violinfingering and its terminology, as these play impor-tant roles in understanding our method. The lefthand defines the pitch and the string on which anote is played. Although there are two definingfactors in fingering (i.e., finger and string), violinistsoften talk about fingering also in terms of the lefthand position, i.e., the general placement of theleft hand required to play a given note on a certainstring using a certain finger. As shown in Figure 1,in this article we associate the string on which anote is played with the variable s, where s = 1 is thelowest-tuned string and s = 4 is the highest. Thesestrings are conventionally tuned seven semitonesapart, where s = 3 is typically tuned to A4 = 440 Hz.The strings are typically notated in a music scoreusing a Roman numeral, where the lowest string(s = 1) is associated with IV, and the highest (s =4) with I. Placed fingers are labeled by numbers,where 1 refers to the index finger, 2 the middle, 3the ring, and 4 the little finger. An open string (i.e.,no finger is pressed and the string vibrates at thetuned pitch) is referred to by 0; only four pitches on aviolin can be played as an open string. In this article,the position is defined by the number of semitonesthat the first finger must traverse from the nut (theridge over which the string passes on the end of thefingerboard near the tuning pegs) in order to play aparticular fingering.

Finger placement is determined by consideringtechnical ease and musical effect. Using a certainfinger (e.g., the index finger instead of the littlefinger) facilitates execution of some musical effectsrelated to pitch, such as a smooth transition betweentwo notes (glissando), or a low-frequency modula-tion of a note (vibrato). At the same time, some

sequences of finger placement are easier to executethan others. For example, rapid movement of thelittle finger is considerably more tiresome than thatof the index finger. The choice of the string on whichto play a given note is determined by considering theconsistency of sonority. For example, because eachstring has a distinct sonority, violinists often playon one string to prevent abrupt changes of the soundquality. The difference in sonority when a givenpitch is played on one string versus another resultsfrom (1) the differences in the physical attributes ofeach string itself, such as the diameter, tension, andmaterial; (2) differences in the position along thestring at which the finger must be placed (becauseeach string is tuned differently); and (3) in caseswhen the pitch is available as an open string, thedifference between the rigid termination providedby the nut and the soft termination provided by thefinger.

The choice of string is the one factor in fingeringthat is most likely to produce audible differences. Bycontrast, the position is sometimes chosen for visualeffect and to demonstrate the violinist’s technicalskills. For example, the violinist may present a“flashy” playing style through a wide change ofposition.

The consistency of sonority offered by playingon one string often forms a trade-off with theconsistency of position that facilitates playing. Ina fast piece, abrupt changes of sonority caused by acertain fingering may be overlooked if it simplifiesplaying. On the other hand, in a slow piece, a vio-linist may choose a difficult fingering that producesa certain sonority. A violinist may, for example,value consistency of sonority in a slow, “singing”(cantilena) passage by playing on one string.

Method

Our method is based on determining the fingerboard-location sequence from a polyphonic audio mixturethat contains a musical phrase for solo violin, andfinding a sensible fingering that satisfies the esti-mated fingerboard-location sequence. We assumethe violin plays a monophonic melody using normalbowing technique (i.e., we do not consider extended

Maezawa et al. 59

Figure 2. System blockdiagram.

playing technique such as plucking [pizzicato],striking the string with the wooden side of the bow[col legno],or playing very close to the bridge toproduce a shimmering sound [sul ponticello]).

Our task involves three aspects:

1. Designing a feature that performs well in apolyphonic mixture.

2. Designing a scheme such that the perfor-mance does not degrade from an acousticalmismatch with the training data, whetherbecause of characteristics of the violins, therooms, or the recording process.

3. Designing a fingering model that reflects thepractices of violin performance.

We attack the first problem by fitting a sum-of-sinusoids plus noise model to the observed spectrumand using the estimated harmonic parameters asthe feature. The second problem is attacked bynormalizing the average features of the trainingset and by doing the same for the recording whosefingerboard-location sequence we would like toidentify. We solve the third problem by introducing anew model of violin fingering, the violin pedagogicalmodel, which incorporates features that are inspiredby practices of violin playing. The system-levelblock diagram is shown in Figure 2.

Fingerboard-Location Sequence Estimationthrough Viterbi Alignment

In order to estimate a sensible fingering from theaudio, the fingerboard-location sequence must beestimated from a polyphonic mixture of sound

that contains a violin melody. Because the stringon which a note is played (the bowed string) isthe primary factor that influences the timbre fordifferent fingerboard locations, we would liketo estimate the sequence of bowed strings. Ourestimation method is based on a bowed-stringclassifier using features that are robust to polyphonicaccompaniment, and a classifier that takes intoaccount the playability of a particular fingerboard-location sequence.

Feature Extraction by Harmonic-Model Fitting

We extract, from a polyphonic audio mixture thatcontains a violin melody, the relative strengths ofthe first N violin harmonics (partials), where N =10, the first partial is the fundamental frequency,and the others are overtones. The parameter isdependent on the material property of the string,the body resonance characteristics, and on howthe instrument is played (e.g., bow force, bowvelocity, and the contact point of the bow andthe string [Cremer and Allen 1984; Fletcher andRossing 1998]). The feature extraction involvesfitting a sum of harmonically spaced sinusoidsonto the observed short-time Fourier transformrepresentation of the input audio signal, Y( f , t),where f is the frequency bin and t is the time index.Because the input contains not only the harmonicsound of the violin but also transients of the violinand accompaniments, the harmonic sound of theviolin part must be segregated. This is achieved bygenerating a time-frequency mask that passes theharmonic component of the violin part, based on theobserved spectrogram, the music score given as a


standard MIDI file (SMF), and the mapping betweenpositions in the SMF and the audio signal (audio-score alignment). As a signal-processing front end,we attenuate the DC component and emphasize thehigh-frequency component by applying a two-tap,high-pass filter with filter coefficients [1,–1]. Thenour method iterates the following steps for a fixednumber of iterations:

1. Update the time-frequency mask to segregatethe harmonic components of the violin part,using the audio-score alignment and theestimated fundamental frequency as the cue.

2. Estimate the fundamental frequency ofthe violin melody, taking into account thepitch notated in the score, the observedspectrogram, and the audio-score alignment.

In the first step, we generate a time-frequencymask, which approaches 1 for time-frequency binsthat contain the violin sound and 0 otherwise. Weincorporate the musical score information (from theSMF) and the audio-score alignment to find whichbins are expected to contain the sound of the violin.Audio-score alignment is determined by extractinga time-sequence of chroma values from both theaudio and the SMF, and performing dynamic time-warping between the chroma representation of theaudio and SMF to find the optimal path, using thecosine distance as the metric between the audioand the score, similar to the existing work of Hu,Dannenberg, and Tzanetakis (2003).

Audio-score alignment gives the time-sequence ofthe notated pitch of the violin part, f̂0(t), for all timepoints t. Given this, we simultaneously segregatethe violin part and estimate the fundamentalfrequency of the violin part, f0(t). Our idea, similarto Kameoka’s method (Kameoka, Nishimoto, andSagayama 2007), is to apply a mask that resemblesa comb filter to the spectrum centered about f̂0(t)to segregate the violin part. Using the segregatedsignal, we re-estimate the fundamental frequency.We then re-apply to the original spectrum the maskwith the updated fundamental frequency. We iteratethese two steps of fundamental frequency updateand violin-part segregation until the fundamentalfrequency converges.

We first segregate the violin part by separatingthe signal into two sub-signals—the violin partand the residual (i.e., the rest of the signal). Weassume that the likelihood of observing the violinpart at frequency f is hv( f | f0(t)), and is of form∑N

n=1 πnN( f |nf0(t), σ 2f ), and we assume that thelikelihood of observing the residual is hR( f ) and is ofform 1F , where F is the number of frequency bins toconsider. πn is a multinomial variable that indicatesthe relative strength of the nth partial. N(x|μ, σ 2)is the likelihood of x for a normal distributionwith mean μ and variance σ 2. We assume that thelikelihood of observing the violin part is α, and thelikelihood of observing the residual is 1−α. For eachfrequency bin, we associate a latent variable ZV,which indicates whether the violin or the residualcontributed to the observed power. ZV( f , t) = 1indicates that at time t, the power contained infrequency f originated from the violin part, andZO( f , t, n) = 1 indicates that, of frames generatedfrom the violin, it originated from the nth partial ofthe violin part. The joint likelihood of the observedsignal and the latent variable is given as follows:

p(X( f , t), Z( f , t)| f0(t), α, π )=

∏f ,t

((1 − α)hR( f ))X( f ,t)(1−ZV( f ,t))

N∏n=1

(απnN

(f |nf0(t), σ 2f

))X( f ,t)ZV( f ,t)ZO( f ,t,n)

We optimize this model using the expectation-maximization (EM) algorithm, in a manner similarto the work of Kameoka, Nishimoto, and Sagayama(2007). In the E-step, we use the parameter from theprevious step to find the distribution of Z given Xand the parameters. We assign the following:

QO( f , t, n) = p(Zn|X, f0, α, π )

= πnN( f |nf0(t), σ2f )∑

ñ πñN( f |ñ f0(t), σ 2f )

Qv( f , t) =α(t)

∑n πnN( f |nf0(t), σ 2f )

(1− α(t))hR( f ) + α(t)∑

n πnN( f |nf0(t), σ 2f )

Maezawa et al. 61

Figure 3. Block diagram offeature extraction step.

In the M-step, we update the fundamentalfrequency. We incorporate the knowledge thatthe fundamental frequency does not deviate toomuch from the notated pitch, by incorporating aprior distribution of the fundamental frequencyas p( f0(t)| f̂0(t)) = N( f0(t)|f̂0(t), σ 2f ). Then, the M-stepyields the following updates:

f0 : =σ 2f f̂0 + σ 2f̂

∫F

∑n=1 X( f )QV( f , t)QO( f , t, n)n f df

σ 2f + σ 2f̂∫

F

∑ñ=1 X( f )QV( f , t)QO( f , t, ñ)ñ2 df

πn : =∫

F X( f )QV( f , t)QO( f , t, n) df∑ñ

∫F X( f )QV( f , t)QO( f , t, ñ) df

α(t) =∫

F X( f )QV( f , t) df∫F X( f ) df

A block diagram of this method is shown inFigure 3.

Finally, we take the logarithm of π (t) to enablefeature adaptation, as will be discussed. Further-more, we decorrelate the log-relative strengthby taking the discrete cosine transform. That is,π (t) : = DCT(log π (t)).

Bowed String Identification

Once we extract the parameters �(t) = { f0(t), π (t)},we find the likelihood of the observed data for eachof the four bowed strings, i.e., the probability at each

Figure 4. Block diagram ofthe GMM.

point in time that the violinist is bowing a givenstring.

The likelihood of bowed string si is modeledusing a Gaussian mixture model (GMM). TheGMM models the density of π for each pitchpi( f0), where pi( f ) converts frequency f into theclosest MIDI note number. Let φ(i)j (p), μ

(i)j (p), and∑(i)

j (p) indicate respectively the weight, the mean,and the covariance of the jth component of theGMM for the ith bowed string at pitch p. Then,the likelihood of observing bowed string i giventhe feature { f0(t), π (t)} and the GMM parametersθGMM = {φ(i)j (p), μ(i)j (p), σ (i)j (p)|∀i, j, p} is given asfollows:

p(s = i|π (t), p(t), �GMM( f0(t)))∝

∑j

φ(i)j (pi(t))N(π (t)|μ(i)j (pi(t)), �(i)j (pi(t))

θGMM is trained using violin audio examples withvarious playing styles, each of which has a knownsequence of pitch and fingerboard position. Usingexamples with a variety of playing styles has theeffect of averaging out statistical discrepancies ofthe feature arising from playing the example in aparticular playing style. Hence, different parametersin GMM correspond to timbral difference arisingfrom the variety of fingerboard position and pitchcombinations, and not playing style, e.g., bowpressure or bow velocity. Figure 4 shows the blockdiagram of the GMM.

The log-likelihood of the bowed string for the kthnote, then, is a sum of bowed-string likelihoods forall audio frames that play the kth note. Let Tk(t) bea binary variable that is 1 if frame t plays the kth


Figure 5. Block diagram ofthe HMM.

note notated on the score. Then, the bowed stringlikelihood at note k, ρk(s), is given as follows:

ρk(s) =∑

t

Tk(t) log p(s|π (t), p(t), θGMM)

Sequence Estimation using the Viterbi Algorithm

We determine the fingerboard-location sequence byfinding the most likely bowed-string sequence, tak-ing into account the observed acoustic signal and thedifficulty of traversing from one fingerboard locationto another. To incorporate both the likelihood of agiven bowed string sequence and the likelihood ofobserving the features given a particular fingerboardposition in which a note is played, we model thebowed-string sequence as an HMM. Figure 5 showsa graphical depiction of our model.

We treat the string the note is played on as thehidden state that needs to be estimated, given theobserved features π (t), the fundamental frequencyf0(t), and the inherent difficulty from traversingfrom one fingerboard position to another. Let vbe a parameter that governs the state transition

Figure 6. Simplifiedhorizontal–vertical modelused in the Viterbialgorithm.

probability of the HMM. Then, we find the mostlikely bowed string as follows:

Ŝ = arg maxS={s0,···,sN}

p(S|�; �GMM, v)

= arg maxS={s0,···,sN}

log ρ0(s0) + log p(s0; v)

+∑i=1

log p(si|si−1; v) + log ρi(si)

ρi(si) is the likelihood of the GMM obtained in theprevious section, for the ith note. The maximum-likelihood sequence is determined inductively,using the Viterbi algorithm. Let Sopt(m, s|X, �) bethe optimal bowed-string sequence for the first mnotes that ends in bowed string s. Then, we defineSopt(m+ 1, s|X, �) as follows:

Sopt(m+ 1, s|π , �) = log p(π (m+ 1)|s, �)+ arg max

ŝlog p(Sopt(m, ŝ|π , �))

+ log p(s|ŝ, �)

We design a suitable bowed-string transitionprobability p(si|sj; v) based on a simplified violinfingering model, as shown in Figure 6:

p(Si|Sj; v) ∝ exp(

−v((

pi, j7

)2+ s2i, j

))

Here, pi, j is the amount of change betweenfingerboard positions and si, j is the amount ofchange between string numbers. The constant7 models the violinists’ tendencies to finger aninterval of up to 7semitones by either playing on thesame string or crossing a string, but to finger a largerinterval by crossing a string.

Maezawa et al. 63

Feature Adaptation

Two violins typically sound different, mainlybecause each violin has a unique body-resonancecharacteristic. Therefore, it is essential for us toadapt our model to violins with different body-resonance characteristics.

Let Y1 and Y2 be the observed log-magnitudespectrum of two violins. Let B1 and B2 be thelog-magnitude frequency response of the bodiesof the two violins, and S be the log-magnitudespectrum of the bow-string interaction. Assum-ing that the acoustic property of the bow–stringinteraction does not change significantly acrossdifferent instruments, we obtain the followingrelations:

Y1 = B1 + SY2 = B2 + S

Taking the expectation, we obtain Y2 =Y1 − 〈Y1〉S + 〈Y2〉S, where 〈Y〉X denotes expecta-tion of Y under probability distribution of X. Inour method, we let Y1 be the observed features,and generate Y2, which is the feature when Y1 isplayed on an instrument with body resonance char-acteristic B2. 〈Y1〉S is obtained readily by takingthe average of the observed signal, and 〈Y2〉S isobtained by summing the average feature of eachbowed string, weighted by how often a particu-lar string is played at a given pitch for the givenmusic score. Because our features are linear trans-formations of the logarithm of the relative powersof harmonic peaks, the same rationale holds. Ini-tially, we set the probability distribution of S asfollows:

p(S|pitch, β) ∼ exp(

− pitch − pitch0(S)β

)

if pitch > pitch0(S), 0 otherwise

Here, pitch is the played pitch, pitch0 is thepitch of the open string played on string S, andβ is a positive parameter that assigns a greaterprobability to lower fingerboard positions, whichare more commonly used and somewhat easier forthe violinist.

Fingering Estimation

At the core of our fingering estimation is the violin-fingering model, which models the difficulty of aparticular fingering. The fingering is determined bydesigning a cost function between multiple statesof hand position, which reflects how difficult itis for the hand to traverse from one position toanother.

Violin Fingering Model

In this study, we extend the existing horizontal–vertical cost model presented in the previous sectionto include fingering practices that are unique to theviolin, which we call the violin pedagogical model.It is mainly inspired by practices of violin fingeringas suggested by Yampolsky (1967) and Flesch (Flesch2000). The violin, in particular, is a small, unfrettedbowed stringed instrument, making it susceptible tointonation errors. Hence, violin fingerings are oftenset such that the weak finger (the little finger) isused sporadically, and other fingers in such a waythat a natural hand position is maintained.

We define a fingering as a 4-tuple n = (n, s, f , p),where n ∈ N is the pitch in MIDI note number,s ∈ (1, 2, 3, 4) is the bowed string, f ∈ (0, 1, 2, 3, 4) isthe pressed finger (0 = open string, i.e., no fingerplaced), and p ∈ N is the position. Let F be a set of allpossible fingerings. Also, let n(n) be a function thatretrieves the pitch of n ∈ F , s(n) the bowed string,f(n) the finger, and p(n) the position.

We define an unnotated score, Su ∈ nM to bean M−tuple of notes, where M is the numberof notes contained in the music score, and theith element, Su(i), is the ith note of the mu-sic score. We define a bowed-string constraint,cbow ∈ (1, 2, 3, 4)M associated with an unnotatedscore Su as an M−tuple, where the ith elementindicates the bowed string for the ith note. Wefinally define a notated score, Sn ∈ FM, wherethe ith element contains the fingering for the ithnote.

We then formulate our problem as finding theoptimal notated score Sopt that satisfies both thenote sequence of the unnotated score obtainedusing the SMF and the bowed-string constraint


determined using the Viterbi algorithm. Namely,we define cost functions for traversing a sequence oftwo notes or three notes, and find the fingering withthe smallest net cost that satisfies the constraints.Let Cb(si, si−1) : F2 → R be a cost function definedover a sequence of two notes. We define a symbolZ that satisfies the following:

∀A∈ F .Cb(Z, A) = Cb(A, Z) = 0

Then, the optimal fingering is a notated score Foptthat satisfies the following:

Fopt = arg mins1···sM∈F

M∑i=1

Cb(si, si−1)

such that ∀ j < 1, sj = 0 and∀ j ∈ [1, M]n(si) = Su(i) and s(Si) = cst(i)

Let τp = 3, δi,j be Kronecker’s delta, and 1(c) bea function that is 1 if condition c is true and 0otherwise.

We define the following quantities forconvenience:

p(i, j) = p(si) − p(sj) s(i, j) = s(si) − s(sj)

f(i, j) = f(si) − f(sj)

pr(i, j) = |n(si) − p0(st(sj))| − |n(si) − p0(sj(sj))|R(i, j) = [Pmin(f(si), Pmax(f(sj))]

Pmax =

⎛⎜⎜⎜⎜⎜⎜⎝

0 0 0 0 00 0 2 4 60 2 0 2 50 4 2 0 30 6 5 3 0

⎞⎟⎟⎟⎟⎟⎟⎠

Pnat =

⎛⎜⎜⎜⎜⎜⎜⎝

0 0 0 0 00 0 2 3 50 2 0 1 20 3 1 0 20 5 2 2 0

⎞⎟⎟⎟⎟⎟⎟⎠

Pmin =

⎛⎜⎜⎜⎜⎜⎜⎝

0 0 0 0 00 0 1 2.1 3.50 1 0 1.1 2.50 2.1 1.1 0 1.50 3.5 2.5 1.5 0

⎞⎟⎟⎟⎟⎟⎟⎠

Then, we define a two-note generic fingeringmodel (denoted HVM-2) as the following set offeatures XBN:

b0 = |p(i, i − 1)| · 1(f(st) = 0 ∧ f(si−1) = 0)b1 = 1(fg(si) = 0 ∧ f(i, i − 1)

= 0 ∧ p(i, i − 1) = 0)b2 = 1(f(i, i − 1) · pr(i, i − 1) < 0 ∧ |p| < τpb3 = |pr(i, i − 1)| − Pnat(f(si), f(si−1))

·1(|pr(i, i − 1)| ∈ R(i, i − 1))xBN(1) = ∞ · 1(n(si) − p0(st(si)) − pos(si) /∈ R(1, i))

xBN(2, 3) = (b0, b20)xBN(4) = b1

xBN(5, 6) = (|s(i, i − 1)|, |s(i, i − 1)|2)·1(fg(si) · fg(si−1) = 0)

xBN(7) = 1(|pr(i, i − 1)| − Pnat(fg(si), fg(si−1)) = 0)xBN(8) = b2 · 1(fg(i) = 0 ∧ fg(i − 1) = 0)xBN(9) = b3 · 1(fg(i) = 0 ∨ fg(i − 1) = 0)

xBN(1) determines whether a note is physicallyplayable on a particular string, where ∞ × 0 = 0; i.e.,this feature discards any fingering that is physicallyunplayable, as shown in Figure 7. xBN(2, 3) appliesa penalty for a change of position. xBN(4) penalizesa change of position using the same finger (i.e., aglissando). xBN(5, 6) adds a penalty for a change offingerboard location. xBN(7) penalizes playing in anunnatural hand position. xBN(8) penalizes playing asequence in which the second note is placed higheron the fingerboard, but the finger traverses from highfinger to low (e.g., little finger to the index finger),and vice versa. Finally, xBN(9) prevents change of

Maezawa et al. 65

Figure 7. Graphicaldescription of XBN(1). ARoman numeral indicatesthe string (IV = lowest, I =highest), and a numberindicates the pressed finger(1 = index finger, 4 = littlefinger)

Figure 7

Figure 8. Graphicaldescription of XBV(1).

Figure 8

position (wrist movement), when it is completelynatural not to do so.

We define a two-note fingering model inspired bythe literature of violin pedagogy (denoted VPM-2) asthe following set of features xBV:

cond0 = fg(si) · fg(si−1) = 0 ∧ |p(i, i − 1)| < τpxBV(1) = 1(cond0 ∧ pr(i, i − 1)

= 0 ∧ f(i, i − 1) = 0)xBV(2) = 1(cond0 ∧ |pr(i, i − 1)|

= 1 ∧ |f(i, i − 1)| > 1)xBV(3) = 1(fg(si) · fg(si−1) = 0 ∧ p(i, i − 1)

= 0 ∧ pr(i, i − 1) · f(i, i − 1) = −1)xBV(4) = 1(fg(si) = fg(si−1)

= 4 ∧ |note(i) − note(i − 1)| > 1)xBV(5) = pos(si)

xBV(6, 7, 8) = (δ2,pos(si), δ5,pos(si), δ1,fg(si))

xBV(1) penalizes using the same finger to moveto a different position, as shown in Figure 8. xBV(2)penalizes a chromatic change of the relative positionusing a nonadjacent finger, as shown in Figure 9;such movement involves wrist motion, which isconsiderably harder to tune than placing a fingerwith a slight stretch. xBV(3) lessens the penaltyfor shifting up from a higher-numbered finger to a


Figure 9


Figure 10

lower-numbered finger, or vice versa, when the shiftis small, as shown in Figure 10. xBV(4) penalizes anadjacent use of the little finger, which is weak andprone to intonation errors. xBV(5) adds a penalty foran extremely high position, which is harder to playin tune. xBV(6, 7) adds a preference for playing in thefirst, second, or third positions, which are typicallythe three easiest positions to play in. xBV(8) penalizesthe half-position, the lowest possible position butan unconventional one; hence, it is unnatural fromthe perspective of violin pedagogy but perhaps theeasiest position to play in tune from a physicalperspective.

Using these features, we define the two-note costfunction as follows:

Cb = diag(WBN WBV)[xBN xBV]T

WBN and WBV define the relative weight of eachfeature. Given ample training data, statisticalmachine learning of these weights might be possible.In this article, we choose to manually adjust theweights, as the violin fingering corpus is too smallto permit statistical machine learning.

Experiments

We perform three experiments. The first experimentassesses the performance of feature-adaptationand error-correction algorithms on bowed-stringestimation, the second experiment assesses theperformance of the bowed-string classifier in a


Figure 11. Recognitionaccuracy (%) for differentvalues of v (all values arefor data with dataadaptation). SS = same

violin and same strings astraining data; SD = sameviolin with differentstrings; DD = differentviolin and different strings.

polyphonic mixture, and the third experimentevaluates the playability of our new fingering modelthrough subjective experiments. In the first twoexperiments, we prepared recordings of three piecesof classical music, using two significantly differentfingerings (207 notes for three pieces, yielding a totalof 414 notes). For each fingering, we recorded themusic with the following three conditions:

1. Using the same violin and strings as wereused to record the training data (denoted assetup SS).

2. Using the same violin but with a differentbrand of strings (setup SD).

3. Using a different violin with a different brandof strings (setup DD).

This results in about 18 minutes of audio forvalidation. The training data, which lasts about 24minutes, consists of two-octave chromatic scalesplayed on each string of an electric violin, usingvarious dynamics.

Setup SS and SD were played on the same electricviolin, and DD on an acoustic violin. We chose torecord the training data using an electric violin andto use an acoustic violin for DD for two reasons.First, it is easy to record noise-free training datausing an electric violin. Second, because electricand acoustic violins sound extremely different, theevaluation of the feature adaptation mechanismcan be regarded as the worst-case performance.Therefore, in a real-life application, we expect thesystem to perform somewhere between setup SDand DD, as long as the training data use an acousticviolin. In the EM algorithm, we set σ 2f = 0.1 and σ2f̂

to start at 50 and narrow down inverse-linearly witheach iteration of the algorithm. The value β used inmodel adaption is chosen to be 0.1.

Experiment 1: Evaluation of Bowed-StringIdentifier

We evaluated the accuracy of the bowed-stringestimator with and without feature adaptation, eachtime evaluating the accuracy (1) of a baseline, i.e.,without considering any sequential information; (2)considering sequential information using a previousstudy (Maezawa et al. 2009, 2010); and (3) consideringsequential information using the Viterbi algorithmas proposed in this article. The value of v used for theViterbi algorithm in the bowed sequence estimationwas set to 30, chosen by evaluating the accuracyfor various values of v, as shown in Figure 11.Figure 12 shows the result. We find that our methodconsistently outperforms our previous study. Inall cases, adaptation decreases the performancewhen the training data and the validation data arefrom the same instrument and the same brand ofstrings. This is because adaptation itself dependson the estimated sequence of fingerboard positions,which contains errors; the discrepancy betweenthe actual sequence of fingerboard positions andthe estimated ones is small enough to be effectivein absorbing the differences in body resonancecharacteristics between two different violins orstrings, but significant when the same string andviolin is used.

Maezawa et al. 67

Figure 12. Comparison ofrecognition accuracy (%)for different sequenceestimation methods.SS = same violin and

same strings as trainingdata; SD = same violinwith different strings;DD = different violin anddifferent strings.

Figure 12

Figure 13. Recognitionaccuracy with differentlevels of accompaniment.SS = same violin and samestrings as training data;

SD = same violin withdifferent strings;DD = different violinand different strings.

Figure 13

Experiment 2: Effect of Accompanimenton Feature Extraction

Piano accompaniments were generated for two ofthe three pieces, using a synthesizer with uniformnote velocity, and the amplitude was adjusted suchthat the peak values of the solo violin and the ac-companiment were identical. Next, the recognitionaccuracy was evaluated for each violin/string type,by changing the level of the accompaniment. Wescaled the amplitude of the violin part such that theroot-mean-square power of the violin part relativeto that of the accompaniment is set to a given valueof signal-to-accompaniment ratio. Figure 13 showsthe result. We find that the bowed-string estimatorperforms similarly for different ratios, as long asthe ratio is positive, that is, the signal level of the

violin is greater than the “noise” level (i.e., theaccompaniment).

Experiment 3: Subjective Evaluationof the Fingering Estimation

First, ten excerpts from three classical pieces wereprepared (approximately 200 notes total). For eachexcerpt, we generated two fingerings: one usingHVM-2, and the other using HVM-2+VPM-2.We assume that bowed-string estimation is per-fectly accurate, by not incorporating constraintsbased on the estimated bowed string. The featureweight W was manually tuned, by first adjustingthe parameters for example (1) until it generatedsatisfactory fingerings for the repertoires consid-ered. Then, parameters pertaining to example (2)were adjusted. They were set to WBN = [1, 0.5,1, 1, 5, 1, 0.5, 20, 10] and WBV = [1, 10, –0.5, 10,1.1, –0.3, –0.2, 0.2]. Excerpts that generated dif-ferent fingerings for each of the setups were thenextracted.

Next, seven violinists of various skills (amateurand professional, i.e., ten or more years of experience)evaluated the generated fingerings using a form asshown in Figure 14. Each violinist was presentedwith fingerings generated using VPM-2 and VPM-2+HVM-2, and was asked to choose the better ofthe two, if any. Seven violinists were each giventen questions (70 total), but only 66 answers were


Figure 14. A screen captureof the questionnaire.

Table 1. The Number of Answers Indicating aPreference for the Baseline Model (HVM-2) or forthe Violin Pedagogical Model (HVM-2 + VPM-2)Setup Total Count

HVM-2 5HVM-2 + VPM-2 54No preference 7

provided. Table 1 shows the result of the survey. Thesign test shows that HVM-2 and VPM-2+HVM-2are not equally favored (p = 0.01), which suggeststhat our proposed model (HVM-2+VPM-2) is favoredover the baseline (HVM-2 only).

Discussion

From the results in Experiment 1 we observe that insetup DD—the most realistic situation—adaptationimproves the recognition accuracy, whereas inSS and SD, the accuracy decreases. We believethis occurs because of the mismatch between thedistribution of the actual bowed-string sequenceand that assumed in our model. Moreover, we find

that our study offers major improvements comparedwith our previous study. In all cases, our methodimproves the recognition accuracy over the baseline,which suggests that considering the playability ofa particular sequence of notes over a particularsequence of bowed strings is effective.

Experiment 2 suggests that our features arerobust in polyphonic audio mixtures, as long as thesignal level of the violin solo part is as loud as theaccompaniment. This condition seems to hold inpieces where the violin has an important melody(and hence, the choice of fingerboard locationbecomes an even greater musical issue). Therefore,we believe our method performs without significantdegradation of accuracy in practical applicationsinvolving works for violin and piano.

Finally, we found that the fingering generated ismore natural when the proposed fingering modelis used than when existing ones are. We believeincorporating the preference for the first and thirdpositions (index finger placed 2 semitones and 5semitones above the nut of the violin, respectively)is the chief reason—they are the most frequentlyused positions on the violin, and fingerings generatedon these positions are more natural for violiniststo play. We find that incorporating these kinds ofheuristics could drastically improve playability.

Maezawa et al. 69

Figure 15. Estimatedfingering from first 50 barsof Romanze in C played byJoachim (1903).

Next, we shall discuss an application of oursystem for recovering the fingering of a recordingfrom more than a century ago.

Application: Demystifying Historical Recordings

Our method may be used to analyze a recordingfrom the past to recover the fingering of a legendaryviolinist. In this example, we attempt to recover thefingering of Romanze in C composed by Joachim,a legendary violinist of the late 19th century, usinga gramophone recording of Joachim in 1903. Therecording was pitch-shifted such that the notated A4was set to 440 Hz. VPM-2+HVM-2 fingering modelwas used to estimate the fingering, an excerpt ofwhich is shown in Figure 15.

We speculate, however, that the actual fingeringmay have been very different from that estimatedby our method. For example, there are notes that areplayable using open strings (i.e., no finger pressed)whose estimated fingerings on the notated scoreshow open strings; yet, on the recording, we clearlyhear the note played with a vibrato, meaning that it

had to have been played by stopping a string witha finger. These kinds of errors might have occurredfor three reasons:

A) In Joachim’s time, the material used for theviolin string was significantly different. Inparticular, the E-string (the highest-pitchedstring) used gut as its core, which has a muchmellower timbre than the kind of string usedtoday. Such discrepancies in the materialproperty of the string itself would cause theperformance of the bowed-string estimatorto deteriorate.

B) A high-quality recording of Joachim is notavailable. The remastered recording we usedhad a frequency range that extended up toonly approximately 5 kHz; the rest was cutoff, perhaps by applying an equalizer. Hence,there were only few partials that conveyedmeaningful information.

C) Joachim’s use of pitch-based playing tech-niques, such as vibrato and glissando, givesmany clues to violinists who want to inferJoachim’s playing. For example, if a transition


from one note to another is completelysmooth, it strongly suggests that the twonotes were played on the same string. Vibratocan be a giveaway for distinguishing whethera note is played using an open string or not.Our method, however, does not incorporatethe pitch trajectory for fingerboard-locationestimation, and hence, misses such clues.

This output suggests a future direction forresearch: incorporating pitch-based cues for finger-board inference, and perhaps robustness to differ-ent materials used for a given string. The latter,however, can be ameliorated with our method to acertain extent; we can record the training data usingthe string that is thought to be used in the recordingwhose fingering we would like to infer.

On Inharmonicity

Stringed instruments such as the violin do notproduce overtones that are exact integer multiplesof the fundamental frequency. Such deviation frominteger harmonics is caused by the inharmonicityof the violin string, which in turn is createdby torsion of the string. Because inharmonicityis dependent on the material of the string, ourmodel initially incorporated inharmonicity, usinga beta distribution with a small shape parameteras its prior distribution. We found, however, thatsuch a model produced lower accuracy than thatwithout inharmonicity. Because of the nature ofthe EM algorithm, we found that the model tendsto “explain” partials of the accompaniment asarising from the violin sound with extremely highinharmonicity.

Conclusion

This article presented a method for recovering theviolin fingering from an input audio signal and amusic score by analyzing the bowed-string sequence,and using it as a constraint to determine the optimalfingering. Bowed-string sequence classification wasbased on features that are robust in a polyphonic

mixture. We also incorporated a sequential modelbased on violin playing. We found that such sequen-tial modeling drastically improved the accuracy. Thefingering estimation incorporated features that arespecific to violin playing practices (the pedagogicalmodel), in addition to some of the more fundamen-tal features applicable to other instruments as well.Incorporating such heuristics generated a drasticallyeasier fingering.

Future research directions may involve improvedrecognition accuracy, refining the fingering model,and more applications. For example, recognition ac-curacy may be improved by exploiting the smooth-ness of pitch trajectory (glissando, vibrato, etc.), aswe observed that violinists listen to the smoothnessof pitch transition to infer the fingerboard locationand the fingering. The fingering model may furtherbe improved by incorporating more features that areinspired from violin pedagogy or through machinelearning of feature weights by preparing a largeviolin fingering corpus. Another possibility is toperform joint estimation of fingerboard locationand fingering—the problems of fingerboard locationestimation and fingering estimation are dependenton each other, suggesting that joint estimationmay improve the performance of both. From amusicological perspective, artist classification, skillassessment, and analysis of historical recordingsmay be interesting applications of our approach tofingering estimation.

Acknowledgments

This work is supported by Grant-in-aid for ScientificResearch (S) and CREST-MUSE of JST.

We would like to thank violinists P. Klinger,Dr. P. Sunwoo, and Dr. J. Choi for stimulating andinspiring discussions on violin fingering.

References

Barbancho, I. 2009. “Transcription and ExpressivenessDetection System for Violin Music.” In Proceedings ofthe International Conference on Acoustics, Speech andSignal Processing, pp. 189–192.

Maezawa et al. 71

Burns, A. M., and M. M. Wanderley. 2006. “Visual Methodsfor Retrieval of Guitarist Fingering.” In Proceedingsof the International Conference on New Interface forMusical Expression, pp. 196–199.

Cremer, L., and J. S. Allen. 1984. The Physics of the Violin.Cambridge, Massachusetts: MIT Press.

Flesch, C. 2000. The Art of Violin Playing. 2nd ed., vol. I.New York: Carl Fischer.

Fletcher, N. H., and T. D. Rossing. 1998. The Physics ofMusical Instruments. 2nd ed. New York: Springer.

Hu, N., R. B. Dannenberg, and G. Tzanetakis. 2003.“Polyphonic Audio Matching and Alignment for MusicRetrieval.” In Proceedings of the Institute of Electricaland Electronics Engineers Workshop on Applicationsof Signal Processing to Audio and Acoustics, pp. 185–188.

Kameoka, H., T. Nishimoto, and S. Sagayama. 2007. “AMultipitch Analyzer Based on Harmonic TemporalStructured Clustering.” Institute of Electrical andElectronics Engineers Transactions on Audio, Speechand Language Processing 15(3):982–994.

Kasimi, A. A., E. Nichols, and C. Raphael. 2007. “A SimpleAlgorithm for Automatic Generation of PolyphonicPiano Fingerings.” In Proceedings of the InternationalSociety for Music Information Retrieval Conference,pp. 355–356.

Krishnaswamy, A., and J. Smith. 2003. “Inferring ControlInputs to an Acoustic Violin from Audio Spectra.” InProceedings of the Institute of Electrical and ElectronicsEngineers International Conference on Multimedia andExpo, pp. 733–736.

Lu, H., et al. 2008. “iDVT: An Interactive Digital ViolinTutoring System Based on Audio-Visual Fusion.”In Austria Association for Computing MachineryInternational Conference on Multimedia, pp. 300–301.

Maezawa, A., et al. 2009. “Bowed String SequenceEstimation of a Violin Based on Adaptive AudioSignal Classification and Context-Dependent ErrorCorrection.” In Proceedings of the InternationalSymposium on Multimedia, pp. 9–16.

Maezawa, A., et al. 2010. “Violin Fingering EstimationBased on Violin Pedagogical Fingering Model Con-strained by Bowed Sequence Estimation from Audio In-put.” Trends in Applied Intelligent Systems, 3:249–259.

Radicioni, D. P., L. Anselma, and V. Lombardo. 2004.“A Segmentation-Based Prototype to ComputeString Instruments Fingering.” In Proceedings of theConference on Interdisciplinary Musicology. Availableonline at www.uni-graz.at/richard.parncutt/cim04.Accessed May 2012.

Radisavljevic, A., and P. F. Driessen. 2004. “PathDifference Learning for Guitar Fingering Problems.”In Proceedings of the International Computer MusicConference, pp. 730–733.

Sayegh, S. I. 1989. “Fingering for String Instruments withthe Optimal Path Paradigm.” Computer Music Journal13(3):76–83.

Traube, C., and J. Smith. 2000. “Estimating the PluckingPoint on a Guitar String.” In Proceedings of the Interna-tional Conference on Digital Audio Effects, pp. 153–158.

Yampolsky, I.M. 1967. Principles of Violin Fingering. NewYork: Oxford University Press.

Yonebayashi, Y., H. Kameoka, and S. Sagayama. 2007.“Automatic Decision of Piano Fingering Basedon a Hidden Markov Model.” In Proceedings ofthe International Joint Conference on ArtificialIntelligence, pp. 2915–2921.

Zhang, B., J. Zhu, and W. Leow. 2007. “Visual Analysisof Fingering for Pedagogical Violin Transcription.” InProceedings of the ACM Conference on Multimedia,pp. 521–524.


akiramaezawa,katsutoshiitoyama, automatedviolinfingering … · 2020. 1. 14. · figure 1. violin...

Documents