文字列カーネルによる辞書なしツイート分類 〜文字列カーネル入門〜

54
文字列カーネルSVMによる 辞書なしツイート分類 ~文字列カーネル入門~ 7回自然言語処理勉強会 #TokyoNLP 2011/09/10 @a_bicky 2013/06/02 改訂版

Upload: takeshi-arabiki

Post on 24-May-2015

6.779 views

Category:

Entertainment & Humor


1 download

DESCRIPTION

第7回自然言語処理勉強会のスライドです

TRANSCRIPT

  • 1. SVM7 #TokyoNLP2011/09/10 @a_bicky2013/06/02

2. SVM 3. SVM 4. Takeshi Arabiki1 Twitter: @a_bicky : id:a_bicky R http://d.hatena.ne.jp/a_bicky/ 5. ROsaka.R #4 Tokyo.R #16http://www.slideshare.net/abicky/twitterr http://www.slideshare.net/abicky/r-9034336 ScyPy 6. Webhttp://favmemo.com/ 7. http://favmemo.com/http://favolog.org/Web 8. SVM 9. 10. 00000 11. x f(x) = wT(x) 12. x f(x) =ni=1ikx(i), x)= wT(x)Representerkx(i), x= (x(i))T(x)k 13. 14. SVM 15. SVM 16. SVM 17. 00g(x) = wTx + w0wSVM 18. 00g(x) = wTx + w0wSVM 19. SVM 20. 00g(x) = wTx + w0wSVM 21. 00g(x) = wTx + w0wSVM 22. SVMg(x) = wTx + w0w00 23. SVMSV: + s.t. 24. SVMs.t.s.t. 25. SVMs.t.s.t. 26. SVM 27. 28. k(x, x) = xTxk(x, x) = (xTx+ c)pGap-weighted String Kernelk(s, t) =uni:u=s[i]j:u=t[j]span(i)+span(j) 29. Spectrum Kernel n n-gram Gap-weighted String Kernel n gapOK Mismatch String Kernel n m e. g. ) m = 1 car cat String Alignment Kernel 30. Spectrum Kernel n n-gram Gap-weighted String Kernel n gapOK Mismatch String Kernel n m e. g. ) m = 1 car cat String Alignment Kernel 31. Gap-weighted String Kernelu(s) =i:u=s[i]span(i)tokyonlpnokunotokyonlpon(tokyonlp) = 5+ 2on(nokuno) = 4span(i1 = 2, i2 = 6) = 5span(i1 = 5, i2 = 6) = 2span(i1 = 2, i2 = 5) = 4 u u = on, s = nokuno,tokyonlp si1= u1, si2= u2, , sin= unspan(i) = sin si1+ 1decay factor (0, 1) 32. Gap-weighted String KernelKn(s, t) =unu(s)u(t) =uni:u=s[i]span(i)j:u=t[j]span(j)=uni:u=s[i]j:u=t[j]span(i)+span(j)1 c O(||n) !! n K(s, t) = (s)T (t) =(s)T(s)(t)(t)=1(s)(t)(s)T(t) =K(s, t)K(s, s)K(t, t) n 33. Dynamic Programming Kn(sx, t) = Kn(s, t) +u x o-k o-n k-n (nokun) 2 4 3(tokyonlp) 2 2 +5 4K2(nokun,tokyonlp) = 4 +6 +7 +9K2(nokun,tokyonlp) = 4 +6 +7 +9K2(nokun,tokyonlp) = 4 +6 +7 +9K2(nokun,tokyonlp) = 4 +6 +7 +9K2(nokun,tokyonlp) = 4 +6 +7 +9 s x o-k o-n k-n o-o k-o (nokuno) 2 4 3 5 4(tokyonlp) 2 2 +5 4 4 3K2(nokuno,tokyonlp) = 4 +6 +7 +9 +7 +9K2(nokuno,tokyonlp) = 4 +6 +7 +9 +7 +9K2(nokuno,tokyonlp) = 4 +6 +7 +9 +7 +9K2(nokuno,tokyonlp) = 4 +6 +7 +9 +7 +9K2(nokuno,tokyonlp) = 4 +6 +7 +9 +7 +9K2(nokuno,tokyonlp) = 4 +6 +7 +9 +7 +9K2(nokuno,tokyonlp) = 4 +6 +7 +9 +7 +9nokun o 34. Dynamic Programming Kn(sx, t) = Kn(s, t) +u x s x nokun n o-k o-n k-n (nokunn) 2 4 +5 3 +4(tokyonlp) 2 2 +5 4K2(nokunn,tokyonlp) = 4 +6 +7 +9 +7 +8 +10K2(nokunn,tokyonlp) = 4 +6 +7 +9 +7 +8 +10K2(nokunn,tokyonlp) = 4 +6 +7 +9 +7 +8 +10K2(nokunn,tokyonlp) = 4 +6 +7 +9 +7 +8 +10K2(nokunn,tokyonlp) = 4 +6 +7 +9 +7 +8 +10u x o-k o-n k-n (nokun) 2 4 3(tokyonlp) 2 2 +5 4K2(nokun,tokyonlp) = 4 +6 +7 +9K2(nokun,tokyonlp) = 4 +6 +7 +9K2(nokun,tokyonlp) = 4 +6 +7 +9K2(nokun,tokyonlp) = 4 +6 +7 +9K2(nokun,tokyonlp) = 4 +6 +7 +9 35. Dynamic Programming Kn(sx, t) = Kn(s, t) +u x s x stuu xxx|s| + 1i1j1txjn1 + 1 k|s| + 1 i1 + 1u x t Kn(sx, t) = Kn(s, t) +un1i:u=s[i]k:tk=xj:u=t[j],jn1k|s|+1i1+1kj1+1= Kn(s, t) +k:tk=x2un1i:u=s[i]j:u=t[j],jn1k|s|i1+1k1j1+1= Kn(s, t) +k:tk=x2Kn1(s, t[1 : k 1])Kn(s, t) =uni:u=s[i]j:u=t[j]|s|i1+1|t|j1+1 |s| s 36. Dynamic Programming sxxxKn(s, t) =uni:u=s[i]j:u=t[j]|s|i1+1|t|j1+1t[1 : k 1]t[1 : k 1]Kn(sx, t) = Kn(s, t) +k:tk=x2Kn1(s, t[1 : k 1])stj1i1|s| + i1 + 1|t| + j1 + 1ukKn(s, t) Kn(s, t) 37. Dynamic Programming Kn(sx, t) = Kn(s, t) +un1i:u=s[i]k:tk=xj:u=t[j],jn1k|s|+1i1+1|t|j1+1= Kn(s, t) +k:tk=x|t|k+2un1i:u=s[i]j:u=t[j],jn1k|s|i1+1k1j1+1= Kn(s, t) +k:tk=x|t|k+2Kn1(s, t[1 : k 1])= Kn(s, t) + Kn(sx, t)Kn(s, t) Kn(s, t)Kn(sx, t) =k:tk=x|t|k+2Kn1(s, t[1 : k 1]) 38. Kn(sx, ty) =Kn(sx, t) if x = yKn(sx, t) + 2Kn1(s, t) otherwiseDynamic Programming t= ty Kn(s, t)Kn(sx, t) =k:tk=x|t|k+2Kn1(s, t[1 : k 1])= k:tk=x,k|t||t|k+2Kn1(s, t[1 : k 1]) + 2Kn1(s, t)[[tk = x, k = |t|]]= Kn(sx, t) + 2Kn1(s, t)[[tk = x, k = |t|]] 39. Ki (sx, ty) =Ki (sx, t) if x = yKi (sx, t) + 2Ki1(s, t) otherwiseGap-weighted String Kernel i = 1, , n 1K0(s, t) = 1 s, tKi (s, t) = 0 if min(|s|, |t|)iKi(s, t) = 0 if min(|s|, |t|)iKi(sx, t) = Ki(s, t) + Ki (sx, t)Ki(s, t) = 0 if min(|s|, |t|)iKn(sx, t) = Kn(s, t) +k:tk=x2Kn1(s, t[1 : k 1]) 40. Gap-weighted String Kernel l = 0.7 # lambdadef indices(t, x):ret = [];pos = -1;while 1:pos = t.find(x, pos + 1)if pos != -1:ret.append(pos)else:breakreturn retdef K(i, s, t):if min(len(s), len(t))i:return 0return K(i, s[0:-1], t) + l ** 2 * sum([K1(i - 1, s[0:-1], t[0:j]) for j inindices(t, s[-1])])Ki(s, t) = 0 if min(|s|, |t|)iKn(sx, t) = Kn(s, t) +k:tk=x2Kn1(s, t[1 : k 1]) 41. Ki (sx, ty) =Ki (sx, t) if x = yKi (sx, t) + 2Ki1(s, t) otherwisedef K1(i, s, t):if i == 0:return 1if min(len(s), len(t))i:return 0return l * K1(i, s[0:-1], t) + K2(i, s, t)Gap-weighted String Kernel K0(s, t) = 1 s, tKi (s, t) = 0 if min(|s|, |t|)iKi(s, t) = 0 if min(|s|, |t|)iKi(sx, t) = Ki(s, t) + Ki (sx, t)def K2(i, s, t):if min(len(s), len(t))i:return 0if s[-1] == t[-1]:return l * (K2(i, s, t[0:-1]) + l * K1(i -1, s[0:-1], t[0:-1]))else:return l * K2(i, s, t[0:-1]) 42. SVM 43. 44. #!/usr/bin/env python# -*- coding: utf-8 -*-import tweepyimport reimport codecsdef format_tweet(tweet):tweet = re.sub(r[rn],, tweet) # remove CR and LFtweet = re.sub(rs*https?://[-w.#%@/?=]*s*,, tweet) # remove URLtweet = re.sub(r(^|s+)#[^s]+s*,, tweet) # remove hash tagstweet = re.sub(rs*(?!w)@w+(?!@)s*,, tweet) # remove user namesreturn tweetapi = tweepy.API()f = open(tweets.dat, w)f = codecs.lookup(utf_8)[-1](f)for c, user in zip((-1, 1), (a_bicky, midoisan)):print userfor i in range(1, 4):while True:print itry:statuses = api.user_timeline(user, count = 200, page = i)breakexcept:print Try again...for tweet in map(lambda s:format_tweet(s.text), statuses):f.write(%s #%sn % (c, tweet))f.close()# 1500# SVMlight 45. user train test337 tweets 163 tweets305 tweets 186 tweets521 midoisan 21 a_bicky: midoisan: !! n-gram, Gap-weighted , unigram = 0.7 46. n-gram n-gram Gap-weighted 4.0%Gap-weighted 47. nokuno n-gram Gap-weighted n-gram Gap-weighted 3.6% 48. 49. SVM SVM n-gram% 50. , , , 2008 H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C.Watkins.TextClassication using String Kernels. Journal of Machine Learning Research,Vol. 2, pp.419444, 2002.Gap-weighted String Kernel http://www.ism.ac.jp/~fukumizu/ISM_lecture_2006/Lecture2006_application.pdf Kernel http://www.ism.ac.jp/~fukumizu/ISM_lecture_2006/Lecture2006_application.pdf Juho Rousu , John Shawe-Taylor. Efcient Computation of Gapped SubstringKernels on Large Alphabets,The Journal of Machine Learning Research,Vol. 6, pp.1323-1344, 2005. 51. 52. https://github.com/abicky/tokyonlp07_abicky 53. SVMlight double custom_kernel(KERNEL_PARM *kernel_parm, SVECTOR *a, SVECTOR *b){char* s = a-userdefined;char* t = b-userdefined;char param[BUFSIZE];strcpy(param, kernel_parm-custom);...}SVMlight -1 #SVMlight-1 #spectrum kernelkernel.h ( kernel.c Makele )# -u parameters SVECTOR-userdefined 54. 2013/06/02 pp. 33-34