custom applications with spark's rdd: spark summit east talk by tejas patil
TRANSCRIPT
![Page 1: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/1.jpg)
CustomapplicationswithSpark’sRDD
TejasPatilFacebook
![Page 2: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/2.jpg)
Agenda
• Usecase• Realworldapplications• Previoussolution• Sparkversion• Dataskew• Performanceevaluation
![Page 3: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/3.jpg)
N-gramlanguagemodeltraining
![Page 4: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/4.jpg)
Canyoupleasecomehere ?
History
5-gram
Wordbeingpredicted
![Page 5: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/5.jpg)
Realworldapplications
![Page 6: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/6.jpg)
Auto-subtitlingforPagevideos
![Page 7: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/7.jpg)
Detectinglowqualityplaces
• Non-publicplaces• Myhome• Homesweethome
• Non-realplaces• Apt#00,Fakelane,FooCity,CA• Mordor,Westeros!!
• Non-suitableforwatch• Anythingcontainingnudity,intensesexuality,profanityordisturbingcontent
![Page 8: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/8.jpg)
Previoussolution
![Page 9: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/9.jpg)
Sub-model1trainingjob
Sub-model2trainingjob
Sub-model`n`trainingjob
Interpolationalgorithm
Languagemodel
LM1
LM2
LM`n`
…....................
Intermediatesubmodels
Hivequery
Hivetable
![Page 10: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/10.jpg)
Sub-model2trainingjob
Sub-model`n`trainingjob
Interpolationalgorithm
Languagemodel
LM1
LM2
LM`n`
…....................
Intermediatesubmodels
Hivequery
Hivetable
Sub-model1trainingjob
INSERTOVERWRITETABLEsub_model_1SELECT....FROM(REDUCEm.ngram,m.group_key,m.countUSING"./train_model --config=myconfig.json ....”AS`ngram`,`count`,...FROM(SELECT...FROMdata_sourceWHERE...DISTRIBUTEBYgroup_key))GROUPBY`ngram`
![Page 11: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/11.jpg)
Lessonslearned
• SQLnotgoodchoiceforbuildingsuchapplications• Duplication• Poorreadability• Brittle,notesting• Alternatives
• Map-reduce• Querytemplating
• Latencywhiletrainingwithlargedata
![Page 12: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/12.jpg)
Sparksolution
![Page 13: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/13.jpg)
Sparksolution
• Samehighlevelarchitecture• Hivetablesasfinalinputsandoutputs• SamebinariesusedinHiveTRANSFORM
• RDDnotDatasets• `pipe()`operator• Modular,readable,maintainable
![Page 14: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/14.jpg)
Configuration
PipelineConfiguration- whereistheinputdata?- wheretostorefinaloutput?- sparkspecificconfigs:"spark.dynamicAllocation.maxExecutors”"spark.executor.memory”"spark.memory.storageFraction”…………
- listofComponentConfiguration……
![Page 15: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/15.jpg)
Scalabilitychallenges
• Executorslostasunabletoheartbeat• ShuffleserviceOOM• FrequentexecutorGC• ExecutorOOM• 2GBlimitinSparkforblocks• Exceptionswhilereadingoutputstreamofpipeprocess
![Page 16: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/16.jpg)
Scalabilitychallenges
• Executorslostasunabletoheartbeat• ShuffleserviceOOM• FrequentexecutorGC• ExecutorOOM• 2GBlimitinSparkforblocks• Exceptionswhilereadingoutputstreamofpipeprocess
![Page 17: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/17.jpg)
Dataskew
![Page 18: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/18.jpg)
Sub-model2trainingjob
Sub-model`n`trainingjob
Interpolationalgorithm
Languagemodel
LM1
LM2
LM`n`
…....................
Intermediatesubmodels
Hivequery
Hivetable
Sub-model1trainingjob
![Page 19: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/19.jpg)
Sub-model2trainingjob
Sub-model`n`trainingjob
Interpolationalgorithm
Languagemodel
LM1
LM2
LM`n`
…....................
Intermediatesubmodels
Hivequery
Hivetable
Sub-model1trainingjob
ngramextraction
andcounting
Estimationandpruning normalize
ngramcounts
![Page 20: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/20.jpg)
HowareyouHowaretheyItsrainingHowarewegoingWhenarewegoingYouareawesomeTheyareworking…..…..
Trainingdataset
<Howarewegoing>:1….<Howareyou>:1<Howarethey>:1….<Howare>:4<Youare>:1<Itsraining>:1….<are>:6<you>:1<How>:4…..
Wordcount
![Page 21: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/21.jpg)
<Howarewegoing>:1<arewegoing>:2<wegoing>:2<going>:1<Whenarewegoing>:1<Itsraining>:1<Youareawesome>:1…..…..
Wordcount
Partitionbasedon2-wordsuffix
![Page 22: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/22.jpg)
<Howarewegoing>:1<arewegoing>:2<wegoing>:2<Whenarewegoing>:1…..
<Itsraining>:1<Youareawesome>:1…..
Wordcount
<Howarewegoing>:1<arewegoing>:2<wegoing>:2<going>:1<Whenarewegoing>:1<Itsraining>:1<Youareawesome>:1…..….. …..
…..
![Page 23: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/23.jpg)
<are>:6<How>:4<you>:1<doing>:1<going>:1<awesome>:1<working>:1…..…..
Frequencyofeveryword:0’thshard
![Page 24: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/24.jpg)
shards1to(n-1)
0-shard(hasfrequencyofeveryword)andisshippedtoallthenodes
N-gramswithsame2-wordsuffixwillfallinthesameshard
Distributionofshards(1-wordsharding)
![Page 25: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/25.jpg)
Skewedshardsduetodatafromfrequentphraseseg.“howto..”,“doyou..”
shards1to(n-1)
Distributionofshards(1-wordsharding)
![Page 26: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/26.jpg)
shards1to(n-1)
0-shardhassinglewordfrequenciesand2-wordfrequenciesaswell
Distributionofshards(2-wordsharding)
![Page 27: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/27.jpg)
Solution:Progressivesharding
![Page 28: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/28.jpg)
Firstiteration
Ignoreskewedshards
![Page 29: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/29.jpg)
def findLargeShardIds(sc:SparkContext,threshold:Long,…..):Set[Int]={val shardSizesRDD = sc.textFile(shardCountsFile).map {caseline=>
val Array(indexStr,countStr)=line.split('\t')(indexStr.toInt,countStr.toLong)
}val largeShardIds =shardSizesRDD.filter {
case(index,count)=> count>threshold}.map(_._1).collect().toSet
returnlargeShardIds}
![Page 30: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/30.jpg)
Firstiteration
Processallthenon-skewedshards
![Page 31: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/31.jpg)
Seconditeration
Effective0-shardissmall
Re-shardleftoverwith2-wordshistory
![Page 32: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/32.jpg)
Seconditeration
Discardbiggershards
![Page 33: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/33.jpg)
Seconditeration
Processallthenon-skewedshards
![Page 34: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/34.jpg)
Continuewithfurtheriterations….
![Page 35: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/35.jpg)
var iterationId =0do{val currentCounts:RDD[(String,Long)]=allCounts(iterationId - 1)val partitioner =newPartitionerForNgram(numShards,iterationId)
val shardCountsFile =s"${shard_sizes}_$iterationId"currentCounts.map(ngram =>(partitioner.getPartition(ngram._1),1L)).reduceByKey(_+_).saveAsTextFile(shardCountsFile)
largeShardIds =findLargeShardIds(sc,config.largeShardThreshold,shardCountsFile)trainer.trainedModel (currentCounts,component,largeShardIds)
.saveAsObjectFile(s"${component.order}_$iterationId")iterationId +1}while(largeShards.nonEmpty)
![Page 36: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/36.jpg)
Performanceevaluation
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Hive Spark
ReservedCPUtime(days)
0
1
2
3
4
5
6
7
8
9
Hive Spark
Latency(hours)
![Page 37: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/37.jpg)
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Hive Spark
ReservedCPUtime(days)
15xefficient
Performanceevaluation
0
1
2
3
4
5
6
7
8
9
Hive Spark
Latency(hours)
2.6xfaster
![Page 38: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/38.jpg)
Upstreamcontributionstopipe()
• [SPARK-13793]PipedRDD doesn'tpropagateexceptionswhilereadingparentRDD• [SPARK-15826]PipedRDD toallowconfigurablecharencoding• [SPARK-14542]PipeRDD shouldallowconfigurablebuffersizeforthestdin writer• [SPARK-14110]PipedRDD toprintthecommandranonnonzeroexit
![Page 39: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil](https://reader031.vdocuments.pub/reader031/viewer/2022021918/58abb4291a28ab04618b4cdd/html5/thumbnails/39.jpg)
Questions?