spark手把手:[e2-spk-s01]

278
1

Upload: erhwen-kuo

Post on 14-Apr-2017

341 views

Category:

Engineering


5 download

TRANSCRIPT

  • 1

  • 2

  • 2

  • 2

  • 2

  • 2

  • 2

  • 2

  • 2

  • 2

  • 3 . 1

  • 3 . 2

  • 3 . 2

  • 3 . 2

  • 3 . 2

  • 3 . 2

  • 3 . 2

  • 3 . 3

  • 3 . 4

  • 3 . 4

  • 3 . 4

  • 3 . 4

  • 3 . 4

  • 3 . 5

    http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf

  • 3 . 5

    http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf

  • 3 . 5

    http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf

  • 3 . 5

    http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf

  • 3 . 6

    http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf

  • 3 . 6

    http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf

  • 3 . 7

    http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf

  • 3 . 7

    http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf

  • 3 . 8

    http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf

  • 3 . 8

    http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf

  • 3 . 9

    http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf

  • 3 . 9

    http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf

  • 3 . 10

    http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf

  • 3 . 10

    http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf

  • 3 . 11

    http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf

  • 3 . 11

    http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf

  • 3 . 12

    http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf

  • 3 . 12

    http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf

  • 3 . 13

    http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf

  • 3 . 13

    http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf

  • 3 . 14

    http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf

  • 3 . 14

    http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf

  • 4 . 1

  • 4 . 2

  • 4 . 2

  • 4 . 3

  • 4 . 3

  • 4 . 3

  • 4 . 3

  • 4 . 4

  • 4 . 4

  • 4 . 4

  • 4 . 4

  • 5 . 1

  • 5 . 2

    http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

  • 5 . 2

    http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

  • 5 . 3

  • 5 . 4

    http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe

  • 5 . 5

  • 5 . 6

    https://maven.apache.org/http://apache.stu.edu.tw/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.zip

  • 5 . 7

  • 5 . 8

    http://www.eclipse.org/downloads/http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/mars/2/eclipse-jee-mars-2-win32-x86_64.zip

  • 5 . 9

  • 5 . 10

  • 5 . 11

    http://192.168.0.2/apps/e2-spk-v01/present/e2-spk-s01/assets/files/e2-spk-s01_java.zip

  • 5 . 12

  • 5 . 13

  • 6 . 1

  • 6 . 2

    http://scala-ide.org/index.htmlhttp://scala-ide.org/download/sdk.html

  • 6 . 3

  • 6 . 4

  • 6 . 5

    http://192.168.0.2/apps/e2-spk-v01/present/e2-spk-s01/assets/files/e2-spk-s01_scala.zip

  • 6 . 6

  • 6 . 7

  • 6 . 8

  • 7 . 1

  • 7 . 2

    http://192.168.0.2/apps/e2-spk-v01/present/e2-spk-s01/spark.apache.orghttp://archive.apache.org/dist/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.6.tgz

  • 7 . 3

  • 7 . 4

  • 7 . 5

  • 7 . 6

  • valdistData=sc.parallelize(Seq("eighty20","spark","traing","hello","world"))valresult_count=distData.count()println("Countresultis:"+result_count)

    7 . 7

  • 8 . 1

  • 8 . 2

  • valdistData=sc.parallelize(Seq("eighty20","spark","traing","hello","world"))valresult_count=distData.count()println("Countresultis:"+result_count)

    8 . 3

  • 9 . 1

  • 9 . 2

  • 9 . 3

  • 10 . 1

  • 10 . 2

  • 10 . 2

  • 10 . 2

  • 10 . 2

  • 10 . 2

  • 10 . 2

  • 10 . 3

  • 10 . 4

    http://spark.apache.org/docs/latest/spark-standalone.html

  • 10 . 5

    http://spark.apache.org/docs/latest/running-on-mesos.htmlhttp://mesos.apache.org/

  • 10 . 6

    http://spark.apache.org/docs/latest/running-on-yarn.html

  • 11 . 1

  • 11 . 2

  • 11 . 2

  • 11 . 2

  • 11 . 2

  • 11 . 2

  • 11 . 2

  • 11 . 2

  • 12 . 1

  • 12 . 2

    http://192.168.0.2/apps/e2-spk-v01/present/e2-spk-s01/index.html#/4

  • 12 . 3

  • 12 . 4

  • 12 . 5

  • 12 . 6

  • 12 . 7

  • spark-submit\--classcc.eighty20.spark.s01.sc_00_helloworld\--masterlocal\e2spks01-0.0.1.jar

    12 . 8

    http://spark.apache.org/docs/latest/submitting-applications.html

  • spark-submit\--classcc.eighty20.spark.s01.sc_00_helloworld\--masterspark://192.168.0.2:7077\e2spks01-0.0.1.jar

    12 . 9

  • 13 . 1

  • 13 . 2

    http://192.168.0.2/apps/e2-spk-v01/present/e2-spk-s01/index.html#/5

  • 13 . 3

  • 13 . 4

  • 13 . 5

  • 13 . 6

  • 13 . 7

  • spark-submit\--classcc.eighty20.spark.s01.sc_00_helloworld\--masterlocal\e2spks01-0.0.1.jar

    13 . 8

    http://spark.apache.org/docs/latest/submitting-applications.html

  • spark-submit\--classcc.eighty20.spark.s01.sc_00_helloworld\--masterspark://192.168.0.2:7077\e2spks01-0.0.1.jar

    13 . 9

  • 14 . 1

  • 14 . 2

  • 14 . 3

  • 14 . 4

  • 14 . 4

  • 14 . 4

  • 14 . 4

  • 14 . 4

  • 14 . 4

  • 14 . 4

  • 14 . 4

  • 14 . 5

  • 14 . 5

  • 14 . 5

  • 14 . 5

  • 14 . 6

  • 14 . 6

  • 14 . 6

  • 14 . 6

  • 14 . 6

  • 14 . 7

  • 14 . 8

  • 14 . 8

  • 14 . 8

  • 14 . 8

  • 14 . 8

  • 14 . 9

  • 14 . 10

  • 14 . 11

  • 14 . 12

  • 14 . 13

  • 14 . 14

  • 14 . 15

  • 14 . 16

  • 14 . 17

  • 14 . 17

  • 14 . 17

  • 14 . 17

  • 14 . 17

  • 14 . 17

  • 14 . 17

  • 14 . 17

  • 14 . 18

  • 14 . 18

  • 14 . 18

  • 14 . 18

  • 14 . 18

  • 14 . 18

  • 14 . 19

  • 14 . 19

  • 14 . 19

  • 14 . 19

  • 14 . 19

  • 14 . 19

  • 14 . 20

  • 14 . 20

  • 14 . 20

  • 14 . 20

  • 14 . 20

  • 14 . 21

  • 14 . 21

  • 14 . 22

  • 14 . 22

  • 14 . 22

  • 14 . 22

  • 14 . 22

  • 14 . 22

  • 14 . 22

  • 14 . 22

  • 14 . 22

  • 14 . 22

  • 14 . 23

  • 14 . 24

  • 14 . 24

  • 14 . 24

  • 14 . 24

  • 14 . 24

  • 14 . 25

  • 14 . 26

    http://www.csie.ntnu.edu.tw/~u91029/DirectedAcyclicGraph.html

  • 14 . 26

    http://www.csie.ntnu.edu.tw/~u91029/DirectedAcyclicGraph.html

  • 14 . 26

    http://www.csie.ntnu.edu.tw/~u91029/DirectedAcyclicGraph.html

  • 14 . 26

    http://www.csie.ntnu.edu.tw/~u91029/DirectedAcyclicGraph.html

  • 14 . 26

    http://www.csie.ntnu.edu.tw/~u91029/DirectedAcyclicGraph.html

  • 14 . 26

    http://www.csie.ntnu.edu.tw/~u91029/DirectedAcyclicGraph.html

  • valr00=sc.parallelize(0to9)valr01=sc.parallelize(0to90by10)valr10=r00.cartesian(r01)valr11=r00.map(n=>(n,n))valr12=r00.zip(r01)valr13=r01.keyBy(_/20)valr20=Seq(r11,r12,r13).foldLeft(r10)(_union_)

    14 . 27

  • valr00=sc.parallelize(0to9)valr01=sc.parallelize(0to90by10)valr10=r00.cartesian(r01)valr11=r00.map(n=>(n,n))valr12=r00.zip(r01)valr13=r01.keyBy(_/20)valr20=Seq(r11,r12,r13).foldLeft(r10)(_union_)

    14 . 27

  • valr00=sc.parallelize(0to9)valr01=sc.parallelize(0to90by10)valr10=r00.cartesian(r01)valr11=r00.map(n=>(n,n))valr12=r00.zip(r01)valr13=r01.keyBy(_/20)valr20=Seq(r11,r12,r13).foldLeft(r10)(_union_)

    14 . 27

  • 14 . 28

  • 14 . 28

  • 14 . 28

  • 14 . 28

  • 14 . 28

  • 15

  • 16 . 1

  • packagecc.eighty20.spark.s01;importorg.apache.spark.SparkConf;importorg.apache.spark.api.java.JavaRDD;importorg.apache.spark.api.java.JavaSparkContext;publicclasssc_01_anatomy_driver{publicstaticvoidmain(String[]args){StringmasterURL="local[*]";//(1) SparkConfconf=newSparkConf()//(2).setAppName("sc_01_anatomy_driver").setMaster(masterURL); JavaSparkContextsc=newJavaSparkContext(conf);//(3) StringfileName=""; if(args.length>0&&args[0]!=null&&!args[0].isEmpty())//(4) fileName=args[0]; else fileName="pom.xml"; JavaRDDlines_rdd=sc.textFile(fileName);//(5) longlines_count=lines_rdd.count();//(6) System.out.printf("Thereare%slinesin%s\n" ,lines_count,fileName); sc.close();}}

    16 . 2

  • StringmasterURL="local[*]";//(1)

    16 . 3

  • SparkConfconf=newSparkConf()//(2).setAppName("sc_01_anatomy_driver").setMaster(masterURL);

    16 . 4

  • JavaSparkContextsc=newJavaSparkContext(conf);//(3)

    16 . 5

  • StringfileName="";if(args.length>0&&args[0]!=null&&!args[0].isEmpty())//(4)fileName=args[0];elsefileName="pom.xml"; JavaRDDlines_rdd=sc.textFile(fileName);//(5)longlines_count=lines_rdd.count();//(6)System.out.printf("Thereare%slinesin%s\n" ,lines_count,fileName);

    16 . 6

  • 16 . 7

  • 17 . 1

  • packagecc.eighty20.spark.s01importorg.apache.spark.{SparkConf,SparkContext}objectsc_01_anatomy_driver{defmain(args:Array[String]){ valmasterURL="local[*]"//(1) valconf=newSparkConf()//(2) .setAppName("sc_01_anatomy_driver") .setMaster(masterURL) valsc=newSparkContext(conf)//(3) valfileName=util.Try(args(0)).getOrElse("pom.xml")//(4) vallines_rdd=sc.textFile(fileName).cache()//(5) vallines_count=lines_rdd.count()//(6) println(s"\nThereare$lines_countlinesin$fileName")}}

    17 . 2

  • valmasterURL="local[*]"//(1)

    17 . 3

  • valconf=newSparkConf()//(2) .setAppName("sc_01_anatomy_driver") .setMaster(masterURL)

    17 . 4

  • valsc=newSparkContext(conf)//(3)

    17 . 5

  • valfileName=util.Try(args(0)).getOrElse("pom.xml")//(4)vallines_rdd=sc.textFile(fileName).cache()//(5)vallines_count=lines_rdd.count()//(6)println(s"\nThereare$lines_countlinesin$fileName")

    17 . 6

  • 17 . 7

  • 18

  • 19 . 1

  • ERROR php:dyingforunknownreasonsWARN dave,areyouangryatme?ERROR didmysqljustbarf?WARN xylonsapproachingERROR mysqlcluster:replacewithsparkcluster

    19 . 2

  • //baseRDDvallines=sc.textFile("hdfs://sample_log_file_path/log.txt")//transformedRDDsvalerrors=lines.filter(_.startsWith("ERROR"))valmessages=errors.map(_.split("\t")).map(r=>r(1)).cache()//action1valmysql_errors=messages.filter(_.contains("mysql")).count()//action2valphp_errors=messages.filter(_.contains("php")).count()

    19 . 3

  • //baseRDDvallines=sc.textFile("hdfs://sample_log_file_path/log.txt")

    19 . 4

  • //baseRDDvallines=sc.textFile("hdfs://sample_log_file_path/log.txt")

    19 . 5

  • //baseRDDvallines=sc.textFile("hdfs://sample_log_file_path/log.txt")

    19 . 6

  • //baseRDDvallines=sc.textFile("hdfs://sample_log_file_path/log.txt")

    19 . 7

  • //transformedRDDsvalerrors=lines.filter(_.startsWith("ERROR"))valmessages=errors.map(_.split("\t")).map(r=>r(1)).cache()

    19 . 8

  • //action1valmysql_errors=messages.filter(_.contains("mysql")).count()

    19 . 9

  • //action2valphp_errors=messages.filter(_.contains("php")).count()

    19 . 10

  • //action2valphp_errors=messages.filter(_.contains("php")).count()

    19 . 11

  • //action2valphp_errors=messages.filter(_.contains("php")).count()

    19 . 12

  • //baseRDDvallines=sc.textFile("hdfs://sample_log_file_path/log.txt")//transformedRDDsvalerrors=lines.filter(_.startsWith("ERROR"))valmessages=errors.map(_.split("\t")).map(r=>r(1)).cache()//action1valmysql_errors=messages.filter(_.contains("mysql")).count()//action2valphp_errors=messages.filter(_.contains("php")).count()

    19 . 13

  • 19 . 14

  • 19 . 15

  • 19 . 16

  • 19 . 17

  • 19 . 18

  • 20 . 1

  • #ApacheSparkSparkisafastandgeneralclustercomputingsystemforBigData.Itprovideshigh-levelAPIsinScala,Java,Python,andR,andanoptimizedenginethatsupportsgeneralcomputationgraphsfordataanalysis.Italsosupportsarichsetofhigher-leveltoolsincludingSparkSQLforSQLandDataFrames,MLlibformachinelearning,GraphXforgraphprocessing,andSparkStreamingforstreamprocessing.

    ##OnlineDocumentationYoucanfindthelatestSparkdocumentation,includingaprogrammingguide,onthe[projectwebpage](http://spark.apache.org/documentation.html)and[projectwiki](https://cwiki.apache.org/confluence/display/SPARK).ThisREADMEfileonlycontainsbasicsetupinstructions.##BuildingSpark...

    20 . 2

  • valtopN=10valfileName="hdfs://log_file_path/README.md"//RDDcreationfromexternaldatasourcevaldocs=sc.textFile(fileName)//Splitlinesintowordsvallower=docs.map(line=>line.toLowerCase())valwords=lower.flatMap(line=>line.split("\\s+"))valcounts=words.map(word=>(word,1))//Countallwords(automaticcombination)valfreq=counts.reduceByKey(_+_)//Swaptuplesandgettopresultsvaltop=freq.map(_.swap).top(topN)top.foreach(println)

    20 . 3

  • 20 . 4

  • 20 . 5

  • 20 . 6

  • 20 . 7

  • 20 . 8

  • 20 . 9

  • 20 . 10

  • 20 . 11

  • 20 . 12

  • 20 . 13

  • 20 . 14

  • 20 . 15

  • 20 . 16

  • 20 . 17

  • 20 . 18

  • 20 . 19

  • 20 . 20

  • 20 . 21

  • 20 . 22

  • 20 . 23

  • 21