introduce to spark sql 1.3.0
TRANSCRIPT
Introduce to Spark SQL 1.3.0
Optimization
效率提升
主要的物件● https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql
.package
spark-shell● 除了 sc之外,還會起 SQL Context
Spark context available as sc.15/03/22 02:09:11 INFO SparkILoop: Created sql context (with Hive support)..SQL context available as sqlContext.
JARval sc: SparkContext // An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
DF from RDD● 先轉成 RDDscala> val data = sc.textFile("hdfs://localhost:54310/user/hadoop/ml-100k/u.data")
● 建立 case classcase class Rattings(userId: Int, itemID: Int, rating: Int, timestmap:String)
● 轉成 Data Framescala> val ratting = data.map(_.split("\t")).map(p => Rattings(p(0).trim.toInt, p(1).trim.toInt, p(2).trim.toInt, p(3))).toDF()ratting: org.apache.spark.sql.DataFrame = [userId: int, itemID: int, rating: int, timestmap: string]
DF from json● 格式
{"movieID":242,"name":"test1"}{"movieID":307,"name":"test2"}
● 可以直接呼叫scala> val movie =
sqlContext.jsonFile("hdfs://localhost:54310/user/hadoop/ml-100k/movies.json")
Dataframe Operations● Show()
userId itemID rating timestmap196 242 3 881250949186 302 3 89171774222 377 1 878887116244 51 2 880606923253 465 5 891628467
● head(5)res11: Array[org.apache.spark.sql.Row] = Array([196,242,3,881250949],
[186,302,3,891717742], [22,377,1,878887116], [244,51,2,880606923], [166,346,1,886397596])
● printSchema() ←根本神技scala> ratting.printSchema()root |-- userId: integer (nullable = false) |-- itemID: integer (nullable = false) |-- rating: integer (nullable = false) |-- timestmap: string (nullable = true)
Select● Select Column
scala> ratting.select("userId").show()
● Condition Selectscala> ratting.select(ratting("itemID")>100).show()
(itemID > 100)true true true
filter● 篩選條件
scala> ratting.filter(ratting("rating")>3).show()userId itemID rating timestmap298 474 4 884182806253 465 5 891628467286 1014 5 879781125200 222 5 876042340122 387 5 879270459291 1042 4 874834944119 392 4
● 偷懶寫法 ratting.filter('rating>3).show()
● 合併使用scala> ratting.filter(ratting("rating")>3).select("userID","itemID")show()userID itemID298 474 286 1014
● 也可以
ratting.filter('userID>500).select(avg('rating),max('rating),sum('rating))show()
GROUP BY● count()
scala> ratting.groupBy("userId").count().show()userId count831 73 631 20
● agg()scala> ratting.groupBy("userId").agg("rating"->"avg","userID" -> "count").show()
● 可以連用scala> ratting.groupBy("userId").count().sort("count","userID").show()
GROUP BY● 其他
o avgo maxo mino meano sum
● 更多 Functionhttps://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
● 大雜燴ratting.groupBy('userID).agg(('userID),avg('rating),max('rating),sum('rating),count('
userID AVG('rating) MAX('rating) SUM('rating) COUNT('rating)831 3.5205479452054793 5 257 73 631 3.1 4 62 20 31 3.9166666666666665 5 141 36 431 3.380952380952381 5 71 21 231 3.6666666666666665 5 77 21 832 2.96 5 74 25 632 3.6610169491525424 5 432 118
UnionAll● 合併相同欄位表格
scala> val ratting1_3 = ratting.filter(ratting("rating")<=3)scala> ratting1_3.count() //res79: Long = 44625scala> val ratting4_5 = ratting.filter(ratting("rating")>3)scala> ratting4_5.count() //res80: Long = 55375ratting1_3.unionAll(ratting4_5).count() //res81: Long = 100000
● 欄位不同無法 UNIONscala> ratting1_3.unionAll(test).count()java.lang.AssertionError: assertion failed
JOIN● 基本語法
scala> ratting.join(movie, $"itemID" === $"movieID", "inner").show()userId itemID rating timestmap movieID name 196 242 3 881250949 242 test163 242 3 875747190 242 test1
● 可支援的 join型態: inner, outer, left_outer, right_outer, semijoin.
也可以把表格註冊成 TABLE● 註冊
scala> ratting.registerTempTable("ratting_table")
● 寫 SQLsqlContext.sql("SELECT us
scala> sqlContext.sql("SELECT userID FROM ratting_table").show()
DF支援 RDD操作● MAP
scala> result.map(t => "user:" + t(0)).collect().foreach(println)
● 取出來的物件型態是 Anyscala> ratting.map(t => t(2)).take(5)
● 先轉 string再轉 intscala> ratting.map(t => Array(t(0),t(2).toString.toInt * 10)).take(5)res130: Array[Array[Any]] = Array(Array(196, 30), Array(186, 30), Array(22, 10),
Array(244, 20), Array(166, 10))
SAVE DATA● Save()
ratting.select("itemID").save("hdfs://localhost:54310/test2.json","json")
● saveAsParquetFile● saveAsTable(Hive Table)
DataType● Numeric types
● String type
● Binary type
● Boolean type
● Datetime type
o TimestampType: Represents values comprising values of fields year, month, day, hour, minute, and second.
o DateType: Represents values comprising values of fields year, month, day.
● Complex types
發展方向
Reference1. https://databricks.com/blog/2015/02/17/introducing-data
frames-in-spark-for-large-scale-data-science.html
2. https://www.youtube.com/watch?v=vxeLcoELaP4
3. http://www.slideshare.net/databricks/introducing-dataframes-in-spark-for-large-scale-data-science