http://pixabay.com/en/the-elephant-africa-111695/
Hadoop Confence Japan 2014 参加レポートエントリ、4発目は Keynote でも発表をされていた Patrick Wendell 氏(Apache Spark主要開発者, Databricks)による「 Spark Internals 」に参加してきました。
Beginner 向けとおっしゃっていただけあって、結構わたしにもとっつきやすい内容になっていました。
では、そのセッションのメモは以下よりです。
14:40- A Deeper Understanding of Spark Internals Patrick Wendell 氏(Databricks)
- Friendly also for Spark beginner
This talk.
- Understanding how Spark runs, focus on performance
- Execution model
- The Shuffle
- Caching
- Not covering in this session.
Why understand internals?
- find number of distinct names per "first letter"
sc.textFile("hdfs:/names")
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues(names => names.toSet.size)
.collect()
Spark Execution Model
- Create DAG of RDDs to represent computation
- Create logical execution plan for DAG
- Pipeline as much as possible.
- Split into "stages" based on need to reorganize data.
- Schedule and execute indivisual tasks
- Split each stage into tasks
- A task is data + computation
- Execute all tasks whitin a stage before moving on.
- HadoopRDD
- map()
- groupBy()
- mapValues()
- collect()
The Shuffle
- Redistributes data among partitions
- Hash keys into buckets
- Optimization
- Avoided when possible, if data is already properly partitioned.
- Partioal aggregation reduces data movement
- Pull based, not push based
- Write intermediates files to disk.
- NW bounds.
Execution of a groupBy()
- Build hash map whitn each partition
- Note: Can spill across keys , but a single key-value pair must fit in memory
Done! (this moment)
What went wrong? - worse case scenario
- Above code can get more better, how?
- Too few partiotions to get good concurrency.
- Large per-key groupBy()
- Shipped all data across the cluster
Common Isuse checklist
- Ensure enough partions for concurrency
- Minimize memory consumption
- Minimize amount of data shuffled
- Know the standard library.
Importace of Partitio Tuning
- Main issue: too few partitions
- Less concurrency
- Secondary issue: too many partitions
- Need reasonable number of partitions
- Comoly between 100-10,000 partitons.
- 2x number of cores in cluster
Memory Problems
- Symptoms
- Bad performance
- Diagnosis
- Set spark.executor.extraJavaOptions to include
- printGCDetails ...
- Set spark.executor.extraJavaOptions to include
Fixing our mistakes
- Fixing original code, with using above ideas.
sc.textFile("hdfs:/names")
.repartition(6) .distinct() .map(name => (name.charAt(0), name))
.groupByKey()
.mapValues(names => names.size)
.collect()
- And finally can get more better.
sc.textFile("hdfs:/names")
.distinct(numPartitions=6) .map(name => (name.charAt(0), 1))
.reduceByKey(_ + _)
.collect()
Demo: Using Spark UI
- Tools for understanding low level performance
- jps | grep Executor
- jstack
- jmap -histo:live
QA.
Does Spark support Join?
- Yes.
How Join works?
- Shuffle joins.
- Hash joins.
Does Spark going to support Rule based Optimization?
- ????? Sorry, I couln't catch up.
If you new installation Spark, we recommended you to use SparkSQL. (not Shark.)
本セッションのメモは以上です。
ではまた次のセッションメモで。
あわせて読まれたい
#hcj2014 の個人参加レポート
各セッションの個人メモ
- 10:00- Hadoop Conference Japan 2014 Keynote.
- 13:00- リクルート式Hadoopの使い方 3rd Edition 石川 信行氏(リクルートテクノロジーズ)
- 13:50- SQLによるバッチ処理とストリーム処理 田籠 聡氏 (LINE)
- 15:40- Apache Drill: Building Highly Flexible, High Performance Query Engines M.C. Srivas 氏 (MapR)
- 16:30- Evolution of Impala - Hadoop 上の高速SQLエンジン、最新情報 嶋内 翔氏(Cloudera)
- 17:20- 並列SQLエンジンPresto - 大規模データセットを高速にグラフ化する方法 古橋 貞之氏(Treasure Data)