#garagekidztweetz

id:garage-kid@76whizkidz のライフログ・ブログ！

Spark Beginner にもやさしかった #hcj2014 A Deeper Understanding of Spark Internals のメモ

conference hcj2014 hcj hadoop lifelog

スポンサーリンク

f:id:garage-kid:20140707220209p:plain

http://pixabay.com/en/the-elephant-africa-111695/

Hadoop Confence Japan 2014 参加レポートエントリ、4発目は Keynote でも発表をされていた Patrick Wendell 氏（Apache Spark主要開発者, Databricks）による「 Spark Internals 」に参加してきました。

Beginner 向けとおっしゃっていただけあって、結構わたしにもとっつきやすい内容になっていました。

では、そのセッションのメモは以下よりです。

14:40- A Deeper Understanding of Spark Internals Patrick Wendell 氏（Databricks）

Friendly also for Spark beginner

This talk.

Understanding how Spark runs, focus on performance
- Execution model
- The Shuffle
- Caching
  - Not covering in this session.

Why understand internals?

find number of distinct names per "first letter"

sc.textFile("hdfs:/names")
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues(names => names.toSet.size)
.collect()

Spark Execution Model

Create DAG of RDDs to represent computation
Create logical execution plan for DAG
- Pipeline as much as possible.
- Split into "stages" based on need to reorganize data.
Schedule and execute indivisual tasks
- Split each stage into tasks
- A task is data + computation
- Execute all tasks whitin a stage before moving on.

HadoopRDD

map()

groupBy()

mapValues()

collect()

The Shuffle

Redistributes data among partitions
Hash keys into buckets
Optimization
- Avoided when possible, if data is already properly partitioned.
- Partioal aggregation reduces data movement
Pull based, not push based
Write intermediates files to disk.
- NW bounds.

Execution of a groupBy()

Build hash map whitn each partition
Note: Can spill across keys , but a single key-value pair must fit in memory

Done! (this moment)

What went wrong? - worse case scenario

Above code can get more better, how?
- Too few partiotions to get good concurrency.
- Large per-key groupBy()
- Shipped all data across the cluster

Common Isuse checklist

Ensure enough partions for concurrency
Minimize memory consumption
Minimize amount of data shuffled
Know the standard library.

Importace of Partitio Tuning

Main issue: too few partitions
- Less concurrency
Secondary issue: too many partitions
Need reasonable number of partitions
- Comoly between 100-10,000 partitons.
- 2x number of cores in cluster

Memory Problems

Symptoms
- Bad performance
Diagnosis
- Set spark.executor.extraJavaOptions to include
  - printGCDetails ...

Fixing our mistakes

Fixing original code, with using above ideas.

sc.textFile("hdfs:/names")
.repartition(6) .distinct() .map(name => (name.charAt(0), name))
.groupByKey()
.mapValues(names => names.size)
.collect()

And finally can get more better.

sc.textFile("hdfs:/names")
.distinct(numPartitions=6) .map(name => (name.charAt(0), 1))
.reduceByKey(_ + _)
.collect()

Demo: Using Spark UI

Tools for understanding low level performance
- jps | grep Executor
- jstack
- jmap -histo:live

QA.

Does Spark support Join?

Yes.

How Join works?

Shuffle joins.
Hash joins.

Does Spark going to support Rule based Optimization?

????? Sorry, I couln't catch up.

If you new installation Spark, we recommended you to use SparkSQL. (not Shark.)

本セッションのメモは以上です。
ではまた次のセッションメモで。

あわせて読まれたい

#hcj2014 の個人参加レポート

#hcj2014 Hadoop Conference Japan 2014 に参加してきました（超個人的まとめ）

各セッションの個人メモ