id:garage-kid@76whizkidz のライフログ・ブログ!

Spark Beginner にもやさしかった #hcj2014 A Deeper Understanding of Spark Internals のメモ




Hadoop Confence Japan 2014 参加レポートエントリ、4発目は Keynote でも発表をされていた Patrick Wendell 氏(Apache Spark主要開発者, Databricks)による「 Spark Internals 」に参加してきました。

Beginner 向けとおっしゃっていただけあって、結構わたしにもとっつきやすい内容になっていました。


14:40- A Deeper Understanding of Spark Internals Patrick Wendell 氏(Databricks)

  • Friendly also for Spark beginner

This talk.

  • Understanding how Spark runs, focus on performance
    • Execution model
    • The Shuffle
    • Caching
      • Not covering in this session.

Why understand internals?

  • find number of distinct names per "first letter"

.map(name => (name.charAt(0), name))
.mapValues(names => names.toSet.size)

Spark Execution Model

  • Create DAG of RDDs to represent computation
  • Create logical execution plan for DAG
    • Pipeline as much as possible.
    • Split into "stages" based on need to reorganize data.
  • Schedule and execute indivisual tasks
    • Split each stage into tasks
    • A task is data + computation
    • Execute all tasks whitin a stage before moving on.
  • HadoopRDD
  • map()
  • groupBy()
  • mapValues()
  • collect()

The Shuffle

  • Redistributes data among partitions
  • Hash keys into buckets
  • Optimization
    • Avoided when possible, if data is already properly partitioned.
    • Partioal aggregation reduces data movement
  • Pull based, not push based
  • Write intermediates files to disk.
    • NW bounds.

Execution of a groupBy()

  • Build hash map whitn each partition
  • Note: Can spill across keys , but a single key-value pair must fit in memory

Done! (this moment)

What went wrong? - worse case scenario

  • Above code can get more better, how?
    • Too few partiotions to get good concurrency.
    • Large per-key groupBy()
    • Shipped all data across the cluster

Common Isuse checklist

  • Ensure enough partions for concurrency
  • Minimize memory consumption
  • Minimize amount of data shuffled
  • Know the standard library.

Importace of Partitio Tuning

  • Main issue: too few partitions
    • Less concurrency
  • Secondary issue: too many partitions
  • Need reasonable number of partitions
    • Comoly between 100-10,000 partitons.
    • 2x number of cores in cluster

Memory Problems

  • Symptoms
    • Bad performance
  • Diagnosis
    • Set spark.executor.extraJavaOptions to include
      • printGCDetails ...

Fixing our mistakes

  • Fixing original code, with using above ideas.

.repartition(6) .distinct() .map(name => (name.charAt(0), name))
.mapValues(names => names.size)

  • And finally can get more better.

.distinct(numPartitions=6) .map(name => (name.charAt(0), 1))
.reduceByKey(_ + _)

Demo: Using Spark UI

  • Tools for understanding low level performance
    • jps | grep Executor
    • jstack
    • jmap -histo:live


Does Spark support Join?
  • Yes.
How Join works?
  • Shuffle joins.
  • Hash joins.
Does Spark going to support Rule based Optimization?
  • ????? Sorry, I couln't catch up.
If you new installation Spark, we recommended you to use SparkSQL. (not Shark.)



#hcj2014 の個人参加レポート