#hcj2014 Hadoop Conference Japan 2014 Keynote のメモ。少なくともこれから一年くらい Hadoop 界隈の情報を追うのにいい道標になる内容だった！

f:id:garage-kid:20140707220927p:plain

http://pixabay.com/en/elephant-elephants-tanzania-safari-289134/

今日は、第五回目となる Hadoop Confence Japan 2014 に参加してきました。
ブログを書くまでがわたしにとっての勉強会なので、わたしが参加してきた各セッションのメモをひとセッションひとエントリ、最後に全体を通した個人的感想という形で投稿していこうと思います。

というわけでまず最初の投稿は、 Keynote から。

Keynote は米谷修氏（リクルートテクノロジーズ）と濱野賢一朗氏（日本Hadoopユーザー会, NTTデータ）の冒頭挨拶にはじまり、
Hadoop の生みの親である Doug Cutting 氏（Hadoop生みの親, Apache Software Foundation, Cloudera）のセッション、
Patrick Wendell 氏（Apache Spark主要開発者, Databricks）のセッション、
太田一樹氏（Treasure Data CTO）のセッションと豪華な顔ぶれで行われました。

個人的には特に Doug Cutting 氏と太田一樹氏の発表が素晴らしかったと思います。
（理由は、少なくともわたしにとってはこれから一年くらいの Hadoop 界隈の情報を追うのには良い道標になってくれる内容だったからです。）

では以降よりわたしの Keynote のメモです。

オープニング動画

ますます広がる Hadoop の輪
ビッグデータの大海原で新たな航路を切り開く
- Hadoop の知見を日本から世界へ

米谷修氏（リクルートテクノロジーズ）

開会の挨拶

法被 Hadoop と入ってる
- 日本 Hadoop ユーザ会主催の Hadoop Conference Japan 今回で五回目

濱野賢一朗氏（日本Hadoopユーザー会, NTTデータ）

日本 Hadoop ユーザ会代表としての挨拶

今回で 5 回目、 2009 年から
- リクルートテクノロジーズさんのご助力で会場設営、ランチ等準備。

今回のイベントの案内

#hcj2014
@hadoopconf
参加登録者数: 今朝時点で 1,296 名 (少し増えたかなというくらい)
- 65% がはじめて参加される
- リピーター少ないの？
- 裾野は広がってる？

Hadoop を取り巻く状況

これまで
- はじめて普及した並列分散処理
- 何がよかったのか？
  - データ読み込みのスループットの最大化
    - 全件データ (ビッグデータ) 処理の実現
    - シンプルなモデル (MR)
      - 失敗した時の処理
      - フレームワーク (MW) での解決 (map->shuffle->reduce)
Hadoop の全体観
- Hadoop は絶賛進化中
  - さらに複雑化
    - バージョンの系統化
      - 0.20 系
      - 2.0 系
      - おおくの Distributer は 2.3, 2.4 系を追いかけている
- その中でも一番大きい動きは？
  - YARN
  - MR だけだったフレームワークがそれ以外も動かそうという流れに変わってきている
これからの Hadoop がリードする世界
- YARN の登場により複数の並列分散処理エンジンを併用できる環境に
  - ノード間の通信がリーズナブルになってきている昨今
- メモリの大容量化、 10Gbps ネットワークの普及
  - In-memory の実現性の向上
- MR ベースの経験をもとに複数の並列分散処理システムを使い分ける時代に

Hadoop の利用経験 (アンケートより)

44% 以上の方が 6 ヶ月以上の経験をもつ
あらたにはじめる方も安定的に増えている
- 3 ヶ月未満の経験の方

利用しているエコシステム

Hive の利用圧倒的
- Hive 570
- ZK 289
- HBase 271
- Fluentd 194
- Pig 191
- Mahout 163
- Sqoop 143
- Impala 141
- Spark 108
- Hue 104

Hadoop コミュニティへの参画 ( World Wide ) 企業別

Source code の行数で NTT-D, NTT が世界第9位
- 積極的に開発コミュニティに参画しまいか？

ご案内とお願い

撮影取材は自由だが、ただしシャッター音など迷惑をかけないように
発表資料は取らなくてもいいはず→資料はあがるはずなので
無線 LAN の提供はありません
懇親会は参加費 2,000 円 (当日参加もおｋ)
展示ブース
- Cloudera
- SAS etc...
- メッセージボードに企業ロゴを描くとプレゼントあり

『The Future of Data』 Doug Cutting 氏（Hadoop生みの親, Apache Software Foundation, Cloudera）

I cannot see the future.

I'm here at the present.
But we can predict from the findings (facts).
Tell the facts that from the data.
- Where we are heading.

Fact: HW becomre more reasonable. (HWはさらに求めやすく)

Microprocessor Transistor Counts & Moore's Law.

Fact: Data become more valuable. (データの価値はさらに高まる)

Everything generates the data.
- if we can gather it.

Fact: OSS will survive.

Now we have HW, Data and SW (That is OSS).
Talking experience of developing Lucene.
- the most conpetitive thing is that SW is OSS.
- Everyone can try it.
Platform for analyzing data should be OSS, I believe.

Fact: Hadoop continues Improving.

The Story of the begining of Hadoop.
- Development of Nutch episode.
- Y! interested in that technology.
- At first there is many lack of features (SPOF, no security implementation)

Fact: Hadoop becomes a standard. (Hadoop 当たり前に)

More capability on top of Hadoop
- Lucene, Solr
- HBase
- Impala
- Graph processing
(Hadoop) Become more capable.

Fact: Hadoop leads the Bigdata World (Hadoop がビッグデータを席巻)

Even Oracle, IBM etc big players start adapting hadoop.
- Silo type of system
- Common type of system

Fact: Even a transaction processing also handled by Hadoop. (トランザクション処理ですら Hadoop 上で実行可能に)

Google's paper published related to this fact.

Futere Image: Enterprise Data Hub

Platform will become low cost.
- Everything will be provided by OSS.
Hadoop is the safe path to the revolution.

QA.

My favorite thing is answering the question.
- Don't be shy and feel free to talk to me.
No one asked the question.

『The Future of Spark』 Patrick Wendell 氏（Apache Spark主要開発者, Databricks）

First thing first

Thanks to Recruit Technology.
Slides are all translated by him.

Spark Development.

500 patch updates

Spark Future

Spark has seen rapid grouwth in the last year.
where we are heading

Goal of the Spark Proj

Empower data scientist and engineers Expressive, clean APIs Unified runtime accross ...

API stability.

in 1.0

Developer friendly release cadence

Minor releases every 3 months
Maintenance release with fixes as necessary ...

The Spark Stack

Spark SQL
MLLib
GraphX ....

The future of Spark is librries

Packaged and distributed with Spark to provide full inter-operablility.

Spark SQL (one of the most hot component in Spark)

Support for SQL and notion of typed shcema RDDs.
Focus going forward
- Optimization (code gen, fater joins)
- Language support (SQL92)

Sqark SQL and Schema RDDs

support
- JSON, parquet on Hive
- Cassandra, HBase, mongoDB (NoSQL)
- SAP, VERTICA, Oracle (RDBMS)

Spark SQL and Shark

Spark 1.1+ will provide a JDBC,ODBC server allowing direct upgrade for Shark users.

Next Library: MLlib

Second fastest growing component.

Spark R

Integration with MLlib.

Spark Core

Allow extensiton innovation by defining interal API's
- Internal storage API
- Spark shuffle API

Not much talked today topic.

Streaming
GraphX
Core
- Elastic scaling on YARN

Databricks Cloud.

Provision a Spark Cluster instantly in the Cloud.
- Notebooks, dashboards, and sheduled jobs.

Demos.

Create new Spark Cluster (in 2 minutes).
- Notebook
  - language (SQL, python, Scala)
  - Adhoc query from purquet file on S3 (?)
- Dashboards
  - Sample"s graph and map dynamically generated.

Wrap up

Spark will grow substantially in the next year.
- Focus in library.
- Developer friendly. (by providing stable releases.)

『Hadoopエコシステムの変遷と、見えてきた使いどころ』太田一樹氏（Treasure Data CTO）

Hadoop エコシステムの羅針盤になればいいなぁという内容のセッション。

自己紹介

OSS 信者なのですこし偏った発言があってもご容赦を
Treasure Data CTO
- Fluentd, Presto に絶賛コミット中

開発10年目だから問う、なぜ Hadoop を使うのか？

Hadoop の価値とは？
Database LandScape Map 2014 (By 451 Research)
- 多数あるプロダクトのなかでなぜ Hadoop?
ありがちな回答
- 安い Storage
  - それだけならあまりよくないのでは？
    - Glusterfs, Ceph などのほうが優れてる
Hadoop lets you collect and store any types of data economically for faster and better use of data, to imporve your product and mitigate business risks.

Hadoop エコシステムの進化と混沌

今日は以下の 4 つをおさえよう
(1) Collect Any Types of data
- Fluentd
- Kafka
- Flume
- Sqoop
(2) Store Any types of data economically
- Parquet, ORCFile (format)
- HDFS, HBase, Accumulo
- Ambali, HUE, Cluudera Maager
  - 管理運用の支援は超大事なんじゃないか？
- Treasure Data, AWS Elastic MR
(3) Faster Use of Data
- いかに簡単にデータを扱うか？ *
- YARN
- Storm, Samza, Norikra
- Apache Tez, Spark
- データ処理フレームワーク
  - HiveQL, Pig
  - Java: Cascading, Apache Crunch
(4) Better Use of Data
- いかにうまくデータを使うか？
- SQL on Hadoop
- Impala, SparkSQL, Presto, Drill
- Mahout, Spark MLlib, Hivemall

他の選択肢の進化

Database の進化
- Massiely Parallel Processing
- Schema-on-Write
- Oracle, DB2, SQLserver
- Teradata, Netezza, Vertica, ParAccel, Greenplum
- 多くのベンダーが Hadoop 対応を表明
- スキーマを決めないといけないのが disadvantage だが
  - Vertica Zonemap など Schema-on-Read 対応なども
- Database の一番の問題とは？
  - 本当に一部のレポートしかつくれない
    - BI の人がリクエストにこたえきれない
    - Hadoop みたいに全部つっこんどけ、ならビジネスをわかってる人が自分でどんどん検索できてしまう
現在のトレンド
- Hadoop に生データをすべて集約
- そこから集計集約したデータを MPP データベースに保存
- 実装例
  - Twitter: Hadoop + Vertica etc..
将来の鍵は、今日のセッションの中にあるはず
- Hadoop は構造化データとの境界線に
- MPP データベースは非構造化データの領域に踏み込む
- 誰がマーケットリーダーになっていくのか注視していかないといけない
  - M&A
- 使用する側は一層の知識とトレンドの把握が必要。

ここまでが Keynote のメモになります。ではまた次のエントリで。

あわせて読まれたい

#hcj2014 の個人参加レポート

#hcj2014 Hadoop Conference Japan 2014 に参加してきました（超個人的まとめ）