2014-07-08

#hcj2014 並列SQLエンジンPresto - 大規模データセットを高速にグラフ化する方法のメモ。

f:id:garage-kid:20140707215107p:plain

http://pixabay.com/en/elephant-babies-elephant-family-278524/
Hadoop Confence Japan 2014 参加レポートエントリ、参加セッションのメモはこれで最後。

最後に参加してきたセッションは @frsyuki さんの「並列SQLエンジン Presto 」に参加してきました。

本セッションに関して言うと Presto 云々より @frsyuki さんにひたすら脱帽（資料きれいだし、説明わかりやすいし、、、、）していました。

ちなみに今回、わたしが参加した中では、このセッション含め以下 2 つのセッションが魅力的でした*1。

QA がひとつまえの @shiumachi 氏のセッションとも掛け合いになっていて同セッションから連続して参加しているととてもいい連続間がありました。

スライドが英語だったのでほとんど英語でメモしちゃいましたが、以降わたしのとってきたメモです。

17:20- 並列SQLエンジンPresto - 大規模データセットを高速にグラフ化する方法古橋貞之氏（Treasure Data）

Presto - Hadoop Conference Japan 2014 from Sadayuki Furuhashi

* 会場にはそこそこいる Presto 使い

Open source Hacker

MessagePack
Fluentd
SErverEngie
LS4
kumofs

What's Presto?

A distributed SQL query engine for the large scale.

History

2012 fall: project at facebook.
- speed of commercial DW
- scalability to the size of facebook
2013 winter: Open sourced.

Problems

couldn't visualise data on HDFS directly using dashboards or BI tools.
- Hive is too slow.
- ODBC connectivity is unavailabe/unstable.
Daily batch results to an interactive DB for quick response
- PostgreSQL, Redshift
  - Interactive DB costs more and less scalable by far
Some data are not on HDFS
- Need to copy them to HDFS

Batch analysis platform and Visualisation platform can be managed better using Presto.

Hive とは使い分けの関係
HDFS -> Presto -> Dashboard
Presto can query any data sets using SQL
- From for example Cassandra, MySQL, Commercial RDBMS.

What Presto cando?

Can
- Query interactively
- Query Commercial BI tool
- Query Across Multiple datasources.

Who uses Presto?

Facebook
- Multiple geographical regions.
- 1,000 nodes
Netflix, Dropbox,,,,

Todays talk here after

Distributed Architecture

Discovery Service, Client, Coordinator, Worker, Connector Plugin, Storage/Metadata
ここはスライドをみるのがいい、とてもわかりやすい
What's connector?
- plugin for Presto
  - Java
- Access to storage and metadata
- ex.
  - Hive connector
  - Cassadra connector
  - MySQL connector
  - can build by each own.
Summary
- 3type of servers
  - Discovery Service, Coordinator, Worker
- Get data/metadata through connector plugins
- client protocol is HTTP + JSON
  - Ruby, Python, etc...

Data visualisation and demos.

BI tool's issues.
- ODBC/JDBC connectivity
- ODBC/JDBC are very complicated.
  - when you create new ODBC/JDBC for the new product, to make it matured, needs much efforts
Solution
- Prestogres
  - Furuhashi-san made it.
  - patched pgpool-II
Demos.
- Connect Presto from Tableau (ODBC)

Presto's execution model.

Not using MR.
これもスライドが図解でみやすい
- Query Planner
  - Stages
- Execution Planner
  - divided to task.
  - Split
MR との一番の違い
- すべてのタスクが同時並行に動く
  - ただし、一個止まると全部とまる
- Disk IO 発生しない
  - メモリ間のデータ転送

Monitoring and configuration

すごくよくできてる？！
Monitoring
- WebUI
- JMX HTTP API
- Event notification
Configuration
- 資料をみてくれい、とのこと

Roadmap

From Presto Meetup. (May 2014)
- Huge Join and Group by
- Task Recovery
- CREATE VIEW (already done)
- Native Store
  - Similar to Spark's cache.
- Authentication
- DDL/DML
- Plugin repository
- CLI plugin manager
- Join and aggregation pushdown
- Custom optimizers

Links.

Web site, docs
ML
Github
Guidelines

Tresuredata is now hiring.

QA.

Impala とくらべてどないやねん？

実際 Impala と比べると Impala のほうが速い
ただバージョン up が頻繁でどんどん速くなっている
一番ちがうのはクエリは落ちてもプロセス落ちない
- ロギングすごい
- 運用まわりよく考えられてる

参考：

『Prestoとは何か，Prestoで何ができるか』 - トレジャーデータ（Treasure Data）公式ブログ 2014-07-10 15:29 追記

これで、わたしが #hcj2014 で参加したセッションのメモはおしまいです。
最後に次のエントリで Conference 全体をとおしての超個人的総括をして終わりにしようと思います。

あわせて読まれたい

#hcj2014 の個人参加レポート

*1:内容然り、話し方然り

#garagekidztweetz

id:garage-kid@76whizkidz のライフログ・ブログ！

#hcj2014 並列SQLエンジンPresto - 大規模データセットを高速にグラフ化する方法のメモ。