2014-07-08

LINE 的 Lambda Architecture 実装紹介とでもいう内容だった #hcj2014 SQLによるバッチ処理とストリーム処理のメモ

f:id:garage-kid:20140707220445p:plain

http://pixabay.com/en/elephant-babies-elephant-family-278525/

Hadoop Confence Japan 2014 参加レポートエントリ、3発目は @tagomoris 氏による「SQLによるバッチ処理とストリーム処理」に参加してきましたのメモ。

LINE 的 Lambda Architecture 実装紹介とでもいうべき内容だった本セッションは、今回のカンファレンスでわたしが参加してよかったと思うセッションのひとつでした。ぜひ、後ほどスライドのほうも振り返りみさせてもらいたいと思っています。

で、以下はわたしがとったメモです。

13:50- SQLによるバッチ処理とストリーム処理田籠聡氏 (LINE)

Batch processing and Stream processing by SQL from SATOSHI TAGOMORI

今回のカンファレンスで唯一ストリーム処理を扱っているセッション。
英題 : Batch Processing and Stream Processing by SQL.

LINE

サブサービスがたくさん
- いろいろ複雑なデータを扱う

fluentd

知ってる人も使ったことがある人もかなり (会場)

SQL の話をします、あと Stream 処理の話をします

Analytics data flow overview

fluentd
- Norikra
  - notification
  - visualisation
  - application metrics
- Hadoop, Hive, Presto
生の MR をかくことはしてない

SQL is not the best.

But SQL is better than None.
SQL はリーズナブル
- ちょっとデータ処理をしたい人には

What support SQL

RDBMS
Hive
Presto, Impala, Drill

今日のタイムテーブル、だいたい SQL

自分で MR を書くよりも SQL なニーズ強し

処理の性質ごとにわけてみると

スライドが詳しい。

Batch Processing

Or, Stream Processing.
Storm が一時期話題になったが
それぞれ使い分けは？
Batch Processing
- Hadoop, Hive
- Target window: hours-weeks
- total throughput: highest
- Query latency: largest (20sec-mins-hours)
- 比較的短いバッチにも MR の起動時間を我慢してきた
Short Batch Processing
- Presto, Impala, Drill
- Target window: seconds-hours-(days)
- Total throughput: Normal
- Query latency: Small (seconds-mins)
Stream Processing
- Storm, Kafka, Esper, Norikra, Fluentd,,,,
- Spark Streaming?
- Target window: seconds-mins
- total throughput: Normal
- Query latency: Smallest (milliseconds)
  - Queries must be written BEFORE DATA
- Once registered, runs forever.
Each case's data flow and latency
- 図解なので資料をみるのがいい
- データの処理を度のタイミングでするか
- Streaming はデータが入ってくる度に
  - incremental だけど latency は小さい
Data Window
- Target time range of queries
- Batch or short batch
  - FROM TO: where dt>= and dt<=
- Stream
  - Calculate this query for every 3 mins
  - Required extended SQL
  - Stream processing with SQL
    - Esper
      - Java library for processing Stream
      - Needed schema (structured data).
        
        Stream 処理系は大体スキーマが必要、 Storm とかも。
        
        LINE の場合はそれが都合がわるい
        
        欲しいデータがコロコロ変わる
      - Esper EPL
        
        Ex. of extended SQL.

そこで Norikra

Schema less stream Processing with SQL
- OSS
  - based on Esper EPL, GPLv2
- Scalable
  - by Scale Out
- without pre-defined schema.
- HTTP RPC
- Dynamic query registration/removing
- Ultra fast bootstrap
- Enable to handle 10k events/sec.
  - On 2cpu (8core) server

どこでつかうの？ Lambda Architecture

Just same 2 process on
- Stream processing
- Batch processing
- 一度書いたらどっちでも
Replayable Processing
- steram processing MUST NOT be replayable
- Queries on stream processing SHOULD be replayable
So that Hybrid processing:
- for FT
  - Stream processing:
    - executes queries in normal
  - Batch processing:
    - executes recovery queries
- for latency reduction + accuracy
  - prompt reports (速報値として) LINE では Norikra
  - (確定値として) Hive
- against complexity
  - NonSQL stream processing
    - for simple, fixed, high traffic events
  - SQL stream processing
    - for complex, fragile events

Case study in LINE

Prompt reports & fixed reports
- Norikra + Hive Hybrid
Error detection from apps and access logs.
- Norikra + Fluentd Hybrid
Realtime aggregation for complex and simple(fixed) objects
- Norikra + Fluentd Hybrid

Ex. Hive: fixed reports

JSON formatted logs.

Ex. Norikra: prompt reports

More queries, more simplicity, and less latency.

Stream 処理は便利
- 一度処理を登録すれば、一時間ごとの集計などは自動的に返ってくる

QA.

Stream で出る速報値からアクションに起こすってしているんですか？

広告案件でしてる。
- 予算としては小さいんだけど、ある一定時期にドバッとなげておしまいということがある
- 売り切れる前にわかるならもっと出したいという広告主さんはいる

メモは以上です。
ではまた次のエントリで。

あわせて読まれたい

#hcj2014 の個人参加レポート

#hcj2014 Hadoop Conference Japan 2014 に参加してきました（超個人的まとめ）

#garagekidztweetz

id:garage-kid@76whizkidz のライフログ・ブログ！