#garagekidztweetz

読者です 読者をやめる 読者になる 読者になる

#garagekidztweetz

id:garage-kid@76whizkidz のライフログ・ブログ!

#hcj2016 KuduによるHadoopのトランザクションアクセスと分析パフォーマンスのトレードオフ解消のメモ

hadoop cloudera kudu conference lifelog

前のコマで参加したセッションが押した関係で、途中からの参加になりましたが、 Hadoop/Spark Conference Japan 2016 の午後二コマ目は Kudu のセッションに参加してきました。

では、今回も以降にメモ。

KuduによるHadoopのトランザクションアクセスと分析パフォーマンスのトレードオフ解消 / Todd Lipcon 氏(Cloudera)

  • Scalable and fast tabular storage
    • Scalable
      • 1000s of nodes, tes of PBs
    • Fast
      • Millions of RW
      • Multiple GB/second
    • Tabular
      • SQL like schema
      • Fast ALTER TABLE
  • Use cases and Architectures
    • Kudu good at sequential and random RW.
    • Time Series.
      • e.g. fraud ditection & prevention.
      • Workload: Insert, update, scans, lookups.
    • Online Reporting
      • e.g. ODS
      • Workload: Insert, update, scans, lookups.
  • Realtime Analytics in Hadoop with Kudu
    • Solving problems before Kudu.
      • Complicated: using 2 storage system.
      • Long latency. Data is not recent.
      • Cannot handle updates/deletes.
    • Kudu make that system much simpler.
      • Fast for Analytics.
      • One system to operate.
      • No cronjobs or background processes.
  • Xiaomi use case.
    • 4th largest SF maker.
    • own online services like photo sharing.
    • need those service monitoriing & trouble shooting tools.
      • Requirements.
        • Hight write throughput.
        • Query latest data and quick response.
        • Can search for individual records.
      • System diagram before Kudu.
        • Long pipeline.
          • High latency (1hour~1day), data conversion pains.
        • No ordering.
          • Log arrival order not exactly logical order.
          • To read 2-3days log data takes 1day.
      • After Kudu.
        • Data Source > Kafka > Storm > Kudu > Impala > result serving.
          • ETL pipeline (0-10sec latency)
          • Direct pipeline (no latency)
  • How it works? (Technical part)
    • Table is horizontally partitioned into tablets.
      • Range or Hash partitioning
      • Each tablet has N replicas (3or5), with Raft consensus
        • Automatic Fault Tolerance.
        • MTTR: ~5sec.
      • Tablet servers host tablets on local disk drives.
    • Installation of Kudu.
      • Just need Kudu install.
    • Metadata and the Master.
      • Replicated Master.
      • Not a bottleneck.
        • super fast in-memory lookups.
  • Kudu as Columnar Storage.
    • Example(Explanation) of Columnar Storage.
      • Storing each column data separately.
        • Good for analytics. Because they are separated so that we can keep data smaller. We only need to access needed column data.
    • Handling inserts and Updates
      • please read white paper in details.
  • Integration.
    • ???
    • Impala integration.
    • MR
  • Performance
    • TPC-H
      • 75server cluster
      • result show that kudu much faster than parquit average 31%.
    • Xiaomi benchmark results
    • YCSB
      • it shows HBase still much faster than Kudu for random access.
  • Project status.
  • Kudu community

資料埋め込みリンク

  • 公開されているものはこちらに埋め込みリンクさえていただく所存。

Hadoop/Spark Hadoop Conference 2016 でとってきた他のエントリへのリンク

  • のちほどリンクを追加していく所存。

garagekidztweetz.hatenablog.com garagekidztweetz.hatenablog.com garagekidztweetz.hatenablog.com garagekidztweetz.hatenablog.com garagekidztweetz.hatenablog.com