#garagekidztweetz

id:garage-kid@76whizkidz のライフログ・ブログ!

#hcj13w A会場 午後 (1) "Hadoop's Power to Transform Business Ted Dunning(MapR Technologies)" のメモ

スポンサーリンク

このエントリーをはてなブックマークに追加 全体のまとめは ➤ こちら
#hcj13w Hadoop Conference Japan 2013 Winter 午前中 Keynote のメモ - #garagekidztweetz にひき続いて、午後のセッションはひとつひとつのエントリで up していきます。
ビックサイト 7F のテクノロジーセッションを順番にわたしのメモを公開していきます。*1

13:05 (7F 国際会議場) Hadoop's Power to Transform Business Ted Dunning(MapR Technologies)

Hadoop allows organizations to better leverage data to improve business results and gain a competitive edge. This session will provide insights into how the combination of scale, efficiency, and analytic flexibility creates the power to expand the applications for Hadoop to transform companies as well as entire industries.

  • 1975 年からずっと OSS に従事
  • Mahout, ZK
➤ MapR technologies
  • Sillicon valley startup
➤ What is history?
  • The study of the past.
  • What is the future?
  • it comes the next.
  • But the future also the past
  • Some things turnded out as expected
  • Many things are different
  • Hadoop have a history and future
  • Also Hadoop have old future
    • MR and HDFS
    • Ecosystem addition

  • New view of hadoop
  • Realtime procession
  • integration with traditional IT
  • integration with new tech
  • fast and flexible computation
➤ Example #1 Search Abuce
  • for recomendations
  • (A'A)h
  • (A'A) : coocurrence matrix can also be implemented as a search index

  • how to create this index (MapR approach)
  • complete history -> coocurrence (Mahout) -> Solr indexing

  • User history -> Web tier -> Solr search

  • Objective Results
  • processing time cut from 20 hours per day to 3 hours
  • recommendation engine load time decreased from 8 hours to 3min
  • very strong results
➤ Example #2 Web Technology
  • node.js
  • very very simple web tier componets

  • Real time data -> fast analytics (like Storm)
  • Large analytics (MR)
  • analytic output
  • Browser query -> Presentation tier (d3 + node.js)
  • Objective results
  • real time and long time analytics is seamless
  • no need to move data
➤ Example #3: Apache Drill
  • Stream processing
  • always streaming
  • now we can get realtime results using S4 or Storm, etc...
  • Something in the middle is missing
  • to fill the missing point
  • Google Drimel
    -> its design principles -> flexible, easy, dependable, fast (4 principals)
  • they implemented it as simple architecture
    IF -> Query language(SQL 2003) -> transform -> logical language(Drill logical synax: JSON) -> optimize -> physical plan -> Execute (Scanner API)

  • ex. Logical Plan Syntax:

  • ex. Logical Streaming Example

  • Data flow, logical plan
  • scan-jo -> filter -> flatten

  • Execution Plan
  • from node1, 2, 3... aggregate data and execute

  • Non SQL queries
  • scan-json -> k-means join -> cluster feature
➤ Summary
  • the future is not what we thought it would be
  • It is more better!
➤ Get Involved
  • tdunning : slide share
  • ML: drill-dev-subscribe@incubator.apache.org
  • join the drill project -> tdunning@maprtech.com
  • @ted_dunning
  • join to MapR (in japan) -> jobs@mapr.com
➤ QA
  • Difference of Impala?
  • Big difference is community.
  • Impala is from only one company.
  • Drill is in a apache community everyone have more free to join.

  • How much drill project development proceeding
  • now only have a little feature of the IF
  • but near future expect to become more widely covered

  • Do you think the SQL is the best?
  • SQL is useful.
  • but hard to answer what is the best.
  • depends on what you want to do and the data type.
  • design to do SQL good, not design to do SQL is hard.

  • Indatabase analytics feature
  • it is possible for Drill to combine the data set from multiple resources.

  • So flexibile and abstruct product, and that makes in itself complex and decrease performance. how about that?
  • no problem for the performance but worrying about need to code more.
  • Impala goes GA this March, how do you think, you can compete with it with development speed?
  • cant garantee the delivery timing, its OSS community, you can help ;)
    • (開発のスピードが遅れない理由も説明していたように聞こえたが、聞き取りきれなかった)

*1:1F との行き来が面倒だったので全部 7F のセッションを受けて来ました