2013-01-21

#hcj13w A会場午後 (1) "Hadoop's Power to Transform Business Ted Dunning（MapR Technologies）" のメモ

全体のまとめは ➤ こちら

#hcj13w Hadoop Conference Japan 2013 Winter 午前中 Keynote のメモ - #garagekidztweetz にひき続いて、午後のセッションはひとつひとつのエントリで up していきます。
ビックサイト 7F のテクノロジーセッションを順番にわたしのメモを公開していきます。*1

13:05 (7F 国際会議場) Hadoop's Power to Transform Business Ted Dunning（MapR Technologies）

Hcj 2013-01-21 from Ted Dunning

Hadoop allows organizations to better leverage data to improve business results and gain a competitive edge. This session will provide insights into how the combination of scale, efficiency, and analytic flexibility creates the power to expand the applications for Hadoop to transform companies as well as entire industries.

1975 年からずっと OSS に従事
Mahout, ZK

➤ MapR technologies

Sillicon valley startup

➤ What is history?

The study of the past.
What is the future?
it comes the next.
But the future also the past
Some things turnded out as expected
Many things are different
Hadoop have a history and future
Also Hadoop have old future
- MR and HDFS
- Ecosystem addition

＊

New view of hadoop
Realtime procession
integration with traditional IT
integration with new tech
fast and flexible computation

➤ Example #1 Search Abuce

for recomendations
(A'A)h
(A'A) : coocurrence matrix can also be implemented as a search index

＊

how to create this index (MapR approach)
complete history -> coocurrence (Mahout) -> Solr indexing

＊

User history -> Web tier -> Solr search

＊

Objective Results
processing time cut from 20 hours per day to 3 hours
recommendation engine load time decreased from 8 hours to 3min
very strong results

➤ Example #2 Web Technology

node.js
very very simple web tier componets

＊

Real time data -> fast analytics (like Storm)
Large analytics (MR)
analytic output
Browser query -> Presentation tier (d3 + node.js)
Objective results
real time and long time analytics is seamless
no need to move data

➤ Example #3: Apache Drill

Stream processing
always streaming
now we can get realtime results using S4 or Storm, etc...
Something in the middle is missing
to fill the missing point
Google Drimel
-> its design principles -> flexible, easy, dependable, fast (4 principals)
they implemented it as simple architecture
IF -> Query language(SQL 2003) -> transform -> logical language(Drill logical synax: JSON) -> optimize -> physical plan -> Execute (Scanner API)

＊

ex. Logical Plan Syntax:

＊

ex. Logical Streaming Example

＊

Data flow, logical plan
scan-jo -> filter -> flatten

＊

Execution Plan
from node1, 2, 3... aggregate data and execute

＊

Non SQL queries
scan-json -> k-means join -> cluster feature

➤ Summary

the future is not what we thought it would be
It is more better!

➤ Get Involved

tdunning : slide share
ML: drill-dev-subscribe@incubator.apache.org
join the drill project -> tdunning@maprtech.com
@ted_dunning
join to MapR (in japan) -> jobs@mapr.com

➤ QA

Difference of Impala?
Big difference is community.
Impala is from only one company.
Drill is in a apache community everyone have more free to join.

＊

How much drill project development proceeding
now only have a little feature of the IF
but near future expect to become more widely covered

＊

Do you think the SQL is the best?
SQL is useful.
but hard to answer what is the best.
depends on what you want to do and the data type.
design to do SQL good, not design to do SQL is hard.

＊

Indatabase analytics feature
it is possible for Drill to combine the data set from multiple resources.

＊

So flexibile and abstruct product, and that makes in itself complex and decrease performance. how about that?
no problem for the performance but worrying about need to code more.
Impala goes GA this March, how do you think, you can compete with it with development speed?
cant garantee the delivery timing, its OSS community, you can help ;)
- （開発のスピードが遅れない理由も説明していたように聞こえたが、聞き取りきれなかった）

✔ #hcj13w わたしのとった他のメモへのリンク

*1:1F との行き来が面倒だったので全部 7F のセッションを受けて来ました