#garagekidztweetz

#garagekidztweetz

id:garage-kid@76whizkidz のライフログ・ブログ!

Hadoop の生みの親、Doug Cutting 氏の基調講演を聞いてきた! #ccw #hadoop

スポンサーリンク

このエントリーをはてなブックマークに追加

上記を見た同僚の方に Hadoop の生みの親である Doug Cutting 氏がNext Generation Data Center 2011 − 苦難の先にある新しい世界で飛躍するために −で基調講演をやるらしいよ、と教えてもらったので、その基調講演を聞きにいってきました。
※ 本当は、さらに後日知ったCloudera カンファレンス Doug Cutting 講演会(仮) on Zusaarの方に参加したかったのですが、業務を優先したためでれませんでした。


場所は、有楽町。国際フォーラム。


受付をすませて、参加証を手に入れました。

さすがに注目が集まっていたらしく、最初はまばらに席にあきがあったのですが、最終的には満席になりました。公演中は、写真を撮るのが禁止といわれてしまったので、 Doug 氏の写真をとれなかったのが残念でした。

そして以降は、私がとった Doug 氏の基調講演のメモです。

◆メ◆モ◆

Apache Hadoop a new paradigms for data processing

Doug Cutting
25 year veteran of silicon valley
xerox parc, Apple, excite, yahoo ,,,
in the last 10 years
Mostly working for OSS
1997 wrote Lucene a full-text search engine
2001 moved Lucene to apache SW Foundation
2003 founded apache Nutch
2006 founded apache Hadoop
currently Architect at Cloudera

the opportunity

data accumulating faster than ever
storage, CPU & NW cheaper than ever
but conventional data base tech
isn't priced at commodity HW prices
doesn't scale well to thousand of CPUs & drives

problem scaling reliably is hard

need to stora PB of data
on 1000s of nodes , MTBF < 1 day
something is always broken
need fault tolerant store
handle HW faults transparently and efficiently
\proceed availability
need fault tolerant computing frame work
even on a big cluster, some things take days

problem bandwidth to data

need to process 100TB dataset
on 1000 node cluster reading from LAN
100 MB all to all bandwidth
scanning at 10MB/s 165min
on 1000 node cluster reading from local drives

scanning at 200MB/s 8min

Apache Hadoop a new parading

scales to thousands of commodity computers
can effectively use all cores & spindles
new SW stack
built on a different foundation
in use already by many

new paradigms.

commodity HW
sequential file access
sharding of data & computation
automated , high-level reliability
OSS

commodity HW

typically in 2 level architecture
nodes are commodity PCs
30-40 nodes/rack
offers linear scalability
at commodity prices

sequential file access

started building full text indexes in the 80s
first implemented with a B-tree
foundation of RDBs
log(n) random accesses per update
seek time is waster time
too slow when updates are frequent
instead use batched sort/merge
to index web at Excite in 90s
and in Lucene 2000
transfer time now dominates

OSS

reduced costs
free as in beer
development QA doc etc shared w/ low overhead
better code
publication encourages quality
sharing encourages generality
motivated employees
respect from wider peer pool
Apache SW Foundation
supports diverse collaborative communities
enforces level playing field

Nutch 2002

distribute by sharding URL source
for N nodes , url goes to node hash(url)%N
batch-based
split updates into file per shard
copy shard updates to shards node
merge updates w/ existing db there
steps performed manually
begged for automation!

Nutch 2004

Google publishes GFS & MR papers
together, provide automation of
sort/merge+sharding
reliability
we then implemented these in Nutch

Hadoop 2006

Yahoo! joins the effort
split HDFS and MR from Nutch

HDFS

scales well
files sharded across commodity HW
efficient and inexpensive
automated reliability
each block replicated on 3 datanodes
automatically rebalances, replicates, etc...
namenode can have hot spare

MR

simple programming model
generalizes common pattern
#images commonly known

MR

compute on same nodes as HDFS storages
io on every drive
compute on every core
massive throughput
sequential access
directly supports sort/,merge
automated reliability & scaling
datasets are sharded

Hadoop ecosystem

active , growing, community
multiple books in print
commercial support available
expanding NW of complementary tools
#Hive, Pig etc....
not easy to manage them (dependency is complicated)

Cloudera's Distribution including Apache Hadoop

packaging of ecosystem
100% apache licensed components
simplified installation , update, etc
tested , compatible version

Pig & Hive

higher level query languages
that generate MR jobs
Pig has imperative dataflow language
often used for ETL
Hive uses SQL
often used for DWH

Avro

common data format
expressive schema language
supports evolution
efficient binary encoding
self describing file format
RPC
java, C , C++ ,Python,,,
works with MR

Mahout

scalable machine learning library
includes algorithms for
classification, clustering, collaborative filtering / recommendation
most use MR

HBase

inspired by Googles BigTable
real time DB
data in HDFS
access by primary key,, or scan
no indexes built in
massive throughput
works with MR

Flume

data collection framework
distributed , reliable, available
log data, event analysis data
search index, keylookup

Pattern of adoption

initially adopt Hadoop for particular app
for cost effective scaling
then load more datasets & add more users
& find it more valuable for new , unanticipated apps
having all data in one place, usable, empowers
We don't use Hadoop because we have a lot of data, we have a lot of data because we use Hadoop.
: -> Data-driven
#Can ease to test their thought are right.

advantages of paradigm

cost effective
scales linearly on commodity HW
general purpose
powerful , easy to program
low barrier on entry
no schema or DB design required up front
just load raw data & use it

the future is data

storage and compute costs will decrease
Moores law , etc
theres more data businesses can collect
more event & context
businesses can use more data to improve
Mprvog ; "Unreasonable effectiveness of data"
businesses will improve by collecting & analyzing more data
Hadoop is the kernel of a new distributed data OS

こ◆こ◆ま◆で

感想:

よくも悪くも基調講演でした。

ですが、私がよかったと思う点をあえて 3 点挙げてみるならば、以下のとおりです。

  1. Doug Cutting 氏の為人(ひととなり)、経歴を知れたこと
    Apple や excite、Yahoo! に在籍していたことがあるということ、 search に造詣が深いこと、Lucene に関係があったこと、などなど。
  2. Hadoop を知らない人向けの講演として、Data processing の世界に起こってい(る/た)変化を説明するものとしてはよい内容だったこと
  3. Cloudera のセールスはあまりなされていなかったこと(今回は Cloudera(CDH) のセールスのために来日してることがメインだと思うのですけどね)。

質疑の時間がなかったので、 Doug Cutting 氏に直接、 Data processing の今後を質問することができなかったことが残念ですが、この前日にあったCloudera カンファレンス Doug Cutting 講演会(仮) on Zusaar に参加された方のメモ等を私は追っていこうと思っています。

参考リンク