2015-02-02

「HBase は素晴らしい！言うならばそれは SKVS ！」という台詞に最後は全て喰われていた HBase 徹底入門刊行記念セミナーで著者の方々のサインをもらってきた #hbase_ca

conference HBase lifelog

何気に今年最初の勉強会ログの公開になりますが、今日は、先日、 2015-01-28 に発売されたHBase徹底入門の刊行記念イベントが渋谷マークシティで行われるということで参加してきました。
詳細：『HBase徹底入門 Hadoopクラスタによる高速データベースの実現』刊行記念セミナー | Hadoopとビッグデータソリューションのリーディングカンパニー | Cloudera Japan

書籍の刊行イベントだったので、さほど込み入った内容のことは話されないだろうと思い、著者のみなさんのサインをもらいにいったつもりだったんですが、なかなかどうして Cloudera の小林さん、嶋内さんの発表は書籍にない話もあったので参加して本当によかったと思うものでした。
※今回のイベントの資料は後日公開されるそうなので、ゼヒそちらを参照されたいと思います。

ちなみに最後の取りの発表をした Cloudera の嶋内さんが「 HBase は素晴らしい！言うならばそれは SKVS (Super Key Value Store) ！」と言っていたのがそれまでの全てを喰ってしまっていました、さすがでした。

わたし個人的には HBase を仕事で使う機会がなくて馬本さらっと読んで当時の Cloudera Manager で HBase 試しに作って少しいじっただけでその後、全く触ってなかったですが HBase 徹底入門で入門し直したいと思ってます。

HBase徹底入門 Hadoopクラスタによる高速データベースの実現[Kindle版]

posted with ヨメレバ

株式会社サイバーエージェント鈴木俊裕,梅田永介,柿島大貴翔泳社 2015-01-28

Kindle

Amazon[書籍版]

では以降よりわたしが取ってきたメモです。

2012-10-12

Lars George氏にサインをもらいに”書籍『HBase』刊行記念著者Lars George氏来日特別セミナー”へ行ってきた

conference HBase

HBase: The Definitive Guideの日本語翻訳版、HBaseの刊行イベントで原著者の Lars George 氏のサインがもらえるということで、行って来ました。

ほとんど目的は、サインをもらいにいくことではありましたが、メモはまじめにとってきたので、共有したいと思います。

開催概要
日時：2012年10月12日
場所：ベルサール神保町
主催：Cloudera株式会社、株式会社オライリー・ジャパン

プログラム

18:30 受付開始

19:00 開演ご挨拶とLars George 氏紹介

19:10 大規模分散データベースHBaseの活用と最新動向〜『HBase』著者 Cloudera社　Lars George 氏（逐次通訳付）〜

20:10 『HBase』書籍の読みどころ、翻訳うらばなし〜『HBase』翻訳者 Sky株式会社　玉川竜司氏〜

20:35 Cloudera UniversityとHBase認定試験〜Cloudera株式会社　川崎達夫氏〜

20:50 Lars George 氏サイン会（書籍を購入/持参の方）

21:10 閉会

18:30 受付開始

19:00 開演ご挨拶とLars George 氏紹介

原書がでてから、翻訳がでるまでのトレンドのギャップの変化を拾うためのイベント。
原著者をお招きできるのはなかなかないこと、サインをもらえるのは貴重な機会。

19:10 大規模分散データベースHBaseの活用と最新動向〜『HBase』著者 Cloudera社　Lars George 氏（逐次通訳付）〜

HBase in Japan

Lars George HBase Seminar with O'REILLY Oct.12 2012 from Cloudera Japan

▶ Self introduction

twitter: @larsgeorge

Apache HBase and Whirr committer.
Oreilly's HBase book author.

▶ Agenda

HBase の紹介
プロジェクトの現状
クラスターのサイジンング

＊

HBase に触ったことがない方をカバーできる内容

▶ HBase の紹介

分散、列指向、多次元、高可用性、高パフォーマンス、ストレージシステム

＊

プロジェクトの目的
数十億の行、数百万の列、数千のバージョン
→数千のコモディティサーバーを通じたPBクラスのデータ量を処理
- この目的は達成した

＊

DBという言葉はあえてさけた、 RDB と混同してはいけないので。データストレージと表現。

▶ HBaseテーブル

ロジカルにはスプレッドシートのようなもの

＊

Rowkey
- ソートできる

＊

バージョンがある
- A3.v1, B3.v3, C3.v1, D3.v2 .....
  - （スプレッドシートで各セルにドロップダウンボタンをつけるイメージ）
- 過去のバージョンのデータもみることができる

＊

Region
- Column Family の説明
- 論理的に、縦横に分かれて管理される
- 物理的にもその単位でディスク上に存在している

＊

図解のため、ここは資料が公開されたら、再度見返したい。

▶ HBaseのテーブルとは

テーブルは行の辞書順でソート
テーブルスキーマは列ファミリーを定義するのみ
それぞれの列ファミリーは任意の列の数で構成
それぞれの列は任意の数のバージョンで構成
それぞれの列は挿入時のみに存在
一つの列ファミリーの列は一緒にソート、格納
テーブル名を除くすべては byte[]
- （テーブル、行、列ファミリ：列、タイムスタンプ）-> 値

▶ Java API

CRUD
get(R), put(CU), delete(D)
それに付け加えて、HBaseでは +SI
- scan: 任意の行の数をスキャン
- increment: 列の値をインクリメント
さらに +CAS
- get, check, put 操作の組み合わせ
- 完全なトランザクションがないことを補う

▶ その他の特徴

IOを効率よく行うバッチ操作
プッシュダウン式に術部処理するフィルタ
圧縮アルゴリズム
ブルームフィルタと時間ベースのストアファイル選択
アトミックな追記と puts + deletes
マルチオペレーション

などなど

▶ 最近のプロジェクトの状況

0.90 の進化した概念
マスターの書き直し
行の中でのスキャニング
アルゴリズムとデータ構造の最適化
CDH3

＊

0.92 のコプロセッサ（現場、ポピュラーなバージョン）
マルチデータセンターレプリケーション
任意のアクセス制御
コプロセッサ
CDH4

＊

0.94 はパフォーマンスリリース
CRC読み込みの改善
シークの最適化
WAL圧縮
プレフィックス圧縮
アトミックな追記
アトミック put + delete
マルチインクリメントとマルチアペンド
リージョンごとの複数行トランザクション
CDH4の新しいバージョンで実装かな、と。

＊

0.96 特異性というラベリングがされている
Protobuf RPC
ローリングアップグレード
複数バージョンのアクセス
Metrics V2
プレビュー技術
- スナップショット（バックアップに）
- PrefixTrieブロックエンコーディング
CDH5になるかなぁ、と。

▶ クラスタのサイジング

ここから後半戦：

顧客とのやり取りの中で培ったナレッジの共有

▶ リソースの競合

読み込み、書き出しで同じ低レベルリソースを奪い合う
HDFSとネットワークIO
- RPC ハンドラとスレッド
その他、完全に別々のコードパスを実行しているのだけども。

▶ メモリの共有

デフォルトでは各リージョンサーバはメモリを次のように割り当てる
40% をインメモリストア
20％をブロックキャッシング
のこりはオブジェクトなどの一般的なJava heapにあてる
メモリの共有には微調整が可能（以降で）

▶ Reads

リージョンサーバの適切な配置とリクセストの配分
（プリフェッチを指定するオプションが存在する）
可能ならば時間の範囲指定かブルームフィルタを利用してストアファイルを削除
ブロックキャッシュを試し、もしブロックが見つからなければディスクから読み込む

▶ ブロックキャッシュ

デフォルトは20%
有効性の確認には出力しているメトリクスを確認
（ヒット率と同時に排除率を満たしているか？ランダムリードは理想的ではない）
ブロックキャッシュは絶対に必要

▶ 書き込み

クラスタサイズは書き込みパフォーマンスによって決定されることが多い
Log structured merge tree base
変更履歴をインメモリストアとWALの両方に格納
負荷が高い時、一定のしきい値（デフォルト40％）に基づき集約されたソートマップをフラッシュ
ペンディング状態の変更がないログは破棄
ストアファイルの定期的なコンパクションを実行

▶ 書き込みのパフォーマンス

クラスタ全体の書き込みパフォーマンスにある多数のファクター
クラスタ全体の書き込みに影響する要因
キーの分散→リージョンのホットスポット回避
ハンドラ→すぐ枯渇しないように考慮する
WAL→第一のボトルネック
コンパクション→間違ったチューニングは増加し続けるバックグランドノイズの原因に

▶ WAL

現在のところリージョンサーバーに1つしかない
全ストアで共有
ファイル追記の呼び出しで同期

＊

WALの不可軽減のためには
- WAL圧縮→開発中、0.94 (CDH4)
- リージョンサーバにつき複数のWAL→ノードごとに複数のリージョンサーバーを起動する
- デフォルトのブロックサイズの95％に設定する→コンフィグ確認すること
- 復旧時間を削減するために低い数字を保持、リミットは32まで
- ログサイズを大きくするとともにブロッキング前のログの数を増やす（あるいはどちらか）
- 書き込みの分布及びフラッシュの頻度にもとづいて数値を計算
- 書き込みは全ストア間で同期させる→・1つの列ファミリーに巨大なセルがあるとほかの書き込みすべてが停止する。・この場合RPCハンドラは動くかブロックされるかの二択になる
- 書き込み時にWALを回避することができるが、これは本当の耐障害性やレプリケーションを失うことを意味する（非推奨）
- 依存データセットをリストアするために、コプロセッサを利用することも可能かもしれない（preWALRestore）

▶ フラッシュ

すべての変更（put, get）のための呼び出しは、フラッシュのチェックの原因になる
閾値に達したらディスクへのフラッシュとコンパクションのスケジューリングを行う

▶ コンパクションストーム

ログの数が多すぎる、あるいはメモリ使用量が逼迫することにより早過ぎるフラッシュが発生する→ファイルは設定したフラッシュサイズより小さくなる

▶ 依存関係

たったひとつのトリガーがあれば、フラッシュは全ストア・列ファミリ間で起こる
フラッシュサイズは結合されている全ストアサイズと比較
多くの列ファミリはサイズが小さい
列のコンビネーションの和のサイズ
- e.g. 55MB + 5MB + 4MB
- 閾値との比較

▶ （HBase に関連した）数値について

一般的なHDFSの書き込みスピードは、35-50MB/sec
速度が低い現実的な数字では、15MB/sec かそれ以下
ここはのちほどスライドを。

▶ 注釈

リージョンxのフラッシュサイズに基づき、memstore サイズを計算
フィルおよび比率に基づき、保存するログの数を計算
最終的に容量は
- Java Heap、リージョン数とサイズ、キーの分散で決まってくる

▶ Cheet Sheet

WALの実行時に十分な性能があることを確認
- 利用できる memtore 領域の許容範囲を超えてクライアントが利用しないことを確認
フラッシュサイズは大きすぎないか
1ノードにつき、より多くのデータを格納できるように圧縮を有効にする
いくつかのレベルでバックグランドIOを固定するため、コンパクションアルゴリズムを変更
別のテーブルに不均一の列ファミリをおくことを考慮

＊

全部は書ききれてないのでスライドがシェアされたらチェック。

▶ 例

10GB のjava Xmx heap
memstore は40％使う
推奨するフラッシュサイズは128MB
（ 4GB / 128 MB = 最大32 リージョン）
WALサイズは 128MB x 95％
20GBのリージョンサイズ
（20GB x 32リージョン＝640GBの生ストレージを使用）

▶ QA

Q1：OpenTSDB - A Distributed, Scalable Monitoring System を使ったことがあればその評判

スキーマデザインをする際に例としてよく使ってる
とてもオススメ
そのクリエイターはHBaseのコミッターであり、Cloudera にジョインした
いいサポートもできるとおもう

＊

Q2： To avoid WAL bottleneck, you suggest multiple WAL, but it might cause more big recovery cost issue.

How to handle that issue, how do you think?
That's right though. Having 3-4 WAL and switching that number seeing performance, that is the good idea.

＊

Q3：Seek optimization

HBaseサイドのみ
lazy seek (lazy block load と呼んだほうがいいとおもってる）
不必要なファイルのリカバリについては除けることができる

20:10 『HBase』書籍の読みどころ、翻訳うらばなし〜『HBase』翻訳者 Sky株式会社　玉川竜司氏〜

▶ 翻訳した著作

Hadoop, Hadoop MR design pattern, Jenkins, AWS programming, HBase etc, etc...
年間で 1200ページほどの翻訳

▶ 翻訳の一日

くたびれるまでひたすら翻訳
合間合間でごはんたべたり、Facebookしたり
twitter: @tamagawa_ryuji

▶ いつも悩むのは

進出用語をどうするか？
たとえば、region server
- 象本では、領域サーバー
- 馬本では、リージョンサーバー
- 象本3版ではリージョンサーバーにするつもり

▶ 今後の予定、ようは宣伝

今日の内容がちんぷんかんぷんなら馬本を手に取れ。
MongoDB in Action の本が出る
象本第3版
Hadoop Operation
Hive本

▶ 最後

この後の翻訳ができるかどうかは、みんなが買ってくれるかどうかにかかってるw

20:35 Cloudera UniversityとHBase認定試験〜Cloudera株式会社　川崎達夫氏〜

▶ 持ち時間は5分w

▶ Clouderaではいろんな研修やってる、認定資格もあるよ（CCSHB）

馬本を買っても分からないことは多いはず
短期間で習得の手助けできるよ
HBase の問題の多くは間違った設定が原因だぜえ
使えないっていう前に、本読もうぜえ

＊

独学するなら馬本を
短期集中ならトレーニング
スキル証明するなら資格とろう

HBase

posted with カエレバ

Lars George オライリージャパン 2012-07-25

Amazon

楽天市場

HBase: The Definitive Guide

posted with カエレバ

Lars George Oreilly & Associates Inc 2011-09-20

Amazon

楽天市場

現状、わたしは HBase 本は 4 割ほどしかまだ読めてないので、まずは頑張って読みきらねばという気持ちを強くしたセミナーでした。

Reference.

✔ こちらもあわせてどうぞ

2011-09-26

Attending Hadoop Conference Japan 2011 Fall in Shinbashi, Bellesalle Shiodome! #hcj11f

hcj11f conference Hadoop HBase

Hadoop Conference Japan 2011 Fall に参加してきたので、私のメモを共有したいと思います。

まずは会場の模様を、写真で紹介したいと思います。

場所は新橋（写真：左）のベルサール汐留（写真：中）。現地にいくと大きな Hadoop のお出迎え（写真：右）がありました。

受付への案内（写真：左）も Hadoop、会場はかなり本格的な会場設営（写真：中）になっていて（Recruit さんの多大なる貢献による）、正面からみて右脇には Big3(分かりにくいですが左から Horton Works、Cloudera、MapR)と Recruit さんの企業旗が下がっていました（写真：右）。

タイムテーブルは、以下のようになっていました。

わたしが今回この Conference に参加した動機は、

Big3 (Cloudera, Horton Works, MapR) がそれぞれどんな考えをもっているか知りたかった

Hadoop 0.23 の情報を収集したい

Hbase: the Definitive Guide の Reviewer(?) である Todd Lipcon 氏による HBase の Introduction を聞いてみたい

上記の3つでしたので、タイムテーブルに引いた矢印のような形で各セッションに参加しました。

では、以降でわたしが参加したセッションのメモを共有します。

(*注1) 日本語のセッションは日本語、英語のセッションは英語で書いています。
(*注2) メモですのでわかりづらい点が多いだろうということを先にお詫びしておきます。
(*注3) スライド等の共有が後からなされるかと思うので、それらは分かり次第追記していきます。

◆

10:00〜10:05 Opening.

Hadoop Conference Japan 2011 Fall オープニングスライド

hashtag: #hcj11f

1178 attendees.

今回の狙い

国内の事例、海外のHadoop関連企業

アンケート結果

まだHadoopをつかったことがないが半数560
3ー6ヶ月で使い始めた、300名ほど
本イベントの把握、Twitterが4分の1、会場の7割が友人からの紹介
コミュニティートラックは2階
リクルートさんの力強い、会場の用意、リクルートの力なくしてなし
Hadoop T-shirts

Question Vote ：質問投稿サイト

QuestionVOTE!! | Hadoop Conference Japan 2011 Fall
1週間でつくったとか。
ハンドルネームと任意のパスワード、これをつかってセッションの最後に極力質問に答える。
後日、動画での配信も行う
:これはなかなか画期的な試みだったように思います。PC 持ち込みじゃない自分は活用出来ませんでしたが。

◆

10:05〜10:35『The role of the Distribution in the Apache Hadoop Ecosystem』　Cloudera Inc, Todd Lipcon

In the past several years, Apache Hadoop has enjoyed considerable success due to its ability to scalably and reliably store and process vast quantities of data. HDFS and MapReduce are the two core components of this software, but the real power of Hadoop comes from the larger ecosystem of open source projects built on and around this core: projects like Apache Hive, Pig, Flume, Oozie, HBase, Avro, and more. ...

Video

Introduction

Todd Lipcon
Software Engineer
Hadoop, HBase committer
Previously machine learning.

Hadoop Overview

2main components
HDFS + MR
HDFS scalable disk storage
MR

Why do we need Hadoop?

flexible
filesystem not Database
dont have schema
with Hadoop we can use any type of data
more flexible than Database
all of the data in one place
not necessarily special ETL tool
add machine easily, and can speed up easily
inexpensive, because it is OSS

Why Hadoop created?

Web2.0
- more and more, users become to create data.
- Weblog etc...
- Operational data.
- Generating Application log...
Its too expensive to treat with Weblog in the RDBMS
Talk about Big Data, but not only the Big Data
Datamining, Hadoop can also do this
Bring new opportunity to bring money

What Can Hadoop Do for you?

2 Core use case
- Advanced analytics + Data processing.
- Both are very difficult to realize in RDBMS
NW analytics
- How much NW capacity is used by customers
Bank
- We want to understand which customers are good customers

How Cloudera Customers use apache Hadoop

Large National Bank, Web-based Media Company, Wireless Telecom Company
each Goal and Hadoop-able challenges.

Clouderas mission

Enable enterprise to derive business value from...
We believe you can make money from data.

About Cloudera

Hadoop expert.
Also Cloudera Japan.
- twitter: @clouderajp

4main products

CDH Overview

100% pure OSS, apache licence
Any apache licence product also included, like Hive, Sqoop etc..
Any time you can see any type of pacha files
Everything you need for Hadoop are included.
SCM
- manage services for 50 nodes are free.
- explanation of SCM

Why use a distribution?

Upstream projects are where OSSS developers should contribute code. Distributions are the best place for users to download and use code.
Distribution is a easiest way for user to use components.
Hadoop is just a kernel.
- production use-cases require many components, each component has many versions available.
- # This is very complicated for the user. Understanding these dependency is very difficult for the user.
- Each component has a separate community. Releases do not happen at the same time.

Summary

Distribution make life easier
One download page
One installer(SCM express)

includes the most stable versions including important bug-fixes and improvement from OSS development

Journey of the apache Hadoop user

Discover the benefits of apache Hadoop
deploy CDH SCM
xxxx

Cloudera Enterprise

Cloudera management Suite
- activity monitor
- service and configuration monitor
- Resource manager.
- Authorization manager
Cloudera support
- 24x7 or 8x5 support SLA
- configuration check
- comprehensive knowledge-base and docs

Summary

CDH is easiest way to use Hadoop
- #Don't need to afraid of the fork.
SCM Express is the easiest way to configure Hadoop
Cloudera can help you run Hadoop in production.
- #if you have some trouble with Hadoop, contact with Cloudera, they can help

QA

Q1. Over 100 customers they have.

Japanese office, currently already have Japanese engineer, and will launch their office.

◆

10:50〜11:20『About Hortonworks』　HortonWorks, Owen O'Malley

Hortonworks is a new company that was founded around the engineering team from Yahoo that has been driving the majority of the work on Apache Hadoop for the last almost 6 years. Our mission is simple: To revolutionize and commoditize the storage and processing of big data via open source. In order to accomplish this mission, we are focused on accelerating the development and adoption of Apache Hadoop. ...

@owen_omalley

Horton works

Starting in July,2011
from Y!
Brand new company
Mission, Architect the Big data management

About Horton work Game plan

We want make Hadoop more and more popular.
Define APIs for ISVs , OEMs and others to integrate with apache Hadoop
Profit by providing training and support

Credentials

Technical, from Y! Hadoop engineer team
Business operation
- team of highly successful OSS veterans
Investors
- backed by benchmark capital and Y!

What is apache Hadoop?

Set of OSS project
Transforms commodity HW into a service that ....
- PB data can store
- MR
Key attribute
- Redundant and reliable
- Easy to program
- Extremely powerful
- Batch processing centric
- Runs on commodity HW

Typical Hadoop Applications

advertising

Who build Hadoop

Chart, Line of code.
- Y! and Horton Works contributes big number

A Brief History

Y! early adopters
other internet companies
Service providers
Wide Enterprise Adoption

Hadoop @Y!

40K servers
170PB
5M monthly jobs
1000+ active users

Case study of Y!

Personalized for each visitor
Click rates are increased
Result
- twice the engagement

More case studies

Serving Maps
Five minute production
Weekly categorization models
Upon 3Type of clusters
- Science Hadoop Cluster
  - Machine learning to build ever better categorization models
- Production Hadoop Cluster
- ???

Yahoo mail

450M mail boxes
5B+ deliveries/day

The Hadoop Market

Business drivers
- identified high value project that require use of more data
Financial drivers
- Growing cost of datra system as propotion of IT spent
Technical drivers
- Existing solutions failing under growing requirements

Market opportunity for Hadoop

Market Dynamics

Technology & knowledges gaps are preventing apache Hadoop from becoming an enterprise standard
Virtually every F500 company is constructing a Hadoop strategy
Top ISV/OEMs working to create Hadoop strategy
Community is becoming increasingly confused by all of the noise
- multiple distribution, many vendar announcement

Conclusion

There is not a Hadoop market to win today
In order to succeed ...

Horton work storategies

#1 Overcome technology Gaps
- make easier to install, manage & use Apache Hadoop
- Make Apache Hadoop more robust
#2 Enable a vibrant Ecosystem
- Unify the community around a storing apache Hadoop offeing
#3 Overcome Knowledge Gaps
- Subscribe their Blog RSS.
  - Improve user experience with Apache Hadoop SW.
  - Expand technical contents
  - Extensive Hadoop truing & certification program
  - Expert technical support services

Rational for Hortonworks Storategy

Strong interest from community in a complete, enterprise-viable , apache Hadoop

QA

Q1. What kind of tool do you use for managing 42,000 nodes

By hand

creating configuration tool

plan to develop

Q2. What the company come from?

from Child story, Horton

Q3. Specific plan for Japanese market.

Planning for support also Japanese company

Q4. Do you have a comment for CDH?

xxxxxxxx

Q5. 500M job contains adhoc query? How much adhoc queries?

Y! has 3 type of cluster

Don't have a number, but science cluster maybe have the largest number of adhoc queries.
#So fast, so cant log enough :(

◆

11:25〜11:55『How Hadoop needs to evolve and integrate into the enterprise』　MapR Technology Inc, Ted Dunning

Hadoop has allowed new classes of problems to be solved in a dramatically more cost effective way. Many problems, however, do not fit nicely into the pattern of isolated problems with limited legacy systems whose primary difficulty is exactly suited to the map-reduce style of computation. In order to move from the early adopter phase of life to the early mainstream adoption, Hadoop will need to overcome several serious liabilities. ...

Future Evolution of Hadoop

Some Thanks

World wide community

Company background

MapR provides the industries best Hadoop Distribution
Background of team
EMC,,,
- Strategic Partnership

MapRs innovation

Easy
Dependable
???

2X application speed

Quick history

From 2007 until now Hadoop has exploded
Data also exploded
Strong adoption of many companies. web, telecom...
Weak adoption in lager companies.

Why this difference happened?

Physics of startup compaies (Chart)
- with exponentila growth the past is always very small

For Startups

Histry is very small
the future is huge
Must adopt new technologies to servive
Compativility is not as important

Physics of large companies

An organization grows the past become very important
- For large comanies
Present state is always large
Realtive growth is much larger

the startup technology picture

old computers (can throw away
current computers
long term computers

The large enterprise picture

current computers (long term and current must be work together
proof of concept Hadoop Cluster
Long term Hadoop Cluter

What does this mean?

Hadoop is very very good at streaming through things in batch jobs
HBase is good at persisting ....

Narrow Foundation

Web service, Pig, Hive
OLAP, OLTP, Sequential File processing, MR, HBase
RDBMS, NAS, HDFS
#move data through the wall is very difficult

Narrow Foundation

Because big data has inertia,m it is difficult to move

One Possible Answer

Widen the foundation
Use standard

Broad Foundation

MapR can broadend the capability of data storage.
can break the wall.
:How to break this wall???
Having a broad foundation allows many kinds of computation to work together
It is no longer necessary to throw data over a wall
Performance much higher for MR
Enterprise grade feature sets such as snapshots and mirrors can be integrated
Operations more familiar to admin staff

Conclusion

Revolution is required for startups
Revolution is not suitable for large
it is important to take more respectful approach when introduction Hadoop into ....

QA

Q1. Why MapR didn't goes OSS?

They are also the contributor of the community. if they give back to the community for the change.

Mr. Tedd is also Zookeepers committer.

They also have free version, but not open.

Q2. How to interact between RDBMS and Hadoop

providing a filesystem-bases

providing a API for batch processing

Q3. Do you have a plan for implementing Zookeeper, MR2.0 with C++

Apache have a good product, so reinventing these mature product so they will accept it.

◆

Lunch time

ここで昼やすみが入りました。

大変ありがたいことに、昼食は出していただきました（本当に委員会のみなさんありがとうございます）。実は引換券（写真：左）のサンドイッチの種類はどこでなにが配られているかが間違っていて、私は本当はチキン竜田にしたかったのですが、ツナサンド（写真：中、右）になっちゃいました（途中で間違ってる旨、アナウンスがあったのですが、並んでいる人が多かったので、並び直すのがめんどくさかったのです…）

◆

Lunch time LT.

#1. IBM, intrroduction of Bigsheet.

Like a Excel sheet IF, can conduct MR without coding.
Demo.
Case Study.
- BlockBuster
Comclusion
- Data discovery is a key to start improving your business.

#2. Mobage の大規模分析基盤とその活用

データマイニング部
大規模データ収集基盤
Mobageとソーシャルゲーム
Mobage の大規模分析の課題
- 分析のニーズ増大
- 内容によってはクエリを若干変えるだけでできる
  →エンジニア以外でもクエリが実行できるようにしたらよい
セキュリティ面の課題
- アクセス権
Hueの活用
- LDAPでアカウントは管理
- WebからPig、Hive
さらなる課題
- HueのPluginをつくるのが面倒
  - Pyhon（Django）、js、Jframe
まとめ
- Hadoop自体も重要だがHadoopの使い方も重要

#3. Hadoop＋HBaseでのPaaS

背景と目的
- 標準化されたPaaSの実装手段の確立
- PaaSのOSS化で市場確立
課題
- トランザクションの確立
- システムパフォーマンスの検証
- セキュリティ
システム構成
- Webアプリケーションサーバークラスター
- リバースプロキシ
利用テクノロジー
- Hadoop、HBase、Zookeeper、Mahout、JMX、JDO、JTA
技術的問題
- HBASE-3992
- HDFSからのWebアプリケーションのデプロイ、アンデプロイ
- ロギングインターフェース
- 自動運用のためのアルゴリズム
- 開発環境設計
- セキュリティ
興味がある方への呼びかけ

#4. パネルログ分析

ブレインパッド、IPOしたばかり、値がついた
パネルログ？
- 通常のWebログと何が違う
- ある特定サイトのログ
パネル＝人
- 全ての人の行動のログを追うことができる
システム構成
- 分析用のAPIを用意している
- APIサーバー、キューサーバー、管理サーバー
  →EMR Or Hadoop Cluster
- 分析用のMRはネイティブなJava MR
  - HiveやPigではできなかったから。
  - 一段のMRの中でもパフォーマンスを考えて複数並列でうごく
OSS化？
- MR2.0 如何でやめる。
用途
- サイトの時系列分析
- リピート分析
- 検索ワード分析
- サイトランキング
- 流入流出サイト分析
ユーティリティもある
SaaSとして提供
- 広告代理店

#5. Hadoop MR デザインパターン

象本の翻訳者、先行販売
tamagawa_ryuji
象本＋徹底入門＋この本
目次の紹介
MRのこう書くんだよ本
第6章が難しい、数式乱出
次は、馬や豚の本をやりたいと思っている

◆

13:00〜13:45『Apache HBase: an Introduction』Cloudera Inc, Todd Lipcon

What is HBase?

OSS, distributed, sorted map data store
OSS apache 2.0 licence
committers and contributors form diverse organizations
- Cloudera , Facebook, stumbleupon,
Distributed
- Store and acess data on 1-1000 commodity servers
- automatic FO based on apache Zookeeper
- Linear scaling of capacity and IOPS by adding servers
sorted map datastore
- Not a relational database
- tables consist of rows, each of which has a primary key
- Each row may have any number of columns
  - like a Map,byte>
  - Different from RDBMS
Rows are stored in sorted order
- logical views as records #image
- Row key, Data
Column family
- DIfferent types of data separated into different column family
- A single cell might have different values at different timestamps
  - milliseconds since unix epoch

Physical view as cells

info column family and roles column family sample.

Column family

Different sets of columns may have different properties and access patterns
Configurable by column family
- Compression
- Version retention poliicies
- Cache priority

Accessing HBase

Java API
REST/HTTP
Apache Thrift (any language)
Hive/Pig for analytics

HBase API

get(row)
put
scan
increment
... checkAndPut, delete, etc...
MR/Hive

HIgh level architecture

MR
Java Client #Zookeeper manage
HBase #also Zookeeper manage
HDFS

Terms and daemons

Regions
- A subset of a tables rows, like a range partitions
Region server
- Serves data for reads and writes
Master
- Responsible for coordinating the slave
assign Region, detects failures of region server

Cluster Architecture

Client
ZK Peers
HMasters
Region Servers
HDFS

HBase Architecture 101

HBase vs Other technologies

When should i use HBase?
#if you have neither random write nor random read, please just use HDFS.
- Write pattern, random write, bulk incremental
- Read pattern, Random read, small range scan, or table scan
- Hive performance, 4-5x faster than HDFS/MR
- Structured data, Sparse CF data model
- Max data size, -1PB

vs RDBMS

Data layout, CF oriented
transactions, Single row only
Query language, get/put/scan...
Indexes, Row-key only
Max data size, PB class

vs Other NoSQL

Favor consistency over availability(but availability isnot bad, is good)
Ordered range partitions
very efficient build loads, MR analysis
automatically shard

HBase in number

850-1000nodes
most clusters 10-40nodes
writes 1-3ms
read 0-3ms

Use Cases

Firefox Crash Reports
- Crash reports is stored in HBase with a unique ID
Facebook analytics
- Read time counters of URLs shared, links liked , impressions generated
20 Billion events/day
30 second latency from click to count
OpenTSDB
- Scalable time-series store and metrics.
- OSS project

Use Hbase if

You need random write and read or both
You need to do many thousands of operations per second on multiple TB of data

...

Dont use HBase if

You only append to your datase and tend to read the whole thing
You primarily do adhoc analytics

...

Resources

DL CDH3
Hbase definitive guide is just released.

QA

Q1. What limit the HBase

Just the practice he shared, not limited.

Q2. Largest is the Y! web page cluster

web cache, presentation on June. can find online.

Q3. Online web analytics knowhow

Facebook message services, they divided Hbase cluster for each one is hundreds nodes and minimize the down time.

Q4. Scan slow down sometimes do you know why it happen?

not know the particular reason.

Q5. Resion servers FO time takes time

during the FO usually one minute or 4-5 minutes, can tune.

Request will be timeout, but will retry.

Q6. Have a hadoop cluster, if want to use HBase, should devide Hadoop cluster and Hbase cluster

should devide, and for Online processing use HBase, for analyzing use Hadoop

◆

13:50〜14:35『Architectural details and implications of MapR technology』MapR Technology Inc, Ted Dunning

much more engineering talk
Hats on the desk

MR review

diagram of typical MR.
- input, split, shuffle, output
Bottle neck and issues
- Read only files
- many copies on IO path
- shuffle based on HTTP
  - cant use new technologies
  - eats file descriptors
- spill goto local file space
  - Bad for skewed distribution of sizes

MapR areas of development

MR
management
storage services

MapR improvements

faster file system
- fewe copies
- multiple NICS
- No file descriptor or pagebuf competition
faster MR
- use distributed file system
- Direct RPC to receiver
- Very wide merges

MapR innovations

Volumes
- distributed management
- Data palace ment
Read/rwite random access file system
- Allows distributed meeta-data
- improved scaling
- Enables NFS access
application level NNIC bonding
Transactiuonally correct snapshots and mirrors

MapR containers

files/directories are sharded into blocks, which are placed into mini NNs containaers on disk
- each container has a replication chain
- updates are transactional
- Failures are handled by rearrange....
Container Locations and replication
Container location database (CLDB) kkeeps track of nodes hosting each container and replication chain order
#communication are conducted directly

MapR scaling

contgainers represent 16-32GB of dta
- each can hold upto 1 Billion files and directories
- 100M contgainers _ 2Exabytes
250 bytttes Dram to cache a container
25GB to cache all containers for 2EB cluster
- but not necessary , can page to disk
typical large 10PB cluster needs 2GB
Container reports are 100x -1000x < HDFS block reports
serve 100x more datanodes
increase container size to 64GB to serve 4EB cluster
- MR not affected

MapRs streaming Performance

Chart
- 16 streams x 120GB, 11x 7200rpm SATA
- 2000 streams x 1GB , 11x 15krpm SAS
- compare HW, MapR, Hadoop
- MB/sec, higher is better.
  - Good number than native Hadoop

Terasort on MapR

10+1 nodes, 8core , 24GB Dranm, 11x1TB 7200rpm SATA
lower is better
Elapsed time(min)
compare with MapR and Hadoop

Hbase on MapR

YCSB random read
records per second higher is better
compare with MapR and Apache

Small files(apache Hadoop, 10dnodes)

op
- create file , write 100bytes, close
notes
- NN not replicated, NN uses 20G RAM, DN uses 2G DRAM
  #file number increased, it become out of box.

> What about MapR?
same 10 nodes
- over 80M files it still stable.

What MapR is not

Volumes !=federation
- MapR supports > 10,000 volums all with independent p;acemnetn and defaults
NFS!=Fuse
MapR!=maprfs

New Capabilities ; 主旨がわり

Alternative of NFS

Export to the world
Local server
- NFS seerver -> Cluster nodes
- Nodes are identical

Shaded text indexing

traditional exsample diagram.
>
on failure, deletes local files
must avoid directory collisions
- cant use shard id

Conventional data flows

input document
Map
Reduce
clustered index storage: index to task work directory via NFS
search engine

Simplified NFS data flows

Input documents
Map
Reduce
Cluster
Mirrors
Search Engine

K-means, the movie

machine learning common problem explanatiion.
Old tricks, new docs
Poor man"s pregal
- Click model architecture
Hybrid model flow

Conclusion

We used to know all this
Tab completion used to work ....

QA

Q1. When MapR adopt MR2.0?

when Hadoop 0.23 released, as soon as possible.

Q2. How MapR think to have a compatibility of Hadoop

as possible as their customer wants.

Q3. Difference of glasterFS

glasterFS is for realise reliable FS

MapR is for using commodity servers

◆

14:40〜15:25『基幹バッチ処理から見たHadoop』ノーチラス・テクノロジーズ, 神林飛志

Hadoopマーケット

大フィーバー状態
日本の場合の特殊状態
基幹にもつかえるんじゃないか

BIと基幹処理の違い

基幹は複雑な処理がからみあう
基幹は品質重視、テストしないでリリースするとかいったことはない

Asakusaの説明

AsakusaDSL
ThunderGate
TestDriver
→Asakusaくらいしかでない
開発手法がないと死ねる
80ページあるので早口になる

Case Study

西鉄ストア
基幹会計システムのリプレース
100ー500GB
Big Dataでもなんでもない。
細かいデータ。件数がおおい。中間データが多い。
会計といっているが、本部基幹

Case1. 西鉄ストア

売り上げ売掛金

店舗からの売り上げデータの一斉集計
いかに早く締めるか
利益の当日確定処理
→Hadoopの出番

買い掛け仕入れ計上

仕入れ計上の高速化
決済データとの照合
かなり面倒なバッチ処理→Hadoopの出番

売価還元法から個別原価法へ

利益確定、正確性、データ量
細かいトレースをすべてとって計算するために計算量が1000倍
→Hadoopしかない

既存のバッチ処理のところがHadoopになっている

集配信

どこまで処理を負担させるか、検討すべき課題
フロント

インフラ

Hadoop、バッチサーバー的な位置づけ
INTEC社

まとめ

Hadoopをつかうことで多様なデータを高速に処理できるようになっている
不要な物理作業が必要なくなる

Case2. アンデルセン（パン屋）

原価計算
原材料から製品原価計算

Hadoopでの並列計算処理に置き換え

Stoed ProcedureをOperatorDSLへ

特徴

グラフ処理のアドバンテージを生かす形
AmazonVPCでHadoop基幹バッチを動かす
データ展開が肝
→Hadoopはそのへんが弱い
→ThunderGateを今は使っている、つぎはWindGate

まとめ

4時間バッチが20分に
AmazonVPCのおかげで追加のシステムが必要ない

Case3. 某名古屋の流通業

LSPの電子化
今は紙でやっている
データ量1ー2GB程度
時間帯別の客数分布予測
→機械学習
現在は計算量がすくないから問題ないが
将来に備えるとコストだい大

分散しないHadoop

速い
計算量が増えたら増やす
備えあれば憂なし

どうやってつくったか？

Hadoopの足りないところ
MRしかない、Transactionない、SPOF
→ThunderGateで解決した

書きやすいDSLが必要

そもそもバッチってなんだ
SQLは問い合わせ言語、オンライン処理は得意
基本的に業務バッチ処理はデータ変形、集計

設計はちゃんとする

データモデルとフローモデルの組み合わせをきちんとする
世の中の処理は、たいてい非同期であるということを知る
たいていの仕組みは、昔から伝わっている業務フローってやつになっている

IO設計

→HIPO

STS分割手法

もっとも困難なのは正常系異常処理

どこまでの例外を業務リスクとして拾うか
ある程度のビジネスセンスが必要

それがないときはどうする

50ー60歳の親父さんたちにきくといい
→酒を呑ませる（バカボンだったら無駄）
昔からやっていた処理
できれば人力でやっていて
やばくなたら人力をいれてしのいだ

グラフ構造をモデル化する

Node処理の独立性
Edge処理での情報の関連性
→こぞう本に

今後のAsakusa

WindGate
開発手法ねりあげる
キャッシュメカニズム
Hadoopだけにひっぱられないように

Asakusaの課題

次世代分散OSへの対応
次世代型のアーキテクチャ

DCのOS化

YARN vs MESOS

QA

アンデルセンさん、EMRは使っていない。
理由がある、NDA ゆえ今は話せない。

◆

15:45〜16:30『NTTデータ流 Hadoop活用のすすめ　〜インフラ構築・運用の勘所〜』 NTTデータ猿田浩輔

インフラの話

自己紹介

猿田浩輔
OSSの検証、整備。実際のプロジェクトの支援
経産省の実証実験に協力
最近は Hive のチューニングをしてほしいといった案件が増えてきている
徹底入門の執筆者のひとり

Hadoopについてのおさらい

2つの大きなコンポーネント
HDFS x MR
集中管理型の分散システム
Slaveを増やすことにより全体の処理性能を向上させるスケールアウトアーキテクチャ
HDFSとMR
HDFS、レプリケーションの話

Hadoopインフラ構築運用時によくあがる話題

Hadoopの可用性向上
数百台以上のClusterの効率的な運用
- 初期構築 etc..

NTTデータの取り組み

マスターノードの可用性向上の検討

MR、HDFS
故障を想定したつくりになっている
MR
TTがタスクを失敗してもリトライする仕組み。他のTTに割り振り直す仕組み。
ただし、JTはSPOF
- ↓
  Zookeeper、MapReduce-279
HDFS
- データを複数のDNに分散レプリケーション
- Rack Awareness の考慮
- ただし、NNはSPOF
  - ↓
    High Availability Framework for HDFS NN -> HDFS-1623
  - 再起動なしでNNの再設定
  - HDFS-1477
Hadoop Clusterは全体の一部にすぎない。
外接要件など、連携箇所との整合性をとる
一部分だけに過剰な可用性を追求しない
シンプルな方式を選択する
Masterでダウンが発生する要因
- SW障害
- HW障害
  →HAなどの仕組みでのりきる
- メンテナンス
- オペミス
- 突発的な停電
  →仕組みだけでなく、運用や設計の工夫
実績のある枯れた技術を組み合わせる
- Pacemaker（Heartbeat）などのHAクラスタリングソフトウェアを用いた可用性向上方式のノウハウがたまっている
- Heartbeat x DRBD
FO に時間がかかるのは、ブロックレポートが発生するから。

大量サーバの運用効率化

数百台のサーバを運用していると、常にどこか一台くらいは壊れている状態になる。
予期しないときの、予期しないトラブルに備えた対策
確実に復旧できる方法を用意し、最悪の復旧時間を制御する
運用設計の検討指針
オペレーションのパターンを最小限に抑える
統一された運用設計でオペレーションミスを排除
障害発生時の例外対応を最小化
所要時間の最悪値を制御
OSの自動インストール
構成管理
→複数のサーバーに一貫した設定を適用
共通的なやり方で簡素化する
実際的なOSインストール
CentOS の Kickstart と PXEブート
Puppet で一貫した管理→100台規模のサーバー群を90分、設定変更は3分
運用簡素化のための割り切り
障害復旧において細かい切り分けは実施しない
OSからのリカバリに失敗する場合は、代替機をセットアップし、交換する
あらかじめ許容できる縮退率を把握、機器交換のタイミングを計画する→一日のおわり、週末にまとめて実施など

クラスタのリソース情報の取得

Ganglia によるリソース情報の可視化
＃Hadoopだけでなく使われている。
Cluster単位、Rack単位などで統合的に監視。

トポロジ設計、失敗談

Rack Awareness
想定外の事故
エッジスイッチごとにラックアウェアネスを構成すると、異なる電源系統のスレーブサーバにレプリカがつくられるとは限らない
電源系統に障害が発生した場合、データにアクセスできなくなる。最悪ロストする。

まとめ

部分最適ではなく、全体最適をめざす
よく知っている方式、方法を採用する
可能なかぎりシンプルに
最悪のケースを制御する

QA

なし

◆

16:35〜17:20『Hadoop 0.23 and MapReduce v2』HortonWorks, Owen O'Malley

Current Hadoop Branches

so many, so many people confused.
Don't use 0.21.0 - Not stable
Hadoop 0.20.0,1,2 - old and very stable.
Hadoop 0.20.203.0
- added security
- MR job limits
- performance work
0.20.204.0
- Fail in place
- RPM & Debian package
0.20.205.0
- HBase support

Highlight of 0.23

Expected to become the next stable release.
A community effort from - Cloudera , Ebay , Hortonworks, and Y!
Includes many new features
HDFS federation
HDFS write pipeline improvement with support for HBase
MR Shuffle optimized by 30%
Small Mapredce jobs optimization
- Small jobs are run as a single task for lower latency
MR 2.0

HDFS federation

a solution to HDFS NN scaling
- Entire HDFS namespace is kept in NNs RAM
  - Limits the number of files and blocks Splits namespace between NN
All DN are shared between all NN
Clientside mount tables allow clients to view multiple NN as a single Namaspace

Hadoop MR today

Job Tracker
- manages cluster resources and job scheduling
TaskTracker
- Per node agent
- manage task
Current limitation
- Scalability
  - Cluster size 4,000 nodes
  - Concurrent tasks, 40,000
  - Coarse synchronization in JT
- SPOF
  - restart is very tricky due to complex state
  - Hard partition of resources into MR slots
  - Lack support for alternate paradigms
- iterative applications implemented usig MR are 10x slower
- eg ; K-means, PageRank
Lack of wire compatible protocols
- Client and cluster must vbe of same version
- Applications and workflow cannot migrate to different Cluster.

Requirement

Reliability
Availability
Scalability
- each machine with 16 core 48/96GB RaAM, 24/36TB disk
- 100,000+ concurrent tasks
- 10,000 concurrent jobs
Wire compatibility
- Agility and Evolution
upgrades to the grid SW stack

Design Centre

Split up the two major faction of JT
- Cluster resource management
- Application life
MR becomes user-land library

Architecture

Client
Resource manager
Node manager
App master
Container

Improvemetns vis a vis curent MR

Scalability
- Application life cycle management is very expencsive
- Partition resource management and application life cycle management
- application management is distributed
- HW trends
  - 6,000 2012 machines
Fault Tolerance and Availability
- Resource manager
  - No SPOF , state are saved on Zookeeper
  - Application master are restarted automatically on RM restart
  - Applications continue to progress with existing resources during restart, new resources aren't allocated
- Application master
  - optional FO via application specific checkpoint
  - MR applications pick up where they left off via state saved in HDFS
WIre Compativility
- Protocol are wire compatible
- Old clients can talk to new servers
- Rolling upgrades
Innovation and Agility
- MR now becomes a user land library
- Multiple versions of MR can run in the same cluster
- Customer upgrade MR versions on theier schedule
Utilization
- Ceneric resource model
  - Memory
  - CPU
  - Disk b/w
  - NW b/w
- Remove fixed partition of map and reduce slots
Support for programming paradigms other than MR
- MPI
- Graph processing (Giraph
- Iterative BSP processing (Hama
- Machine learning (Spark
- Master-worker
- Enabled by allowing use of paradigm-specific Application Master
- Run all on the same Hadoop Cluster

Sammary

MR v2 takes Hadoop to the next level
- Scale out even further
- High availability ....

Status Sep 2011

Feature complete
Rigorous testing cycle underway
- Scale testing 500-
coming in the next release of Apache Hadoop
Beta deployment, CY2011 Q4.

QA

Q1. One of the NN failed what happen?

downed namespace cant be access, if want access need to HA

cant access it will restart

Can access from new server to Old one

Q2. MR2.0, wire compatibility or other support which is more important to develop?

HDFS not using RPC, hard to improve.

Wire compatibility is now in a best effort

use Hoop?

Q3. Windows support?

They will, but nobody seems to use windows.

Q4. Will use release Oozie support MR2.0?

Dont actually know, but will.

Because Y! also using 0.2x

Q5. What the difference MESOS between MR2.0?

most big difference is, MESOS not has security.

retaliation problem?

Giraph will???

◆

17:25〜18:10『MapReduceによる大規模データ処理 at Yahoo! JAPAN』ヤフー, 角田直行吉田一星

自己紹介

角田直行
Y! 地図、路線、、、、

Y! Japanでの事例

サービスを支えるプラットフォームでHadoopが使われている。
Yahoo ではプラットフォーム戦略を現在はとっている
ログプラットフォームと検索プラットフォーム（ABYSS）
Yahoo検索
キーワード入力補助、関連キーワード、ショートカットの表示制御
リアルタイム検索
検索プラットフォーム（ABYSS）が検索機能を提供
Twitterから提供されたデータをABYSSに送ってインデキシング
Yahooオークション
レコメンデーションプラットフォーム

MRによるアルゴリズムデザイン

こぞう本、OREILLY本
Yahooの事例

#1 空間解析

位置関係を利用した検索を行いたい
今いる場所から一番近いコンビニなど
リバースジオコーダー
緯度経度から住所をもとめる
ある住所が住所ポリゴンのなかにふくまれるかどうか？
GeoHash
緯度経度を英数字の文字列であらわすアルゴリズム
:このセッションは詳細にメモをとることはあきらめた。

#2 検索インデックス生成

MRの最も基本的なタスクをひとつ
単語をキーにした URLのリストの形式に転置する
実際の課題
インデックスには複数のフィールドがある
フィールドごとに単語を分割する方法をがちがう
フィールドと単語分割（2Gram、形態素解析）
ユニークな文書番号を付与し、文書番号でソートする
ABYSSの場合
検索インデックス作成処理の流れ
前処理→Hadoop→後処理
実際に何をやってるか？
Map、Shuffle、Reduce それぞれの説明

#3 機械学習

データの中で見えているものをてがかりに、見えないものを予測する
例：ページの内容がアダルトかどうか判定する、など（一番上にもってこなくても…）
機械学習によるランキング
検索結果を機械学習でランキングする例
Pagerank, クリック数, リンク数, 被リンク数
重みと素性の値を単純にかけあわせてスコアを推定→線形回帰
重みの学習
重みをどう学習するかがポイント
オンライン学習
正解データを一件づつみていって重みを更新する
バッチ学習
すべての正解データをみて重みを更新する
オンライン学習をMRで行う。
Iterative Parameter Mixing
※繰り返すアルゴリズムは効率が悪いのだが。
Gradient Boosted Decision Tree
決定木
検索ランキングでは実際には、精度がたかい決定木ベースのGBDTという手法が使われている
複数の決定木のスコアを足し合わせて、結果のスコアを決定する
GBDTの学習プロセス
1つ前の決定木の結果をもとに次の決定木の学習を行う
決定木の構築
一定の深さになったら終了する
MRで分散可能なのは、決定木の構築

枝の分岐

MR の水平分割

MRの垂直分割

分散方法の比較
水平分割と垂直分割
実は MR を使わず MPI で分散させることが一番はやい

まとめ

Hadoop の連載記事があるとのこと。
東京Node学園の宣伝

◆

メモは以上です。
※あとで更新するかもしれませんが、最後に感想を書いて終わりにしたいと思います。

感想：

RECRUIT さんの貢献によるところが大きいのだと思いますが、大変エンタテイメント性の高いイベントだったと思います。オープンニングのムービーなど凝り方などは国内の有料のカンファレンスでもないほどのものだった。
参加者アンケートの結果をみての判断かと思うのだが、 Beginner 向けが 3/4 くらいを占めていたように思う。
1/4 がアーリーアダプター向け、残りがレイトマジョリティー向けという印象だった。その間にいるアーリーマジョリティー向けの内容がごっそりと抜けていた印象を受けた。（ロジャーズの採用者分布曲線）
こぞう本と馬本を買って自分で勉強しようという意欲を強めました。

◆

同 Conference の References

他の方による本 Conference のレポート

Hadoop Conferene Japan Fall 2011 - 急がば回れ、選ぶなら近道
@okachimachiorz1さんによる Big3 の鋭い考察が大変ためになります。
Hadoop Conference Japan 2011 Fall に参加しました - Taste of Tech Topics
ひとつひとつのセッションに丁寧なコメントが書かれています。
Hadoop Conference Japan 2011 fallで使用された資料 #hcj11f | インフラエンジニアのつぶやき
当日発表されたスライド、気になったという tweet のまとめ
Hadoop Conference Japan 2011 Fall #hcj11f - nokunoの日記
Community track がわのメモが大変ためになりました。
Hadoop Conference Japan 2011 Fall に行ってきた＆しゃべってきた - tagomorisのメモ置き場
livedoor での Hadoop 事例を LT された方のレポート。
忘れてしまうから書いておこう -- Hadoop Conference Japan 2011 fall - Guutaraの日記
覚書、備忘録的なエントリ。
Hadoop Conference Japan 2011 Fallに行ってきた - watawata日記
Community Track 中心に参加された方のレポート。

わたしが書いた過去の関連エントリ

2011-07-01

HBase at Facebook what I heard on #hbasetokyo

HBase Hadoop seminar

Today, I attended #hbasetokyo which held at Harajyuku Japan, so let me share you some knowledges what I learned from there.

What's #hbasetokyo like?

So many people attended*1 #hbasetokyo, most of them in a casual style.
There are 2 monitors ahead, right one displays English presentation, and left one displays Japanese version of presentation.

Motivation for me to attend #hbasetokyo.

From this July, I need to evaluate HBase at my company. But before that, I want to make clear, what kind of demands are fit for using HBase.
I guess if I can know the Facebook's use case of HBase, it must become my help (Because they are one of the world biggest IT giant by now). So I decided to attend this #hbasetokyo.

Sentiments.

From hearing Facebook's use case, I think there are 4 key points when we decide to use HBase.

Have demand to analyze big data (PB over) literally in realtime.

Consistency is the highest priority (It works better than Cassandra and Sharded MySQL).

Need high write throughput.

Look it(HBase) as augment MySQL(Not for alternative).

Jonathan-san(the speaker of this presentation) told more detail informations of HBase at facebook are written in here( Apache Hadoop Goes Realtime at Facebook ).
I'm still glance a look of this document (sorry...) , but this must be worth to read.
I can't talk with Jonathan-san, but I want to say Thank you from here.
Today's Jonathan-san's presentation is quite impressive for me (Especially FB's HBase use case of Puma).
I will actively use today's knowledge for my work :D

Here from the memos what I took on there.

Japanese version of slide.

Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)

View more presentations from tatsuya6502

Recorded USTREAM.

Video streaming by Ustream

Title: Realtime Big Data at Facebook with Hadoop and HBase.

Self introduction of Jonathan Gray.

previous life as co founder of streamy.com.
realtime social new aggregator.
originally based on PostgreSQL.
Big data problem led us to Hadoop/HBase.

Software Engineer at Facebook.
Data Infra and OSS evangelist.
Develop, deploy support HBase.

Why Apache Hadoop and HBase?

For realtime Data?
to analyze it obviously, but why?

Before HBase FB's system .
LAMP.
+ memcached for cache.
+ Hadoop for data analysis.

Powerful stuff but problem existing stack.

> MySQL is stable but,

not inherently distributed.
table size limits.
inflexible schema.

Hadoop is scalable but,
MR is slow and difficult.
Does not support randam write.
poor support for randum read.

Specialized solution, facebook took.
High-throughput persistent key value ->Tokyocabinet
Large scale DWH -> Hive
Photo Store -> Haystack
Custum C++ servers for lots of other stuff.
# Aboves are enough scalable, but this system not fit for every service in FB.

What do we need in a data store?
Requirement for Facebook messages.
massive datasets, with large subsets of cold data.
Elasticity and high availability is important.
need many servers.
strong consistency within a datacenter.
fault isolation.

Some non-requirement.
NW partition within a single datacenter. -> NW doubled.
Active-active serving from multiple datacenters.

HBase satisfied our requirement.
in early 2010, engineers at FB compared DBs,
Apache Cassandra, Apace HBase, Sharded MySQL.

Compared performance, scalability, and features.
HBase gave excellent write performance good reads.
HBase already included many nice-to-have features.

Introduction to HBase.

What is HBase?
distributed.
designed for clusters of commodity nodes.

Column oriented.
map data structures not spreadsheets.

Multi-diminutional and sorted.
fully sorted on three dimensions.
tables sorted.

Highly available.
tolerant of node failure.

High performance.
log structured for high write throughput.

Originally part of Hadoop.
HBase adds random read/write access to HDFS.

Required some Hadoop changes for FB usages.
File appends ( using so called sync command, other user can see the data ).
HA Namenode.
Read optimization.

Plus Zookeeper!

Quick overview of HBase # nice image here.
on HDFS.
Zookeeper.

HBase Key Properties.
High write throughput.
Horizontal scalability.
Automatic failover.
Regions sharded dynamically.
HBase uses HDFS and Zookeeper (FB already uses it).

High Write Thoughput.
Commit log.
Memstore ( in memory ).

insert new piece of data.
pended to committed log -> sequential write to HDFS.

next into in-memory metastore.
if it grow up bigger, finally it write to disk.

horizontal scalability.
Region 0 - infinity.
Shard.
if you add new node, HBase.

Automatic FO.
HBase automatically shard data.

Application of HBase at Facebook.

some specific application at facebook.

Regions can be sharded dynamically.

HBase uses HDFS.
we get the benefits of HDFS as a storage system for free.
fault tolerance.
scalability.
checksums fix corruptions.
MR.
HDFS heavily tested and used at PB scale at FB.
3,000 nodes , 40PB.

Use case1 Titan.
Facebook message.

The new FB messages.
several communication chanel to one.

Largest engineering effort in the history of FB.
15 engineers over more than a year.
incorporates over 20 infrastructure technologies.
Hadoop HBase Haystack Zookeeper.

Aproduct at massive scale on day one.
hundred of million.
300TB.

Messaging Challenges.
High write throughput.
Every message, instant message, SMS, and email.
search indexes for all of the above.
Denormalized schema -> if 10 mail exist, 10 mail copied.

Massive clusters.
so much data and usage require a large server footprint.
do not want outages.

Typical cell layout.
Multiple cells for messaging.
20 server/rack , 5 or more racks per cluster.
Controllers (master/zookeeper) spread across racks ### image.

Use case2 Puma.
before Puma, traditional ETL with Hadoop.
Web tier, many 1000s of nodes.

- > scribe ( collect weblog ).

HDFS.

- > Hive MR.
- > SQL.

MySQL --> Webtier can select.
15mins-24hours.

Puma, realtime ETL system.
Webtier.

- > Scribe.

HDFS ( uses HDFS append feature ).

- > PTail.

Puma (using PTail collect weblog).

- > HTable.

HBase --> Thrift --> Webtier.
10-30 sec.

Raltime Data Pipeline.
Utilize existing log aggregation pipleline (Scribe-HDFS).
Extend low-latency capability of HDFS (Sync+Ptail).
High thoughput writes(HBase).

Support for Realtime Aggregation.
Utilize HBase atomic increnets to maintain rollups.
complex HBase schemas for unique user.

Puma as REaltime MR.
Map phase with PTail.
divide the input log stream into N shard.
First version only supported random bucketing.
Now supports applicaition level bucketing.

Reduce Phase with HBase.
Every row column in HBase is an output key.

Puma for Facebook Insights.
realtime URL/Domain insight.
Domain owners can see deep analytics for their site.
Detail demographic breakdown.
Top URLs calculated per-domain and globally.

Massive Throughput.
Billions of URLs.
1million counters increments within 1 sec.

> Click count analytics,,,,

Use Case 3 ODS.
Facebook Incremental.

Operational Data Store.
System metrics (CPU, Memory, IO, NW).
application metrics (Web, DB, Caches.
FB metrics (Usage, Revenue).

easily graph this data over time.

Difficult to scale with MySQL.
Millions of unique time-series with billions of points.
irregular data growth patterns.
MySQL cant automatically shard orz....

> Only MySQL DBA they become.

Future of FB HBase.
User and Graph Data in HBase.

Looking at HBase to augment MySQL
only single row ACID from MySQL is used.
DBs are always fronted by an in-memory cache.
Hbase is great at storing dictionaries and lists.

DB tier size determined by IOPS.
Cost saving -> FB succeeded to scale up MySQL, but it cost too much.
Low IOPs translate to lower cost.
Larger tables on denser, cheaper, commodity.
MySQL, 300GB, augmented flush memory.

> HBase, 2TB each.

Conclusion.

FB investing in realtime Hadoop/HBase.
Work of a large team of FB engineers.
close collaboration with open source developers.

Here from the last, QA session's Memo.

At the seminar, from here, I was tired for taking a memo in English, so after here, my memo was written in Japanese :P

Q. Puma に似たようなシステムをつくったが、バッファが保護されないというような問題

バッファ上のデータが消えたらどうするか
1つはバッファしない
イベントがきた瞬間にフラッシュする
2つめはCheckpoint
オフセットをチェックポイントとして保存する
オフセットをHBaseに保存していると、もう一度HBaseから読み直せる
→トランザクションにしないといけない

Scribeはたまにバッファしてたまにデータ、なくなります。

Q. CassandraやMySQL Shardingを選ばなかった理由

Consistency が一番欲しかった
ひとつのサーバーに複数の Shard を割り当てたかった
HDFSに詳しい人がたくさんいたから

Q. Pumaにしたらなぜそんなにスケールしたのか？旧システムは何が大きなボトルネックだったのか？

Bufferingの問題が一番大きかった
Hive での処理
24時間の処理をするのに24時間以上かかっていた

アルゴリズムを変えたのか？
バッチスタイルから、ストリーミング的な処理に変えた。
アルゴリズム自体は同じ。

MRの問題、SortしてShuffleしないといけないこと
NW上でデータのやりとりをしないといけない
HBaseならHDFSの中でMRできる？
Sequential

Q. MySQLをスケーラブルにしているということだが

MySQLはTransactional Data Only
Simple Shared
FBのテーブルにはユニークなIDがかならずある
JoinはApplicationサイドでやってる

Q. HBaseではRow Keyが重要だと思うが、

スキーマをUpgradeする
全部のデータを書き換える
新しいデータをつくって2箇所に書き出す
MR Jobで新しい方に書き直す

Q7 NWレイアで気にしたこと

NW partition がDC内で起きないようにすること
上層のNWを10GBにした
ネットワークバウンド

Q. 複数DCにまたがるか？

またがらない
またがる場合には、レプリケーションしてる

Q. Twitter 、Virtical

Hive Pig

Q. HBaseのフレキシブルなスキーマのメリットは？

Titan
ひとつのカラムに全てのデータをもってる
Versionがふってあり、すべてのメッセージをいれている。
Rowは何百万ものメッセージをストアできる

Q. 見積もりをどうしてるか？

Dark Launchという手法？
古いのと同じように動く新しいものをつくる
10％のユーザーには、新しいものをつかわせて、徐々に古いものから移行する

Q. Cassandra はFBの中で使われているのか？

FBでは Cassandraは使ってない
ただ悪いって言ってるわけじゃないよ
Availabilityが最も違う
ConsistencyはHBaseのほうが強い

Q. 20サーバ、1ラックは少なくないか？

1つのサーバーは20個のディスクをもってる

ラック間は10Gbps、ラック内は1Gbps

Q. PumaはOSSにならないか？

2つコンポーネントがあるが、その片方は、OSS化の予定がある。
将来的にはどちらも。

Q. PTail がほしい

昨日、カリフォルニアで話があったがどうかわからない
ただもしかしたら数週間以内にOSS化されるかも

Q. DC間レプリケーションはどうしてるか？

オリジナルのものをつかってる
Delayはどれだけ起こるか
→小さくすることは可能。

複数の HBase Clusterがあるが、Cluster間でまたぎが起こらないようになっている。ユーザーはどこかの1つのClusterに属するようになっている。

Q.どうやってHBaseのバックアップをとってるか？

今、現在、開発中
Titan の前は2箇所に書いていた
ワークロードをさばくほうと、バックアップのほう。

HBaseにスナップショットを。
コピーをもたないといけない。
単純にコピーするとネットワークバウンド
コピーをNWするのではなく、自分自身のディスクに書き、非同期でコピーする

Q. Major CompactionがNWのレイテンシーに問題を起こす場合にはどういう解決をしているか？

Major Compactionが起こるタイミングを減らす
ブルームフィルター？

Resion のスプリット自動的には行わない
Titanの場合、くるデータの量がわかっているので、あらかじめスプリットしておいている

Q. Pumaによって具体的にできるようになったことの例

Facebook Insight、ひとつの広告をどれだけの人がみているか？
Commerse System における行動履歴、分析。

Logは非同期なものだから、使える。

Q. Hadoop と HBase 2重にもつデータをどうやってを管理してる

Scribe で HDFSに書いたら
そこから PTail、Hive どちらからも読み込み可能

HBase: The Definitive Guide

作者: Lars George
出版社/メーカー: Oreilly & Associates Inc
発売日: 2011/09/20
メディア: ペーパーバック
購入: 1人クリック: 27回
この商品を含むブログ (5件) を見る

Appendix.

Togetter.

Togetter - 「Hbase at FaceBookのまとめ」

Zusaar.

Hbase at FaceBook on Zusaar

Agenda
Facebookhas one of the largest Apache Hadoop data warehouses in the world,
primarily queried through Apache Hive for offline data processing and analytics.
However, the need for realtime analytics and end-user access has led to the development of several new systems built using Apache HBase.
This talk will cover specific use cases and the work done at Facebook around building large scale, low latency and high throughput realtime services with Hadoop and HBase.
This includes several significant contributions to existing projects as well as the release of new open source projects.

Location data.

大きな地図で見る

*1:according to [http://www.zusaar.com/event/agZ6dXNhYXJyDQsSBUV2ZW50GOuRAgw:title=Zusaar] 283 people were there

2011-06-16

無謀にも参加してきた「第2回 HBase 勉強会」のメモ #hbaseworkshop

HBase seminar

近々、検証を行う予定の HBase 、多少なりともその知識を深めることを目的に「第2回 HBase 勉強会」に参加してきました。
用語を知っている程度での参加だったので、大分ついていけないところがありましたが、可能な範囲でメモをとってきましたので、共有します。

1. HBase アーキテクチャ概要（@ueshinさん）

20110616 HBaseアーキテクチャ概要

View more presentations from ueshin

サーバー構成

DataNode, TaskTracker, RegionServer, HMaster, Zookeeper

データモデル

GoogleのBigTable
行の特定にRow Key
あるカラムファミリーに対して任意の数のカラムをもてる

物理配置

物理的にはカラムファミリーごとに保存
リージョン
Row Keyの範囲によって分割された領域
リージョンのサイズが大きくなると自動的に分割されていく
256MB

ACID 特性

1つのRowに対するロックを取得できる
Row内のデータの更新をAtomicに実行することを可能
Row内のデータのConsistencyは
保証される
複数のRowにまたがる更新のAtomicityは
確保できない

カラムファミリーがたくさんあってひとつのリージョンにおさまらないときどうなるのか？
無限major compaction
無限に圧縮し続ける

耐障害性

HDFSのNNがSPOF

実装

HMaster -> Zookeeper Cluster

Regionserver x 3

HDFS

Clientの第一声はZookeeperに
Zookeeperが死んでしまっていたら無限にHBaseからretryが発生して動作が進まない

書き込み

コミットログ追記
メモリキャッシュに追加
メモリキャッシュがいっぱいになったらフラッシュ

Compaction

minor Compaction
- フラッシュファイルの数が閾値をこえると1つにまとめられる
major Compaction
- リージョンに属するファイルを定期的に1つにまとめる

リージョン分割

フラッシュファイルのサイズが閾値をこえるとリージョン分割
1 親リージョンがオフラインに
2 METAに記録してHMasterに通知
3 HMasterが新規リージョンをアサイン

CDH3を使うことが基本的な考えになっている
Zookeeper専用のクラスタがあったほうが無難
負荷が高いと反応が遅い
奇数個つくるのはZookeeper的なルール

2. HBaseアーキテクチャ要件（@okachimachiorz1 さん）

HBaseを業務要件から使う上でのまとめ

不安定
ある程度の規模があることが前提
HBaseのCluster上でMR動かさないよね
↓
一緒にしたほうがいいが理屈

特定の用途に制限的に利用することで問題が顕在化しないようにする

Hadoop MRで足りない部分を補う
フルスキャン前提
TxーTxデータの処理が遅い

Hadoopの次にいくには？
常にフルスキャンではないが、時々フルスキャンというデータ

マスター管理

オンライン・マスター的に利用する
低レイテンシー要件
フロントにはやはりRDBが必要
RDBMS→HBase→HDFS

ほぼ完全にMR用のサポートDBの位置づけ

受発注

基本的にトランザクション
伝票検索
特定キーと日付検索
問題のあったtxを任意で検索する

伝票処理
一括処理〜大抵はMRで処理する→ここはHDFSで処理が可能
逐次処理のためにHBaseが必要

要件アーキテクチャ
レイテンシー要求は高くない

在庫管理

在庫マスターの処理から考える
仮フラグの設定くらいしかないACID
いわゆる在庫マスターの更新処理を行う

業務ロック制御のための HBase
開けっ放しでのバッチ処理のための HBase

要件アーキテクチャ
ロック制御
HBase だとコピーアクションがいらない
基幹であればフラグのたったレコードをダンプして後続処理を走らせる

原価計算

計算はバッチ処理
確定処理とシュミレーション処理
同期の必要はない

結果の再利用のための HBase

物流管理

リレーションデータとの結合
各マスターとのぶつけ合いが多い

他システムとの連動ためのHBase

Planning 系と Execution 系での利用
普通は別システム
受け渡しは必要。
planning 系に Hadoop は強くなるのではないかという予測

製造管理

分散MRP
Oracleの牙城を壊せるか？？
実はHBaseが大本命？

ノード単位でのデータロケーションの管理

やると全世界的に大尊敬

生産管理のメモ

歴史と現状

生産管理に分散アーキテクチャを導入
目的
スケールアウト
非同期処理の高速化
疎結合による簡素化
→そのかわりレイテンシーがあがるトレードオフ

原子力発電管理

プラント系
シミュレーション
センサーデータの収集分析
リアルタイム性の厳密な管理
組み込みの直接制御
マトリックスみたいな感じのやつを頼む

原発管理をHBaseでやってみるとか
リアルタイムシミュレーションとか有効？

販売処理管理

ハイトランザクションを想定
低txの場合はExcelでOK

集配信→ HBase → HDFS → RDB
（分散書き込み→非同期処理→分散書き出し）

Consistencyが保てないと思った瞬間にHBaseは止まる

余談

1. 止まったときはデータがどこにあるのか把握することが大事。
↓
ないならないでいい（はっきりさせることが大事）
2. 世の中の7割のRDBは、高機能ファイルサーバーとしての用途

3. HBaseの用途

分散 Memtable

書き込むデータをメモリ上に

ソート済

それぞれのデータはキー順に整列
隣のレコードという概念がある
接頭辞スキャンなど範囲読みだしが可能
バーストリード

Column Family

ファイルごとにわかれている
ファミリの更新が閉じる
テーブルが同じだとリージョンも同じ

普通の用途

OLTP＋インクリメンタル処理
オンライン性能をそれほど劣化させずにバックグランド処理
複数のバックグランド処理でお互いが干渉しにくい

やりたいこと

締め処理を速くしたい
前提
入力データは BigData まではいかない
小さなマスタが多い
巨大なマスタがある
tx の全件照合などの破滅的な処理
PA の効果は限定的

分散 Memtable

バッチ処理ではあまり恩恵をうけられない
多くの書き込みが Bulk
暇ならリバランスの戦略と相性が悪い

ソート済

一定期間のもののみ対象になどと相性はよさそう
小さなマスタに

Column Family

とても羨ましい機能
MySQL だと Update は Delete ＋ Insert と同じ程度の性能
更新パターンごとにバラして読み出し時にJoinしている

ここまでの考察では

HBase は役にたたない
↓
考え方を変える

Incremental

未確定の情報もどんどんとりこんでいく
先行して Materialized View をつくる
Column Family にマスタを結合しておく
使う順に並び替えておく

Emulated BSP

MR 以外の選択肢
事前に必要なマスタを Shard して配備
マスタを固定したまま tx だけ移動する
ロードや転送制御が MR よりわかりやすそう
集計や照合までに同期が不要に

用途によってはMRより適切
Full outer join / Partial Aggregation はMRの方が楽かも

4. @frsyukiさんの発表

MessagePack+Hadoop (HBase-study 2011-06-16 Japan)
MessagePack-Hadoop Integration (HBase勉強会) - 古橋貞之の日記

ログ収集ツール

ログ収集
↓
fluent
（バッファリング＆split(Message Pack File)
↓
↓定期的に保存（トランザクショナル）
↓
HDFS
（マージジョブでColumnar Fileを作成）

MessagePack＋Hadoop

msgpack/msgpack-hadoop - GitHub

MessagePack＋Hive

UDTL msgpack_map

Columnar File

columns.mapping

データをフレキシブルに投入し、抽出するときには定義を。

Appendix.

Location.

大きな地図で見る

HBase: The Definitive Guide

作者: Lars George
出版社/メーカー: Oreilly & Associates Inc
発売日: 2011/09/20
メディア: ペーパーバック
購入: 1人クリック: 27回
この商品を含むブログ (5件) を見る

18:30 受付開始

19:00 開演ご挨拶とLars George 氏紹介

19:10 大規模分散データベースHBaseの活用と最新動向 〜『HBase』著者 Cloudera社 Lars George 氏（逐次通訳付）〜

HBase in Japan

▶ Self introduction

▶ Agenda

▶ HBase の紹介

▶ HBaseテーブル

▶ HBaseのテーブルとは

▶ Java API

▶ その他の特徴

▶ 最近のプロジェクトの状況

▶ クラスタのサイジング

▶ リソースの競合

▶ メモリの共有

▶ Reads

▶ ブロックキャッシュ

▶ 書き込み

▶ 書き込みのパフォーマンス

▶ WAL

▶ フラッシュ

▶ コンパクションストーム

▶ 依存関係

▶ （HBase に関連した）数値について

▶ 注釈

▶ Cheet Sheet

▶ 例

▶ QA

20:10 『HBase』書籍の読みどころ、翻訳うらばなし 〜『HBase』翻訳者 Sky株式会社 玉川竜司 氏〜

▶ 翻訳した著作

▶ 翻訳の一日

▶ いつも悩むのは

▶ 今後の予定、ようは宣伝

▶ 最後

20:35 Cloudera UniversityとHBase認定試験 〜Cloudera株式会社 川崎達夫 氏〜

▶ 持ち時間は5分w

▶ Clouderaではいろんな研修やってる、認定資格もあるよ（CCSHB）

Reference.

✔ こちらもあわせてどうぞ

まずは会場の模様を、写真で紹介したいと思います。

タイムテーブルは、以下のようになっていました。

では、以降でわたしが参加したセッションのメモを共有します。

10:00〜10:05 Opening.

hashtag: #hcj11f

1178 attendees.

今回の狙い

アンケート結果

Question Vote ：質問投稿サイト

10:05〜10:35『The role of the Distribution in the Apache Hadoop Ecosystem』 Cloudera Inc, Todd Lipcon

Video

Introduction

Hadoop Overview

Why do we need Hadoop?

Why Hadoop created?

What Can Hadoop Do for you?

How Cloudera Customers use apache Hadoop

Clouderas mission

About Cloudera

4main products

CDH Overview

Why use a distribution?

Summary

Journey of the apache Hadoop user

Cloudera Enterprise

Summary

QA

10:50〜11:20『About Hortonworks』 HortonWorks, Owen O'Malley

Horton works

About Horton work Game plan

Credentials

What is apache Hadoop?

Typical Hadoop Applications

Who build Hadoop

A Brief History

Hadoop @Y!

Case study of Y!

More case studies

Yahoo mail

The Hadoop Market

Market opportunity for Hadoop

19:10 大規模分散データベースHBaseの活用と最新動向〜『HBase』著者 Cloudera社　Lars George 氏（逐次通訳付）〜

20:10 『HBase』書籍の読みどころ、翻訳うらばなし〜『HBase』翻訳者 Sky株式会社　玉川竜司氏〜

20:35 Cloudera UniversityとHBase認定試験〜Cloudera株式会社　川崎達夫氏〜

10:05〜10:35『The role of the Distribution in the Apache Hadoop Ecosystem』　Cloudera Inc, Todd Lipcon

10:50〜11:20『About Hortonworks』　HortonWorks, Owen O'Malley

11:25〜11:55『How Hadoop needs to evolve and integrate into the enterprise』　MapR Technology Inc, Ted Dunning