HBase at Facebook what I heard on #hbasetokyo

Today, I attended #hbasetokyo which held at Harajyuku Japan, so let me share you some knowledges what I learned from there.

What's #hbasetokyo like?

So many people attended*1 #hbasetokyo, most of them in a casual style.
There are 2 monitors ahead, right one displays English presentation, and left one displays Japanese version of presentation.

Motivation for me to attend #hbasetokyo.

From this July, I need to evaluate HBase at my company. But before that, I want to make clear, what kind of demands are fit for using HBase.
I guess if I can know the Facebook's use case of HBase, it must become my help (Because they are one of the world biggest IT giant by now). So I decided to attend this #hbasetokyo.

Sentiments.

From hearing Facebook's use case, I think there are 4 key points when we decide to use HBase.

Have demand to analyze big data (PB over) literally in realtime.

Consistency is the highest priority (It works better than Cassandra and Sharded MySQL).

Need high write throughput.

Look it(HBase) as augment MySQL(Not for alternative).

Jonathan-san(the speaker of this presentation) told more detail informations of HBase at facebook are written in here( Apache Hadoop Goes Realtime at Facebook ).
I'm still glance a look of this document (sorry...) , but this must be worth to read.
I can't talk with Jonathan-san, but I want to say Thank you from here.
Today's Jonathan-san's presentation is quite impressive for me (Especially FB's HBase use case of Puma).
I will actively use today's knowledge for my work :D

Here from the memos what I took on there.

Japanese version of slide.

Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)

View more presentations from tatsuya6502

Recorded USTREAM.

Video streaming by Ustream

Title: Realtime Big Data at Facebook with Hadoop and HBase.

Self introduction of Jonathan Gray.

previous life as co founder of streamy.com.
realtime social new aggregator.
originally based on PostgreSQL.
Big data problem led us to Hadoop/HBase.

Software Engineer at Facebook.
Data Infra and OSS evangelist.
Develop, deploy support HBase.

Why Apache Hadoop and HBase?

For realtime Data?
to analyze it obviously, but why?

Before HBase FB's system .
LAMP.
+ memcached for cache.
+ Hadoop for data analysis.

Powerful stuff but problem existing stack.

> MySQL is stable but,

not inherently distributed.
table size limits.
inflexible schema.

Hadoop is scalable but,
MR is slow and difficult.
Does not support randam write.
poor support for randum read.

Specialized solution, facebook took.
High-throughput persistent key value ->Tokyocabinet
Large scale DWH -> Hive
Photo Store -> Haystack
Custum C++ servers for lots of other stuff.
# Aboves are enough scalable, but this system not fit for every service in FB.

What do we need in a data store?
Requirement for Facebook messages.
massive datasets, with large subsets of cold data.
Elasticity and high availability is important.
need many servers.
strong consistency within a datacenter.
fault isolation.

Some non-requirement.
NW partition within a single datacenter. -> NW doubled.
Active-active serving from multiple datacenters.

HBase satisfied our requirement.
in early 2010, engineers at FB compared DBs,
Apache Cassandra, Apace HBase, Sharded MySQL.

Compared performance, scalability, and features.
HBase gave excellent write performance good reads.
HBase already included many nice-to-have features.

Introduction to HBase.

What is HBase?
distributed.
designed for clusters of commodity nodes.

Column oriented.
map data structures not spreadsheets.

Multi-diminutional and sorted.
fully sorted on three dimensions.
tables sorted.

Highly available.
tolerant of node failure.

High performance.
log structured for high write throughput.

Originally part of Hadoop.
HBase adds random read/write access to HDFS.

Required some Hadoop changes for FB usages.
File appends ( using so called sync command, other user can see the data ).
HA Namenode.
Read optimization.

Plus Zookeeper!

Quick overview of HBase # nice image here.
on HDFS.
Zookeeper.

HBase Key Properties.
High write throughput.
Horizontal scalability.
Automatic failover.
Regions sharded dynamically.
HBase uses HDFS and Zookeeper (FB already uses it).

High Write Thoughput.
Commit log.
Memstore ( in memory ).

insert new piece of data.
pended to committed log -> sequential write to HDFS.

next into in-memory metastore.
if it grow up bigger, finally it write to disk.

horizontal scalability.
Region 0 - infinity.
Shard.
if you add new node, HBase.

Automatic FO.
HBase automatically shard data.

Application of HBase at Facebook.

some specific application at facebook.

Regions can be sharded dynamically.

HBase uses HDFS.
we get the benefits of HDFS as a storage system for free.
fault tolerance.
scalability.
checksums fix corruptions.
MR.
HDFS heavily tested and used at PB scale at FB.
3,000 nodes , 40PB.

Use case1 Titan.
Facebook message.

The new FB messages.
several communication chanel to one.

Largest engineering effort in the history of FB.
15 engineers over more than a year.
incorporates over 20 infrastructure technologies.
Hadoop HBase Haystack Zookeeper.

Aproduct at massive scale on day one.
hundred of million.
300TB.

Messaging Challenges.
High write throughput.
Every message, instant message, SMS, and email.
search indexes for all of the above.
Denormalized schema -> if 10 mail exist, 10 mail copied.

Massive clusters.
so much data and usage require a large server footprint.
do not want outages.

Typical cell layout.
Multiple cells for messaging.
20 server/rack , 5 or more racks per cluster.
Controllers (master/zookeeper) spread across racks ### image.

Use case2 Puma.
before Puma, traditional ETL with Hadoop.
Web tier, many 1000s of nodes.

- > scribe ( collect weblog ).

HDFS.

- > Hive MR.
- > SQL.

MySQL --> Webtier can select.
15mins-24hours.

Puma, realtime ETL system.
Webtier.

- > Scribe.

HDFS ( uses HDFS append feature ).

- > PTail.

Puma (using PTail collect weblog).

- > HTable.

HBase --> Thrift --> Webtier.
10-30 sec.

Raltime Data Pipeline.
Utilize existing log aggregation pipleline (Scribe-HDFS).
Extend low-latency capability of HDFS (Sync+Ptail).
High thoughput writes(HBase).

Support for Realtime Aggregation.
Utilize HBase atomic increnets to maintain rollups.
complex HBase schemas for unique user.

Puma as REaltime MR.
Map phase with PTail.
divide the input log stream into N shard.
First version only supported random bucketing.
Now supports applicaition level bucketing.

Reduce Phase with HBase.
Every row column in HBase is an output key.

Puma for Facebook Insights.
realtime URL/Domain insight.
Domain owners can see deep analytics for their site.
Detail demographic breakdown.
Top URLs calculated per-domain and globally.

Massive Throughput.
Billions of URLs.
1million counters increments within 1 sec.

> Click count analytics,,,,

Use Case 3 ODS.
Facebook Incremental.

Operational Data Store.
System metrics (CPU, Memory, IO, NW).
application metrics (Web, DB, Caches.
FB metrics (Usage, Revenue).

easily graph this data over time.

Difficult to scale with MySQL.
Millions of unique time-series with billions of points.
irregular data growth patterns.
MySQL cant automatically shard orz....

> Only MySQL DBA they become.

Future of FB HBase.
User and Graph Data in HBase.

Looking at HBase to augment MySQL
only single row ACID from MySQL is used.
DBs are always fronted by an in-memory cache.
Hbase is great at storing dictionaries and lists.

DB tier size determined by IOPS.
Cost saving -> FB succeeded to scale up MySQL, but it cost too much.
Low IOPs translate to lower cost.
Larger tables on denser, cheaper, commodity.
MySQL, 300GB, augmented flush memory.

> HBase, 2TB each.

Conclusion.

FB investing in realtime Hadoop/HBase.
Work of a large team of FB engineers.
close collaboration with open source developers.

Here from the last, QA session's Memo.

At the seminar, from here, I was tired for taking a memo in English, so after here, my memo was written in Japanese :P

Q. Puma に似たようなシステムをつくったが、バッファが保護されないというような問題

バッファ上のデータが消えたらどうするか
1つはバッファしない
イベントがきた瞬間にフラッシュする
2つめはCheckpoint
オフセットをチェックポイントとして保存する
オフセットをHBaseに保存していると、もう一度HBaseから読み直せる
→トランザクションにしないといけない

Scribeはたまにバッファしてたまにデータ、なくなります。

Q. CassandraやMySQL Shardingを選ばなかった理由

Consistency が一番欲しかった
ひとつのサーバーに複数の Shard を割り当てたかった
HDFSに詳しい人がたくさんいたから

Q. Pumaにしたらなぜそんなにスケールしたのか？旧システムは何が大きなボトルネックだったのか？

Bufferingの問題が一番大きかった
Hive での処理
24時間の処理をするのに24時間以上かかっていた

アルゴリズムを変えたのか？
バッチスタイルから、ストリーミング的な処理に変えた。
アルゴリズム自体は同じ。

MRの問題、SortしてShuffleしないといけないこと
NW上でデータのやりとりをしないといけない
HBaseならHDFSの中でMRできる？
Sequential

Q. MySQLをスケーラブルにしているということだが

MySQLはTransactional Data Only
Simple Shared
FBのテーブルにはユニークなIDがかならずある
JoinはApplicationサイドでやってる

Q. HBaseではRow Keyが重要だと思うが、

スキーマをUpgradeする
全部のデータを書き換える
新しいデータをつくって2箇所に書き出す
MR Jobで新しい方に書き直す

Q7 NWレイアで気にしたこと

NW partition がDC内で起きないようにすること
上層のNWを10GBにした
ネットワークバウンド

Q. 複数DCにまたがるか？

またがらない
またがる場合には、レプリケーションしてる

Q. Twitter 、Virtical

Hive Pig

Q. HBaseのフレキシブルなスキーマのメリットは？

Titan
ひとつのカラムに全てのデータをもってる
Versionがふってあり、すべてのメッセージをいれている。
Rowは何百万ものメッセージをストアできる

Q. 見積もりをどうしてるか？

Dark Launchという手法？
古いのと同じように動く新しいものをつくる
10％のユーザーには、新しいものをつかわせて、徐々に古いものから移行する

Q. Cassandra はFBの中で使われているのか？

FBでは Cassandraは使ってない
ただ悪いって言ってるわけじゃないよ
Availabilityが最も違う
ConsistencyはHBaseのほうが強い

Q. 20サーバ、1ラックは少なくないか？

1つのサーバーは20個のディスクをもってる

ラック間は10Gbps、ラック内は1Gbps

Q. PumaはOSSにならないか？

2つコンポーネントがあるが、その片方は、OSS化の予定がある。
将来的にはどちらも。

Q. PTail がほしい

昨日、カリフォルニアで話があったがどうかわからない
ただもしかしたら数週間以内にOSS化されるかも

Q. DC間レプリケーションはどうしてるか？

オリジナルのものをつかってる
Delayはどれだけ起こるか
→小さくすることは可能。

複数の HBase Clusterがあるが、Cluster間でまたぎが起こらないようになっている。ユーザーはどこかの1つのClusterに属するようになっている。

Q.どうやってHBaseのバックアップをとってるか？

今、現在、開発中
Titan の前は2箇所に書いていた
ワークロードをさばくほうと、バックアップのほう。

HBaseにスナップショットを。
コピーをもたないといけない。
単純にコピーするとネットワークバウンド
コピーをNWするのではなく、自分自身のディスクに書き、非同期でコピーする

Q. Major CompactionがNWのレイテンシーに問題を起こす場合にはどういう解決をしているか？

Major Compactionが起こるタイミングを減らす
ブルームフィルター？

Resion のスプリット自動的には行わない
Titanの場合、くるデータの量がわかっているので、あらかじめスプリットしておいている

Q. Pumaによって具体的にできるようになったことの例

Facebook Insight、ひとつの広告をどれだけの人がみているか？
Commerse System における行動履歴、分析。

Logは非同期なものだから、使える。

Q. Hadoop と HBase 2重にもつデータをどうやってを管理してる

Scribe で HDFSに書いたら
そこから PTail、Hive どちらからも読み込み可能

HBase: The Definitive Guide

作者: Lars George
出版社/メーカー: Oreilly & Associates Inc
発売日: 2011/09/20
メディア: ペーパーバック
購入: 1人クリック: 27回
この商品を含むブログ (5件) を見る

Appendix.

Togetter.

Togetter - 「Hbase at FaceBookのまとめ」

Zusaar.

Hbase at FaceBook on Zusaar

Agenda
Facebookhas one of the largest Apache Hadoop data warehouses in the world,
primarily queried through Apache Hive for offline data processing and analytics.
However, the need for realtime analytics and end-user access has led to the development of several new systems built using Apache HBase.
This talk will cover specific use cases and the work done at Facebook around building large scale, low latency and high throughput realtime services with Hadoop and HBase.
This includes several significant contributions to existing projects as well as the release of new open source projects.

Location data.

大きな地図で見る

*1:according to [http://www.zusaar.com/event/agZ6dXNhYXJyDQsSBUV2ZW50GOuRAgw:title=Zusaar] 283 people were there

#garagekidztweetz

id:garage-kid@76whizkidz のライフログ・ブログ！