id:garage-kid@76whizkidz のライフログ・ブログ!

HBase at Facebook what I heard on #hbasetokyo



Today, I attended #hbasetokyo which held at Harajyuku Japan, so let me share you some knowledges what I learned from there.

What's #hbasetokyo like?

So many people attended*1 #hbasetokyo, most of them in a casual style.
There are 2 monitors ahead, right one displays English presentation, and left one displays Japanese version of presentation.

Motivation for me to attend #hbasetokyo.

From this July, I need to evaluate HBase at my company. But before that, I want to make clear, what kind of demands are fit for using HBase.
I guess if I can know the Facebook's use case of HBase, it must become my help (Because they are one of the world biggest IT giant by now). So I decided to attend this #hbasetokyo.


From hearing Facebook's use case, I think there are 4 key points when we decide to use HBase.

  1. Have demand to analyze big data (PB over) literally in realtime.
  2. Consistency is the highest priority (It works better than Cassandra and Sharded MySQL).
  3. Need high write throughput.
  4. Look it(HBase) as augment MySQL(Not for alternative).

Jonathan-san(the speaker of this presentation) told more detail informations of HBase at facebook are written in here( Apache Hadoop Goes Realtime at Facebook ).
I'm still glance a look of this document (sorry...) , but this must be worth to read.

I can't talk with Jonathan-san, but I want to say Thank you from here.
Today's Jonathan-san's presentation is quite impressive for me (Especially FB's HBase use case of Puma).
I will actively use today's knowledge for my work :D

Here from the memos what I took on there.

Recorded USTREAM.

Video streaming by Ustream

Title: Realtime Big Data at Facebook with Hadoop and HBase.
Self introduction of Jonathan Gray.

previous life as co founder of streamy.com.
realtime social new aggregator.
originally based on PostgreSQL.
Big data problem led us to Hadoop/HBase.

Software Engineer at Facebook.
Data Infra and OSS evangelist.
Develop, deploy support HBase.

Why Apache Hadoop and HBase?

For realtime Data?
to analyze it obviously, but why?

Before HBase FB's system .
+ memcached for cache.
+ Hadoop for data analysis.

Powerful stuff but problem existing stack.

  • > MySQL is stable but,

not inherently distributed.
table size limits.
inflexible schema.

Hadoop is scalable but,
MR is slow and difficult.
Does not support randam write.
poor support for randum read.

Specialized solution, facebook took.
High-throughput persistent key value ->Tokyocabinet
Large scale DWH -> Hive
Photo Store -> Haystack
Custum C++ servers for lots of other stuff.
# Aboves are enough scalable, but this system not fit for every service in FB.

What do we need in a data store?
Requirement for Facebook messages.
massive datasets, with large subsets of cold data.
Elasticity and high availability is important.
need many servers.
strong consistency within a datacenter.
fault isolation.

Some non-requirement.
NW partition within a single datacenter. -> NW doubled.
Active-active serving from multiple datacenters.

HBase satisfied our requirement.
in early 2010, engineers at FB compared DBs,
Apache Cassandra, Apace HBase, Sharded MySQL.

Compared performance, scalability, and features.
HBase gave excellent write performance good reads.
HBase already included many nice-to-have features.

Introduction to HBase.

What is HBase?
designed for clusters of commodity nodes.

Column oriented.
map data structures not spreadsheets.

Multi-diminutional and sorted.
fully sorted on three dimensions.
tables sorted.

Highly available.
tolerant of node failure.

High performance.
log structured for high write throughput.

Originally part of Hadoop.
HBase adds random read/write access to HDFS.

Required some Hadoop changes for FB usages.
File appends ( using so called sync command, other user can see the data ).
HA Namenode.
Read optimization.

Plus Zookeeper!

Quick overview of HBase # nice image here.
on HDFS.

HBase Key Properties.
High write throughput.
Horizontal scalability.
Automatic failover.
Regions sharded dynamically.
HBase uses HDFS and Zookeeper (FB already uses it).

High Write Thoughput.
Commit log.
Memstore ( in memory ).

insert new piece of data.
pended to committed log -> sequential write to HDFS.

next into in-memory metastore.
if it grow up bigger, finally it write to disk.

horizontal scalability.
Region 0 - infinity.
if you add new node, HBase.

Automatic FO.
HBase automatically shard data.

Application of HBase at Facebook.

some specific application at facebook.

Regions can be sharded dynamically.

HBase uses HDFS.
we get the benefits of HDFS as a storage system for free.
fault tolerance.
checksums fix corruptions.
HDFS heavily tested and used at PB scale at FB.
3,000 nodes , 40PB.

Use case1 Titan.
Facebook message.

The new FB messages.
several communication chanel to one.

Largest engineering effort in the history of FB.
15 engineers over more than a year.
incorporates over 20 infrastructure technologies.
Hadoop HBase Haystack Zookeeper.

Aproduct at massive scale on day one.
hundred of million.

Messaging Challenges.
High write throughput.
Every message, instant message, SMS, and email.
search indexes for all of the above.
Denormalized schema -> if 10 mail exist, 10 mail copied.

Massive clusters.
so much data and usage require a large server footprint.
do not want outages.

Typical cell layout.
Multiple cells for messaging.
20 server/rack , 5 or more racks per cluster.
Controllers (master/zookeeper) spread across racks ### image.

Use case2 Puma.
before Puma, traditional ETL with Hadoop.
Web tier, many 1000s of nodes.

    • > scribe ( collect weblog ).


    • > Hive MR.
    • > SQL.

MySQL --> Webtier can select.

Puma, realtime ETL system.

    • > Scribe.

HDFS ( uses HDFS append feature ).

    • > PTail.

Puma (using PTail collect weblog).

    • > HTable.

HBase --> Thrift --> Webtier.
10-30 sec.

Raltime Data Pipeline.
Utilize existing log aggregation pipleline (Scribe-HDFS).
Extend low-latency capability of HDFS (Sync+Ptail).
High thoughput writes(HBase).

Support for Realtime Aggregation.
Utilize HBase atomic increnets to maintain rollups.
complex HBase schemas for unique user.

Puma as REaltime MR.
Map phase with PTail.
divide the input log stream into N shard.
First version only supported random bucketing.
Now supports applicaition level bucketing.

Reduce Phase with HBase.
Every row column in HBase is an output key.

Puma for Facebook Insights.
realtime URL/Domain insight.
Domain owners can see deep analytics for their site.
Detail demographic breakdown.
Top URLs calculated per-domain and globally.

Massive Throughput.
Billions of URLs.
1million counters increments within 1 sec.

  • > Click count analytics,,,,

Use Case 3 ODS.
Facebook Incremental.

Operational Data Store.
System metrics (CPU, Memory, IO, NW).
application metrics (Web, DB, Caches.
FB metrics (Usage, Revenue).

easily graph this data over time.

Difficult to scale with MySQL.
Millions of unique time-series with billions of points.
irregular data growth patterns.
MySQL cant automatically shard orz....

  • > Only MySQL DBA they become.

Future of FB HBase.
User and Graph Data in HBase.

Looking at HBase to augment MySQL
only single row ACID from MySQL is used.
DBs are always fronted by an in-memory cache.
Hbase is great at storing dictionaries and lists.

DB tier size determined by IOPS.
Cost saving -> FB succeeded to scale up MySQL, but it cost too much.
Low IOPs translate to lower cost.
Larger tables on denser, cheaper, commodity.
MySQL, 300GB, augmented flush memory.

  • > HBase, 2TB each.

FB investing in realtime Hadoop/HBase.
Work of a large team of FB engineers.
close collaboration with open source developers.

Here from the last, QA session's Memo.

At the seminar, from here, I was tired for taking a memo in English, so after here, my memo was written in Japanese :P

Q. Puma に似たようなシステムをつくったが、バッファが保護されないというような問題



Q. CassandraやMySQL Shardingを選ばなかった理由

Consistency が一番欲しかった
ひとつのサーバーに複数の Shard を割り当てたかった

Q. Pumaにしたらなぜそんなにスケールしたのか?旧システムは何が大きなボトルネックだったのか?

Hive での処理



Q. MySQLをスケーラブルにしているということだが

MySQLはTransactional Data Only
Simple Shared

Q. HBaseではRow Keyが重要だと思うが、

MR Jobで新しい方に書き直す

Q7 NWレイアで気にしたこと

NW partition がDC内で起きないようにすること

Q. 複数DCにまたがるか?


Q. Twitter 、Virtical

Hive Pig

Q. HBaseのフレキシブルなスキーマのメリットは?


Q. 見積もりをどうしてるか?

Dark Launchという手法?

Q. Cassandra はFBの中で使われているのか?

FBでは Cassandraは使ってない

Q. 20サーバ、1ラックは少なくないか?



Q. PumaはOSSにならないか?


Q. PTail がほしい


Q. DC間レプリケーションはどうしてるか?


複数の HBase Clusterがあるが、Cluster間でまたぎが起こらないようになっている。ユーザーはどこかの1つのClusterに属するようになっている。


Titan の前は2箇所に書いていた


Q. Major CompactionがNWのレイテンシーに問題を起こす場合にはどういう解決をしているか?

Major Compactionが起こるタイミングを減らす

Resion のスプリット自動的には行わない

Q. Pumaによって具体的にできるようになったことの例

Facebook Insight、ひとつの広告をどれだけの人がみているか?
Commerse System における行動履歴、分析。


Q. Hadoop と HBase 2重にもつデータをどうやってを管理してる

Scribe で HDFSに書いたら
そこから PTail、Hive どちらからも読み込み可能

HBase: The Definitive Guide

HBase: The Definitive Guide



Hbase at FaceBook on Zusaar

Facebookhas one of the largest Apache Hadoop data warehouses in the world, 
primarily queried through Apache Hive for offline data processing and analytics. 
However, the need for realtime analytics and end-user access has led to the development of several new systems built using Apache HBase. 
This talk will cover specific use cases and the work done at Facebook around building large scale, low latency and high throughput realtime services with Hadoop and HBase. 
This includes several significant contributions to existing projects as well as the release of new open source projects.

Location data.


*1:according to [http://www.zusaar.com/event/agZ6dXNhYXJyDQsSBUV2ZW50GOuRAgw:title=Zusaar] 283 people were there