2012-01-18

とても勉強になった #mongotokyo のメモ

MongoDB conference

ブログを書くまでが Conference ということで...
MongoDB Tokyo 2012

に参加してきましたので、わたしがとってきたメモを share させていただきます。
※ スライドは Slide Share 等に公開していただけたものを随時はらせていただきます。

今日の MongoDB Tokyo 2012 は品川シーサイドの楽天タワーで開催されました。

MongoDB のステッカーと NAVER さんのストラップ、そしてパンフレットをいただきました。

先に感想を書いてしまいますが、
すべてのセッションが、大変有用でとてもためになりました。
とくに " mongo-hadoopで始める大規模ログ解析〜低コストへの新たな道〜 (@muddydixon Daichi Morifuji) " が良かったと思います。
~~スライドが公開していただけるのであれば、必見かなと思います。~~
スライド、公開していただけましたね。本エントリにも埋め込みさせてもらいました。必見です。

では以降、わたしのとってきたメモです。

Contents.

Opening.

1:00 pm - 1:45 pm Welcome to Mongo Tokyo and Overview of MongoDB (Max Schireson President, 10gen)

1:55 pm - 2:40 pm Basic Application and Schema Design with MongoDB (Alvin Richards, Senior Director of Service & Enterprise Engineering)

2:50 pm - 3:35 pm Effective MongoDB Deployment Architecture (Alvin Richards, Senior Director of Service and Enterprise Engineering)

3:45 pm - 4:30 pm KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB (@crumbjp Hiroaki Kubota)

4:40 pm - 5:25 pm 事例：とある写真共有アプリでの MongoDB (@just_do_neet Tetsuya Ohira)

5:35 pm - 6:20 pm mongo-hadoopで始める大規模ログ解析〜低コストへの新たな道〜 (@muddydixon Daichi Morifuji)

6:30 pm - 7:15 pm ソーシャルゲームにおけるMongoDB適用事例 (@matsukaz Masakazu Matsushita)

✔ Opening

hashtag: #mongotokyo

10gen公式のアカウント @mongodb_jp ができました。今後はこのアカウントから日本向けに Good News が流れることでしょう。みなさんどうぞフォローの上今後の情報を楽しみにしておいて下さい。 #mongotokyo

2012-01-18 14:32:38 via YoruFukurou

✔ 1:00 pm - 1:45 pm Welcome to Mongo Tokyo and Overview of MongoDB (Max Schireson President, 10gen)

▶ Background

By Dwight Meriman(Double Click CTO) and Eliot Horowitz(Double Click Engineer)
Founded four years ago.
31 Million $

▶ RDBMS has been an important tech for a long time

from 1970
tech change a lot.

▶ Recent changes

Big data

Billions millions record
100s of k/millions of queries per second

Cloud - economics

elasticity

Developer empowerment

Developer has a strong voice
- democratic

▶ New DB Requirement

Horizontally scalable

Run in 100s or 1000s of commodity servers

Agile

smaller team respond more less time

▶ New req lead to new architecture

Horizontally scalable

Sharding
Replica sets for greater reliability

Document oriented

new data model required to avoid distributed joins and multi statement transactions
JSON for developer productivity

▶ MongoDB is the leading NoSQL Database

inspired by custom data stores created at double Click to handle 400,000 advertisement per second
Now the market leader
Over 1 Million download

▶ Partnership - some examples

Amazon
MS
Redhat
VMWare

▶ Leading customers Commercial Customers

Internet and Tech

Ebay, Intuit, SAP

Enterprise

Disney, Ericsson, MTV ...

Government

UK National Achieves, India National ID

ー
and some Japanese users also

▶ Who Customers are Choosing MongoDB

Capability
Economics

servers, storage, SW

▶ MongoDB in Japan

DB tech know well 10gen
learning Japanese market
Thinking MongoDB may fit for Japanese market
Looking forward to find the good partnership in Japan

No QA

✔ 1:55 pm - 2:40 pm Basic Application and Schema Design with MongoDB (Alvin Richards, Senior Director of Service & Enterprise Engineering)

▶ Topics

Schema design is easy
Common Patterns

▶ Use MongoDB with your language

10gen support drivers

Ruby, Python, Perl ....

Object data mappers

Morphia - java
Mongoid, MongMapper - Ruby

▶ So todays example will use

Presenters favorite book

▶ Design your objects in your code (code sample)

Java using Driver

get a connection to the DB
Create the object
insert the object into MongoDB

▶ Java using object data mapper

use morphia create the data store
create the object
insert object into MongoDB

emphasizing MongoDBs simplicity.

▶ Terminology

Difference between RDBMS and mongoDB
Schema design relational database

ER diagram

MongoDB schema Design

alternative of same schema design
embedding, linking

▶ Design Session

Design documents that simply map to your application

ID must be unique, but can be anything you'd like

▶ Add and index, find via index

1 means ascending, -1 means descending

▶ Examine the query plan

B-tree cursor

▶ Query operatiors

Conditional operators
find posts with any tags
(like a like query)

▶ Extending the Schema

Schema less
MongoDB will not enforce schema
Not need to talk with DBA for painful conversation when you want to add the column
update statement

push command
inc command

create index on nested documents
find last 5 post
most commented post

▶ Common Patterns

Inheritance
how to implement in the RDBMS - sample
When use MongoDB - sample

missing values not stored. ( because dont need to store )

how to find the data sample.
find shapes where radius > 0

index only values present
- index are very small

▶ One to many relationship

One to many relatiionship can specify

degree of association between objects
containment
...

Embedded Array

$slice operator to return subset of comments
some queries harder
- e.g. find latest comments across all blogs
> alternatives

Normalized (2 collections)
now can collect latest 3 record among all blogs

most flexible
more queries

ー
Summarize
One to many - patterns
Embedded array / array keys

▶ Many - Many

Example - products and categories Original design
Alternative design
talking about each pros and cons.

▶ Tree Pattern

Hierarchical information
Blog - Comments <- Reply
ー
Full tree in Document
Pros. Single Document , Performance , intuitive
Cons. Hard to search , Partial result, 16MB limits

▶ Array of Ancestors

A
B, E
C, D, F
ー
Alternative
Store all Ancestors of a node
find all threads where "b" is in.
find all direct message "b" reply to

▶ Tree as Paths

Store hierarchy as a path expression

Separate each node by a delimiter , e..g. ","
...

▶ Queue

Need to maintain order and state
Ensure that updates are atomic
ー
find highest priority job and mark as in-progress

▶ Summary

Schema design is different in MongoDB
Basic data design principals stay the same
Focus on how the application manipulates data
Rapidly evolve schema to meet your requirement
...

▶ QA

1. 複数のClientが1レコードをみにきたときはどうなるか

Version number を比べる

2. Write management

several client
1 collection
several collection
at the same time.
ー
Blocking system mongoDB has.
Blocking serialize the system

3. permission for different collection

4. all of the writes are managed by a single lock system

but the lock is not so work in a long time
per collection or per db locking are now expecting.

5. How to embedding

Owe to your application how to implement

6. Transaction

MongoDB guarantee only 1 single document writing atomic
When you fail writing, before failing data can write but after fail can't write.
if you need rollback, you need to implement rollback function in your application by yourself.
MongoDBのトランザクション周りの考え方を学びましょうAtomic Operations URLPerform Two Phase Commits URL #mongotokyo
2012-01-18 14:41:19 via YoruFukurou

✔ 2:50 pm - 3:35 pm Effective MongoDB Deployment Architecture (Alvin Richards, Senior Director of Service and Enterprise Engineering)

▶ Five things to think about

Data protection
Machin sizing
Load testiong and monitoring
Back up restore
Handbook

▶ Types of outage

Planned

HW upgrade
OS or FS tuing
rellocation

ー
Unplanned

HW failure
Data center failure

▶ Replication features

a cluster of N servers
any single node can be primary
consensus election of primary
automatic FO
automatic recovery
all writes to primary
reads can be to primary or secondary

▶ How mongoDB replication works

3 nodes cluster
contact each other
Election establishes the primary
ー
When primary fails
surviving members contact each other
and choose the next primary
ー
member2 become primary automatically
ー
when the previous primary back
it automatically recovery
and replication re-established.
the previous primary is no longer become primary in this moment

▶ Typical COnfiguration

Single MongoDB
if it failed
cant read data, cant write data
ー
Replica set two nodes
when one node dies
application can read read
but cannnot write
# majority must be exit for electing the leader
ー
Three member replica set
one node dies
Application can read and write
ー
Replica set
2 in London
2 in SF
1 in NYC
if London failed
application can read and write
ー
All of these happen automatically

▶ HW sizing

How many machine need to order?
ー
Collection1
index1
Virtual address Space1

> mapped to Physical RAM
> Physical RAM is mapped to Physical Disks

▶ Sizing RAM and Disk

Woking set
Document size
Memory versus Disk
Data lifecycle patterns
# important to un

Long tail
Pure randum
Bulk remove

▶ Figuring out working set

Can easily know from stats.
db.blogs.stats
Size of data
Average document size
size on disk (and in memory)
Size of all indexes
Size of each index

statsはロックするのでプロダクト環境だと呼べないですよね #mongotokyo

2012-01-18 14:46:12 via Twitter for iPhone

▶ Data configurations

Comparing Single disk,
RAID0 configuration
RAID10 configuration ( n number of striping )

▶ SSD?

Seek time of 0.1ms vs 5ms
But expensive
ー
Can see mixture of SSD and HDD
the data need performance move to SSD and not need performance remained on HDD

▶ Key Points

Know how important page faults are

if you want to latency avoid page faults

Size memory appropriately

to avoid page faults fit everything in RAM
Collection data + index data

Provision disk appropriately

RAID10 is recommended
SSD are fast if you can afford them

▶ Monitoring is your friend

To understand MongoDB deeply, you need to concern monitoring in a first statge.

▶ Monitoring Tools

mongostat
MMS! - 10gens tool
munum, cacti, nagios, zabix

MongoDBの監視なら「Server Density」 URL もおすすめ。前回のMongoDB勉強会で @davidmytton 氏が来日して発表してくれました。日本のカスタマーを積極的に探していますので是非ご検討を！ #mongotokyo

2012-01-18 14:57:11 via YoruFukurou

▶ Load testing understand what you think the system should be

load and test your hypothesis

use the DB profiler
- Trend

use a trending monitoring tool to analyze - MMS, munin, etc...

▶ Backups

mongodump versus snapshot
Restore a member versus whole rep set versus whole cluster
Don't forget your config dbs in a sharded system

▶ Plan for the worst

Not everything will go to plan
Have a operation handbook
Practice basic procedure
backup & restore failing over node
rolling upgrade
ー
Nevetthless your boss calling you something wrong happen to your system.

▶ QA

1. when contain propaganda data primary down

2. memory map file, size limitation (2GB)

3. maximum number of nodes

OSS so what people doing they can't know more than hundreds
2PB, 260nodes
MongoDB is CPU intensive system, so if you put MongoDB on the same machine previously woking Oracle or MySQL, you can see the CPU bound.
But you can also see good access to the memory

4. if you auto sharding

heard the bug of count the number of nodes are incorrect is it fixed?
not so optimized counting nodes in this moment.

5. master fails write operation

Mongo asyncronous write

6. multiple replica set

all reads are going to primary (default)

✔ 3:45 pm - 4:30 pm KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB (@crumbjp Hiroaki Kubota)

MongoTokyo

View more presentations from Hiroaki Kubota

Rakuten MongoDBの特徴、それを活かした使い方などを楽天・インフォシークニュースの事例などを通じて紹介します。また機能、性能の検証結果、運用ノウハウの共有。PHPドライバのパッチなども公開します。

スライドＵＰ。でもやっぱズレるし、所々おかしい・・・まあ要所は大丈夫そうだし、いいか。。URL #mongotokyo

2012-01-18 23:46:59 via web

▶ introduction

▶ MongoDB characteristics

Read intensive.
Write is not so much good
To read data immediately after its written is not good

MongoDBはReadはすごく速い。ただWriteはそこまでじゃない。現バージョンはGlobalロックされるので、Writeが多い場合の利用は気を付けないといけない。 #mongotokyo

2012-01-18 16:01:54 via TweetDeck

▶ Our new system Cockatoo

ニュースの配信サービス
LayoutとComponent
Sessionはmemcachedに変えた（後述）

▶ Cockatoo system diagram(image)

VM（1Core）を5node
developer

CMS

ContentsDB(mongo)

▶ MapReduce

楽天では使えていない
記事を消すときなど少しは使っている
ソーシャルに展開する際にもっと使って行きたい

▶ ベンチマークテスト

計算式
C = number of core
DD = dd command performance (byte/sec)
S = document size (byte)

Get qps = 4500 x number of core
Set bytes/s = 0.05 x DD / S
Set (nsync) qps = 4500

ー
AWS のインスタンスサイズの共有
ノード追加のコスト
性能評価時のクエリの紹介
Group by x 6,000万件 →52min

▶ MongoDB の落とし穴

図解
Indexing

Lock operation
work firstly on primary
next work on secondary →結果、サービス止まってしまう

バックグランドで実行してもダメ
ー
Manual index
Primary でのindexing が終わっても自動的に secondary には自分でindex つくる
（提案してみようとおもっていたが）
→すでに実装予定がある 10gen
ー
Staleという問題
replica set の問題
oplog をみて同期をしていく
oplog は再起的に使っているので、あまり忙しいと古いデータが同期されるまえに消してしまう
oplog size はMongoDBを最初に起動したときのサイズで固定されてしまう
初期設計が重要ということ
oplog あふれがおこると他の secondary をみにいく
ー
Disk Space
データの更新（追加、削除）を大量にしていると、フラグメンテーションが起こる
↓
PostgreSQLのバキュームのような処理
Compaction
セッションデータの格納に使っていた
→Workaroundの共有（step down の応用）
ー
PHP の問題
1.1.4
Connection pool がっこわれたときに捨ててくれない
1.2.2
Socket Handle leak
etc ..
1.2 系を使うのを推奨
→PHPの patch on github

▶ Summary

High read performance
Good durability
Sharding is very challenging

operation is hard

Can work on a cheap environment
Need to consider high write throughput

▶ QA

1. session 情報にMongoDBを使わないのは

oplog が溢れるから

2. fsync=true の理由

staleになるのを防ぐため
あまりレスポンスが早いと oplog を使い切ってしまう、古いデータが消える

3. oplog の内容の確認

4. oplog のリサイズ

より大きな oplog のセカンダリをつくって置き換えていく

5. Read 主体のサービスで使っているように思うが、大きな write が発生するサービスでも使うことを薦められるか

write の内容によります。

6. indexing

アプリ変更などのタイミングなどでは、index は新たにつけたりしないよね。それはRDBMSと同じ

✔ 4:40 pm - 5:25 pm 事例：とある写真共有アプリでの MongoDB (@just_do_neet Tetsuya Ohira)

Presentation at MongoDB Tokyo 2012

View more presentations from NAVER Japan

NHN Japan 事例：とある写真共有アプリでの MongoDB
事例紹介におもきをおく

▶ self intro

▶ Intro of NHN Japan

Korean Company
Japan Subcidiery

search (NAVER)
mobile app

▶ Position talk

使いやすいストレージ
スキーマレス、ドキュメント志向
なかなかひとつのスキーマに決めにくいデータ
ー
MongoDB
スケールするDB
Big Data に使うのはきびしい
まだまだMongoDBのエキスパートは少ない
ー
大規模運用は難しい
ー
企業DBには3種類
Mongoは3つ目のDBだ
カジュアルな実験の場
いままでできなかったことができる

▶ introduction of examples of using MongoDB

▶ NAVER photo album

iOS and Android app also.
social 系の機能、TLなど
MongoDBはまさにTLに使っている
ー
その課題
複雑さ
アクティビティの種類、それぞれに違う情報
ユーザとの関係性、それもいくつかの種類
過去のアクティビティを検索して消す
ー
DBでやろうとすると、アクティビティごとにスキーマをきめないといけない
↓
それはできるよMongoDBなら
オペレーションはシンプル
★実際に発行しているQueryのサンプル

▶ とあるメッセンジャーアプリの話（LINE）

検討したが、MongoDBは使えなかったという話
ー
どんなアプリなのかという説明
ベッキーのCMで有名
中東地域でなぜか人気
半年で1000万ダウンロード
: やたら謙虚…
ー
要求する要件
スケーラブル
↓
MongoDBとHBaseと比較
ー
MongoDB
Auto sharding, 設定簡単
データの肥大化がこわい
HBase
実績があるFacebookなど
チューニングが細かくできる（メモリ、compaction）
SPOF
: これはすでにないと思うが
ー
検証→内容は、スライドを。
15台で検証。最終的にはもうすこし台数を。（さばけなかったから）
NAVERの use case ではHBaseがいい結果に。
HBaseのほうが書き込みでいい結果
Readは同程度できるという結果( id 指定の単純なもの）
NWの使い方が違うっぽい
Mongo はキャップがかかっているかのように上限があるようにみえた
HBaseは定期的にsync しているよう
HBaseは圧縮ができるところも好感
ー
最終的にはHBaseを使った
Data Size
NW
がネックになった MongoDB
※大規模で使えないといいたいわけではない
★HBaseは現状、100TBクラス

▶ とあるWebサービスのデータ解析

準リアルタイム分析をやりたい
↓
fluentd, node.js + MongoDB
ー
fluentd の説明

キタ！ MongoDB + #fluentd + Node.js の話！Naverさんのfluentd紹介資料はこちら！ URL弊社fluentdに関するブログエントリ URL #mongotokyo

2012-01-18 16:52:11 via YoruFukurou

MongoDBとFluentdの組み合わせに関してはTreasure Data公式のブログにも記事がありますよ！ > URL #mongotokyo

2012-01-18 17:29:50 via YoruFukurou

plugin 機能があることが大きな売りという印象をもっていると。
過剰に負荷をかけないでMongoDBに書き込める
ー
real time analytics system architecture
使っているもの
apache
fluentd
Hadoop
MongoDB
jubatus
node.js

▶ Demos

(1) NAVER まとめ
ユーザがどの国と地域から見にきているか世界地図にプロットする
maptail.js を拡張
1分程度の時差
(2) 今、サービスを見にきている人の情報
ー
教訓
MongoDB も素晴らしいが、fluentd や node.js などを組み合わせるとよりよいものがつくれる
java script の文脈で一貫してシステムをつくれる

▶ まとめ

MongoDBはいいよ
適切な現場があれば、カジュアルにつかってみてはどうか

▶ QA

1. MongoDBのスキーマデザインの話

embedded

2. Cassandra という選択肢はなかったのか

MyCassandra をつくった人が中にいる。
Cassandra にしたいという話はある。
そのときは対象にあがらなかったという話

3. MongoDB の非同期

✔ 5:35 pm - 6:20 pm mongo-hadoopで始める大規模ログ解析〜低コストへの新たな道〜 (@muddydixon Daichi Morifuji)

BigData Analysis with mongo-hadoop

View more presentations from muddydixon

Nifty mongo-hadoop https://github.com/mongodb/mongo-hadoop を利用することで、「Hadoop始めました」の次のHDFSの管理や冗長化などをmongoに肩代わりさせることで、一番つらい運用フェイズの習熟コストを落とし、少人数で大規模ログ解析を実現する手法について紹介します。

Hadoop の input ソースとしての MongoDB という話
Hadoopの管理は大変
HDFSの管理
HBaseならばZookeeperも

▶ self intro.

@muddydixon
去年、部ができた
2人のエンジニア

▶ log解析

大企業じゃないときびしい
リソース足りない

▶ Requirement

Adhoc Query
OLAP for survay
Distributed processing for non-holiday works

▶ MongoDB

Cons

OOM occures in MR
rich resources (memory/disc)

Pros

HA
Full index
Schemaless
AutoSharding

▶ Hadoop

Cons

Operation Difficulty
Many daemons
HBase needs more daemons
Hive spends much times

Pros

Scaleout
Calculate large data by MR
Scalabule HDFS

▶ Cons

Mongoは解析にむいてない
ー
Hadoopはデーモン多すぎる
管理たいへん

▶ mongo-hadoop

図解
Javaの Adapter
Hadoopの読み書きがMongoでできる
it enables Hadoop to access MongoDB data
mongo-hadoop supports shards and chunks environments
ー
HadoopのHDFSの部分が、MongoDBになったイメージをしてほしいとのこと

▶ Sample

これはスライドを見るのがよい
BSON の返ってきた型は知っていないといけない

▶ Hadoop と MongoDBの Consがいれかわる

Easy Calculate Large data
Limited deamons
Replica set and sharding
Can find data in a moment

▶ Advance

★ Hadongo

use BSON writable
use Common Mapper
use Reducer class for combinator
merge multiple put resources *now pull requesting)

同じKeyのものはおなじReducerで受け取ることができる
ー
why use BSONwritable?
we can use schemaless objects consisted of stiring number boolean object list
we should handle them such as check existence, type validation
ー
use common mapper
In hadongo
we create common MR for distinct processiong logics and use it to analyze manay validations

▶ why not publish hadongo

まだ、完成してないから。

▶ まとめ

データを解析したい人にご自身でどうぞを実現できる
新年会 cross
Data Miner、Data Scientist 募集中

おぉ、Hotな話題が #mongotokyo / The MongoDB NoSQL Database Blog - Operations in the New Aggregation Framework URL

2012-01-18 17:28:43 via Hatena

▶ QA

1.Reduce 後の書き込みが Mongoの不安

↓
書き込み先を変える、たとえばHDFS
Reduceでより小さくのチューニング
などで回避

2.Hadoop とMongoの Cluster は別

同梱させて動かすことは考えてなかったのでSourceをそこまでみていない
Data Localiy を確保できたら省エネ

3. hadongo はいずれリリース

いかに楽にデータ解析しようかを実現したくてつくっている
MongoDBには何も手に入れていないところが売り

4. Sqoop は Mongo に対応してないか

してないという記憶