2012-06-17

Notes of Hadoop Summit 2012 Day1. #hadoopsummit

Notes of Day1.

Last 13-14 June, I've been attending Hadoop Summit 2012 San Jose.
In my opinion, the conference won't finish until finish posting blog.
So at first I want to share you my notes of Hadoop Summit 2012 Day1.
Continuously I will up Notes of Day2 and my summary of this event, so stay tuned ;)

Contents:

08:30am - 10:05am Keynote & Plenary Sessions in the Main Ballroom

10:30am - 11:10am Big Data Architectures in the AWS Cloud

11:25am - 12:05pm HDFS - What is New and Future

01:30pm - 02:10pm HMS:Scalable and flexible configuration management system for Hadoop stack

02:25pm - 03:05pm Big Data Challenges at NASA

03:35pm - 04:15pm Infrastructure around Hadoop - backups, failover, configuration, and monitoring

04:30pm - 05:10pm Hadoop and Vertica: The Data Analytics Platform at Twitter

Links to Notes of Day2 and my Summary.

Relative posts.

8:30am - 10:05am Keynote & Plenary Sessions in the Main Ballroom

Kicked off with this exciting introductory video.

Source: vimeo.com via garage-kid on Pinterest

▶ from Horton works CEO

Summit driven by the community

267 sessions 120 organizations

content selection commitee was steered by 40 leaders across the ecosystem from 27 organizations
dilemma
1. tremendas data explosion
2. proper data storategy
3. pre transaction

apahe Hadoop is poised to be at athe EPICENTER of next generation data architecture

▶ Hadoop's Oppourtunity to Power NG architecture

BIG DATA
$100B
market opportunity

transaction + interactions + observations -> BIG DATA

ERP -> CRM -> WEB -> BIG DATA
sensors / RFID / Devices
Mobile web
User click stream
Sentiment
User Generated Contents ....
Increasing data variety and complexity

there is still work to be done to ensure Hadop powert the big data wave

Many communities must works as one
vendors, OSS, End users
- be diligent stewards of the OSS core
- Be tireless innovators beyond the core
  - visualizing data
- provide robust data platform services and open APIs
- enable ecosystem at each layers of the stack

top 10 influence of the decade
1. Google
2. Apple
3. Apache software foundataion ###

Diligent stewards and tireless innovators
Trees are still growing
Base is Hadoop
Pig, Hive ....
Avro ....

Integrating Hadoop with existing IT investments is vitally important
- Larry Feinsmith

Connecting Transaction + Integration + Observations
Business transaction & integrations
Business inteligence & analytics
1. share refined data and runtime models
Big data refinery (new added)
- ( Audio, video... / Docs, text, ... / Web logs, clicks ...

Next Generation Bigdata architecture
Hadoop arrows powered by ETL, data movement, and data integration technologies
NewSQL

Data services and Open APIs are viral
Apache HCatalog: Hadop's centralized metadata services
provide consistent metadata and datamodels across toools
Share data as tables and out of HDFS ...

Data services and Open APIs in action
4 step process example
1. Web log files via WebHDFS APIs
2. Customers and order data via talend & HCatalog for schema
3. Process,, analyze , and join data via talend pig & HCatalog
4. Analyze website visits by the type of ...

Demo

Ecosystem Completes the Puzzle
Application business
- IBM, spring, microsoft, Datameer
Datamanagement and movement
- IBM, microsoft, MarkLogic, splunk, teradata aster
Infra and system management
- StackIQ, SAVVIS, cisco, vmware, IBM, microsoft

Our opportunity and our role
By the end of 2015
more than half the worlds data will be processd by Apache Hadoop.

▶ Hadoop Now

Hadoop Summit is BIG!
2008 first summit 200+ people
2012 summit 2200+ people

Time line Apache Hadoop 1.0 & 2.0
Hadoop 1.0 is the stable release
2.0
- NG MR and HDFS

Hadoop 1.0 key features
Flush / sync for HBase
interactive apps - web site personalization
Security - Kerberos
MR limits
- Solve whack a mole bad user job problem
- benefit reliability, multi-tenancy

Haoop 2.0 innovations
Focus on scale and commiinity innovation
YARN: scalabel pluggable execution framework
- near realtime, machine learning and analytics use case
Federation : scalable pluggable storage
Always on : no cluster downtime
- Wire compatible API
- HDFS HA
- rolling upgrade
- log and checkpoint management

Balancing Innovation and Stability

Hortaonworks Data Platform 1.0 (HDP) ###
- Hortonworks Data Platform | Hortonworks

HIghlights

1. Pure Apache Hadoop 1.0 code line, 100% open source
2. OSS management & monitoring via Ambari
3. common metadata services via HCatalog
4. ...

Management and monitoring system : Ambari
Powerful monitoring and alerting dashboard
Simple Integration and provisioning

Full stack HA
Proven HA solutions with proven Hadoop 1.0
FO and restart for
NN
JT
Other services to come...
multi vender support by Open APIs

The road ahead
Ambari
- REST API
HCatalog
- ODBC/JDBC
- more REST APIs
Full stack HA
- Continued work with virtualization and OS vendors
Native Windows support

Help the grow ecosystem!

▶ Bringing the value of Hadoop to Enterprise

Teradata, SCOTT GNAU

Value
Meet customer expectation
reducing cost
increasing market share and revenue

Technology evolution and where we are today
more cheaper
more faster (performance)

What could derail bigdata?
- Skill sets
  - its hard to utilize exisisting skill sets
- No enterprise wide adoption
- Big data architectures

History of innovation

What made data warehouse successful
- Affordability
- Robust ecosystem
- Reference solutions
  - mesurable business value

What does big data technology need to success in business
know a Value
what if you knew the customer was going to leave?
- -> analytics

Unified big data architecture
DWH
- Teradata
Discovery platform
- Aster ( <- SQL-H <- Hadoop )
ETL tools
BI tools and visualization tools

10:30am - 11:10am Big Data Architectures in the AWS Cloud

Speaker: Adam Gray

Big Data technologies such as Hadoop, NoSQL, and scalable object stores are an ideal fit for the elasticity and scalability of a cloud deployment. In this talk we will take a look at several common architectural patterns that are being used today on the AWS cloud to take advantage of these synergies while overcoming some of its inherent limitations. ...

▶ Should I run Hadoop in the cloud?

▶ Customer case studies

▶ Case #1 Netflix

50 B events per day

Big idea #1: Cloud based data hub
- S3

Amazon s3 growth
99.99999999 % durablility

8TB of event data per day
Legacy data dimension data stored in Cassandra
-1PB of data stored in Amazon S3

Big idea #2 data hub as a Hadoop File system
EMR
managed Hadoop offering in the cloud
integration with other AWS services

S3 -> prod Cluster(EMR)
data streamd directly from S3 to the cluster
data consumedd i nmultiple way
recommendation,, adhoc query...
wide range of processing languages used
Pig...

Big idea #3 Dynamically resizable clusters
300 nodes during the day
400+ nodes in the evening peak time

BIg idea #4 one datfa set; multiple clusters
S3 -> prod cluster and query cluster

cheaper experimentation = faster innovation

▶ case #2

Corn production
200TB of data in S3
Simulation each month

Big idea #5 transient Hadoop Clusters
Sending in from S3 to Cluster
and sending back cluster to S3

Big idea #6 SPOT instances
Trade off
price depends on the rete at that moment.

Mix spot and on-demand instances

▶ case #3 airbnb

1M daily searches
10M HTTP requests
billions of records

Big idea #7 scalable realtime data access
Amazon DynamoDB
DynamoDB fully managed NoSQL datfabase service, seamles scalability
Zero Administration
Low Latency SSDs
Reserved capacity

A Cloud based hadoop architecture
RDS, Log files, click stream
Tier 1: S3 (historical data)
Tier 2: Hadoop/EMR (working) : expandable, replicable
Tier 3: DynamoDB (aggregrate)

▶ Everyone is Hiring

▶ QA

data aggregation
- chakwa
- Welcome to Apache Chukwa

data taransision from S3

security
case by case, solutions it should be.

realtime processing

EBS for local disk

11:25am - 12:05pm HDFS - What is New and Future

Speaker: Suresh Srinivas,Sanjay Radia

Hadoop 1.0 is a significant milestone in being the most stable and robust Hadoop release tested in production against a variety of applications. It offers improved performance, support for HBase, disk-fail-in-place, Webhdfs, etc over previous releases. The next major release, Hadoop 0.23 offers several significant HDFS improvements including new append-pipeline, federation, wire compatibility, NameNode HA, further performance improvements, etc ...

▶ Outline

Hadoop 1 and 2 releases
generated storage service

enterprise use casses
HDFS infra inprovements ...

▶ Hadoop 1 and 2

Hadoop 1 GA
- security
- append/fsync(HBase)
- WebHDFS + Spnego
- Write pipeline
Hadoop 2 alpha ...

Testing & Quality - used for each stable release
Nightly testing
- 1200 automated tests on 30 nodes
QE certification for release
Release testing - alpha and beta
- sandbox cluster - 3 clusters each 400 - 1k nodes
- research cluster

Hadoop 1 and 2 Timeline
Federation alpha testing

HA, Wire compatibility on the development status

▶ Federation: Generalized Block storage

Block storage as generic storage service
Block pool
Multiple independent NN and Namespace Volume in a cluster

HDFS generic storagge service opportunities for innovation
Federation - distributed namespace
new services - independent block pools
- New FS - partial Namespace inmemory
- MR tmp storage, HBase directly on block storage
- Shadow file system in DN

Shadow File System
Custom Namespace to shadow namespace of another system
Different policies on the data
e.g. single replica, fetch missing ones from source
e.g. reduce replication factor for data duplicated in another cluster

Managing Namespaces
Federation has multiple namespace
Dont you need a single global namespace?
- some tenants want private namespace
  - hadoop as a service
A simple global namespace is one way share
Client side mount table is another way to share
- share mount-table
- personalized mount-table
Client side implementation of mount tables
- No SPOF
- No hotspot for root and top level directories

Next step first support for volumes
NameServer
WorkingSet of name space in memory
Number of NameServers =
sum of (namespace working set) ...

▶ Enterprise Use cases

HA
Standard interfaces
- WebHDFS, Fuse, and NFS access
Snapshot
Disaster Recovery
- Distcp
  - enhance using journal interface & snapshots
Data Efficiency/RAID
- productive the tools and experience at Facebook

▶ HDFS infrastructure improvement

Netty ###
- Better connection and thead management
Image/Edits management
Parallel writes
Group Blocks
Support for Heterogeneous Storage
Rolling Upgrades improvement

▶ HA in Hadoop 1!

Using FUll Stack HA Architecture
when NN FO then
JT in safemode, apps(also from outside) pause/retry

FO completes everything become normal

HA in Hadoop 1 with HDP1
Full stack HA architecture
Use industry ...

Hadoop NN/JT HA with vSphere
NN HA with Linux-HA

FO times
NN FO times with vSphere and LinuxHA
Failure ditection and FO - 0.5 to 2 min
OS bootup needed for vSphere - 1min ...

▶ Summary

Hadoop 1 - the most stable version
Hadoop 2 - alpha testing
- 3years development
- generalized storage layer
- Hadoop 2 HA
Snapshot and DR improvement

▶ QA

About Distcp

Topology is unchanged in Hadoop 1 and 2.

about FO time
PB size cluster speaker tried.

about Linux HA

1:30pm - 2:10pm HMS:Scalable and flexible configuration management system for Hadoop stack

Speaker: Kan Zhang, Eric C Yang

To deploy the whole Hadoop stack and manage its full life cycle is no small feat. This is partly due to the interdependency among services and partly due to the large number of nodes in a Hadoop cluster. For example, when upgrading an HDFS cluster, one needs to make sure NameNode is upgraded successfully before starting up the DataNodes. Such cross-node dependencies are difficult to specify in existing configuration management systems like Puppet. ...

▶ Motivation

Goal
managing Hadoop stack in a data center
Scalability
- fault tolerance adds further complexity
Real time interaction and FB
- visibility into cluster state is a major pain point for sys admins.
cross node ordering dependency
- e.g. start/stop NN, JT
  - simple to specify and efficient to implement

▶ HMS approach

Zookeeper plays a central role
fault-torerant and scalable storage

asynchronous messaging service

▶ Zookeeper

a hierarchical namespace of znodes for storing data
sequential znodes for message queing

watches for asynchronous notification
Ephemeral znodes for failure detection

▶ Leveraging Zookeper

Storing Cluster size
Storing system state

Distributed orchestration
Cross-node dependency

▶ HMS Overview #picture

▶ Design Implications

All cluster state is stored in Zookeeper
Controllers and agent dont rinteract directly
- all communication are via ZK async notification
Controllaers and agents are stateless
- Controllers can be replicated for load balancing
- Controllers failures are automatically detected and handled
Dependency specified in terms of node states
- actions come and go, but their effects are captured in node states

▶ Details of configuration

Node list
- rolls are mapped to hostnames

Package manifest

Configuration plan
- run a list of scripts to configure NN

Start NN
run a script to setup HDFS on NN

Compiled plan

Start JT
dependency discriptions

Node state

▶ QA

HMS protype is available on GitHub.
- https://github.com/macroadster/hms

earlier on slide mentioned
5k nodes are in multiple cluster

doesn't OS level deployment in this moment.

about Zookeeper scalability
mentioned in Zookeepers paper.

can use other system not Hadoop
Yes.
depends on your script.

Puppet, Chef
HMS is for scalability
designed from scratch

2:25pm - 3:05pm Big Data Challenges at NASA

Speaker: Chris Mattmann

I`ll describe what I see as the most current and exciting big data challenges at NASA, across a number of application domains: * Planetary Science * Next generation Earth science decadal missions * Radio Astronomy and the next generation instruments including the Square Kilometre Array * Snow Hydrology and Climate Impacts The session will focus on defining the problem space, suggesting how technologies like Apache Hadoop can be leveraged, ...

Big data challenge at NASA
and set of technologies they developping

▶ Self introduction

▶ Agenda

▶ Big data grand challenges I'm interested in

700 TB/sec of data comming off the wire when we acturally have to keep it around?
10years of data from colorado river basin and store and disseminate the output product

How do we compare PB of climate model ouput data in a variety of formats ...

How do we catalog all of NASA's current planetary science data ###

huge data, different format data comparison.
data placed on diffrent place, data center

▶ The NASA ESDS Context #picture

▶ Lessons from 90's era missons

data volumes
complexity of instruments and algorithms

increasing availability of proxy/sim/ancillary data
rate of technology refresh

▶ Where do Big data technologies fit into this?

U.S. national climate assessment
SKA South Africa: Square ...

▶ MODSCAG Historical Data processing

snow.jpl.nasa.gov
YARN

data processing, data flow dependency

▶ EVLA demonstration architecture #picture

python (kasa???)

▶ Apache OODT

diverse set of science data system activities in planetary science, earth science, radio astornomy, biomedicine, astrophysics, and more...

▶ OODT: OSS bigdata platform originally pioneered at NASA

OODT is meant to be a set of tools to help build data systems
- turn key ...

▶ Governance Model + NASA = ♥

NASA and other government agencies have tons of process

▶ OODT and PCS

#photo.

▶ How do we deploy PCS for a mission?

NASA implement
- server config
- product metadata specification
- ...

▶ OODT 0.4

▶ How do these fit together?

Hadoop, Apache OODT

▶ where are we headed with OODT + Hadoop

investigate and integrate YARN
Plug in HBase as File manager Catalog

OODT + Hadoop virtual machines and RPMs
- Easy installation

3:35pm - 4:15pm Infrastructure around Hadoop - backups, failover, configuration, and monitoring

Speaker: Terran Melconian, Ed MacKenty

There`s more running on your Hadoop cluster than just the Hadoop binaries. We`ll present our accompanying infrastructure – our HDFS and Hive DDL backup system, our monitoring with Ganglia and Nagios, our use of DRBD for master failover, and our configuration management with Puppet. We`ll talk about what`s worth monitoring, what`s worth automating, and some things we tried that didn`t work out. We are in the process of open-sourcing several of these components and should be done well before June.

▶ What tripadvisor does

Worlds largest travel site and community
Trip planning
> 50 million unique monthly visitors 30 country ...

▶ What the Warehouse team does

retain and aggregrate historic site activity data
-50 nodes in 4 clusters
- 12 analytics team
3 people fairly new to hadoop/hive
CDH3u3 ( Hadoop 0..20.2 )

▶ Why Hadoop at TripAdvisor

old RDBMS data warehouse could barely keep up with data integration , even running on expensive HW with a SAN
many kinds of statistics

▶ HA Namenode : DRBD, Corosync and Pacemaker

NN and JT on master node
DN and TT run on slave node

automatic FO of all master node services to a passive nodes
provision two identical systems ...

▶ DRBD /Corosync Configuration

DRBD: replicates NN image, Hive meta data, Oozie job data
Corosync: messaging between active passive masters ###

Corosync will start Pacemaker for you
use /etc/init.d/corosync to manage it, and Pacemaker

▶ Pacemaker configuration

Define each resource you want to manage
- DRBD device, master IP address, ehernet connectivity checks, Hadoop NN and JT, Hive thrift server, MySQL for Hive metadata...
set monitoring intervals for each resource
define resourse co location dependencies
define resource ordering dependencies
Restart failed services, e.g. HIve-Thrift
Use crm tool to manage nodes and resources
Test with a manual FO

SNN works on passive machine.

▶ Monitoring: Ganglia and Nagios, Job Tracking

visibility into cluster operations
monitor HW states and resource usage

▶ Ganglia

Graphs for entire cluster and individual nodes

▶ Nagios

Their primary notification system
about 80 checks , - 25 are their own. #photo memo
Some warnings
Dont let Nagios run Hadoop fsck
- takes much time 30 sec command ###
LDAP failure causes email cascade
High loads can cause timeouts, which cause notifications

▶ Job Tracking ###

PERL script invoked frequently by cron
Parses JT log entries since last run
Records data on each job in PostgreSQL
CGI script to do queries

Helps post morterm of problems
Used to predict trends, future resource needs

▶ Other cron scripts we run

Check load
Master nodes
- Compress Hadoop/Hive logs
- check_hdfs
- backup current NN metadata ...

▶ Configuration management

Use Puppet as a tool
Gets the job done
Lacks flexibility in in heritance to specialize defaults per machine
Some aspects of operations are hard to debug

▶ Backup ###

Runs on separate backup server with storage (NexSan)
Pull process driven by processes on backup server

Incremental HDFS backup
Hive DDL backup

▶ Backup HDFS

Open-source Java app
requires customization to your environment
- Hadoop hdfs / Hive backup solution, open-sourced by TripAdvisor
- backup-hadoop-and-hive/README.txt at master · TAwarehouse/backup-hadoop-and-hive · GitHub

▶ Backup Hive DDL

OSS java app uses Thrift server
used to move data MySQL

▶ Other things to potentially backup

Backup the NN Metadata ... #photo

▶ Why Oozie?

Drawback: several times slower to write than cronjobs , while also less expressive
Advantage: ... #photo

▶ How Oozie?

Bypass the Oozie GUI
Dont ever use Derby

▶ Experiences and Expectations

Hadoop is not a mature
- reliability, stability
Cluster outage are common events, not outliers
You must design for failure and have a robust mechanism to cleanly and easily resume execution once the cluster is back up.
Important jobs must be ..

▶ Attributes of Robust Jobs

Idempotent and resumable regardless ...

▶ Benchmark ###

Key insight
for the same job, the same task always does the same work.

▶ Features you should use ###

Fair scheduler
refreshNodes , refreshQueue ...

▶ Hiring

▶ QA

about Disaster Recovery

▶ Related tweets.

TripAdvisorのHadoopでの工夫の話は、真新しいものはないものの、具体的で共感できる。 #hadoopsummit

2012-06-14 08:28:01 via TweetDeck

4:30pm - 5:10pm Hadoop and Vertica: The Data Analytics Platform at Twitter

Speaker: Bill Graham

Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter

View more presentations from Bill Graham

Twitters Data Analytics Platform uses a number of technologies, including Hadoop, Pig, Vertica, MySQL and ZooKeeper, to process hundreds of terabytes of data per day. Hadoop and Vertica are key components of the platform. The two systems are complementary, but their inherent differences create integration challenges. This talk will give an overview of the overall system architecture focusing on integration details, job coordination and resource management. Attendees will learn about the pitfalls we encountered and the solutions we developed while evolving the platform.

▶ Overview

Architecture
Data flow

Job coordination....

▶ We count things

140 characters
140 M active users
...
400M tweets per day
80-100TB integrated days
...

▶ Heterogeneous Stack

Many job execution applications
- Crane - Java ETL
- Oink - Pig schedular
- Rasvelg - SQL aggregrations
- Scalding - cascading via scala
- PyCascading - Cascading via Python
- Indexing Jobs
Our users
- Analytics, Revenue, Growth, Search, Recommendations, etc...
- PMs, sales!

▶ Data Flow: Analytics #photo

production hosts

staging hadoop cluster
MySQL/GIzzard
main Hadoop DW (HBase)

Vertica
- Real-Time Analytics Platform | Big Data Analytics | MPP Data Warehouse
Analytics web tools

HCatalog

▶ Chaotic ? Actually, No.

▶ System Concepts

Loose coupling
Job coordination as a service

Resource management as a service
Idempotence

▶ Loose coupling

Multi job frameworks
Right tool for the job

Common dependency management

▶ Job coordination

Shared batch table for job state
access via client libraries

Jobs & data are time based
3 types of preconditions
1. othe job success
2. existence of data
3. user defined
Failed jobs get retried (usually)

▶ Resource management

ARM
Library above ZK

Throttles jobs and workers
- Only 1 job of this name may run at once
- Only N jobs may be run by this app at once
- Only M ....

▶ Job DAG and state transition

Local view
is it time for met to run yet?
are my dependencies satisfied
any resource constraits?

▶ Example : active users #photo

▶ Vertica or Hadoop? ###

Vertica
- loads 100s of Ks rows/sec
- aggregate 100s of Ms rows in sec ...
Hadoop
- Excels when data size is massive
- Flexible and powerful
- Great with nested data ...

▶ Vertica import options

Direct import via Crane
Atomic import via Crane/Rasvelg
Parelel import

▶ Vertica imports - Pros/cons

Crane and Rasvelg
- good for smaller data sets , DB to DB transfers
Pig
- great for larger dataset

▶ VerticaStorer

PigStorage implementation
From Vertica's Hadoop connector suite ...

▶ Pig ViierticaStorage

Their enhancement
- Connection credential management
- truncate before load option
- Throttle concurrent writers via ZK
Future feature
- counters for rows inserted/rejected
- Name-based tuple-column bindings
- atomic load via temp table