#garagekidztweetz

id:garage-kid@76whizkidz のライフログ・ブログ!

Notes of Hadoop Summit 2012 Day1. #hadoopsummit

スポンサーリンク

Notes of Day1.

Last 13-14 June, I've been attending Hadoop Summit 2012 San Jose.
In my opinion, the conference won't finish until finish posting blog.
So at first I want to share you my notes of Hadoop Summit 2012 Day1.
Continuously I will up Notes of Day2 and my summary of this event, so stay tuned ;)

Contents:

8:30am - 10:05am Keynote & Plenary Sessions in the Main Ballroom

Kicked off with this exciting introductory video.

▶ from Horton works CEO
  • Summit driven by the community

#

  • 267 sessions 120 organizations

#

  • content selection commitee was steered by 40 leaders across the ecosystem from 27 organizations
  • dilemma
    1. tremendas data explosion
    2. proper data storategy
    3. pre transaction

#

  • apahe Hadoop is poised to be at athe EPICENTER of next generation data architecture
▶ Hadoop's Oppourtunity to Power NG architecture
  • BIG DATA
  • $100B
  • market opportunity

#

  • transaction + interactions + observations -> BIG DATA

#

  • ERP -> CRM -> WEB -> BIG DATA
  • sensors / RFID / Devices
  • Mobile web
  • User click stream
  • Sentiment
  • User Generated Contents ....
  • Increasing data variety and complexity

#

  • there is still work to be done to ensure Hadop powert the big data wave

#

  • Many communities must works as one
  • vendors, OSS, End users
    • be diligent stewards of the OSS core
    • Be tireless innovators beyond the core
      • visualizing data
    • provide robust data platform services and open APIs
    • enable ecosystem at each layers of the stack

#

  • top 10 influence of the decade
  • 1. Google
  • 2. Apple
  • 3. Apache software foundataion ###

#

  • Diligent stewards and tireless innovators
  • Trees are still growing
  • Base is Hadoop
  • Pig, Hive ....
  • Avro ....

#

  • Integrating Hadoop with existing IT investments is vitally important
    • Larry Feinsmith

#

  • Connecting Transaction + Integration + Observations
  • Business transaction & integrations
  • Business inteligence & analytics
    1. share refined data and runtime models
  • Big data refinery (new added)
    • ( Audio, video... / Docs, text, ... / Web logs, clicks ...

#

  • Next Generation Bigdata architecture
  • Hadoop arrows powered by ETL, data movement, and data integration technologies
  • NewSQL

#

  • Data services and Open APIs are viral
  • Apache HCatalog: Hadop's centralized metadata services
  • provide consistent metadata and datamodels across toools
  • Share data as tables and out of HDFS ...

#

  • Data services and Open APIs in action
  • 4 step process example
  • 1. Web log files via WebHDFS APIs
  • 2. Customers and order data via talend & HCatalog for schema
  • 3. Process,, analyze , and join data via talend pig & HCatalog
  • 4. Analyze website visits by the type of ...

#

  • Demo

#

  • Ecosystem Completes the Puzzle
  • Application business
    • IBM, spring, microsoft, Datameer
  • Datamanagement and movement
    • IBM, microsoft, MarkLogic, splunk, teradata aster
  • Infra and system management
    • StackIQ, SAVVIS, cisco, vmware, IBM, microsoft

#

  • Our opportunity and our role
  • By the end of 2015
  • more than half the worlds data will be processd by Apache Hadoop.
▶ Hadoop Now
  • Hadoop Summit is BIG!
  • 2008 first summit 200+ people
  • 2012 summit 2200+ people

#

  • Time line Apache Hadoop 1.0 & 2.0
  • Hadoop 1.0 is the stable release
  • 2.0
    • NG MR and HDFS

#

  • Hadoop 1.0 key features
  • Flush / sync for HBase
  • interactive apps - web site personalization
  • Security - Kerberos
  • MR limits
    • Solve whack a mole bad user job problem
    • benefit reliability, multi-tenancy

#

  • Haoop 2.0 innovations
  • Focus on scale and commiinity innovation
  • YARN: scalabel pluggable execution framework
    • near realtime, machine learning and analytics use case
  • Federation : scalable pluggable storage
  • Always on : no cluster downtime
    • Wire compatible API
    • HDFS HA
    • rolling upgrade
    • log and checkpoint management

#

  • Balancing Innovation and Stability

#

HIghlights

  • 1. Pure Apache Hadoop 1.0 code line, 100% open source
  • 2. OSS management & monitoring via Ambari
  • 3. common metadata services via HCatalog
  • 4. ...

#

  • Management and monitoring system : Ambari
  • Powerful monitoring and alerting dashboard
  • Simple Integration and provisioning

#

  • Full stack HA
  • Proven HA solutions with proven Hadoop 1.0
  • FO and restart for
  • NN
  • JT
  • Other services to come...
  • multi vender support by Open APIs

#

  • The road ahead
  • Ambari
    • REST API
  • HCatalog
    • ODBC/JDBC
    • more REST APIs
  • Full stack HA
    • Continued work with virtualization and OS vendors
  • Native Windows support

#

  • Help the grow ecosystem!
▶ Bringing the value of Hadoop to Enterprise

Teradata, SCOTT GNAU

  • Value
  • Meet customer expectation
  • reducing cost
  • increasing market share and revenue

#

  • Technology evolution and where we are today
  • more cheaper
  • more faster (performance)

#

  • What could derail bigdata?
    • Skill sets
      • its hard to utilize exisisting skill sets
    • No enterprise wide adoption
    • Big data architectures

#

  • History of innovation

#

  • What made data warehouse successful
    • Affordability
    • Robust ecosystem
    • Reference solutions
      • mesurable business value

#

  • What does big data technology need to success in business
  • know a Value
  • what if you knew the customer was going to leave?
    • -> analytics

#

  • Unified big data architecture
  • DWH
    • Teradata
  • Discovery platform
    • Aster ( <- SQL-H <- Hadoop )
  • ETL tools
  • BI tools and visualization tools

10:30am - 11:10am Big Data Architectures in the AWS Cloud

Speaker: Adam Gray

Big Data technologies such as Hadoop, NoSQL, and scalable object stores are an ideal fit for the elasticity and scalability of a cloud deployment. In this talk we will take a look at several common architectural patterns that are being used today on the AWS cloud to take advantage of these synergies while overcoming some of its inherent limitations. ...

▶ Should I run Hadoop in the cloud?
▶ Customer case studies
▶ Case #1 Netflix
  • 50 B events per day

#

  • Big idea #1: Cloud based data hub
    • S3

#

  • Amazon s3 growth
  • 99.99999999 % durablility

#

  • 8TB of event data per day
  • Legacy data dimension data stored in Cassandra
  • -1PB of data stored in Amazon S3

#

  • Big idea #2 data hub as a Hadoop File system
  • EMR
  • managed Hadoop offering in the cloud
  • integration with other AWS services

#

  • S3 -> prod Cluster(EMR)
  • data streamd directly from S3 to the cluster
  • data consumedd i nmultiple way
  • recommendation,, adhoc query...
  • wide range of processing languages used
  • Pig...

#

  • Big idea #3 Dynamically resizable clusters
  • 300 nodes during the day
  • 400+ nodes in the evening peak time

#

  • BIg idea #4 one datfa set; multiple clusters
  • S3 -> prod cluster and query cluster

#

  • cheaper experimentation = faster innovation
▶ case #2
  • Corn production
  • 200TB of data in S3
  • Simulation each month

#

  • Big idea #5 transient Hadoop Clusters
  • Sending in from S3 to Cluster
  • and sending back cluster to S3

#

  • Big idea #6 SPOT instances
  • Trade off
  • price depends on the rete at that moment.

#

  • Mix spot and on-demand instances
▶ case #3 airbnb
  • 1M daily searches
  • 10M HTTP requests
  • billions of records

#

  • Big idea #7 scalable realtime data access
  • Amazon DynamoDB
  • DynamoDB fully managed NoSQL datfabase service, seamles scalability
  • Zero Administration
  • Low Latency SSDs
  • Reserved capacity

#

  • A Cloud based hadoop architecture
  • RDS, Log files, click stream
  • Tier 1: S3 (historical data)
  • Tier 2: Hadoop/EMR (working) : expandable, replicable
  • Tier 3: DynamoDB (aggregrate)
▶ Everyone is Hiring
▶ QA

#

  • data taransision from S3

#

  • security
  • case by case, solutions it should be.

#

  • realtime processing

#

  • EBS for local disk

11:25am - 12:05pm HDFS - What is New and Future

Speaker: Suresh Srinivas,Sanjay Radia

Hadoop 1.0 is a significant milestone in being the most stable and robust Hadoop release tested in production against a variety of applications. It offers improved performance, support for HBase, disk-fail-in-place, Webhdfs, etc over previous releases. The next major release, Hadoop 0.23 offers several significant HDFS improvements including new append-pipeline, federation, wire compatibility, NameNode HA, further performance improvements, etc ...

▶ Outline
  • Hadoop 1 and 2 releases
  • generated storage service

#

  • enterprise use casses
  • HDFS infra inprovements ...
▶ Hadoop 1 and 2
  • Hadoop 1 GA
    • security
    • append/fsync(HBase)
    • WebHDFS + Spnego
    • Write pipeline
  • Hadoop 2 alpha ...

#

  • Testing & Quality - used for each stable release
  • Nightly testing
    • 1200 automated tests on 30 nodes
  • QE certification for release
  • Release testing - alpha and beta
    • sandbox cluster - 3 clusters each 400 - 1k nodes
    • research cluster

#

  • Hadoop 1 and 2 Timeline
  • Federation alpha testing

#

  • HA, Wire compatibility on the development status
▶ Federation: Generalized Block storage
  • Block storage as generic storage service
  • Block pool
  • Multiple independent NN and Namespace Volume in a cluster

#

  • HDFS generic storagge service opportunities for innovation
  • Federation - distributed namespace
  • new services - independent block pools
    • New FS - partial Namespace inmemory
    • MR tmp storage, HBase directly on block storage
    • Shadow file system in DN

#

  • Shadow File System
  • Custom Namespace to shadow namespace of another system
  • Different policies on the data
  • e.g. single replica, fetch missing ones from source
  • e.g. reduce replication factor for data duplicated in another cluster

#

  • Managing Namespaces
  • Federation has multiple namespace
  • Dont you need a single global namespace?
    • some tenants want private namespace
      • hadoop as a service
  • A simple global namespace is one way share
  • Client side mount table is another way to share
    • share mount-table
    • personalized mount-table
  • Client side implementation of mount tables
    • No SPOF
    • No hotspot for root and top level directories

#

  • Next step first support for volumes
  • NameServer
  • WorkingSet of name space in memory
  • Number of NameServers =
  • sum of (namespace working set) ...
▶ Enterprise Use cases
  • HA
  • Standard interfaces
    • WebHDFS, Fuse, and NFS access
  • Snapshot
  • Disaster Recovery
    • Distcp
      • enhance using journal interface & snapshots
  • Data Efficiency/RAID
    • productive the tools and experience at Facebook
▶ HDFS infrastructure improvement
  • Netty ###
    • Better connection and thead management
  • Image/Edits management
  • Parallel writes
  • Group Blocks
  • Support for Heterogeneous Storage
  • Rolling Upgrades improvement
▶ HA in Hadoop 1!
  • Using FUll Stack HA Architecture
  • when NN FO then
  • JT in safemode, apps(also from outside) pause/retry

#

  • FO completes everything become normal

#

  • HA in Hadoop 1 with HDP1
  • Full stack HA architecture
  • Use industry ...

#

  • Hadoop NN/JT HA with vSphere
  • NN HA with Linux-HA

#

  • FO times
  • NN FO times with vSphere and LinuxHA
  • Failure ditection and FO - 0.5 to 2 min
  • OS bootup needed for vSphere - 1min ...
▶ Summary
  • Hadoop 1 - the most stable version
  • Hadoop 2 - alpha testing
    • 3years development
    • generalized storage layer
    • Hadoop 2 HA
  • Snapshot and DR improvement
▶ QA
  • About Distcp

#

  • Topology is unchanged in Hadoop 1 and 2.

#

  • about FO time
  • PB size cluster speaker tried.

#

  • about Linux HA

1:30pm - 2:10pm HMS:Scalable and flexible configuration management system for Hadoop stack

Speaker: Kan Zhang, Eric C Yang

To deploy the whole Hadoop stack and manage its full life cycle is no small feat. This is partly due to the interdependency among services and partly due to the large number of nodes in a Hadoop cluster. For example, when upgrading an HDFS cluster, one needs to make sure NameNode is upgraded successfully before starting up the DataNodes. Such cross-node dependencies are difficult to specify in existing configuration management systems like Puppet. ...

▶ Motivation
  • Goal
  • managing Hadoop stack in a data center
  • Scalability
    • fault tolerance adds further complexity
  • Real time interaction and FB
    • visibility into cluster state is a major pain point for sys admins.
  • cross node ordering dependency
    • e.g. start/stop NN, JT
      • simple to specify and efficient to implement
▶ HMS approach
  • Zookeeper plays a central role
  • fault-torerant and scalable storage

#

  • asynchronous messaging service
▶ Zookeeper
  • a hierarchical namespace of znodes for storing data
  • sequential znodes for message queing

#

  • watches for asynchronous notification
  • Ephemeral znodes for failure detection
▶ Leveraging Zookeper
  • Storing Cluster size
  • Storing system state

#

  • Distributed orchestration
  • Cross-node dependency
▶ HMS Overview #picture
▶ Design Implications
  • All cluster state is stored in Zookeeper
  • Controllers and agent dont rinteract directly
    • all communication are via ZK async notification
  • Controllaers and agents are stateless
    • Controllers can be replicated for load balancing
    • Controllers failures are automatically detected and handled
  • Dependency specified in terms of node states
    • actions come and go, but their effects are captured in node states
▶ Details of configuration
  • Node list
    • rolls are mapped to hostnames

#

  • Package manifest

#

  • Configuration plan
    • run a list of scripts to configure NN

#

  • Start NN
  • run a script to setup HDFS on NN

#

  • Compiled plan

#

  • Start JT
  • dependency discriptions

#

  • Node state
▶ QA

#

  • earlier on slide mentioned
  • 5k nodes are in multiple cluster

#

  • doesn't OS level deployment in this moment.

#

  • about Zookeeper scalability
  • mentioned in Zookeepers paper.

#

  • can use other system not Hadoop
  • Yes.
  • depends on your script.

#

  • Puppet, Chef
  • HMS is for scalability
  • designed from scratch

2:25pm - 3:05pm Big Data Challenges at NASA

Speaker: Chris Mattmann

I`ll describe what I see as the most current and exciting big data challenges at NASA, across a number of application domains: * Planetary Science * Next generation Earth science decadal missions * Radio Astronomy and the next generation instruments including the Square Kilometre Array * Snow Hydrology and Climate Impacts The session will focus on defining the problem space, suggesting how technologies like Apache Hadoop can be leveraged, ...

  • Big data challenge at NASA
  • and set of technologies they developping
▶ Self introduction
▶ Agenda
▶ Big data grand challenges I'm interested in
  • 700 TB/sec of data comming off the wire when we acturally have to keep it around?
  • 10years of data from colorado river basin and store and disseminate the output product

#

  • How do we compare PB of climate model ouput data in a variety of formats ...

#

  • How do we catalog all of NASA's current planetary science data ###

#

  • huge data, different format data comparison.
  • data placed on diffrent place, data center
▶ The NASA ESDS Context #picture
▶ Lessons from 90's era missons
  • data volumes
  • complexity of instruments and algorithms

#

  • increasing availability of proxy/sim/ancillary data
  • rate of technology refresh
▶ Where do Big data technologies fit into this?
  • U.S. national climate assessment
  • SKA South Africa: Square ...
▶ MODSCAG Historical Data processing
  • snow.jpl.nasa.gov
  • YARN

#

  • data processing, data flow dependency
▶ EVLA demonstration architecture #picture
  • python (kasa???)
▶ Apache OODT
  • diverse set of science data system activities in planetary science, earth science, radio astornomy, biomedicine, astrophysics, and more...
▶ OODT: OSS bigdata platform originally pioneered at NASA
  • OODT is meant to be a set of tools to help build data systems
    • turn key ...
▶ Governance Model + NASA = ♥
  • NASA and other government agencies have tons of process
▶ OODT and PCS

#photo.

▶ How do we deploy PCS for a mission?
  • NASA implement
    • server config
    • product metadata specification
    • ...
▶ OODT 0.4
▶ How do these fit together?
  • Hadoop, Apache OODT
▶ where are we headed with OODT + Hadoop
  • investigate and integrate YARN
  • Plug in HBase as File manager Catalog

#

  • OODT + Hadoop virtual machines and RPMs
    • Easy installation

3:35pm - 4:15pm Infrastructure around Hadoop - backups, failover, configuration, and monitoring

Speaker: Terran Melconian, Ed MacKenty

There`s more running on your Hadoop cluster than just the Hadoop binaries. We`ll present our accompanying infrastructure – our HDFS and Hive DDL backup system, our monitoring with Ganglia and Nagios, our use of DRBD for master failover, and our configuration management with Puppet. We`ll talk about what`s worth monitoring, what`s worth automating, and some things we tried that didn`t work out. We are in the process of open-sourcing several of these components and should be done well before June.

▶ What tripadvisor does
  • Worlds largest travel site and community
  • Trip planning
  • > 50 million unique monthly visitors 30 country ...
▶ What the Warehouse team does
  • retain and aggregrate historic site activity data
  • -50 nodes in 4 clusters
    • 12 analytics team
  • 3 people fairly new to hadoop/hive
  • CDH3u3 ( Hadoop 0..20.2 )
▶ Why Hadoop at TripAdvisor
  • old RDBMS data warehouse could barely keep up with data integration , even running on expensive HW with a SAN
  • many kinds of statistics
▶ HA Namenode : DRBD, Corosync and Pacemaker
  • NN and JT on master node
  • DN and TT run on slave node

#

  • automatic FO of all master node services to a passive nodes
  • provision two identical systems ...
▶ DRBD /Corosync Configuration
  • DRBD: replicates NN image, Hive meta data, Oozie job data
  • Corosync: messaging between active passive masters ###

#

  • Corosync will start Pacemaker for you
  • use /etc/init.d/corosync to manage it, and Pacemaker
▶ Pacemaker configuration
  • Define each resource you want to manage
    • DRBD device, master IP address, ehernet connectivity checks, Hadoop NN and JT, Hive thrift server, MySQL for Hive metadata...
  • set monitoring intervals for each resource
  • define resourse co location dependencies
  • define resource ordering dependencies
  • Restart failed services, e.g. HIve-Thrift
  • Use crm tool to manage nodes and resources
  • Test with a manual FO

#

  • SNN works on passive machine.
▶ Monitoring: Ganglia and Nagios, Job Tracking
  • visibility into cluster operations
  • monitor HW states and resource usage
▶ Ganglia
  • Graphs for entire cluster and individual nodes
▶ Nagios
  • Their primary notification system
  • about 80 checks , - 25 are their own. #photo memo
  • Some warnings
  • Dont let Nagios run Hadoop fsck
    • takes much time 30 sec command ###
  • LDAP failure causes email cascade
  • High loads can cause timeouts, which cause notifications
▶ Job Tracking ###
  • PERL script invoked frequently by cron
  • Parses JT log entries since last run
  • Records data on each job in PostgreSQL
  • CGI script to do queries

#

  • Helps post morterm of problems
  • Used to predict trends, future resource needs
▶ Other cron scripts we run
  • Check load
  • Master nodes
    • Compress Hadoop/Hive logs
    • check_hdfs
    • backup current NN metadata ...
▶ Configuration management
  • Use Puppet as a tool
  • Gets the job done
  • Lacks flexibility in in heritance to specialize defaults per machine
  • Some aspects of operations are hard to debug
▶ Backup ###
  • Runs on separate backup server with storage (NexSan)
  • Pull process driven by processes on backup server

#

  • Incremental HDFS backup
  • Hive DDL backup
▶ Backup Hive DDL
  • OSS java app uses Thrift server
  • used to move data MySQL
▶ Other things to potentially backup
  • Backup the NN Metadata ... #photo
▶ Why Oozie?
  • Drawback: several times slower to write than cronjobs , while also less expressive
  • Advantage: ... #photo
▶ How Oozie?
  • Bypass the Oozie GUI
  • Dont ever use Derby
▶ Experiences and Expectations
  • Hadoop is not a mature
    • reliability, stability
  • Cluster outage are common events, not outliers
  • You must design for failure and have a robust mechanism to cleanly and easily resume execution once the cluster is back up.
  • Important jobs must be ..
▶ Attributes of Robust Jobs
  • Idempotent and resumable regardless ...
▶ Benchmark ###
  • Key insight
  • for the same job, the same task always does the same work.
▶ Features you should use ###
  • Fair scheduler
  • refreshNodes , refreshQueue ...
▶ Hiring
▶ QA
  • about Disaster Recovery
▶ Related tweets.

4:30pm - 5:10pm Hadoop and Vertica: The Data Analytics Platform at Twitter

Speaker: Bill Graham

Twitters Data Analytics Platform uses a number of technologies, including Hadoop, Pig, Vertica, MySQL and ZooKeeper, to process hundreds of terabytes of data per day. Hadoop and Vertica are key components of the platform. The two systems are complementary, but their inherent differences create integration challenges. This talk will give an overview of the overall system architecture focusing on integration details, job coordination and resource management. Attendees will learn about the pitfalls we encountered and the solutions we developed while evolving the platform.

▶ Overview
  • Architecture
  • Data flow

#

  • Job coordination....
▶ We count things
  • 140 characters
  • 140 M active users
  • ...
  • 400M tweets per day
  • 80-100TB integrated days
  • ...
▶ Heterogeneous Stack
  • Many job execution applications
    • Crane - Java ETL
    • Oink - Pig schedular
    • Rasvelg - SQL aggregrations
    • Scalding - cascading via scala
    • PyCascading - Cascading via Python
    • Indexing Jobs
  • Our users
    • Analytics, Revenue, Growth, Search, Recommendations, etc...
    • PMs, sales!
▶ Data Flow: Analytics #photo
  • production hosts

#

  • staging hadoop cluster
  • MySQL/GIzzard
  • main Hadoop DW (HBase)

#

#

  • HCatalog
▶ Chaotic ? Actually, No.
▶ System Concepts
  • Loose coupling
  • Job coordination as a service

#

  • Resource management as a service
  • Idempotence
▶ Loose coupling
  • Multi job frameworks
  • Right tool for the job

#

  • Common dependency management
▶ Job coordination
  • Shared batch table for job state
  • access via client libraries

#

  • Jobs & data are time based
  • 3 types of preconditions
    1. othe job success
    2. existence of data
    3. user defined
  • Failed jobs get retried (usually)
▶ Resource management
  • ARM
  • Library above ZK

#

  • Throttles jobs and workers
    • Only 1 job of this name may run at once
    • Only N jobs may be run by this app at once
    • Only M ....
▶ Job DAG and state transition
  • Local view
  • is it time for met to run yet?
  • are my dependencies satisfied
  • any resource constraits?
▶ Example : active users #photo
▶ Vertica or Hadoop? ###
  • Vertica
    • loads 100s of Ks rows/sec
    • aggregate 100s of Ms rows in sec ...
  • Hadoop
    • Excels when data size is massive
    • Flexible and powerful
    • Great with nested data ...
▶ Vertica import options
  • Direct import via Crane
  • Atomic import via Crane/Rasvelg
  • Parelel import
▶ Vertica imports - Pros/cons
  • Crane and Rasvelg
    • good for smaller data sets , DB to DB transfers
  • Pig
    • great for larger dataset
▶ VerticaStorer
  • PigStorage implementation
  • From Vertica's Hadoop connector suite ...
▶ Pig ViierticaStorage
  • Their enhancement
    • Connection credential management
    • truncate before load option
    • Throttle concurrent writers via ZK
  • Future feature
    • counters for rows inserted/rejected
    • Name-based tuple-column bindings
    • atomic load via temp table
▶ Gotcha #1
  • MR data load is not atomic
  • Avoid partial reads

#

  • op1: load to temp table , then insert direct
  • op2: add job dependency concept
▶ Gotcha #2 #photo
▶ Gotcha #3
  • isIdempotant() must be a first-class
▶ Gotcha #4
  • Vendor code only gets you so far
  • Nice to have == have to write
▶ Future work
  • more virticaStorer features
  • multiple vertica clusters
  • atomic DB loads with Pig/Oink
  • Better DAG visibility
  • Better job history visibility
  • MR job optimizations via historic stats
  • HCatalog dadta registory
  • Job push events
▶ QA #laugh

Need to check IC recorder later.

Continued to Notes of Day2 and the Summary.

Relative posts.