#garagekidztweetz

id:garage-kid@76whizkidz のライフログ・ブログ!

Memos: #hw2011 (Day1), 08, Nov, 2011.

スポンサーリンク

このエントリーをはてなブックマークに追加

NYC で開催された Hadoop World 2011 に参加してきたので、まずは第一日目のメモを共有したいと思います。
長いので、第二日目サマリと3つのエントリに分けて公開していきます。

Table of Contents: Memos: #hw2011 (Day1), 08, Nov, 2011.

8:30am-10:00am Keynotes: Hadoop World 2011: Mike Olson Keynote Presentation

Now in its fifth year, Apache Hadoop has firmly established itself as the platform of choice for organizations that need to efficiently store, organize, analyze, and harvest valuable insight from the flood of data that they interact with. Since its inception as an early, promising technology that inspired curiosity, Hadoop has evolved into a widely embraced, proven solution used in production to solve a growing number of business problems that were previously impossible to address. In his opening keynote, Mike will reflect on the growth of the Hadoop platform due to the innovative work of a vibrant developer community and on the rapid adoption of the platform among large enterprises. He will highlight how enterprises have transformed themselves into data-driven organizations, highlighting compelling use cases across vertical markets. He will also discuss Cloudera’s plans to stay at the forefront of Hadoop innovation and its role as the trusted solution provider for Hadoop in the enterprise. He will share Cloudera’s view of the road ahead for Hadoop and Big Data and discuss the vital roles for the key constituents across the Hadoop community, ecosystem and enterprises.


  • Four exciting keynote
  • lots networking opportunity
  • Sixty educational sessions
  • Wireless network Sheraton Meeting
  • Hash tag #HW2011
  • take the surveys
  • Breakout sessions
  • Overall survey
Three years ago - Hadoop is going to be huge.
  • 1,400 people from 580
  • Less than one year user is the majority
  • Your cluster is 120 nodes, up from 66 last year.
  • Largest cluster bigger than 20PB
Two years ago
  • Hadoop is at the center of a new platform for big data.
  • Apache ecosystem
  • Around Hadoop become more developed
Last year

Hadoop must integrates with datacenter infrastructure.

  • You need to understand how these the services contribute to the businesses.
  • High level to low level.
  • Need to be operated reliably.
This year
  • Were talking about the future.
    • Unstructured data Total archived data share increasing significantly.

Building Application.

  • wibi!data
  • develop personalized application on Hadoop and HBase
  • What application is the most suitable for you to use, analyze this app.
Data Analysis and Visualization
  • Demand for Online App Analytics
    • inteructive.
  • Powerful Statistical Tools
    • Why Hadoop and R?
      • Need to do more than simple statistics
      • Analyze all of the data.
  • Integration
    • Complex Data Exploration.
  • US government, Army.
  • Business Analytics
    • can see on iPad, iPhone
  • An Exploring, Diverse Ecosystem
    • Big robust market.
ACCEL
  • $100MM dedicated to find entreprenures globally in building disruptive, Big data companies.
  • funding innovation Across every layer of Big data stuck
  • Who we are
  • Three decades of technology investing with over $6B in US, Europe, China and India.
  • Big Data Wave.
  • Data is exploring
  • Make value for company
  • Funding the Big data Ecosystem
  • Big mainstream shift happened on the Big data platform
    • Hadoop
  • Contact. @bigdatafund
The Next Generation Data Center
  • Hadoop is the key part of the infrastructure.
  • connect to the ...
  • The Future
  • Tackling critical Business issues.

Hadoop World 2011 Keynote: eBay Keynote - Hugh E. Williams, eBay

Hugh Williams will discuss building Cassini, a new search engine at eBay which processes over 250 million search queries and serves more than 2 billion page views each day. Hugh will trace the genesis and building of Cassini as well as highlight and demonstrate the key features of this new search platform. He will discuss some of the challenges in scaling arguably the world’s largest real-time search problem, including the unique considerations associated with e-commerce and eBay’s domain, and how Hadoop and HBase are used to solve these problems


  • Project Cassini; New Search Engine
  • Three Story, About eBay, Auctions, some statistics.
  • item e.g. $2.63 million for a lunch with Warren Buffet ...
  • 250 MM queries/day
  • 75B database call/day
  • 9PB data in their Hadoop - 2 B Page view/day
Huge opportunity; taking the E out of ecommerce.
  • Today: offline influenced by online >= offline > online
  • Tomorrow ; online + offline
Project Cassini at eBay
  • new type of their Search engine
Explanation of Voyager: their current search engine.
  • Entirely new codebase
  • World class from a world class team
  • Platform for ranking innovation
  • Uses all data by default
  • Flexible - Automated
  • Four major tracks, 100+ engineer
  • complete in less than 18 months
Beginning tests and likely launch in 2012.
A Short Primer on Indexing
  • When a user types a query, it isn't practical to exhausively scan 200+ million items
  • Instead , we create an inverted index , and use it to rank the items and find the best matches
  • an ...
An inverted index.
  • e.g. search the word containing "cat"
Distributed Index Construction
Indexing in Cassini
  • Larger index than Voyager
    • Descriptions, seller data, other metadata...
  • More computationally expensive work at index time and less at query time
  • Ability to rescore or reclassify entire site inventory
Hbase and Hadoop in Cassini
  • Hadoop
    • Distributed indexing
    • Fault tolerance through HDFS replication
  • Better utilization of HW. - HBase
    • Column oriented data store on top of HDFS
  • Used to store eBays items
  • Bulk and incremental item writes
    • Fast item reads for index construction
    • Fast item reads and writes for item annotation
Cassini Challenges with HBase
  • Everyone still learning
  • Some issues only appear at scale
  • Production cluster configuration is challenging
  • HBase stability
  • Monitoring health of HBase
  • Managing workflow many step MR jobs

Keynote: JPMorgan, Larry Feinsmith, Morgan Stanley

its global technology footprint
  • 150 PB of online storage
  • 42 DC
  • 50K servers
  • investment
  • 4B for Application side
  • 4B for Infrastructure side
Why JPMorgan Chase uses Hadoop?
  • two simple reasons
    • Increase Revenue
    • Reduce Cost
Big data analytics Sweet Spot
  • Pros -> Big data analytics
    • Per Job Data volumes
    • Schema Complexity
    • Processing Freedom
    • Total data volumes
  • Cons -> Traditional RDB more suitable.
    • Concurrent Jobs
    • Data Update patterns
    • Responsiveness
    • Table join Complexity
JPmorgan chase Big data Analytics Strategy
  • Offering Hadoop as a shred service to the lines of business
  • 5 of 7 lines of business are actively Using Hadoop
  • Building data scientist team
  • Have advisory partnerships with industry leading vendors
Use Case: ETL
  • Move lots of data.
  • Private bank
    • Transaction filtering to feed data marts
  • Global Finance Technology
    • Basel III liquidity analysis
      Pair with ETL tools to accelerate underperforming batches ....
Use Case: Common Data Platform
  • Across all consumer businesses
    • 360 degree customer view
  • Treasury & security services
    • Global payment hub
      Low cost storage alternative for occasionally searched data
Use Case: Data Mining
  • Retail Bank and IT infrastructure
    • Fraud prevention
  • Investment Management
    • Trade quality analysis
      # Existing technology for analyzing is need to use new technology.
Analysis across structured and unstructured data.
  • Consideration for Hadoop in the Enterprise
    • Monitoring and recovery, resiliency
    • Skills shortage
    • Needs killer app
    • Better integration with BI ecosystem
    • Security and entitlements
    • Megavendors proprietary offerings

10:15am-11:05am Building Realtime Big Data Services at Facebook with Hadoop and HBase Jonathan Gray, Facebook

Facebook has one of the largest Apache Hadoop data warehouses in the world, primarily queried through Apache Hive for offline data processing and analytics. However, the need for realtime analytics and end-user access has led to the development of several new systems built using Apache HBase. This talk will cover specific use cases and the work done at Facebook around building large scale, low latency and high throughput realtime services with Hadoop and HBase. This includes several significant contributions to existing projects as well as the release of new open source projects.
Realtime Analytics.


About speaker.
  • Software engineer at facebook
  • MySQL and HBase team work together
Why Hadoop and HBase?
  • LAMP stack
  • Problems with existing stack
  • MySQL story already heard before story....
Memcased is fast, but ...
  • only KVS
  • No write throughput
Hadoop is scalable, but...
  • MR is slow
  • difficult
Specialized solutions
  • inbox search
    • Cassandra
  • High throughput persistent key values
    • Tokyo Cabinet
  • Large scale DWH
    • Hive Finding a new online data store
  • Consistent patterns emerge
    • Massive datasets, often largely inactive
    • Lots of writes
    • Fewer reads
    • Dictionaries and lists
    • Entities centric schemas
      • peruser, per domain , per app
    • Other requirements laid out
      • Elasticity
      • HA
      • Strong consistency within a DC
Some non requirement
  • network partitions within a single DC
  • active active serving 2010, engineers at FB compared DBs
Compared performance , scalability , and features
    • HBase gave excellent write performance
    • multiple shards per server
    • atomic rad modify write operation
    • Bulk importing
    • Range scans
HBase uses HDFS
  • we get the benefits of HDFS as a storage for free
    • Fault tolerance
    • Scalability
    • Checksums
    • MR
    • Fault isolation....
HBase in a nutshell
  • sorted and column oriented
  • High write throughput
  • Horizontal scalability
  • Automatic FO
  • Regions sharded dynamically
Applications of HBase at facebook.
Use case1 : Titan
  • Facebook messaging
    • Hadoop, HBase, Haystack, Zookeeper
  • High write throughput
    • Denormalized schema
    • Commit log -> Memstore -> sequential write to the file
  • Horizontal Scalability
    • Region server
  • Automatic FO
Facebook messages stats
  • 6B+ messages per day
    • 75B+ read write ops to HBase per day
    • 1.5 M ops sec at peak -- 55% read m 45% writes
    • 16 columns per operations across multiple families
  • 2PB+ of online data in HBase
  • LZO compressed Growing at 250TB month
Use case2: PUMA
  • before PUMA offline ETL

(before PUMA)
web tier

scribe HDFS
MR

Hive

SQL

MySQL

( after PUMA)
web tier

scribe

HDFS

PTail

PUMA (PTail Client)

HTable

HBase

Thrift

back to webtier
# 2-30 sec

PUMA as realtime MR
  • Map phase with PTail
    • divide the input the data stream into N shards
    • First version supported random bucketing
    • Now supports application level bucketing
  • Reduce phase with HBase
    • Every row column in HBase is an output key
    • Aggregate key counts using automatic counters
PUMA for FB insights
  • Real time URL Domain insights
  • Massive Throughput
    • 1M counter increments per seconds
Future of PUMA
  • Centrally managed service for many products
  • several other application in production
    • Commerce tracking
    • Ad insights
  • Making PUMA generic
    • Dynamically configured by product teams
    • Custom query languages
Use Case 3 : ODS
Operational Data Store
  • system metrics (CPU, memory, IO, NW)
  • Application Metrics (Web , DB, Caches
  • FB metrics (Usage , Revenue
    • Easily graph data over time
  • Difficult to scale with MySQL
    • irregular growth data
    • Millions of unique time series with billions of points
MySQL to HBase migration.
  • Dynamic sharding of regions
Future of HBase at facebook.
  • User and Graph data in HBase
  • Why now?
  • MySQL + memcached hard to replace , but...
    • joins and other RDBMS functionality are gone from writing SQL to using APIs persistent storage engine transparent to www primary a financially motivated decision
  • MySQL works but can HBase save more money
  • HBase vs MySQL
    • MySQL
      • Tier size determined solely by IOPS
      • Heavy on random IO for reads and writes
      • Rely on fast disks or flash to scale individual nodes
    • HBase showing promise of cost saving
      • Fewer IOPS write heavy
      • Larger tables on denser, cheaper nodes
      • Simpler operations and replication for FREE.
    • 'MySQL is not going anywhere soon But HBase is a great addition to the tool kit
    • Different set of trade offs
    • Great at storing key values , dictionaries, and lists
    • Products with heavy write requirement
    • Generated data
UDB challenges
  • MySQL has a 20+ head start
Insane Requirement
  • Zero data loss, low latency , very high throughput
  • WAN replication, backups w point in time recovery
  • Live migration of critical user data w existing shards
  • queryf() and other fun edge cases to deal with
  • HBase at Facebook

11:15am - 12:05pm The Hadoop Stack – Then, Now and In The Future Charles Zedlewski, Cloudera Eli Collins, Cloudera

Many people refer to Apache Hadoop as their system of choice for big data management but few actually use just Apache Hadoop. Hadoop has become a proxy for a much larger system which has HDFS storage at its core. The Apache Hadoop based “big data stack” has changed dramatically over the past 24 months and will change even more over the next 24 months. This session will explore the trends in the evolution of the Hadoop stack, change in architecture and changes in the kinds of use cases that are supported. It will also review the role of interoperability and cohesion in the Apache Hadoop stack and the role of Apache Bigtop in this regard.


In the beginning
good start
  • Shell/CLI
  • Data processing, Resource management
  • File storage
First cut at the system
  • Shell/CLI
  • Languages, Library, Workflow : Hive, Pig, Mahout
  • Data processing, Resource management : Hadoop
  • Metadata storage : Hive
  • Record storage : HBase
  • File storage : Hadoop
  • Coordination : Zookeeper
  • Formants, RPC, Compression
Where we are today
  • Web(Hue), Shell/CLI, Drivers(JDBC/ODBC)
  • Languages, Libraries, Workflow, Scheduling(Oozie)
  • Data processling, Resource management.
  • Metadata
  • Record storage
  • File storage
  • Coordination
  • Formats, RPC, Authorization, compression
  • Integration
  • Files
  • RDBMS
  • Logs & events
Core use cases
  • Data processing
    • search index search
    • click sessionization
  • Analytics
    • Machine learning
    • Batch reporting
  • Live content serving (for the braver folks)
  • Realtime applications
    • Content serving
    • system management
    • realtime aggregates & counters
  • Storage
    • EDW archive
Limitations
  • Redundancy
    • DAG, RPC, serialization, integration, etc...
  • Uniformity
    • diff components, require diff DBs, mgt interfaces, etc...
  • Ease of use
    • improving but still an obstacle. e.g. non-native file formats require integration
  • Multi DCs
    • Cross DC repl. for HBase but not HDFS
  • Interoperability
    • Requires conversations,
on going work
  • Metadata repos
    • shared schema and data types, table abstraction via apache HCat(incubating) and Apache Hive.
  • Self describing data via Apache Avro
Apache Bigtop
  • Dedicated to Hadoop stack integration and testing.
    • integration - between projects, dependencies, hosts
  • testing
    • interoperability, multi-component use cases
technical trends
  • Moving more forms of computation to Hadoop storage
  • Frameworks to make HBase more app and developer friendly
  • Taking advantage of pluggability to provide more optimized formats , schedulers , codecs, etc...
  • More granular security model...
HW trends
  • NW 10/40 gige
  • storage 48/60TB
  • Cloud Low power CPUs
Enable future use cases pt1
  • more valuable data
    • Cost
    • High value data not must generated but also consumed by the platform Richer end user applications
    • Apps built differently on the platform (eBays Cassini, Facebook messaging ...)
    • Web 3.0
Enable use case pt 2
  • Lower latency / higher interactivity
  • interactive
    • human-driven, correlated access , eg analytics
Enable future use case pt3
  • Policy - access control, std mgt interfaces, SLAs, MDM
  • Operation
    • DR, archive, etc
    • Traditional feature
Things to look forwards
  • Web , Shell/CLI, Drivers
  • Languages, Libraries, Workflow, Scheduling
  • MR, Stream, Graph, MPI, Other
  • Resource management
  • Metadata storage
  • Time series, ORM, OLAP, OLTP
  • Record storage
  • File storage....

1:15pm-2:05pm Security Considerations for Federal Hadoop Deployments, Jeremy Glesner, Berico Technologies

Richard Clayton, Berico Technologies Security in a distributed environment is a growing concern for most industries. Few face security challenges like the Defense Community, who must balance complex security constraints with timeliness and accuracy. We propose to briefly discuss the security paradigms defined in DCID 6/3 by NSA for secure storage and access of data (the “Protection Level” system). In addition, we will describe the implications of each level on the Hadoop architecture and various patterns organizations can implement to meet these requirements within the Hadoop ecosystem. We conclude with our “wish list” of features essential to meet the federal security requirements.


Federal IT trends
  • Data Deluge
  • IT consolidation
Traditional Security
  • isolate the hadoop cluster.
  • Boundary protection through tiered VLAN architectural ...
Security Constraints
  • FISMA
  • Executive Orders
  • DoD NIST
  • What are the security implication for hadoop?
    1. Protecting data from Theft or accidental
    2. ...
    3. Securing the results of analytics that cross datasets.
Use cases
case1.
  • Essential security users cannot read data at rest or in motion unless ...
    |
    Limitation with Encrypting whole files in HDFS
  • Block-level encryption.
    How to implement Block level encryption
  • create a custom file format - use a custom compression codec for use in sequence file.
    • ...
    • #github, hadoop-opengp-codec
  • Configure Hadoop to User custom compression codec

Encryption Gotchas

  • temporary data
    • temporary data must also encrypted
  • Key management
    • How does compression codec know what key to use?
      • property file
      • environment ...
  • Passing Keys via JobConf
    • on the compression codec implementation
    • set the ...
  • Implementing SSL in Hadoop
    • HDFS and MR communication in Hadoop is done through the Hadoop IPC ...
case2.
  • Mixed sensitives
  • security objective
    • restrict data based on its classification level
    • Hadoop administrator ...
  • Multi-key encryption caveats
    • Same principles single key block encryption except
    • ...
  • Passiong Context to the Compression codec for decryption
    • pass the correct key via jobconf
    • Extend sequence file to carry encryption context in its header data and extend compression codec to accept sequence file...
  • Controlling data entry
    • in multi classification instances of Hadoop, data must enter HDFS encrypted.
    • Strategy for multiple classification Transport in multi classification system, data in transit must also be segregaed.
    • SSL certificate for Hadoop
    • per corpus SSL certificate
    • Generate...
case3.
  • securing the result of analytics.
  • Security objective.
  • allow MR operation to occure accross classification
  • allow the creation of new classification by combining two or more dataset....
  • Strategies for Record-level Encryption.
    • decrypt/encrypt directly in Mapper , Reducer and Combiner...
  • Extending compression codec to allow record-level decryption ...
  • controlling data entry with mixed sensitivities.
  • Handling Derived data
Closing Results
  • Next step of them.
    • Implement
    • Search
    • Mature
    • watch list
      • Accumulo
        A new Apache Incubator project sponsored by Doug Cutting

Real time KVS store that incorporates cell labels to control data is returned to the user

2:15pm-3:05pm Hadoop Network and Compute Architecture Considerations, Jacob Rapp, Cisco

Hadoop is a popular framework for web 2.0 and enterprise businesses who are challenged to store, process and analyze large amounts of data as part of their business requirements. Hadoop’s framework brings a new set of challenges related to the compute infrastructure and underlined network architectures. This session reviews the state of Hadoop enterprise environments, discusses fundamental and advanced Hadoop concepts and reviews benchmarking analysis and projection for big data growth as related to Data Center and Cluster designs. The session also discusses network architecture tradeoffs, and the advantages of close integration between compute and networking.


Cisco and Big Data Hadoop World 2011.

Big data NW infrastructure
  • cisco unified fabric
  • Big data traditional databases
  • storage
Lab environment Overview
  • 128 nodes of UCS C200 M2, 1RU Rack mount servers
  • 4x2TB, dudal Xeon ...
  • Topologies used in the following analytics
  • N7k/N5k + N2k topology
  • N7k/N3k topology
Characteristics that affect Hadoop Clusters
  • Cluster size
  • Number of DN
  • Data model
  • MR function
    • Input data size - Total starting dataset
  • Characteristics of DN
    • IO, CPU, Memory, ...
    • Data locality of HDFS
  • Background Activity
  • NW characteristics
Cluster size
  • Time taken become lower when No of nodes grows..
  • 24, 48, 82
  • #test result from ETL like workload (Y! terasort) using 1TB data.
MR data mode
  • the complexity of the functions used in map and/or reduce has a large impact on type of job and NW traffic.
  • #show difference between Y! terasort and shakespeare wordcount.
ETL workload (1TB Y! terasort)
  • NW Graph of all traffic received on an single node(80node run)
  • #shortly after Reduces starts Map tasks are finishing and data is being shuffled to reducers as maps completely finish the NW...
BI workload (wordcount on 200K copies of complete works of shakespeare)
  • NW Graph of all traffic received on an single node (80nodes run)
Input data size
  • Given the same MR job, the larger the input dataset , the longer the job will take.
  • it is important to note that dataset sizes increase completion times ..
Characteristics of DN the IO capacity, CPU and memory of the DN can have a direct impaction performance of a cluster.
Data locality in HDFS
  • Data locality - the ability to process data where it is locally stored.
  • # during the map phase the JT attempts to use data locality to schedule map task where the data is locally stored. this is not perfect and is dependent on a DN ...
Multi use Cluster characteristics
  • Hadoop Cluster are generally multi-use . the effect of background use can effect any single jobs completion.
  • # a given cluster , is generally running may different types of Jobs, importing into HDFS, etc...
NW characteristics
  • the relative impact of various NW characteristics on Hadoop clusters
  • impact of NW characteristics on Job completion times
  • Availability and resiliency is most big.
Availability and Resiliency
  • the failure of a NWing device can affect multiple data nodes of a Hadoop Cluster with a range of effects.
Burst Handling and Queue Depth
  • Several HDFS operations and phases of MR jobs are very busty in nature.
Terasort N3k Buffer Analysis (10TB)
  • the buffer utilization is highest during the shuffle and output replication phases.
  • Optimized buffer sizes are required to avoid packet loss leading to slower job completion times.
    |
    buffer being used during shuffle phase
  • buffer being used during output replication
Terasort FEX buffer analytics (10TB)
  • the buffer utilization highest during the shuffle and output replication phases.
  • Optimized buffer sizes are required to avoid packet...
Multi-used cluster characteristics
  • in a multi-use cluster described previously, multiple job (ETL, Bi...) and importing data into HDFS can be happening at the same time.
Over-subscription Ratio
  • in the largest workloads, multi-TBs can be transmitted the NW
DN NW speed
  • generally 1GE is being used largely due to the cost performance trade-offs . Though 10GE can provide benefits...
1GE vs. 10GE buffer usage
  • moving from 1GE to 10GE actually lowers the buffer requirement on the switching layer
NW latency
  • generally NW latency, while consistent latency being important does not represent a significant ...
  • cisco.com/go/bigdata


3:30pm-4:20pm Integrating Hadoop with Enterprise RDBMS using Apache SQOOP and Other tools

As Hadoop graduates from pilot project to a mission critical component of the enterprise IT infrastructure, integrating information held in Hadoop and in Enterprise RDBMS becomes imperative. We’ll look at key scenarios driving Hadoop and RDBMS integration and review technical options. In particular, we’ll deep dive into the Apache SQOOP project, which expedites data movement between Hadoop and any JDBC database, as well as providing an framework which allows developers and vendors to create connectors optimized for specific targets such as Oracle, Netezza etc.


  • Structure data transfer to Hadoop.
  • Hadoop Data processing
  • Pig, Hive, MR , HBase, Mahout
  • on HDFS
  • Data store is not only HDFS also RDBMS, Document based systems, enterprise DWH
Needs for...
  • Adhoc data transfer
  • on ecosystem to another
Adhoc processing
  • Directory access to the data sources.
  • RDBMS, document based system, enterprise DWH (Enterprise Data store)
    | Sqoop import | Sqoop export
    Pig, MR, Mahout, HBase, Hive (Hadoop)
    on HDFS
How Sqoop works
  • data import
  • RDB table's key use for creating partitions.
  • each partitions break down to Mapper
  • and each Mapper import data into HDFS.
Over view of Sqoop
1. Pre-processing
  • Connector selection
  • a. Identify connector
    | metadata lookup
    b. sqoop record
    c. MR job
    Code generation
    • SQL type -> Java type
      • Type mapping example.
2. Data transfer
  • a. submit job
3. Post processing
  • a. cleanup
  • b. Hive commands
    connector lookup metadata
    Hive table def Hive script
Sqoop connectors
SQOOP-365
  • proposal for Sqoop 2.0
  • Highlight
    • Sqoop as a service
      CLI, Browser
      |
      Sqoop (REST)
      || |
      metadata store
      |
      Hadoop
Scenario for RDBMS - Hadoop integration
  • #1 reference data in RDBMS
  • #2 Hadoop for offline analytics
  • #3 MR output to RDBMS
  • #4 Hadoop as RDBS active archive
Case study
  • extending SQOOP for Oracle
  • SQOOP implements a generic approach to RDBMS Hadoop data transfer....
Reading from Oracle - default SQOOP
  • Oracle parallelism gone bad (1)(2).....
Sqoop connectors
  • Oracle Big data Applience.
  • Teradata
  • Greenplum MS SQL server
  • Toad for Cloud database ....

4:30pm-5:20pm Next Generation Apache Hadoop MapReduce, Mahadev Konar, Hortonworks

The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of Apache Hadoop MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application execution. Since downtime is more expensive at scale high-availability is built-in from the beginning; as are security and multi-tenancy to support many users on the larger clusters. The new architecture will also increase innovation, agility and hardware utilization. We will be presenting the architecture and design of the next generation of map reduce and will delve into the details of the architecture that makes it much easier to innovate. We will also be presenting large scale and small scale comparisons on some benchmarks with MRV1.”


Bio
  • Hadoop and Zookeeper committer
Hadoop MR classic
  • JT
  • TT
Current limitation
  • Hard partition of resources into map and reduce slots
  • Lacks support for alternate paradigms
    • Hacks for the likes of MPI/Graph processing
  • Lack of wire compatible protocols
  • Scalability
    • 4,000 nodes
    • concurrent task 40,000
    • coarse synchronization in JT
  • SPOF
  • Restart is very tricky duet to complex state
Requirements
  • Reliability
  • Availability
  • Scalability
    • clusters of 6,000 - 10,000 nodes
    • 16 cores , 48G/96G ram, 24TB/36TB disks
    • 100,000+ concurrent tasks
    • 10,000 concurrent jobs
  • wire compatibility
  • agility & evolution
Design Center
  • Split up the two major function of JT
    • resource management
    • application life-cycle management
  • MR becomes ....
Architecture
  • Application
    • Application is a job submitted to the framework
    • Example - MR job
  • Container
    • Basic unit of allocation
    • Example Resource manager
    • Global resource scheduler
  • Node manager
    • Per machine agent
  • Application Master
    • Per-application
    • manages application scheduling and task execution
Resource manager
  • Application manager
    • responsible for launching and monitoring application master - restarts an application on failure
  • Scheduler
    • Responsible for allocating resources to the application
  • Resource Tracker
    • Responsible for managing the nodes in the cluster.
Improvements vis a vis classic MR

#fulfill the requirement.

  • Utilization
    • Generic ....
  • Scalability
    • Application life-cycle management is very expensive
Fault tolerance and Availability
  • Resource manager
      • No SPOF -- state saved in Zookeeper.
  • Application master
Wire compatibility
  • Protocol
Innovation and agility
  • MR becomes a user-land library
  • customers can upgrade MR version as their needs
Support for programming paradigms other than MR
  • MPI
  • Master worker
Summary
  • MR next takes Hadoop to the next level
  • Scale out even further
  • Y! deployment on 1,000 servers
  • 0.23.0
  • ongoing work on
    • stability
    • admin ease
  • MPI on hadoop
  • other frameworks on Hadoop
    • Giraph
    • Spark,....
Roadmap RM restart and pre emption performance - at par with classic
  • sort DFSIO
  • Y! performance team

To be continued to the Day2 and the Summary.