#garagekidztweetz

#garagekidztweetz

id:garage-kid@76whizkidz のライフログ・ブログ!

Notes of Hadoop Summit 2012 Day2. #hadoopsummit

スポンサーリンク

Notes of Day2.

Next from Notes of Hadoop Summit 2012 Day1. #hadoopsummit, this time I want to share you my Notes of Hadoop Summit Day2.
And maybe until the end of tomorrow I will up Summary of this event (in my view point) lastly.

Contents:

8:30am - 10:05am Keynote - Geoffrey Moore and Plenary Sessions in the Main Ballroom

▶ Digitizing the World - the Driving Force Behind Hadoop's Adoption Life Cycle
  • Digital innovation
  • 4 decades
    1. 1980s digital office work - B2E
    2. 1990s digital value chain - B2B
    3. 2000s digital consumption - B2C
    4. 2010s digital engagement - B2B2C

#

  • Enterprise IT: Pre disruption
  • Transaction system for global commerce
  • Drove three decades of investment
  • Y2K put the capstone on this trend

#

  • IT Disruption: For the past decade
  • we can become more productive
  • Consumer IT on fire

#

  • Consumer IT redefines Human Exp the digitization of culture across the globe
  • 3 principle
  • 1. Access - universal
    • information, opinion, searchable, interactive
    • why dont you google it?
  • 2. Broadband - emotional
    • pictures, music, video, voice, shopping
    • Facebook, iTunes, ...
  • 3. Mobile - ubiquitous
    • anywhare, anytime, any income
    • iPod, iPhone, iPad, Android, ...
  • -> How inpact IT enterprise?

#

  • Enterprise IT - 21th century
  • system of engagement for B2C
  • Big Data system is the censoring system
  • Collaborative filtering
  • Behaviroral Targeting - retargeting
  • Personalized transactions
  • Location based services..
  • -> Big data, analitics, real-time

#

  • system of engagement for B2B
  • Enterprise facebook - something like facebook but not facebook
  • Enterprise YouTube
  • Enterprise Search
  • Enterprise App Store ...
  • And lots more to come... #laugh
  • -> Social, Mobile, Cloud

#

  • The evolution in IT infra
  • Driving the next $1 Trillion of spend in B2B2C
  • Morphing the Stack
  • Cleary Hadoop has a very big role to play but when?
  • -> Seeking the tipping point

#

  • Adoption Life Cycle
  • Early market
    • innovators, early adaptors
  • CHASM
  • TORNADO
  • BOWLING ALLEY
  • MAIN STREET

#

  • Where we are now? ###
  • Getting to tipping point
  • different playbooks for each phase
  • Early market
  • Bowling Alley
  • Tornado
  • Main street

#

  • Final thoughts for Hadoop Developers
  • Common Theme: Optimizing Outcomes at Scale
  • e.g. Media -> Contents...
  • Key takeaway
  • for the foreseeable future, DOMAIN EXPERTISE

#

  • Common tactic is Big data analytics
  • system of record are one foundation
  • log files are the other foundation
  • batch analytics drives pattern detection ...
▶ SVP advertising and data Yahoo!
  • Yahoo! is at the frontier of Hadoop Scale
  • 140PB 42,000 nodes, 3PB
  • 500 users
  • 360,000

#

  • visualize.yahoo.com

#

  • Hadoop drives moneytization to yahoo
  • 2 billion annually in dispay ad revenue from tec built on hadoop.

#

  • Analytics
  • TAO (hadoop+cubes+tableau)
  • AIY (hadoop+oracle+holap+cocktails(node.js))
  • javascript analytics

#

  • the next Hadoop frontiers
    • resource management
    • HW

#

  • Contribute to Hadoop, Y! more than before.
▶ Hadoop in the Enterprise
  • CTO Sears Holdings and Metascale CEO

#

  • Classic enterprise challenges #picture

#

  • The Sears Holdings Approach
  • allowing user to continue to use familiar consumption IF
  • providing ingerent HA
  • enabling business to unlock previously unusable data.

#

  • Hadoop must be ecosystem
  • large company have a complex environment
  • need to build own solution

#

  • The Sears Holding Architecture ### #picture

#

  • The Learnings #picture
  • Hadoop, implementation, unique value

#

  • Some Examples - Use cases
    • Use case#1: analysis for pricing
    • Use case#2: also pricing (COBOL -> Pig )
    • Use case#3: to meet SLAs on growing data volume (not application change)
    • Use case#4: User experience

#

  • Summary #picture

10:30am - 11:10am The Future of HCatalog.

The initial work in HCatalog has allowed users to share their data in Hadoop regardless of the tools they use and relieved them of needing to know where and how their data is stored. But there is much more to be done to deliver on the full promise of providing metadata and table management for Hadoop clusters. It should be easy to store and process semi-structured and unstructured data via HCatalog. ...

▶ Hadoop Ecosystem
▶ Opening up metadata to MR and Pig #picture
▶ Templeton - REST API
  • REST endpoints: databases, tables, partitions, columns, table properties
  • PUT to create/update, GET to list or describe, DELETE to drop

#

  • Get a list of all tables in the default DB
  • return JSON document

#

  • Create new table

#

  • Describe table
▶ Reading and Writing Data in Parallel
  • Use case, example,
  • what exists today -> webHDFS, Sqoop
▶ HCatReader and HCatWriter ### #picture
  • in this moment only java IF.
▶ Storing Semi-/Unstructured Data #picture
  • HCatalog is HA
▶ Hive ODBC/JDBC today
  • Issues
    • JDBC client
    • ODBC client
    • Hive Server
▶ ODBC/JDBC proposal
  • REST server between Hadoop and ODBC/JDBC Client
▶ QA
  • HCatalog OSS?
  • Yes
  • apache incubator

#

  • Integrate with HBase?

#

  • when adopting HCatalog
  • Wont need re-format previous data

#

  • Future integration HCatalog with Oozie
  • Dont honestly know Oozie tried to implement integration with Oozie

#

  • about Security
▶ Related tweets.

11:25am - 12:05pm Analytical Queries with Hive: SQL Windowing and Table Functions

Speaker: Harish Butani

Hive Query Language (HQL) is excellent for productivity and enables reuse of SQL skills, but falls short in advanced analytic queries. Hive`s Map & Reduce scripts mechanism lacks the simplicity of SQL and specifying new analysis is cumbersome. We developed SQLWindowing for Hive(SQW) to overcome these issues. SQW introduces both Windowing and Table Functions to the Hive user. SQW appears as a HQL extension with table functions and windowing clauses interspersed with HQL. This means the user stays within a SQL-like interface, while simultaneously having these capabilities available. ...

windowing
PTF

▶ What PTF?
▶ Analytics expressed using PTFs
  • Ranking, Top N
  • Time series analysis
  • Market Basket Analysis
  • Recursive Queries
  • Aster SQL/MR ...
▶ PTF bottom line
  • Enable more interesting Questions more simply in the SQL context
▶ PTF innovation example
  • ex. market basket analysis
  • input is a large set of baskets, each contains
▶ SQL windowing as a PTF
▶ PTFs with hive
▶ PTFs with Hive CLI
▶ Query Structure
  • Query abstraction is a select statement
▶ Query Example: Basic Query
▶ Demo
▶ Query Example: Top N #picture
  • Calculate the Top3 Tracts by country
▶ PTF Example: NPath
  • find incidents where a flight has been more than 15 min late 5 or more times in a row.
▶ Whats available
  • windowing functions
    • 21 functions
    • for ranking aggregation, navigation ...
  • One pass PTFs
  • Multi pass PTFs
▶ Query Evaluation: a PTF
▶ Multi Pass and Recursive Queries PTFs
▶ Multi Pass PTFs
▶ PTF Examples: Market Basket Analysis
▶ Recursive Queries as PTFs
  • HaLoop Project
  • Giraph Project
▶ PTF Example: Transive Closure
▶ Generalized TC
▶ Summary
  • Enable simpler analiytics.
▶ Next Step
  • Make windowing clauses closer to standard SQL
▶ QA

1:30pm - 2:10pm Status of Hadoop 0.23 Operations at Yahoo!

Speaker: Charles Wimmer

Upgrading from Hadoop 0.20.x to 0.23.x is challenging in several ways. This presentation will include details on how Yahoo! handles these challenges. It will outline the process we use to upgrade existing 0.20.x clusters to 0.23. It will show specific changes we`ve made to startup scripts and monitoring infrastructure. It will show how we manage new configurations for viewfs and hierarchical queues.

  • 4 weeks ago speaker moves to LinkedIn
▶ Summary of this talk
  • Includes
    • Operational changes required to support 0.23
  • Does not include
    • specifics about customer testing...
▶ Scope of This change of Y!
  • 42k+ Hadoop servers
  • 20+ clusters
  • sandbox, research, production
  • 0.20.205
▶ Overview of the process
  • provide customer 0.23 sadnbox cluster
  • provide customer enough data to test their app
  • provide developer support to address app issues quickly
  • upgrade resarch and production clusters ...
▶ Test Cluster
  • 420 nodes
  • 2x westmere 4 core processors
  • 24G RAM
  • 12 x 2 TDisks
  • no federations
▶ Hirechical Queues
  • configuration ex. #photo
▶ Memory Configuration
  • everything in memory by now.
  • yarn.nodemanager.resource.memory-mb
▶ Kerberos Configuration
▶ Init scripts
  • DN/NN
  • SNN
  • HistoryServer (HS)
  • NodeManager (NM)
  • ResourceManager (RM)
▶ DN/NN
  • minor change.
▶ SNN
  • previous downloaded edit log deletion feature added
▶ HistoryServer
  • new script
▶ ResourceManager/NodeManager
  • #Failed to take photo.
▶ QA
  • Application compiled are compatible?
  • mostly works fine.

2:25pm - 3:05pm Network reference architecture for Hadoop – validated and tested approach to define a reference network design for Hadoop

Speaker: Nimish Desai

Hadoop is a popular framework for web 2.0 and enterprise businesses that are challenged to store, process and analyze large amounts of data as part of their business requirements. Hadoop’s framework brings a new set of challenges related to the compute infrastructure and underlined network architectures. The goal of the session is to provide the reference network architecture for deploying Hadoop application. This session will cover the Hadoop application behavior in the context of building a network infrastructure and its design consideration. ...

▶ Session Objectives and takeways
  • Provide reference NW
  • characterize Hadoop Application ...
▶ Validated 96 node Hadoop Cluster
  • benchmarking
  • Terasort
▶ DC infra
▶ Big data application realm - web20 social community NW
  • facebook
▶ Enterprise
  • total different from social media
  • topology
▶ Bigdata building Blocks into the enterprise.
▶ Charactoristics that affect Hadoop Cluster ### #photo
▶ Hadoop Components and Operations
▶ MR data model ### #photo
  • ETL workload
  • BI workload difffrence
▶ ETL Workload (1TB Yahoo Terasort) ### #photo
▶ BI workload
▶ Data Locality in HDFS
  • data locality spike occur
▶ Map to Reducer Ratio impact on Job completion
  • reducers number
▶ Changing Reducers number
  • graph showing
▶ NW characteristics
  • latency
▶ NW reference architecture
▶ Scaling the DC Fabric changing the device paradigm
▶ Hadoop NW Topologies - reference ### #photo
▶ HA switching design
  • prepare variaty to connecting NW
  • Single NIC, Dual NIC, ...
▶ Availability with Single attached server 1G or 10G
▶ Availabitily Single Attached vs. Dual Attached node ### #photo
  • Dual attached node
▶ Availability NW failure result - 1TB terasort - ETL
▶ Oversubscription Design
▶ number of uplinks affection ### #photo
▶ DN speed difference 1G vs. 10G TCPDUMP of Reducers TX ###
  • 10 G can provide benefits depending on workload.
  • bonding
▶ 1GE vs. 10GE Buffer Usage
▶ Multi use Cluster Charactoristic ### #photo
▶ 100 jobs each with 10GB data set stable, node and rack failure.
  • flexibility
▶ Burst Handling and Queue Handling
▶ Nexus 2248 ...
▶ Terasort ETL
  • slide 48 similar
▶ NW latency ### #photo
  • Java slowness, compare to that NW latency is almost zero.
  • DNS lookup something like that cost more.
▶ Summary

3:35pm - 4:15pm PayPal Behavioral Analytics on Hadoop

Speaker: Anil Madan

Online and offline commerce are changing and converging, and technology is dramatically influencing how consumers connect, shop and pay. With over 100 million active accounts in 190 markets and 25 currencies around the world, PayPal is at the forefront of payments innovation. PayPal continues to disrupt traditional means of money exchange and is the faster, safer way to pay and get paid anytime, anywhere and any way. ...

▶ Paypal's vision
  • digital wallet
  • anywhere, anytime, anyway
  • 110 M active account , 190 markets, 25 currencies
▶ Behavioral tacking vision
  • understand your customer's behavior and exp.
  • ensure privacy, security and trust for our customers
  • anywhere, anytime, anyway
  • ensure instrumentation standardization accross channel
  • to drive desirable outcomes for our customers ...
▶ Tracking Platform overview ###
  • Direct/Home page
  • tansaction Emails
  • Email Marketing
  • Display advertising
  • search engine marketing
  • -> tracking service <- metadata, realtime system
  • -> Bigdata platform
  • reporting/visualization, digital metrics, attributition
▶ metadata - entity model
  • page, layout, link...
▶ metadata - event model
  • trackeing event
    • impresson event
      • component impression event
      • page impression event
      • ad impressioon event
    • reaction event
      • ...
    • conversation event ...
▶ Attribution Model ###
  • Channel
    • direct
    • organic search
    • paid search
    • display offers
    • onsite offers
    • transactional emails
    • marketing emails
▶ Logical architecture ###
  • onsite, marketing 2 channels
  • aggregrate data to meet with Hadoop's block size.
  • Hadoop for behavioral intelligence etc...
▶ Data ingest pipeline
  • 3 processes
  • pre-process, sessionization, metrics generation
▶ Sessionization ###
  • Event -> visit container
▶ Dimension and Metrics
  • Dimension -> Metrics -> Time period
▶ Metrics Generation
  • compute sessions sorted by visitor, dimension
  • compute metrics by dimension
▶ PIG - adhoc Queries
▶ Reporting
  • OSS powered
▶ QA
  • Time events

4:30pm - 5:10pm Writing New Application Frameworks on Apache Hadoop Yarn

Speaker: Hitesh Shah

Hadoop YARN is the next generation computing platform in Apache Hadoop with support for programming paradigms besides MapReduce. In the world of Big Data, one cannot solve all the problems wholly using the Map Reduce programming model. Typical installations run separate programming models like MR, MPI, graph-processing frameworks on individual clusters. Running fewer larger clusters is cheaper than running more small clusters. ...

  • Simple sample writing app with YARN.
▶ Agenda
  • YARN Architecture and Concept
  • Writing New app
▶ YARN Architecture ### #picture
  • Resource Manager (RM)
  • Node Manager (NM)
    • per machine agent
  • Application Manager
    • Per application
▶ YARN Concepts
  • Application ID
    • Application Attempt IDs
  • Container
    • ContainerLaunchContext
  • Resource Request
    • Host/Rack/Any match
    • Priority
    • Resource Constraints
  • Local Resource
    • File/Archive
    • Visibility

#

  • Support Cache
▶ What you need for a new FW
  • Application Submission Client
    • MR job client
  • Application Master
    • the core FW library
  • Application History (optional)
    • history of all previously run instances
  • Auxiliary Services (optional)
    • long-running application-specific services running on the NM
▶ Use case: Distributed Shell
  • take a user-provided script or application and run it on a set of nodes in the cluster

#

  • Client: RPC calls (Client->RM)
  • Uses clientRPM protocol
  • get a new application ID th RM
  • application submission
  • application monitoring
  • kill the application?
▶ Client
  • Registration with the RM
    • New Application ID
  • Application Submission
    • User Information
    • Scheduler queue
    • Define the container for the distributed shell app master via the container...
  • Application monitoring
▶ Defining a container
  • ContainerLaunchContext class
  • Command(s) to run
  • Local resources needed for the process to run
  • Environment to setup
  • Security-related ...
▶ Application Master: RPC calls
  • AMRM and CM protocols
  • register AM with RM

#

  • ask RM to allocate resources
  • launch tasks on allocated containers

#

  • manage tasks to final completion
  • inform RM of completion
▶ Application manage (Detail) ###
  • setup RPC to handle requests from client and/or tasks launched on containers
  • register and send regular heatbeats to the RM

#

  • request resourcees from the RM
  • Launch user shell scripit on containers as and when allocated

#

  • monitor status of user script of remote conteainers and manage failures by retrying if needed
  • inform RM of completion when app is done.
▶ AMRM#allocate
  • Request
    • Containers needed
      • not a delta protocol
      • locality consttraints: host/Rack/any
      • resource constraints: memory
      • Priority-baseed assignment
    • Containers to release - extra/unwanted?
  • Responce
    • Allocated containers
    • Completed Containers
▶ YARN application
  • Data processing
  • OpenMPI on Hadoop
  • Spark
  • Realtime data processing
    • Storm
    • Apache S4
  • Beyond data
▶ Ref.

Continued to Notes of Day1 and the Summary.

Relative posts.