Notes of Hadoop Summit 2012 Day2. #hadoopsummit

Notes of Day2.

Next from Notes of Hadoop Summit 2012 Day1. #hadoopsummit

, this time I want to share you my Notes of Hadoop Summit Day2.
And maybe until the end of tomorrow I will up Summary of this event (in my view point) lastly.

Contents:

8:30am - 10:05am Keynote - Geoffrey Moore and Plenary Sessions in the Main Ballroom

10:30am - 11:10am The Future of HCatalog.

11:25am - 12:05pm Analytical Queries with Hive: SQL Windowing and Table Functions

1:30pm - 2:10pm Status of Hadoop 0.23 Operations at Yahoo!

2:25pm - 3:05pm Network reference architecture for Hadoop – validated and tested approach to define a reference network design for Hadoop

3:35pm - 4:15pm PayPal Behavioral Analytics on Hadoop

4:30pm - 5:10pm Writing New Application Frameworks on Apache Hadoop Yarn

Links to Notes of Day1 and my Summary.

Relative posts.

8:30am - 10:05am Keynote - Geoffrey Moore and Plenary Sessions in the Main Ballroom

▶ Digitizing the World - the Driving Force Behind Hadoop's Adoption Life Cycle

Digital innovation
4 decades
1. 1980s digital office work - B2E
2. 1990s digital value chain - B2B
3. 2000s digital consumption - B2C
4. 2010s digital engagement - B2B2C

Enterprise IT: Pre disruption
Transaction system for global commerce
Drove three decades of investment
Y2K put the capstone on this trend

IT Disruption: For the past decade
we can become more productive
Consumer IT on fire

Consumer IT redefines Human Exp the digitization of culture across the globe
3 principle
1. Access - universal
- information, opinion, searchable, interactive
- why dont you google it?
2. Broadband - emotional
- pictures, music, video, voice, shopping
- Facebook, iTunes, ...
3. Mobile - ubiquitous
- anywhare, anytime, any income
- iPod, iPhone, iPad, Android, ...
-> How inpact IT enterprise?

Enterprise IT - 21th century
system of engagement for B2C
Big Data system is the censoring system
Collaborative filtering
Behaviroral Targeting - retargeting
Personalized transactions
Location based services..
-> Big data, analitics, real-time

system of engagement for B2B
Enterprise facebook - something like facebook but not facebook
Enterprise YouTube
Enterprise Search
Enterprise App Store ...
And lots more to come... #laugh
-> Social, Mobile, Cloud

The evolution in IT infra
Driving the next $1 Trillion of spend in B2B2C
Morphing the Stack
Cleary Hadoop has a very big role to play but when?
-> Seeking the tipping point

Adoption Life Cycle
Early market
- innovators, early adaptors
CHASM
TORNADO
BOWLING ALLEY
MAIN STREET

Where we are now? ###
Getting to tipping point
different playbooks for each phase
Early market
Bowling Alley
Tornado
Main street

Final thoughts for Hadoop Developers
Common Theme: Optimizing Outcomes at Scale
e.g. Media -> Contents...
Key takeaway
for the foreseeable future, DOMAIN EXPERTISE

Common tactic is Big data analytics
system of record are one foundation
log files are the other foundation
batch analytics drives pattern detection ...

▶ SVP advertising and data Yahoo!

Yahoo! is at the frontier of Hadoop Scale
140PB 42,000 nodes, 3PB
500 users
360,000

visualize.yahoo.com

Hadoop drives moneytization to yahoo
2 billion annually in dispay ad revenue from tec built on hadoop.

Analytics
TAO (hadoop+cubes+tableau)
AIY (hadoop+oracle+holap+cocktails(node.js))
javascript analytics

the next Hadoop frontiers
- resource management
- HW

Contribute to Hadoop, Y! more than before.

▶ Hadoop in the Enterprise

CTO Sears Holdings and Metascale CEO

Classic enterprise challenges #picture

The Sears Holdings Approach
allowing user to continue to use familiar consumption IF
providing ingerent HA
enabling business to unlock previously unusable data.

Hadoop must be ecosystem
large company have a complex environment
need to build own solution

The Sears Holding Architecture ### #picture

The Learnings #picture
Hadoop, implementation, unique value

Some Examples - Use cases
- Use case#1: analysis for pricing
- Use case#2: also pricing (COBOL -> Pig )
- Use case#3: to meet SLAs on growing data volume (not application change)
- Use case#4: User experience

Summary #picture

10:30am - 11:10am The Future of HCatalog.

Future of HCatalog - Hadoop Summit 2012

View more PowerPoint from Hortonworks

The initial work in HCatalog has allowed users to share their data in Hadoop regardless of the tools they use and relieved them of needing to know where and how their data is stored. But there is much more to be done to deliver on the full promise of providing metadata and table management for Hadoop clusters. It should be easy to store and process semi-structured and unstructured data via HCatalog. ...

▶ Hadoop Ecosystem

▶ Opening up metadata to MR and Pig #picture

▶ Templeton - REST API

REST endpoints: databases, tables, partitions, columns, table properties
PUT to create/update, GET to list or describe, DELETE to drop

Get a list of all tables in the default DB
return JSON document

Create new table

Describe table

▶ Reading and Writing Data in Parallel

Use case, example,
what exists today -> webHDFS, Sqoop

▶ HCatReader and HCatWriter ### #picture

in this moment only java IF.

▶ Storing Semi-/Unstructured Data #picture

HCatalog is HA

▶ Hive ODBC/JDBC today

Issues
- JDBC client
- ODBC client
- Hive Server

▶ ODBC/JDBC proposal

REST server between Hadoop and ODBC/JDBC Client

▶ QA

HCatalog OSS?
Yes
apache incubator

Integrate with HBase?

when adopting HCatalog
Wont need re-format previous data

Future integration HCatalog with Oozie
Dont honestly know Oozie tried to implement integration with Oozie

about Security

▶ Related tweets.

hiveのテーブルをpigでも処理はするよね。Hcatalogは、スキーマの扱いを単純にするという点でかなり強力。まぁ実際にはHA化とかバックアップとか考える点はあるけれどね。

2012-06-15 03:01:01 via Seesmic

11:25am - 12:05pm Analytical Queries with Hive: SQL Windowing and Table Functions

Speaker: Harish Butani

Hive Query Language (HQL) is excellent for productivity and enables reuse of SQL skills, but falls short in advanced analytic queries. Hive`s Map & Reduce scripts mechanism lacks the simplicity of SQL and specifying new analysis is cumbersome. We developed SQLWindowing for Hive(SQW) to overcome these issues. SQW introduces both Windowing and Table Functions to the Hive user. SQW appears as a HQL extension with table functions and windowing clauses interspersed with HQL. This means the user stays within a SQL-like interface, while simultaneously having these capabilities available. ...

windowing
PTF

▶ What PTF?

▶ Analytics expressed using PTFs

Ranking, Top N
Time series analysis
Market Basket Analysis
Recursive Queries
Aster SQL/MR ...

▶ PTF bottom line

Enable more interesting Questions more simply in the SQL context

▶ PTF innovation example

ex. market basket analysis
input is a large set of baskets, each contains

▶ SQL windowing as a PTF

▶ PTFs with hive

hbutani/SQLWindowing · GitHub

▶ PTFs with Hive CLI

▶ Query Structure

Query abstraction is a select statement

▶ Query Example: Basic Query

▶ Demo

▶ Query Example: Top N #picture

Calculate the Top3 Tracts by country

▶ PTF Example: NPath

find incidents where a flight has been more than 15 min late 5 or more times in a row.

▶ Whats available

windowing functions
- 21 functions
- for ranking aggregation, navigation ...
One pass PTFs
Multi pass PTFs

▶ Query Evaluation: a PTF

▶ Multi Pass and Recursive Queries PTFs

▶ Multi Pass PTFs

▶ PTF Examples: Market Basket Analysis

see wiki for details
- Folding into Hive · hbutani/SQLWindowing Wiki · GitHub

▶ Recursive Queries as PTFs

HaLoop Project
Giraph Project

▶ PTF Example: Transive Closure

▶ Generalized TC

▶ Summary

Enable simpler analiytics.

▶ Next Step

Make windowing clauses closer to standard SQL

▶ QA

1:30pm - 2:10pm Status of Hadoop 0.23 Operations at Yahoo!

Speaker: Charles Wimmer

Upgrading from Hadoop 0.20.x to 0.23.x is challenging in several ways. This presentation will include details on how Yahoo! handles these challenges. It will outline the process we use to upgrade existing 0.20.x clusters to 0.23. It will show specific changes we`ve made to startup scripts and monitoring infrastructure. It will show how we manage new configurations for viewfs and hierarchical queues.

4 weeks ago speaker moves to LinkedIn

▶ Summary of this talk

Includes
- Operational changes required to support 0.23
Does not include
- specifics about customer testing...

▶ Scope of This change of Y!

42k+ Hadoop servers
20+ clusters
sandbox, research, production
0.20.205

▶ Overview of the process

provide customer 0.23 sadnbox cluster
provide customer enough data to test their app
provide developer support to address app issues quickly
upgrade resarch and production clusters ...

▶ Test Cluster

420 nodes
2x westmere 4 core processors
24G RAM
12 x 2 TDisks
no federations

▶ Hirechical Queues

configuration ex. #photo

▶ Memory Configuration

everything in memory by now.
yarn.nodemanager.resource.memory-mb

▶ Kerberos Configuration

▶ Init scripts

DN/NN
SNN
HistoryServer (HS)
NodeManager (NM)
ResourceManager (RM)

▶ DN/NN

minor change.

▶ SNN

previous downloaded edit log deletion feature added

▶ HistoryServer

new script

▶ ResourceManager/NodeManager

#Failed to take photo.

▶ QA

Application compiled are compatible?
mostly works fine.

2:25pm - 3:05pm Network reference architecture for Hadoop – validated and tested approach to define a reference network design for Hadoop

Speaker: Nimish Desai

Hadoop is a popular framework for web 2.0 and enterprise businesses that are challenged to store, process and analyze large amounts of data as part of their business requirements. Hadoop’s framework brings a new set of challenges related to the compute infrastructure and underlined network architectures. The goal of the session is to provide the reference network architecture for deploying Hadoop application. This session will cover the Hadoop application behavior in the context of building a network infrastructure and its design consideration. ...

▶ Session Objectives and takeways

Provide reference NW
characterize Hadoop Application ...

▶ Validated 96 node Hadoop Cluster

benchmarking
Terasort

▶ DC infra

▶ Big data application realm - web20 social community NW

facebook

▶ Enterprise

total different from social media
topology

▶ Bigdata building Blocks into the enterprise.

▶ Charactoristics that affect Hadoop Cluster ### #photo

▶ Hadoop Components and Operations

▶ MR data model ### #photo

ETL workload
BI workload difffrence

▶ ETL Workload (1TB Yahoo Terasort) ### #photo

▶ BI workload

▶ Data Locality in HDFS

data locality spike occur

▶ Map to Reducer Ratio impact on Job completion

reducers number

▶ Changing Reducers number

graph showing

▶ NW characteristics

latency

▶ NW reference architecture

▶ Scaling the DC Fabric changing the device paradigm

▶ Hadoop NW Topologies - reference ### #photo

▶ HA switching design

prepare variaty to connecting NW
Single NIC, Dual NIC, ...

▶ Availability with Single attached server 1G or 10G

▶ Availabitily Single Attached vs. Dual Attached node ### #photo

Dual attached node

▶ Availability NW failure result - 1TB terasort - ETL

▶ Oversubscription Design

▶ number of uplinks affection ### #photo

▶ DN speed difference 1G vs. 10G TCPDUMP of Reducers TX ###

10 G can provide benefits depending on workload.
bonding

▶ 1GE vs. 10GE Buffer Usage

▶ Multi use Cluster Charactoristic ### #photo

▶ 100 jobs each with 10GB data set stable, node and rack failure.

flexibility

▶ Burst Handling and Queue Handling

▶ Nexus 2248 ...

▶ Terasort ETL

slide 48 similar

▶ NW latency ### #photo

Java slowness, compare to that NW latency is almost zero.
DNS lookup something like that cost more.

▶ Summary

3:35pm - 4:15pm PayPal Behavioral Analytics on Hadoop

Speaker: Anil Madan

Online and offline commerce are changing and converging, and technology is dramatically influencing how consumers connect, shop and pay. With over 100 million active accounts in 190 markets and 25 currencies around the world, PayPal is at the forefront of payments innovation. PayPal continues to disrupt traditional means of money exchange and is the faster, safer way to pay and get paid anytime, anywhere and any way. ...

▶ Paypal's vision

digital wallet
anywhere, anytime, anyway
110 M active account , 190 markets, 25 currencies

▶ Behavioral tacking vision

understand your customer's behavior and exp.
ensure privacy, security and trust for our customers
anywhere, anytime, anyway
ensure instrumentation standardization accross channel
to drive desirable outcomes for our customers ...

▶ Tracking Platform overview ###

Direct/Home page
tansaction Emails
Email Marketing
Display advertising
search engine marketing
-> tracking service <- metadata, realtime system
-> Bigdata platform
reporting/visualization, digital metrics, attributition

▶ metadata - entity model

page, layout, link...

▶ metadata - event model

trackeing event
- impresson event
  - component impression event
  - page impression event
  - ad impressioon event
- reaction event
  - ...
- conversation event ...

▶ Attribution Model ###

Channel
- direct
- organic search
- paid search
- display offers
- onsite offers
- transactional emails
- marketing emails

▶ Logical architecture ###

onsite, marketing 2 channels
aggregrate data to meet with Hadoop's block size.
Hadoop for behavioral intelligence etc...

▶ Data ingest pipeline

3 processes
pre-process, sessionization, metrics generation

▶ Sessionization ###

Event -> visit container

▶ Dimension and Metrics

Dimension -> Metrics -> Time period

▶ Metrics Generation

compute sessions sorted by visitor, dimension
compute metrics by dimension

▶ PIG - adhoc Queries

▶ Reporting

OSS powered

▶ QA

Time events

4:30pm - 5:10pm Writing New Application Frameworks on Apache Hadoop Yarn

Speaker: Hitesh Shah

Writing Yarn Applications Hadoop Summit 2012

View more PowerPoint from Hortonworks

Hadoop YARN is the next generation computing platform in Apache Hadoop with support for programming paradigms besides MapReduce. In the world of Big Data, one cannot solve all the problems wholly using the Map Reduce programming model. Typical installations run separate programming models like MR, MPI, graph-processing frameworks on individual clusters. Running fewer larger clusters is cheaper than running more small clusters. ...

Simple sample writing app with YARN.

▶ Agenda

YARN Architecture and Concept
Writing New app

▶ YARN Architecture ### #picture

Resource Manager (RM)
Node Manager (NM)
- per machine agent
Application Manager
- Per application

▶ YARN Concepts

Application ID
- Application Attempt IDs
Container
- ContainerLaunchContext
Resource Request
- Host/Rack/Any match
- Priority
- Resource Constraints
Local Resource
- File/Archive
- Visibility

Support Cache

▶ What you need for a new FW

Application Submission Client
- MR job client
Application Master
- the core FW library
Application History (optional)
- history of all previously run instances
Auxiliary Services (optional)
- long-running application-specific services running on the NM

▶ Use case: Distributed Shell

take a user-provided script or application and run it on a set of nodes in the cluster

Client: RPC calls (Client->RM)
Uses clientRPM protocol
get a new application ID th RM
application submission
application monitoring
kill the application?

▶ Client

Registration with the RM
- New Application ID
Application Submission
- User Information
- Scheduler queue
- Define the container for the distributed shell app master via the container...
Application monitoring

▶ Defining a container

ContainerLaunchContext class
Command(s) to run
Local resources needed for the process to run
Environment to setup
Security-related ...

▶ Application Master: RPC calls

AMRM and CM protocols
register AM with RM

ask RM to allocate resources
launch tasks on allocated containers

manage tasks to final completion
inform RM of completion

▶ Application manage (Detail) ###

setup RPC to handle requests from client and/or tasks launched on containers
register and send regular heatbeats to the RM

request resourcees from the RM
Launch user shell scripit on containers as and when allocated

monitor status of user script of remote conteainers and manage failures by retrying if needed
inform RM of completion when app is done.

▶ AMRM#allocate

Request
- Containers needed
  - not a delta protocol
  - locality consttraints: host/Rack/any
  - resource constraints: memory
  - Priority-baseed assignment
- Containers to release - extra/unwanted?
Responce
- Allocated containers
- Completed Containers

▶ YARN application

Data processing
OpenMPI on Hadoop
Spark
Realtime data processing
- Storm
- Apache S4
Beyond data

▶ Ref.

Doc on writing YARN application: Apache Hadoop 2.0.0-alpha