#garagekidztweetz

id:garage-kid@76whizkidz のライフログ・ブログ!

Memos: #hw2011 (Day2), 09, Nov, 2011.

スポンサーリンク

このエントリーをはてなブックマークに追加

NYC で開催された Hadoop World 2011 の第一日目のメモに引き続き、第二日目のメモを公開します。最後に、サマリを書きます。

Table of Contents: Memos: Hadoop World 2011 (Day2), 09, Nov, 2011.

8:30am-09:45am Keynotes.The Future of Data Management - James Markarian, Informatica

James Markarian will discuss historical trends and technology shifts in data management and how the data deluge has contributed to the emergence of Apache Hadoop. James will showcase examples of how forward-looking organizations are leveraging Hadoop to maximize their Return on Data to improve insight and operations. By sharing his perspective about this next major analytics platform, James will discuss why Hadoop is poised to change the face of analytics and data management. Finally he will challenge the Hadoop ecosystem to work together to close the remaining technology gaps in Hadoop.


  • story of Analytics.
Prove Hadoop is the important product from this 30 years.
logical connection
new recognition of the world around us.
  • what surrounding around us information.
Woman's image
  • failed to understand what she needs, we make bad selection.
Commoditization of technology
  • Business agility
  • Democratizatin of Data
  • Explosion of Data
  • #more cheap the infra become.
    |
    How to become the top of this?
from 70's
  • problems = flexibility
    • schema
    • query performance
      | fasten and fasten
      data parallelism
Hotel story
  • analytics give hotel customer sattisfaction
Application everywhere
  • BI
  • risk analysis, ...
  • Security
  • DBAtion???
  • When First read MR, Dynamo paper story.
  • DB industry can become interesting again.

8:30am-09:45am Keynotes. Hadoop World 2011 Keynote: The State of the Apache Hadoop Ecosystem - Doug Cutting, Cloudera

Doug Cutting, co-creator of Apache Hadoop, will discuss the current state of the Apache Hadoop open source ecosystem and its trajectory for the future. Doug will highlight trends and changes happening throughout the platform, including new additions and important improvements. He will address the implications of how different components of the stack work together to form a coherent and efficient platform. He will draw particular attention to Big Top, a project initiated by Cloudera to build a community around the packaging and interoperability testing of Hadoop-related projects with the goal of providing a consistent and interoperable framework. Cutting will also discuss the latest additions in CDH4 and the platform roadmap for CDH.


Outline
  • the ecosystem
  • why we need it
  • what it is
  • why ...
Why are we here
  • HW has improved
    • exponentially for decades
    • both storage and compute
  • We can now store and process much more
    • yet have been slow to leverage
  • Analyzing more data makes us smarter
    #the more you have data, the more you can become more smarter.
  • can make better decisions.
The Ecosystem is the system
  • Hadoop has become the kernel
    • of the distributed OS for Big data
  • No one uses the kernel alone
  • A collection of projects at Apache
Strength of Apache
  • Mandates diversity & transparency
    • you control your fate
  • Insures against vendor lock-in
    • cant buy the ASF
  • Allow competing project
    • survival of the fittest
  • Ecosystem as loose federation
    • lets platform evolve
      #everyone can be the member, can become the influencer.
      #nobody pulling your legs.
      #fork is also allowed
Whats new?
  • Apache Hadoop 0.20.205
    • append
    • security
  • CDH3
    • Mahout included
    • Avro support across components
Whats next?
  • Apache Hadoop 0.23
    • HDFS
      • performance
      • scalability (federation)
      • availability
  • MR2
CDH4
  • Includes Hadoop 0.23
  • BigTop-based (?) later explained
Ecosystem
  • S4, Giraph, Crunch, Blur....
Apache BigTop (incubating)
  • Ecosystem as a project
    • integration test
    • compatible versions
    • common packaging
    • release is a test
  • Basis for CDH
    • Like Fedora is for RHEL
  • community driven
  • Includes:
    • Hadoop, HBase, Zookeeper, Avro, Hive, Pig, Oozie, Flume, Mahout....
Join the community
  • Hadoop and big data are still young
  • HW trends will continue
  • Hadoop started with just two developers
  • Now it has hundreds
  • You can be the next
  • What do you need?
    #we are very lucky to meet this opportunity

QA

Using java risk.
  • agree with the risk.
work together products dependency
  • Like a Linux kernel
  • Hadoop will also work
Realtime capability
  • Storm
Google contribution now?
  • they are not contribute openly so much
  • mostly for its internal.
about Avro
Top3 area of Ecosystem
  • not top down decision they make but...
  • realtime processing like HBase
Commodity

Hadoop is cost efffective.

Development is quite slow
  • still not 1.0
  • release number is not so important.
  • not important to change the numbering the version.

10:00am-10:50am Hadoop and Performance. Todd Lipcon, Cloudera. Yanpei Chen, Cloudera.

Performance is a thing that you can never have too much of. But performance is a nebulous concept in Hadoop. Unlike databases, there is no equivalent in Hadoop to TPC, and different use cases experience performance differently. This talk will discuss advances on how Hadoop performance is measured and will also talk about recent and future advances in performance in different areas of the Hadoop stack.


Measurement of performance
  • MR/HDFS
  • per job latency
  • cluster throughput
  • efficiency
HBase
  • Average 99 percentile performance
  • Write read throughput
  • in cache vs out of cache
Performance from Hadoop developers POV
  • What do we measure?
  • overhead
CPU overhead lock contension
  • CPU seconds per HDFS GB written / read
  • Miicro second of latency to....
  • Scheduler/framework overhead
    • no op MR job ( sleep -mt 1 -rt 1 -m 5000 -r 1)
  • Disk utilization
    • iostat
Hadoop performance myth
  • Java is slow?
  • most of Hadoop is IO or NW bound , not CPU
Java doesn't give you enough low level system accesses
  • JNI allows us to call any syscall that C can
  • we can integrate ....
Lies, damn lies, and statistics benchmark
HDFS/MR improvements IO/Cashing
  • MR is designed do primarily Sequential IO
  • Linux IO sheculer read ahead heuristics are weak
    • HDFS 0.20 does many seeks even for sequential access
  • Many data sets are significally larger trhan memory
  • capacity cashing them is a waste of memroy
  • Linux employs lazy page writeback ...
IO cashing solution
  • Linux provides 3 useful syscalls
    • posix_....
IO cashing results
  • 20% improvements on terasort wall-clock
MR CPU improvement: sort
  • CPU inefficiency
    • optimization
      • Use sun.misc.unsafe for -4x reduction in CPU usage of this function
  • CPU cache usage
    • optimization
      • include 4byte key prefix with pointer , avoiding memory indirection for most comparison
Benchmark results
  • CDH3u2 optimized build with cache conscious ....
MR improvements: Scheduler
  • 0.20.2
    • TT heartbeat once every 3 seconds , even on a small cluster
      Trunk/0.20.205/CDH3
Scheduler results
  • 10node cluster , 10 map slots per machine
  • hadoop jar examples.jar sleep -mt 1 -rt1 ....
HDFS CPU improvements: Checksuming
  • Significant CPU overhead
  • Optimization
  • Switch to bulk API
  • verify compute 64KB at a time instead of 512 bytes Switch to CRC32C ....
Checksum improvements
  • only 16% overhead vx unchecksummed access
  • maintain...
HDFS random access
  • 0.20.2
    • TCP hadnshake overhead
  • 0.23
    • ...
Random read micro benchmark
HBase YCSB, Random-read macro benchmark
MR2 improvement:shuffle
  • shuffle improvents
  • auto tunes
Summary
  • Hadoop is faster than ever
  • 2-3x
  • CPU usage
  • 2x faster wall-clock...
Upsteam JIRA
Coming to a Hadoop near you
  • Apache Hadoop 0,23
  • Many improvements
  • Next
advances in MR performance measurement
state of the art in MR performance Essential but inefficient benchmarks
  • Gridmix2, Hive BM, Hivench, PigMix, Gridmix3
    # acess each benchmark in Datasize , intensity variations , job types, cluster independent, synthetic workload...
  • Representive? Reply?
Need to measure what production MR system REally do
  • from generic large elephant? ...
Need to advance the state of the art
  • develop scienteific knowledge
  • Apply to system engineering
First comparison of production MR
  • facebook trace
  • 6 month in 2009 , approx. 6,000 jobs per day
  • Cloudera customer trace
    • 1 week in 2011, 3000 jobs per day
  • Not in this talk trace from 4 other customer...
Different job sizes
  • input, shuffle, output size.
Different workload time variation
  • number of jobs
  • input + shuffle + output data size
  • Map + reduce task time
Different job types - identify by kmeans
  • %of jobs, input, shuffle, output, duration, maptime, readtime, Description
summary
  • facebook
    95% small jobs
  • CC(Cloudera Costumer)-b
    88.9% small jobs
    #system optimization should be adopted for its specific workloads.
Measure performance by replying workload
  • Major advantage over artificial bbenchmarks
  • Challenges
  • turn ....
Reply concatenated trace samples
  • Problem
    • need short and representative workload
  • Solution
    • E.g. day long workload
  • Strength
    • reproduce all statistics within a sample window
  • Limits
    • introduces sampling error
Verify that we get representative workloads
  • CDF(?) #need to know later
case study: compare FIFO vs fair scheduler
  • fixed workload - rely on system n - performance comparison
  • workload=....
choice of scheduler depends on workload
  • workload one shows fair scheduler is much better.
  • other shows FIFO scheduler is better
    # comparing job's ***Takeaways
  • Need to consider workload level performance
  • Data sizes, arrival patterns, job types. Understanding performance is a community effort.
    • not everyone is like Google, Yahoo, Facebook
    • different feature and metrics relevant to each use case
  • Check out our evolving MR workload repository

11:00am - 11:50pm Leveraging Hadoop to Transform Raw Data to Rich Features at LinkedIn

This presentation focuses on the design and evolution of the LinkedIn recommendations platform. It currently computes more than 100 billion personalized recommendations every week, powering an ever growing assortment of products, including Jobs You May be Interested In, Groups You May Like, News Relevance, and Ad Targeting. We will describe how we leverage Hadoop to transform raw data to rich features using knowledge aggregated from LinkedIn's 100 million member base, how we use Lucene to do real-time recommendations, and how we marshal Lucene on Hadoop to bridge offline analysis with user-facing services.


Think platform leverage Hadoop
  • no analytic platform can complete without Hadoop
the worlds largest professional NW
  • new member joins -2/sec
  • company pages > 2M
Recommendations opportunity
  • right information to right person, ....
  • Recruit....
  • for everyone.
50%
  • allow recomendations.
  • positions, educations, ....
are all the same?
  • job title's name is not same but same.
are all companies the same?
  • IBM's variation
recommendation trade-offs
  • Real time
    • LinkedIn today
  • Time Independent, Content Analysis, Collaborative, Precision, Recall
Recommendation
  • similarity score calculation
  • Normalization , scoring adn Ranking
  • Filtering and FB
Technologies
  • Lucine x Hadoop x Zoie x voldemout x mahout x Kafka(?)
Hadoop case study
  • Scaling
  • Blending recommendation algorithms
  • Grandfathering
  • Model selection AB testing
  • Tracking and reporting
Scaling
  • Billions of recommendations
  • latency > 1 sec
    number of document, size of document
    | minhashing
  • latency < 1 sec
    recall=low
    | Hadoop
  • latency < 1 sec
    recall=high
Blending recommendation Algorithms
  • similarity between the several profiles.
  • how to blend together
  • two algorithm
  • Co-view
    impact latency - min
    complexity = high
    | Hadoop
    co view
    impact laency ...
    complextiy = low
Grandfathering
  • adding and changing features
  • next profile edit
  • notime guarantees minimal disruption
  • Parallel feature
  • Extraction pipeline (Batch)
  • Time - week
  • significant system work
Model selection
  • Features models
  • parameters
  • SVM
  • decision trees
  • logistic regression
  • content,
  • L1+L2 regularization
A/B testing
  • is option A better than option B? letst test.
Think platform leverage Hadoop

1:00pm-1:50pm HBase Roadmap. Jonathan Gray, Facebook.

This technical session will provide a quick review of the Apache HBase project, looking at it from the past to the future. It will cover the imminent HBase 0.92 release as well as what is slated for 0.94 and beyond. A number of companies and use cases will be used as examples to describe the overall direction of the HBase community and project.


3 road map
Apache HBase
  • A friendly open source project
  • (self opinion)
Apache HBase
  • a dynamic and pragmatic community
    • HBase committers scattered around many companies
    • a culture of acceptance ( please contribute
      • perhaps , occasionally , to a fault
    • many HBase committers have moved companies
  • Roadmap driven by sponsoring companies
    • HBase has no enterprise company behind it
The Ghost of HBase Past
  • Histroy
  • 2007 Bigtable clone for Hadoop
  • Six major releases
2008
  • random read/read access for offline processes
  • early users focused on offline , crawl data storage
    • power set was primary user
    • others like worldLingo, OpenPlaces
  • augmenting Offline MR
    • needed random writes for web crawling log
2009
  • OLTP databases for web startups
  • stand upon, trend micro, ...
  • Online HBase
  • Next gen of HBaser wanted OLTP
  • HBase Goes realtime
    • gave this talk at Hadoop summit 2009
    • HBase 0.20 first ever performance release
  • HBase 0.20
  • performance release (aka the unjavafy release)
    • Rewrite of entire read and write paths
  • Zookeeper Integration
2010
  • A highly available, scalable db for tech companies
  • facebook, Cloudera, yfog, twitter, gumgum, adobe, yahoo, ....
    #huge huge period of big change.
  • engineer driven tech.
  • HBase 0.90
  • Durability, Stability, Availabitily release
    • production ready HBase
    • Zero data loss
    • Rewrite of master and Zookeeper interactions
Testing HBase 0.90 production ready
  • Zero data loss
  • Mater rewrite
  • Operational improvements
    • HBCK,
    • Rolling restarts for ....
  • New features
    • Cluster to Cluster replication.....
HBase today
  • 2011
  • a large scale production capable DB system
  • cloudera, facebook , standupon, "eBay, salesforce.com", trendmicro
HBase 0.92
  • stability and feature release
    • lots of usability and stability improvements
  • 0.92 RC sometime in November
    • 0 blockers and 2 criticals as of this morning
Big new features
  • Coprocessors
    • Triggers and Stored procedures
    • Pre/post hooks to all client cals and server operations
    • Dynamically add new RPC call
    • ACL security atop coprocessors
  • HFile V2
    • support for very large regions/files
    • multi level block index and inline blooms
      #Bloom filter
      #can allow 20GB HFile
Performance improvements
  • more seeking and early out hints
  • distributed log splitting
  • cacheonwrite , evictonclose
Compaction improvements
  • multi-thread compaction
  • vastly improved selection algorithm
  • lots of metrics
Operational improvements
  • HBCK improvements , Web UI improvements
  • Slow query log , running tasks and thread status
  • online schema modification
Usability and API improvements
  • increment client API
  • String based fileter language
HBase 0,.92 Documentation
  • The book
    • Definitive Guide -- The apace HBase book
HBase of the Future 0.94 and beyond
2012
  • a usasble, large scale production DB system
  • you!
HBase 0.94
  • stability and usability is the core focus
    • increase stability by decreasing complexity
    • more work on UI, tools, monitoring ,operatively
    • table/family-level metrics
But features will always continue
  • Fast backup w/ point in recovery
  • multi slave replication
  • constraints and other coprocessor based contribs
Thrift 2.0
  • new Thrift API to more closely match java API
  • Embedded Thrift w/ short circuit
Other goodies
  • TTL + minVersions
  • Point in time snapshot scanners
  • Atomic append operation
Performance
  • Scaling for throughput vs. latency
    • early lock release to decrease row contention
    • early thread release to increase throughput
    • remove all global wait()/notify() on HLog
  • Improved seeking and file selection
    • Lazy seek in order file processing
    • DeleteFamily bloom filter
Project management
  • renewed focus on fast release cycle
    • HBase 0.94 branch cut immediately after 0,92 release
    • already close 0.94 feature freeze , 0.93 dev release
    • 3 blockers and 12 criticals left
Apache HBase, A slightly less accepting project
  • stability is really code stability
  • push towards interactive feature dev and branch dev
  • Coprocessors and service interfaces go a long way
2013
  • Beyond HBase 0.94
  • Stablility and usability is still the core focus
    • more tests, testing framework , integration tests
  • But features will always continue...
    • RPC redux
    • Dynamic configuration
    • Request , IO and locality based load balancing...
Conclusion
  • Apache HBase has come a long way.
    • Use case driven development
  • HBase 0.92 comming very soon
    • most stable release to date
  • Contributors and committers drive development
    • consumers cant dictate the road map
    • individuals and organizations solve their problems(they have their own users ... and jobs to keep)

Questions?

Secondary indexing

I'm not working for it. but .... HFile stores timestamp range?
yes

how to access the timestamp data

metadata

2:00pm-2:50pm Hadoop Hadoop 0.23 Arun Murthy, Hortonworks

Apache Hadoop is the de-facto Big Data platform for data storage and processing. The current stable, production release of Hadoop is “hadoop-0.20″. The Apache Hadoop community is preparing to release “hadoop-0.23″ with several major improvements including HDFS Federation and NextGen MapReduce. In this session, Arun Murthy, who is the Apache Hadoop Release Master for “hadoop.next”, will discuss the details of the major improvements in “hadoop-0.23″.


Releases so far
  • 0.20 is still the basis of all current, stable, Hadoop distributions.
  • 0.20.205
  • security + append -> HBase
    #release pace is going slow.
hadoop0.23
  • First stable release off apaceh hadoop iin over 30 monthes
HDFS - federation
  • Significant scaling
  • separation of namespace mgmt and Block mgmt
  • Suresh Srin....
MR - YARN
  • NextGen Hadoop Data processing framework
Performance
  • 2x + accross the board
  • HDFS read/write
    • CRC32
    • fadvise
    • Shortcut for local reads
  • MR
    • Unlock a lot of improvements from terasort record
    • shullfe 30%+
    • small jobs - Uber AM
  • Todd Lipcon - Wed 10am
HDFS - NN HA
  • The famous SPOF
  • well on the way to fix in hadoop 0.23
  • Suresh Srinvas , Aaron Myers , Tue 2:15
More ...
  • HDFS write pipeline improvements for HBase
  • Build - full Mavenization
  • EditLogs re-write, HDFS1713....
Deployment goals
  • Clusters of 6,000 machines
    • each machine with 16+ core 48/96GB RAM, 24TB/36TB disks
    • 200+ PB per cluster
    • 100,000+ concurrent tasks
    • 10,000 concurrent jobs
  • Yahoo: 50,000+ machines
What does it take to get there?
  • Testing lots of it
  • Benchmarks
  • Integration testing
    • HBase, Pig, Hive, Oozie....
  • Deployment discipline
Testing
  • Why is it hard?
    • MR is , effectively, very wide API
    • add streaming
    • add pipes
    • Oh, Pig/Hive ....
  • Functional test
    • Nightly
    • Nearly 1,000 functional test for MR alone
    • several hundred for Pig/Hive
  • Scale tests
    • Simulation
  • Longevity tests
    • Stress test
Benchmarks
  • Benchmarks every part of the HDFS& MR pipeline - HDFS read/write throughput
    • NN operation
    • Scan , Shuffle ,Sort
  • Gridmixv3 #
    • Run production traces in test clusters
    • Thousands of jobs
    • Stress mode v/x replay mode
Integration Testing
  • Several projects in the ecosystem
    • HBase, Pig, Hive, Oozie...
  • Cycle
    • Functional
    • Scale
    • Rinse, repeat
Deployment
  • Alpha/test (early UAT)
    • Starting nov 2011
    • small scale (500-800 nodes
Alpha
    • jan, 2012
    • majority of users
    • 2,000 nodes per cluster > 10,000 nodes in all
Beta
  • misnomer: 100s of PB, millions of user application
  • sighificantly wide variety of applications and load
  • 4,000+ nodes per cluster > 20,000 nodes in all
  • late Q1, 2012
Production
  • well, its production
  • mid to late 2012
  • twitter: @acmurthy

3:20pm-4:10pm Practical HBase Ravi Veeramchaneni, Informatica

Many developers have experience in working on relational databases using SQL. The transition to No-SQL data stores, however, is challenging and often time confusing. This session will share experiences of using HBase from Hardware selection/deployment to design, implementation and tuning of HBase. At the end of the session, audience will be in a better position to make right choices on Hardware selection, Schema design and tuning HBase to their needs.


Talk about...
  • what we can look into HBase
  • Use case
Why HBase?
  • Hadoop Benefits
  • stores and process large amounts of data
  • scales 100s and 1000s of nodes
  • fast 1TB sort in 62s 1 PB in 16.25h
  • Availability....
  • But ...
  • not so good or does not support
  • random access updating the data and/or file
  • apps that require low latency access to data
  • does not to support lots of small files....
Featuring HBase
  • HBase Scales runs on top of Hadoop
  • HBase provides fast table scans for time ranges and fast key based lookups
  • HBase storees null values for free
  • unstructured/semi-structured data
  • built-in version management
to solve big data problems
  • sparse data un or semi-structured data
  • cost effectively scalable
  • versioned data
  • atomic read write update
  • Linear distribution of data across the DN
  • rows are stored in byte-lexographic sorted order...
Navteq's Usecase
  • Contents is - constantly growing
    • sparse and unstructured
    • provided in multiple data formats
    • ingested , processed and delivered in transactional and batch mode
  • Content Breadth
    • 100s of millions of content records
    • 100s of content supplies + community input
Content Processing High level overview #
  • audio, video contents
HBase @ Navteq
  • started in 2009
    • 8node , VMware sandbox cluster
    • switched to CDH
  • Early 2010, HBase 0.20.x CDH2
    • 10node physical sandbox cluster
  • Current HBase 0.90.3
    • moved to CDH3u1 release ....
MEasured Business Value
  • Scalability & deployment
    • Handlining spikes are addressed by simply adding nodes
    • from 15 to 30 to 60 nodes and more , as datga grows...
  • Speed to market
    • By suporting real time transactions
    • batch updates are handle more efficiently ...
  • Faster supplier on boarding...
    #not complete ROI, software licence cost
HBase & Zookeeper
  • ZK - distributed coordination service
HBase depends on ZK and authorizes ZK to manage the state
  • HBase hosts key ...
Design Consideration
  • DB/Schema design
    • Transaction to column oriented or flat schema
  • Understand your access pattern
  • Row-key design / implementation
    • Sequential key
      • Suffers from distribution of load but uses the block caches
      • Can be addressed by pre-spretting the regions
  • Randomize keys to get better distribution
    • Achieved throughput hashing on Key Attributes - SHA1 or MD5
    • Suffers range scans
  • Too many column family is not good.
    • Compactions
Serialization
  • Avro didnt work well - deliveration issue
  • Developed configurable....
  • Secondary indexes
    • were using ITHBase and IHBase from contrib - doesn't work well
    • redesighed schema without need for index
    • we still need it through
  • Performance
    • Tunable parameter
      • Hadoop, HBase, OS, JVM, NW
  • Scalability
Hadoop/HBase processes
  • Maters
    • NN, JT, Zookeeper, HBase Master
  • Workers
    • DN, TT, Region server
    • on top of it
      • Hive , Pig, Java MR, Streaming
HW/Deployment considerations
  • HW
    • DN 24G RAM, 8 cores, 4x1TB
    • 6 mappers and 6 processors per node (16 mappers, 4 reducers)
    • memory allocation by process
      • DN 1GB, 2GB
      • TT - 1GB,2GB
      • map task 6x1GB
      • reduce task 6x1GB
      • region server - 8GB
      • total allocation - 24GB, 64GB
  • Deployment
    • Dont run ZK on DN node, have a separate ZK quorum
    • dont run HMaster on NN node.
    • avoid HMaster's SPOF
  • Configuration / Tuning
  • Configurating HBase
  • configuration is the key
  • many moving parts
OS
  • open files limit
  • vm.swapiness to lower or 0
HDFS
  • increase xcelvers to 2047
  • set socket timeout to 0
HBase
  • need more memory
  • ....
  • Per cluster
    • Turn off block cache if the hit ratio is less
  • Per table
    • Memstore flush size
    • Max file size
  • Per CF
    • Compression
    • Bloom filters
  • Per RS
    • Amount of heap in each RS to reserve for all memstores
    • Memstore flush size
  • Per SF

....

    • write optimization
      • hbase.regionserver.global.memstore.upper...
  • Security
    • Still in flux, need robust RBAC
  • Reliability
Desired Fetures
  • Better operational tools for using Hadoop and HBase
  • support for secondary indexes
  • Full-text indexes and searching
  • Data replication for HA and DR
  • Security at table , CF and Row level
  • documentation -> Oreilly book released. #

QA.

  • Why don't adopt Cassandra , MongoDB?
    • Simply Scalability
    • and Natively support Hadoop(HDFS)

4:20pm-5:10pm The Powerful Marriage of R and Hadoop - David Champagne, Revolution Analytics

When two of the most powerful innovations in modern analytics come together, the result is revolutionary.
This session will cover:
An overview of R, the Open Source programming language used by more than 2 million users that was specifically developed for statistical analysis and data visualization.
The ways that R and Hadoop have been integrated.
A use case that provides real-world experience.
A look at how enterprises can take advantage of both of these industry-leading technologies.


About Revolution difference between R and revolution R enterprize
Why R and Hadoop?
  • R a statistical programming language
  • NEed for more than counts and averages analyze all the data
Motivation for this projcet
  • easy for the R programming to interact with the Hadoop data stores and write MR programs
  • Run R on a massively distributed system without having understand the underlying...
R and Hadoop - The R packages
  • rhdfs - R and HDFS
  • fhbase - R and HBase
  • rmr - R and MR
rhdfs
  • manipulate HDFS directly to ..
  • File namupulations
  • File read/write
rhbase
  • Manupurate Hbase table and their content Uses Thrift
  • Table manipuration
  • Row read/write
Writing MR programs in R (rmr)
  • for programmers
    • a way to access big data sets
    • a simple way to parallel programs
    • very R-like , building on the functional characteristics of R Just a library
  • for developers
    • much simpler than writing Java
    • not as simple as Hive. Pig at what they do, but more general
    • great for prototyping , can transition to production -- optimize instead of rewriting lower risk , always executable
rmr MR function
  • input - input folder
  • output - output folder
  • map - R function used as map
  • reduce - R functioin used as reduce...
  • other advanced for programmers
Some simple things
  • Example showing sampling and counting #... look the slide
Compare with Hive query and rmr programming # ... look the slide
Complex things k-means clustering
k-means implementation
  • well known design
  • comparison of the k-means in MR
  • Pig - 100 line code -> rmr - 20 line code
k-means Highlights
k-means optimizations