2011-11-09

Memos: #hw2011 (Day2), 09, Nov, 2011.

conference hadoop hw2011

f:id:garage-kid:20111115145611j:image:w230

NYC で開催された Hadoop World 2011 の第一日目

のメモに引き続き、第二日目のメモを公開します。最後に、サマリ

を書きます。

✔ Table of Contents: Memos: Hadoop World 2011 (Day2), 09, Nov, 2011.

08:30am-09:45am Keynotes

The Future of Data Management - James Markarian, Informatica

The State of the Apache Hadoop Ecosystem - Doug Cutting, Cloudera

10:00am-10:50am Hadoop and Performance. Todd Lipcon, Cloudera. Yanpei Chen, Cloudera.

11:00am-11:50pm Leveraging Hadoop to Transform Raw Data to Rich Features at LinkedIn

01:00pm-01:50pm HBase Roadmap. Jonathan Gray, Facebook.

02:00pm-02:50pm Hadoop Hadoop 0.23 Arun Murthy, Hortonworks

03:20pm-04:10pm Practical HBase Ravi Veeramchaneni, Informatica

04:20pm-05:10pm The Powerful Marriage of R and Hadoop - David Champagne, Revolution Analytics

Links to the Day1 and the Summary.

✔ 8:30am-09:45am Keynotes.The Future of Data Management - James Markarian, Informatica

Hadoop World 2011 Keynote: The Future of Data Management - James Markarian, Informatica

View more presentations from Cloudera, Inc.

James Markarian will discuss historical trends and technology shifts in data management and how the data deluge has contributed to the emergence of Apache Hadoop. James will showcase examples of how forward-looking organizations are leveraging Hadoop to maximize their Return on Data to improve insight and operations. By sharing his perspective about this next major analytics platform, James will discuss why Hadoop is poised to change the face of analytics and data management. Finally he will challenge the Hadoop ecosystem to work together to close the remaining technology gaps in Hadoop.

story of Analytics.

Prove Hadoop is the important product from this 30 years.

logical connection

new recognition of the world around us.

what surrounding around us information.

Woman's image

failed to understand what she needs, we make bad selection.

Commoditization of technology

Business agility
Democratizatin of Data
Explosion of Data
#more cheap the infra become.
|
How to become the top of this?

from 70's

problems = flexibility
- schema
- query performance
  | fasten and fasten
  data parallelism

Hotel story

analytics give hotel customer sattisfaction

Application everywhere

BI
risk analysis, ...
Security
DBAtion???
When First read MR, Dynamo paper story.
DB industry can become interesting again.

✔ 8:30am-09:45am Keynotes. Hadoop World 2011 Keynote: The State of the Apache Hadoop Ecosystem - Doug Cutting, Cloudera

Hadoop World 2011 Keynote: The State of the Apache Hadoop Ecosystem

View more presentations from Cloudera, Inc.

Doug Cutting, co-creator of Apache Hadoop, will discuss the current state of the Apache Hadoop open source ecosystem and its trajectory for the future. Doug will highlight trends and changes happening throughout the platform, including new additions and important improvements. He will address the implications of how different components of the stack work together to form a coherent and efficient platform. He will draw particular attention to Big Top, a project initiated by Cloudera to build a community around the packaging and interoperability testing of Hadoop-related projects with the goal of providing a consistent and interoperable framework. Cutting will also discuss the latest additions in CDH4 and the platform roadmap for CDH.

Outline

the ecosystem
why we need it
what it is
why ...

Why are we here

HW has improved
- exponentially for decades
- both storage and compute
We can now store and process much more
- yet have been slow to leverage
Analyzing more data makes us smarter
#the more you have data, the more you can become more smarter.
can make better decisions.

The Ecosystem is the system

Hadoop has become the kernel
- of the distributed OS for Big data
No one uses the kernel alone
A collection of projects at Apache

Strength of Apache

Mandates diversity & transparency
- you control your fate
Insures against vendor lock-in
- cant buy the ASF
Allow competing project
- survival of the fittest
Ecosystem as loose federation
- lets platform evolve
  #everyone can be the member, can become the influencer.
  #nobody pulling your legs.
  #fork is also allowed

Whats new?

Apache Hadoop 0.20.205
- append
- security
CDH3
- Mahout included
- Avro support across components

Whats next?

Apache Hadoop 0.23
- HDFS
  - performance
  - scalability (federation)
  - availability
MR2

CDH4

Includes Hadoop 0.23
BigTop-based (?) later explained

Ecosystem

S4, Giraph, Crunch, Blur....

Apache BigTop (incubating)

Ecosystem as a project
- integration test
- compatible versions
- common packaging
- release is a test
Basis for CDH
- Like Fedora is for RHEL
community driven
Includes:
- Hadoop, HBase, Zookeeper, Avro, Hive, Pig, Oozie, Flume, Mahout....

Join the community

Hadoop and big data are still young
HW trends will continue
Hadoop started with just two developers
Now it has hundreds
You can be the next
What do you need?
#we are very lucky to meet this opportunity

QA

Using java risk.

agree with the risk.

work together products dependency

Like a Linux kernel

Hadoop will also work

Realtime capability

Storm

Google contribution now?

they are not contribute openly so much

mostly for its internal.

about Avro

Top3 area of Ecosystem

not top down decision they make but...

realtime processing like HBase

Commodity

Hadoop is cost efffective.

Development is quite slow

still not 1.0

release number is not so important.

not important to change the numbering the version.

✔ 10:00am-10:50am Hadoop and Performance. Todd Lipcon, Cloudera. Yanpei Chen, Cloudera.

Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera

View more presentations from Cloudera, Inc.

Performance is a thing that you can never have too much of. But performance is a nebulous concept in Hadoop. Unlike databases, there is no equivalent in Hadoop to TPC, and different use cases experience performance differently. This talk will discuss advances on how Hadoop performance is measured and will also talk about recent and future advances in performance in different areas of the Hadoop stack.

Measurement of performance

MR/HDFS
per job latency
cluster throughput
efficiency

HBase

Average 99 percentile performance
Write read throughput
in cache vs out of cache

Performance from Hadoop developers POV

What do we measure?
overhead

CPU overhead lock contension

CPU seconds per HDFS GB written / read
Miicro second of latency to....
Scheduler/framework overhead
- no op MR job ( sleep -mt 1 -rt 1 -m 5000 -r 1)
Disk utilization
- iostat

Hadoop performance myth

Java is slow?
most of Hadoop is IO or NW bound , not CPU

Java doesn't give you enough low level system accesses

JNI allows us to call any syscall that C can
we can integrate ....

Lies, damn lies, and statistics benchmark

HDFS/MR improvements IO/Cashing

MR is designed do primarily Sequential IO
Linux IO sheculer read ahead heuristics are weak
- HDFS 0.20 does many seeks even for sequential access
Many data sets are significally larger trhan memory
capacity cashing them is a waste of memroy
Linux employs lazy page writeback ...

IO cashing solution

Linux provides 3 useful syscalls
- posix_....

IO cashing results

20% improvements on terasort wall-clock

MR CPU improvement: sort

CPU inefficiency
- optimization
  - Use sun.misc.unsafe for -4x reduction in CPU usage of this function
CPU cache usage
- optimization
  - include 4byte key prefix with pointer , avoiding memory indirection for most comparison

Benchmark results

CDH3u2 optimized build with cache conscious ....

MR improvements: Scheduler

0.20.2
- TT heartbeat once every 3 seconds , even on a small cluster
  Trunk/0.20.205/CDH3

Scheduler results

10node cluster , 10 map slots per machine
hadoop jar examples.jar sleep -mt 1 -rt1 ....

HDFS CPU improvements: Checksuming

Significant CPU overhead
Optimization
Switch to bulk API
verify compute 64KB at a time instead of 512 bytes Switch to CRC32C ....

Checksum improvements

only 16% overhead vx unchecksummed access
maintain...

HDFS random access

0.20.2
- TCP hadnshake overhead
0.23
- ...

Random read micro benchmark

HBase YCSB, Random-read macro benchmark

MR2 improvement:shuffle

shuffle improvents
auto tunes

Summary

Hadoop is faster than ever
2-3x
CPU usage
2x faster wall-clock...

Upsteam JIRA

Coming to a Hadoop near you

Apache Hadoop 0,23
Many improvements
Next

advances in MR performance measurement

state of the art in MR performance Essential but inefficient benchmarks

Gridmix2, Hive BM, Hivench, PigMix, Gridmix3
# acess each benchmark in Datasize , intensity variations , job types, cluster independent, synthetic workload...
Representive? Reply?

Need to measure what production MR system REally do

from generic large elephant? ...

Need to advance the state of the art

develop scienteific knowledge
Apply to system engineering

First comparison of production MR

facebook trace
6 month in 2009 , approx. 6,000 jobs per day
Cloudera customer trace
- 1 week in 2011, 3000 jobs per day
Not in this talk trace from 4 other customer...

Different job sizes

input, shuffle, output size.

Different workload time variation

number of jobs
input + shuffle + output data size
Map + reduce task time

Different job types - identify by kmeans

%of jobs, input, shuffle, output, duration, maptime, readtime, Description

summary

facebook
95% small jobs
CC(Cloudera Costumer)-b
88.9% small jobs
#system optimization should be adopted for its specific workloads.

Measure performance by replying workload

Major advantage over artificial bbenchmarks
Challenges
turn ....

Reply concatenated trace samples

Problem
- need short and representative workload
Solution
- E.g. day long workload
Strength
- reproduce all statistics within a sample window
Limits
- introduces sampling error

Verify that we get representative workloads

CDF(?) #need to know later

case study: compare FIFO vs fair scheduler

fixed workload - rely on system n - performance comparison
workload=....

choice of scheduler depends on workload

workload one shows fair scheduler is much better.
other shows FIFO scheduler is better
# comparing job's ***Takeaways
Need to consider workload level performance
Data sizes, arrival patterns, job types. Understanding performance is a community effort.
- not everyone is like Google, Yahoo, Facebook
- different feature and metrics relevant to each use case
Check out our evolving MR workload repository
- SWIM - Statistical Workload Injector for MapReduce

✔ 11:00am - 11:50pm Leveraging Hadoop to Transform Raw Data to Rich Features at LinkedIn

Hadoop World 2011: LeveragIng Hadoop to Transform Raw Data to Rich Features at LinkedIn - Abhishek Gupta & Adil Aijaz, LinkedIn

View more presentations from Cloudera, Inc.

This presentation focuses on the design and evolution of the LinkedIn recommendations platform. It currently computes more than 100 billion personalized recommendations every week, powering an ever growing assortment of products, including Jobs You May be Interested In, Groups You May Like, News Relevance, and Ad Targeting. We will describe how we leverage Hadoop to transform raw data to rich features using knowledge aggregated from LinkedIn's 100 million member base, how we use Lucene to do real-time recommendations, and how we marshal Lucene on Hadoop to bridge offline analysis with user-facing services.

Think platform leverage Hadoop

no analytic platform can complete without Hadoop

the worlds largest professional NW

new member joins -2/sec
company pages > 2M

Recommendations opportunity

right information to right person, ....
Recruit....
for everyone.

50%

allow recomendations.
positions, educations, ....

are all the same?

job title's name is not same but same.

are all companies the same?

IBM's variation

recommendation trade-offs

Real time
- LinkedIn today
Time Independent, Content Analysis, Collaborative, Precision, Recall

Recommendation

similarity score calculation
Normalization , scoring adn Ranking
Filtering and FB

Technologies

Lucine x Hadoop x Zoie x voldemout x mahout x Kafka(?)

Hadoop case study

Scaling
Blending recommendation algorithms
Grandfathering
Model selection AB testing
Tracking and reporting

Scaling

Billions of recommendations
latency > 1 sec
number of document, size of document
| minhashing
latency < 1 sec
recall=low
| Hadoop
latency < 1 sec
recall=high

Blending recommendation Algorithms

similarity between the several profiles.
how to blend together
two algorithm
Co-view
impact latency - min
complexity = high
| Hadoop
co view
impact laency ...
complextiy = low

Grandfathering

adding and changing features
next profile edit
notime guarantees minimal disruption
Parallel feature
Extraction pipeline (Batch)
Time - week
significant system work

Model selection

Features models
parameters
SVM
decision trees
logistic regression
content,
L1+L2 regularization

A/B testing

is option A better than option B? letst test.

Think platform leverage Hadoop

✔ 1:00pm-1:50pm HBase Roadmap. Jonathan Gray, Facebook.

Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook

View more presentations from Cloudera, Inc.

This technical session will provide a quick review of the Apache HBase project, looking at it from the past to the future. It will cover the imminent HBase 0.92 release as well as what is slated for 0.94 and beyond. A number of companies and use cases will be used as examples to describe the overall direction of the HBase community and project.

3 road map

Apache HBase

A friendly open source project
(self opinion)

Apache HBase

a dynamic and pragmatic community
- HBase committers scattered around many companies
- a culture of acceptance ( please contribute
  - perhaps , occasionally , to a fault
- many HBase committers have moved companies
Roadmap driven by sponsoring companies
- HBase has no enterprise company behind it

The Ghost of HBase Past

Histroy
2007 Bigtable clone for Hadoop
Six major releases

2008

random read/read access for offline processes
early users focused on offline , crawl data storage
- power set was primary user
- others like worldLingo, OpenPlaces
augmenting Offline MR
- needed random writes for web crawling log

2009

OLTP databases for web startups
stand upon, trend micro, ...
Online HBase
Next gen of HBaser wanted OLTP
HBase Goes realtime
- gave this talk at Hadoop summit 2009
- HBase 0.20 first ever performance release
HBase 0.20
performance release (aka the unjavafy release)
- Rewrite of entire read and write paths
Zookeeper Integration

2010

A highly available, scalable db for tech companies
facebook, Cloudera, yfog, twitter, gumgum, adobe, yahoo, ....
#huge huge period of big change.
engineer driven tech.
HBase 0.90
Durability, Stability, Availabitily release
- production ready HBase
- Zero data loss
- Rewrite of master and Zookeeper interactions

Testing HBase 0.90 production ready

Zero data loss
Mater rewrite
Operational improvements
- HBCK,
- Rolling restarts for ....
New features
- Cluster to Cluster replication.....

HBase today

2011
a large scale production capable DB system
cloudera, facebook , standupon, "eBay, salesforce.com", trendmicro

HBase 0.92

stability and feature release
- lots of usability and stability improvements
0.92 RC sometime in November
- 0 blockers and 2 criticals as of this morning

Big new features

Coprocessors
- Triggers and Stored procedures
- Pre/post hooks to all client cals and server operations
- Dynamically add new RPC call
- ACL security atop coprocessors
HFile V2
- support for very large regions/files
- multi level block index and inline blooms
  #Bloom filter
  #can allow 20GB HFile

Performance improvements

more seeking and early out hints
distributed log splitting
cacheonwrite , evictonclose

Compaction improvements

multi-thread compaction
vastly improved selection algorithm
lots of metrics

Operational improvements

HBCK improvements , Web UI improvements
Slow query log , running tasks and thread status
online schema modification

Usability and API improvements

increment client API
String based fileter language

HBase 0,.92 Documentation

The book
- Definitive Guide -- The apace HBase book

HBase of the Future 0.94 and beyond

2012

a usasble, large scale production DB system
you!

HBase 0.94

stability and usability is the core focus
- increase stability by decreasing complexity
- more work on UI, tools, monitoring ,operatively
- table/family-level metrics

But features will always continue

Fast backup w/ point in recovery
multi slave replication
constraints and other coprocessor based contribs

Thrift 2.0

new Thrift API to more closely match java API
Embedded Thrift w/ short circuit

Other goodies

TTL + minVersions
Point in time snapshot scanners
Atomic append operation

Performance

Scaling for throughput vs. latency
- early lock release to decrease row contention
- early thread release to increase throughput
- remove all global wait()/notify() on HLog
Improved seeking and file selection
- Lazy seek in order file processing
- DeleteFamily bloom filter

Project management

renewed focus on fast release cycle
- HBase 0.94 branch cut immediately after 0,92 release
- already close 0.94 feature freeze , 0.93 dev release
- 3 blockers and 12 criticals left

Apache HBase, A slightly less accepting project

stability is really code stability
push towards interactive feature dev and branch dev
Coprocessors and service interfaces go a long way

2013

Beyond HBase 0.94
Stablility and usability is still the core focus
- more tests, testing framework , integration tests
But features will always continue...
- RPC redux
- Dynamic configuration
- Request , IO and locality based load balancing...

Conclusion

Apache HBase has come a long way.
- Use case driven development
HBase 0.92 comming very soon
- most stable release to date
Contributors and committers drive development
- consumers cant dictate the road map
- individuals and organizations solve their problems(they have their own users ... and jobs to keep)

Questions?

Secondary indexing

I'm not working for it. but .... HFile stores timestamp range?
yes

how to access the timestamp data

metadata

✔ 2:00pm-2:50pm Hadoop Hadoop 0.23 Arun Murthy, Hortonworks

Hadoop World 2011: Apache Hadoop 0.23 - Arun Murthy, Horton Works

View more presentations from Cloudera, Inc.

Apache Hadoop is the de-facto Big Data platform for data storage and processing. The current stable, production release of Hadoop is “hadoop-0.20″. The Apache Hadoop community is preparing to release “hadoop-0.23″ with several major improvements including HDFS Federation and NextGen MapReduce. In this session, Arun Murthy, who is the Apache Hadoop Release Master for “hadoop.next”, will discuss the details of the major improvements in “hadoop-0.23″.

Releases so far

0.20 is still the basis of all current, stable, Hadoop distributions.
0.20.205
security + append -> HBase
#release pace is going slow.

hadoop0.23

First stable release off apaceh hadoop iin over 30 monthes

HDFS - federation

Significant scaling
separation of namespace mgmt and Block mgmt
Suresh Srin....

MR - YARN

NextGen Hadoop Data processing framework

Performance

2x + accross the board
HDFS read/write
- CRC32
- fadvise
- Shortcut for local reads
MR
- Unlock a lot of improvements from terasort record
- shullfe 30%+
- small jobs - Uber AM
Todd Lipcon - Wed 10am

HDFS - NN HA

The famous SPOF
well on the way to fix in hadoop 0.23
Suresh Srinvas , Aaron Myers , Tue 2:15

More ...

HDFS write pipeline improvements for HBase
Build - full Mavenization
EditLogs re-write, HDFS1713....

Deployment goals

Clusters of 6,000 machines
- each machine with 16+ core 48/96GB RAM, 24TB/36TB disks
- 200+ PB per cluster
- 100,000+ concurrent tasks
- 10,000 concurrent jobs
Yahoo: 50,000+ machines

What does it take to get there?

Testing lots of it
Benchmarks
Integration testing
- HBase, Pig, Hive, Oozie....
Deployment discipline

Testing

Why is it hard?
- MR is , effectively, very wide API
- add streaming
- add pipes
- Oh, Pig/Hive ....
Functional test
- Nightly
- Nearly 1,000 functional test for MR alone
- several hundred for Pig/Hive
Scale tests
- Simulation
Longevity tests
- Stress test

Benchmarks

Benchmarks every part of the HDFS& MR pipeline - HDFS read/write throughput
- NN operation
- Scan , Shuffle ,Sort
Gridmixv3 #
- Run production traces in test clusters
- Thousands of jobs
- Stress mode v/x replay mode

Integration Testing

Several projects in the ecosystem
- HBase, Pig, Hive, Oozie...
Cycle
- Functional
- Scale
- Rinse, repeat

Deployment

Alpha/test (early UAT)
- Starting nov 2011
- small scale (500-800 nodes

Alpha

- jan, 2012
- majority of users
- 2,000 nodes per cluster > 10,000 nodes in all

Beta

misnomer: 100s of PB, millions of user application
sighificantly wide variety of applications and load
4,000+ nodes per cluster > 20,000 nodes in all
late Q1, 2012

Production

well, its production
mid to late 2012
twitter: @acmurthy

✔ 3:20pm-4:10pm Practical HBase Ravi Veeramchaneni, Informatica

Hadoop World 2011: Practical HBase - Ravi Veeramchaneni, Informatica

View more presentations from Cloudera, Inc.

Many developers have experience in working on relational databases using SQL. The transition to No-SQL data stores, however, is challenging and often time confusing. This session will share experiences of using HBase from Hardware selection/deployment to design, implementation and tuning of HBase. At the end of the session, audience will be in a better position to make right choices on Hardware selection, Schema design and tuning HBase to their needs.

Talk about...

what we can look into HBase
Use case

Why HBase?

Hadoop Benefits
stores and process large amounts of data
scales 100s and 1000s of nodes
fast 1TB sort in 62s 1 PB in 16.25h
Availability....
But ...
not so good or does not support
random access updating the data and/or file
apps that require low latency access to data
does not to support lots of small files....

Featuring HBase

HBase Scales runs on top of Hadoop
HBase provides fast table scans for time ranges and fast key based lookups
HBase storees null values for free
unstructured/semi-structured data
built-in version management

to solve big data problems

sparse data un or semi-structured data
cost effectively scalable
versioned data
atomic read write update
Linear distribution of data across the DN
rows are stored in byte-lexographic sorted order...

Navteq's Usecase

Contents is - constantly growing
- sparse and unstructured
- provided in multiple data formats
- ingested , processed and delivered in transactional and batch mode
Content Breadth
- 100s of millions of content records
- 100s of content supplies + community input

Content Processing High level overview #

audio, video contents

HBase @ Navteq

started in 2009
- 8node , VMware sandbox cluster
- switched to CDH
Early 2010, HBase 0.20.x CDH2
- 10node physical sandbox cluster
Current HBase 0.90.3
- moved to CDH3u1 release ....

MEasured Business Value

Scalability & deployment
- Handlining spikes are addressed by simply adding nodes
- from 15 to 30 to 60 nodes and more , as datga grows...
Speed to market
- By suporting real time transactions
- batch updates are handle more efficiently ...
Faster supplier on boarding...
#not complete ROI, software licence cost

HBase & Zookeeper

ZK - distributed coordination service

HBase depends on ZK and authorizes ZK to manage the state

HBase hosts key ...

Design Consideration

DB/Schema design
- Transaction to column oriented or flat schema
Understand your access pattern
Row-key design / implementation
- Sequential key
  - Suffers from distribution of load but uses the block caches
  - Can be addressed by pre-spretting the regions
Randomize keys to get better distribution
- Achieved throughput hashing on Key Attributes - SHA1 or MD5
- Suffers range scans
Too many column family is not good.
- Compactions

Serialization

Avro didnt work well - deliveration issue
Developed configurable....
Secondary indexes
- were using ITHBase and IHBase from contrib - doesn't work well
- redesighed schema without need for index
- we still need it through
Performance
- Tunable parameter
  - Hadoop, HBase, OS, JVM, NW
Scalability

Hadoop/HBase processes

Maters
- NN, JT, Zookeeper, HBase Master
Workers
- DN, TT, Region server
- on top of it
  - Hive , Pig, Java MR, Streaming

HW/Deployment considerations

HW
- DN 24G RAM, 8 cores, 4x1TB
- 6 mappers and 6 processors per node (16 mappers, 4 reducers)
- memory allocation by process
  - DN 1GB, 2GB
  - TT - 1GB,2GB
  - map task 6x1GB
  - reduce task 6x1GB
  - region server - 8GB
  - total allocation - 24GB, 64GB
Deployment
- Dont run ZK on DN node, have a separate ZK quorum
- dont run HMaster on NN node.
- avoid HMaster's SPOF
Configuration / Tuning
Configurating HBase
configuration is the key
many moving parts

OS

open files limit
vm.swapiness to lower or 0

HDFS

increase xcelvers to 2047
set socket timeout to 0

HBase

need more memory
....
Per cluster
- Turn off block cache if the hit ratio is less
Per table
- Memstore flush size
- Max file size
Per CF
- Compression
- Bloom filters
Per RS
- Amount of heap in each RS to reserve for all memstores
- Memstore flush size
Per SF

....

- write optimization
  - hbase.regionserver.global.memstore.upper...
Security
- Still in flux, need robust RBAC
Reliability

Desired Fetures

Better operational tools for using Hadoop and HBase
support for secondary indexes
Full-text indexes and searching
Data replication for HA and DR
Security at table , CF and Row level
documentation -> Oreilly book released. #

QA.

Why don't adopt Cassandra , MongoDB?

Simply Scalability

and Natively support Hadoop(HDFS)

✔ 4:20pm-5:10pm The Powerful Marriage of R and Hadoop - David Champagne, Revolution Analytics

Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, Revolution Analytics

View more presentations from Cloudera, Inc.

When two of the most powerful innovations in modern analytics come together, the result is revolutionary.
This session will cover:
An overview of R, the Open Source programming language used by more than 2 million users that was specifically developed for statistical analysis and data visualization.
The ways that R and Hadoop have been integrated.
A use case that provides real-world experience.
A look at how enterprises can take advantage of both of these industry-leading technologies.

About Revolution difference between R and revolution R enterprize

Why R and Hadoop?

R a statistical programming language
NEed for more than counts and averages analyze all the data

Motivation for this projcet

easy for the R programming to interact with the Hadoop data stores and write MR programs
Run R on a massively distributed system without having understand the underlying...

R and Hadoop - The R packages

rhdfs - R and HDFS
fhbase - R and HBase
rmr - R and MR

rhdfs

manipulate HDFS directly to ..
File namupulations
File read/write

rhbase

Manupurate Hbase table and their content Uses Thrift
Table manipuration
Row read/write

Writing MR programs in R (rmr)

for programmers
- a way to access big data sets
- a simple way to parallel programs
- very R-like , building on the functional characteristics of R Just a library
for developers
- much simpler than writing Java
- not as simple as Hive. Pig at what they do, but more general
- great for prototyping , can transition to production -- optimize instead of rewriting lower risk , always executable

rmr MR function

input - input folder
output - output folder
map - R function used as map
reduce - R functioin used as reduce...
other advanced for programmers