➤ Recap in my viewpoint.
"Hadoop for Enterprise".
This phrase firstly came to my mind when I recapped Hadoop Summit 2012.
It is no doubt Data*1 business have already crossed the Chasm*2. And Hadoop is working as its core function.
This time I mostly attended "Future of Apache Hadoop" or "Deployment and Operations" categorized sessions, so not much fit to those topics, but if you want to see my notes. I already up my notes Day1 here and Day2 here.
➤ Other sentiments of mine.
➤➤ 1: Growing scale.
- As from Status of Hadoop 0.23 Operations at Yahoo!*4, Yahoo's Hadoop still keep scaling. But not in one cluster, they have 20+ 1k+ nodes clusters, that's interesting.
➤➤ 2: Manies are Pig user
- Because Yahoo! hosted affected or not I don't know, but this time in many sessions I heard Pig user.
➤➤ 3: Less energetic than Hadoop World.
- Each sessions are 40min, so each presentations are not much in detail, even more abstractive *5. So it's hard to ask a question for me.
- I don't know why but seeing from twitter hashtag #hadoopsummit not much people tweeting. Also its facebook page is not so much used.
- Slide's are not still up on their site, and until when its up, I don't know. They said over coming days, though.
For everyone at #hadoopsummit who has asked, all sessions were recorded and will be posted to URL over the coming days.
- I also attended last year Nov. Hadoop World, compared to it, I feel this conference less energetic.
➤➤ One more: Facebook group for this conference.
- I want to say thank you all who joining to this group.
- If I can allow to say, only one regret is no non-Japanese join to this group.
#And I'm sorry myself, too introvert hesitate to ask someone non-Japanese people to join this group spoke up for myself. - And after this conference I found this LinkedIn event. And here so many attendees. This time I didn't have LinkedIn account yet, so next time I want to try LinkedIn as the similar trial.
➤ What I think more I need to learn and investigate from this conference.
Hortonworks Data Platform (HDP)
Hortonworks Data Platform (HDP) is a 100% open source data management platform based on Apache Hadoop. It allows you to load, store, process and manage data in virtually any format and at any scale. As the foundation for the next generation enterprise data architecture, HDP includes all of the necessary components to begin uncovering business insights from the quickly growing streams of data flowing into and throughout your business. Hortonworks Data Platform is ideal for organizations that want to combine the power and cost-effectiveness of Apache Hadoop with the advanced services required for enterprise deployments. It is also ideal for solution providers that wish to integrate or extend their solutions with an open and extensible Apache Hadoop-based platform. Hortonworks Data Platform will be available for download beginning on Friday, June 15th. ...
Apache Ambari
Apache Ambari is a web-based tool for installing, managing, and monitoring Apache Hadoop clusters. The set of Hadoop components that are currently supported by Ambari includes: Apache HBase Apache HCatalog Apache Hadoop HDFS Apache Hive Apache Hadoop MapReduce Apache Oozie Apache Pig Apache Sqoop Apache Templeton Apache Zookeeper ...
Apache HCatalog
Apache HCatalog is a table and storage management service for data created using Apache Hadoop. This includes: Providing a shared schema and data type mechanism. Providing a table abstraction so that users need not be concerned with where or how their data is stored. Providing interoperability across data processing tools such as Pig, Map Reduce, and Hive. ...
YARN
MapReduce NextGen aka YARN aka MRv2 The new architecture introduced in hadoop-0.23, divides the two major functions of the JobTracker: resource management and job life-cycle management into separate components. The new ResourceManager manages the global assignment of compute resources to applications and the per-application ApplicationMaster manages the application‚〓〓s scheduling and coordination. An application is either a single job in the sense of classic MapReduce jobs or a DAG of such jobs. The ResourceManager and per-machine NodeManager daemon, which manages the user processes on that machine, form the computation fabric. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks. ...
backup-hadoop-and-hive/README.txt at master · TAwarehouse/backup-hadoop-and-hive · GitHub
This project provides a method for backing up a hadoop cluster. There are two components that need to be backed up: the contents of the hadoop filesystem (hdfs), and the hive DDL. The latter is what provides a way to string together the files kept in hdfs and access them via HQL. The BackupHdfs class deals with the first task. It traverses the entire hdfs filesystem, ordering all found files by timestamp. It then copies them (just like "hadoop fs -copyToLocal") to the local filesystem. Each file is checksum-verified after the copy to ensure integrity. BackupHdfs has options for ignoring files earlier than a given timestamp, which is needed for incremental backups. ...
Apache〓 OODT
It's metadata for middleware (and vice versa): Transparent access to distributed resources Data discovery and query optimization Distributed processing and virtual archives But it's not just for science! It's also a software architecture: Models for information representation Solutions to knowledge capture problems Unification of technology, data, and metadata ...
Folding into Hive · hbutani/SQLWindowing Wiki · GitHub
Currently the Windowing Engine runs on top of Hive. An obvious question is if this functionality can be folded into Hive. There are undeniable reasons for doing this, briefly: end users would want this functionality inside Hive for reasons of consistent behavior, support etc. use of a consistent expression language. For e.g. reuse of Hive functions in Windowing clauses. Implementation wise: Windowing is orders of magnitude simpler than Hive, and can benefit from using equivalent components that are in Hive. More on this in the next section. Avoid the trap of constantly chasing changes in the Hive code base. Folding in Table function mechanics may open up optimizations not possible with the approach today. Here, we initially summarize how the Windowing Engine works and map it to concepts/components in Hive. Then we list a multi-stage process of moving Windowing & Table functionality into Hive. I am no expert on Hive, so have erred on the side of being non intrusive. There is good chance that there are better approaches to some of the later steps; open to comments/suggestions from the community. ...
Vertica
The Vertica Analytics Platform delivers: Real-time insight into your data allowing you to consume, analyze, and make informed decisions at the speed of business. Fastest time to value making it possible to monetize your data in a matter of minutes, not days. Maximized performance meaning you get the most out of your analytic infrastructure investment. ...
➤ Other person's blog post related to Hadoop Summit 2012.
➤ Appendix: Links to each my notes of this Hadoop Summit 2012.
Day1:
- 08:30am - 10:05am Keynote & Plenary Sessions in the Main Ballroom
- 10:30am - 11:10am Big Data Architectures in the AWS Cloud
- 11:25am - 12:05pm HDFS - What is New and Future
- 01:30pm - 02:10pm HMS:Scalable and flexible configuration management system for Hadoop stack
- 02:25pm - 03:05pm Big Data Challenges at NASA
- 03:35pm - 04:15pm Infrastructure around Hadoop - backups, failover, configuration, and monitoring
- 04:30pm - 05:10pm Hadoop and Vertica: The Data Analytics Platform at Twitter
Day2:
- 8:30am - 10:05am Keynote - Geoffrey Moore and Plenary Sessions in the Main Ballroom
- 10:30am - 11:10am The Future of HCatalog.
- 11:25am - 12:05pm Analytical Queries with Hive: SQL Windowing and Table Functions
- 1:30pm - 2:10pm Status of Hadoop 0.23 Operations at Yahoo!
- 2:25pm - 3:05pm Network reference architecture for Hadoop – validated and tested approach to define a reference network design for Hadoop
- 3:35pm - 4:15pm PayPal Behavioral Analytics on Hadoop
- 4:30pm - 5:10pm Writing New Application Frameworks on Apache Hadoop Yarn
✔ Relative posts.
*1:I don't want to use the word BIGDATA
*2:[http://en.wikipedia.org/wiki/File:Technology-Adoption-Lifecycle.png:title=Technology Adoption Lifecycle Model]
*3:author of “Crossing the Chasm” and “Escape Velocity”
*4:this speaker already move to LinkedIn, though
*5:and most of the speaker's slide paging and speaking are too fast