Comments
bruce.armstrong wrote: Somebody just said it better than I did, and with more chops to say it: Open Letter to Mark Zuckerberg, Sheryl Sandberg & Facebook Mobile
Cloud Expo on Google News


2008 West
DIAMOND SPONSOR:
Data Direct
SOA, WOA and Cloud Computing: The New Frontier for Data Services
PLATINUM SPONSORS:
Red Hat
The Opening of Virtualization
GOLD SPONSORS:
Appsense
User Environment Management – The Third Layer of the Desktop
Cordys
Cloud Computing for Business Agility
EMC
CMIS: A Multi-Vendor Proposal for a Service-Based Content Management Interoperability Standard
Freedom OSS
Practical SOA” Max Yankelevich
Intel
Architecting an Enterprise Service Router (ESR) – A Cost-Effective Way to Scale SOA Across the Enterprise
Sensedia
Return on Assests: Bringing Visibility to your SOA Strategy
Symantec
Managing Hybrid Endpoint Environments
VMWare
Game-Changing Technology for Enterprise Clouds and Applications
Click For 2008 West
Event Webcasts

2008 West
PLATINUM SPONSORS:
Appcelerator
Get ‘Rich’ Quick: Rapid Prototyping for RIA with ZERO Server Code
Keynote Systems
Designing for and Managing Performance in the New Frontier of Rich Internet Applications
GOLD SPONSORS:
ICEsoft
How Can AJAX Improve Homeland Security?
Isomorphic
Beyond Widgets: What a RIA Platform Should Offer
Oracle
REAs: Rich Enterprise Applications
Click For 2008 Event Webcasts
SYS-CON.TV
Top Links You Must Click On


Object Storage for Big Unstructured Data
There are two kinds of Big Data: Big Data (for analytics) and Big Unstructured Data

Big Data is Big, but it also causes a lot of confusion. Big Data is used for anything related storage these days, so people don’t know anymore what it exactly is. Is it Hadoop? Is it analytics? It doesn’t need to be that complicated though. There are two kinds of Big Data: Big Data (for analytics) and Big Unstructured Data.

Big Data for analytics is a paradigm that became popular in the previous decade. A lot of innovation was done for research projects. New technology enabled researchers in many different domains to capture data in a way they had never been able to do before. In agriculture, for example, ploughs would get sensors that would send little bits of information to a central system (over satellite). Every couple of feet these sensors would measure what’s in the ground (minerals for example), how humid the ground is etc. Based on that, large agriculture companies would then be able to make better decisions on where to grow which crop.

The problem was that traditional systems to store this massive amount of small data (relational databases) were no longer adequate to store this information. Systems like MapReduce and Hadoop were created as an alternative and would store these massive volumes of files as concatenated “Big” files. Big Data was born, Big Data for semi-structured data.

Today we are seeing a similar trend with unstructured data. Studies show that data storage requirements will grow with a factor 30 over the next decade. 80% of that data are large files: office documents, movies, music, pictures. Similar to how the databases in the previous decade, traditional storage – file systems – is not the best way to store this data. File systems will not scale sufficiently and actually become obsolete as applications will take over the role of the file system.

A nice example is what Google Picasa does for us: in the old days we would store pictures nicely organized in a file system (hopefully with some backups). One folder per year, one per month in each year, one per holiday or party. Today, we just dump all the pictures in one folder and Picasa will sort them for us based on date, location, face recognition (!) or other metadata. With an intelligent query, we can display the right pictures very fast, much faster than browsing the file system. We don’t even have to worry about backups as we can store copies in the cloud automatically.

The new paradigm that will help us store these massive amounts of unstructured data is Object Storage. Object Storage systems are uniformly scalable pools of storage that are accessible through a REST interface. Files – objects – are dumped into the pool and an identifier is kept to locate the object when it is needed. Applications that are designed to run on top of object storage will use these identifiers through the REST protocol. A good analogy is parking your car Valet vs. self park. When you self park you have to remember the lot, the floor, the isle etc (file system); with Valet you get a receipt when you give your keys and you will later use that receipt to get your car back.

So what is needed to build an object storage system? Basically just lots of disks, a REST API and a way to provide durability. This could be done with traditional systems like RAID but the problem is that RAID requires a huge amount of overhead to provide acceptable availability. The more data we store, the more painful it is to be needing 200% overhead as some systems do. The smarter way to provide durability for object storage is erasure encoding.

Erasure encoding stores objects as equations, which are spread over the entire storage pool: Data objects are split up in sub-blocks, from which equations are calculated. According to the availability policy, an overhead of equations is calculated and the equations are spread over as many disks are possible, also policy-defined. As a result, when a disk breaks, the system will always have sufficient equations to restore the original data block. If a disk is broken, the system can re-calculate equations as a background task to bring the number of available equations on a healthy level again. A pioneer of this technology is Amplidata, who use low power Atom processors in their hardware to reduce power costs. As the entire system, all storage nodes, can recalculate missing equations as a background task, Amplidata figured out it was not necessary to use the high-end nodes that RAID systems need (to speed up restores and avoid performance losses).

Apart from providing a more efficient and a more scalable way to store data, erasure coding based object storage can save up to 70% on the overall TCO thanks to reduced raw storage needs and reduced power needs (less hardware + low power devices save on power and cooling). Also, uniformly scalable storage systems with an automated healing mechanism drastically reduce the management effort and cost.

So what are the use cases for object storage? As data needs grow, object storage will become the storage paradigm of choice in more and more environments, but already today we see the need in a number of situations:

Building live archives
Object storage enables companies to re-activate their data. Currently, most companies see data more as a burden than anything else: the data will never be used again but needs to be archived for a whole lot of reasons. But this data actually has a lot of value. By using live archives, employees have faster access to older data and they can use those valuable resources. With traditional storage it would never be achievable to build disk based archives for this purpose as the overhead would make this too costly.

Online applications
Most of the data-intensive online – cloud – applications are built on public clouds such as Amazon S3, which are early implementations of Object Storage. The benefits for the application providers are plenty: a simple programming interface, low cost and fast time to market. As their data sets grow, those companies might move to private Object Storage implementations to reduce costs even more.

Media and entertainment
Traditionally, the M&E industry has been very much file-oriented but we’re seeing a growing interest in object storage to optimize efficiency and reduce costs, but also because this industry is already hitting the limits of their file systems.

These are just a few examples of Object Storage implementations for Big Unstructured Data. Object Storage was not built to replace any of the current storage architectures. Very much like NAS filers were designed in the 90ies because block storage (SAN was designed when databases were king) was not optimized for Unstructured Data, Object Storage will find it’s place next to those two for Big Unstructured Data.

Read the original blog entry...

About Tom Leyden
Tom Leyden is Director of Alliances and Marketing at Amplidata, a Belgian Object Storage Innovator. He has 15 years’ experience at technology ventures (from startup to acquisition) and innovation-oriented technology enterprises, inluding 4 years in Cloud Computing. Through his collaboration with Belgium-based technology incubator Incubaid, Leyden has been involved in several successful startups, including: data deduplication pioneer DataCenter Technologies and Q-layer, who designed the first Cloud Computing IAAS platform and were acquired by Sun Microsystems

In order to post a comment you need to be registered and logged in.

Register | Sign-in

Reader Feedback: Page 1 of 1

Enterprise Open Source Magazine Latest Stories . . .
Grid Dynamics, an eCommerce technology solutions company, and GridGain Systems, makers of an open source in-memory platform for Big Data processing, on Wednesday announced the expansion of their partnership which began in 2008. Grid Dynamics provides personalization and big data solut...
Before embarking on using open source cloud technology for your web property, a basic understanding of cloud, as it’s used in the industry, is essential. While there might be exceptions, here are the definitions. A software application delivered on the web instead of installing standa...
Private clouds solve many problems for enterprises and bring unique operational challenges along with them. There are dozens of companies of all sizes that will build you a private cloud and turn over the keys – then what? Trying to convert a traditional enterprise IT operations team t...
The networking industry has gone through different waves over last 30+ years. In the ’80s, the first wave was all about connecting and sharing; how to connect a computer to other peripheral devices and other computers. There were many players who developed technology and services to ad...
If your organization already uses virtualized infrastructure, you are well on your way to providing IT as a Service. But as businesses demand faster results in today’s competitive market, organizations look to gain more benefits from cloud computing than just virtualized infrastructure...
In this CTO Power Panel at the 10th International Cloud Expo, moderated by Cloud Expo Conference Chair Jeremy Geelan, industry-leading CTOs & VPs of Technology will discuss such topics as: Which do you think is the most important cloud computing standard still to tackle? Who should...
Subscribe to the World's Most Powerful Newsletters
Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
Click to Add our RSS Feeds to the Service of Your Choice:
Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Subscribe in NewsGator Online
myFeedster Add to My AOL Subscribe in Rojo Add 'Hugg' to Newsburst from CNET News.com Kinja Digest View Additional SYS-CON Feeds
Publish Your Article! Please send it to editorial(at)sys-con.com!

Advertise on this site! Contact advertising(at)sys-con.com! 201 802-3021


SYS-CON Featured Whitepapers
ADS BY GOOGLE