Where is Hadoop? Just a few years ago, it was everywhere. Internet heavy hitters such as Facebook and Yahoo boasted about the size of their Hadoop clusters and applications. It was easy to imagine the day when Hadoop would be the one indispensable tool for anyone who dealt in large volumes of data.
And now? It isn’t exactly hanging out with Waldo and Carmen Sandiego in some forgotten corner of the Netherworld, but Hadoop is no longer the talk of the town. New tools have come along, and the Big Data giants have turned their attention elsewhere. Yesterday’s trend is today’s old news. (Hear that, Docker fans? We may have more to say on Docker-as-the-new-Hadoop in a future post.)
Hadoop is an extremely useful framework for tackling the kinds of problems that it was designed to solve — typically, managing and processing enormous sets of data distributed across clusters of off-the-shelf systems. The key word here is “enormous” — quantities of data (typically in the terabyte or petabyte range) that are simply too large to be effectively handled by more traditional techniques. If you are trying to store and use a set of 5TB to 20TB files on a cluster of computers each of which has a 3TB hard drive, for example, Hadoop can tackle the job, but a typical SQL-based system using the cluster’s native file system probably will not be up to it.
Hadoop manages to handle such large blocks of data by using its own distributed file system (HDFS), which sits on top of the host machines’ native file systems. It also uses a distributed processing system built on the MapReduce model to speed up data manipulation by working with chunks of data on local nodes within the cluster. In other words, it’s the go-to framework for dealing with genuinely Big Data.
This kind of super-size scaling requires tradeoffs. The methodical divide-and-conquer approach that MapReduce uses to handle multi-terabyte blocks of data is likely to be slow and clumsy compared to index-based SQL queries when applied to data files that are in the hundreds of gigabytes or smaller range. Standard data management tools are generally designed to work very efficiently with data files of a “reasonable” size. More often than not this means files that can be stored and accessed by the native file system. If well-written SQL /NoSQL queries and RAID (or single hard drive) storage will do the job, Hadoop is probably overkill. It’s one of those tools that’s indispensable when you genuinely need it, but is likely to get in the way when you don’t.
And as often happens (in part for reasons outlined below) when a new and impressive tool comes along, people not only adopt it — they over-adopt it. It becomes trendy, and soon people who should know better are using it in ways that are not really appropriate. As the trend fades (often accompanied by a backlash), they set it aside, and move on (or back) to tools which are more suitable for the job. By now, Hadoop has gone through most, if not all, of this cycle.
So where is Hadoop? It’s being used in situations where it is the best, if not the only,way to handle massive blocks of data, and it will probably continue to be used in such situations for some time. But you are likely to find it less often now than you would have a few years ago in situations where the size of the data doesn’t justify its use. In other words, it’s still around, but it’s just not the latest trend any more.
None of this is surprising. Hadoop, after all, targeted a rather specialized market segment from the beginning. Anyone watching from the sidelines might have been forgiven for assuming that it would succeed largely within the confines of its niche — that it would be used to handle genuinely Big Data, and not much else.
“Big Data” itself is not a neutral term. It doesn’t simply refer to blocks of data which are so large that they require special handling; if that were all that “Big Data” meant, it would just be a way of describing a particular kind of technical challenge, and nothing more. But, as even a cursory online search of the term “Big Data” (48,000,000+ hits) suggests, that is clearly not the case. “Big Data” is a Big Buzzword, and there’s a good reason for that. Big Data is marketable. If you have it, you can (presumably) mine it for valuable information, and if you deal in it, you are a Big Player, potentially worth Big Money. Big Data is status.
The kind of status that Big Data represents is in turn a marketable commodity, and not just another form of symbolic feel-good status. Whether you’re lining up venture capital or preparing for an IPO, it’s important to convince potential investors that your company has (or will produce) something of significant value. Back at the peak of the original dot-com boom, a bit of razzle-dazzle and a little hand-waving was usually enough to prime the pump. But in these relatively cautious times, investors want to see something that they understand and recognize as valuable. If you’re handling Big Data, then almost by definition, you’re in possession of a valuable asset.
How do you show technically sophisticated investors that you’re dealing in Big Data? In the very least, you need to be using recognizable Big Data tools — and what could be more recognizable than Hadoop? If being seen using Hadoop means that you’re handling Big Data, which in turn means that you have the potential to bring in Big Money, which in turn will bring in Big Investors — then you’ll use Hadoop, whether your data needs it or not.
This may sound cynical, but it’s really no more cynical than leasing a better-looking-than-necessary office suite, or spending a little extra on the decor for the executive boardroom. If marketing sometimes leads to the use of tools which are inappropriate or even counterproductive, that doesn’t mean that the people making the decisions are technically ignorant (although they may be) or malicious (which they are very unlikely to be); it simply means that their priorities are based on business considerations (attracting investment money), and not technical efficiency.
This may be the take-home point for anyone who is involved in IT, in development, and particularly in DevOps, where the choice of tools can be crucial and it is not always easy to separate what is important from what is merely trendy. A technical tool isn’t only a technical tool, and its value may not lie only in its technical utility. A company that habitually makes its technical choices based on what is currently trendy is headed for trouble, but there may be times when the visible use of a technically inappropriate tool has genuine (and even necessary) marketing value. It isn’t always easy to recognize the difference between mere trendiness and good marketing (let alone to know how to deal with it) but it is important to keep the distinction in mind.