Big Data: Problem and Solutions
Big Data is a concept more related to data processing than to data size in absolute terms. We know we are dealing with Big Data when traditional ways of processing the data don’t work in a reasonable amount of time. Traditional ways mean the most commonly used computing paradigms in the enterprise world, such as relational databases. In order to process this kind of dataset we take different approaches:
- Use another set of commercial, already built, software packages for Big Data. You could say that these tools are becoming traditional so this definition will have to be redefined in a few years, but also data size is growing so quickly. (Facebook data grew from 20 Petabytes in 2010 to 30 Petabytes in 2011, and continues to increase at a rate of 10TB per day).
- Develop our own specialized software with parallel and distributed computing (like MapReduce)
- Just add more processing power to the current software configuration, and save time until the next problem appears (depends on the budget)
The Budget Variable
The third point of course can be combined with the first two, and it can’t be applied in very complex situations, but it’s no news that a slow database server is often replaced with a faster machine rather than refactoring the entire application. If you have a number of powerful machines, you don’t need to waste time improving small details of your code or database. Well, that’s not true. That is the really lazy way.
The truth is that there are times when there is no option for a new machine. Small companies with low budgets, large companies with tight budgets, there are so many reasons for the situation.
Given limited hardware, a diligent software developer and a database administrator can do wonders in optimization. If the application is large enough, there is always room for optimizing. And the best of all, when new hardware comes out, the application will perform even better, and it will be more scalable.
A large and growing dataset with a bad structure or a badly optimized application, upgraded with more powerful hardware, can be a huge problem in production environments.
There are two versions of the quote regarding lazy people and difficult tasks, apparently the first one by Walter Chrysler “Whenever there is a hard job to be done I assign it to a lazy man; he is sure to find an easy way of doing it.”, and the second one by Bill Gates “I will always choose a lazy person to do a difficult job. Because he will find an easy way to do it.” Maybe Bill Gates applied that, and found that the easiest way to say a memorable quote is to copy another one.
I’m sorry they were both wrong. In the Big Data world, I would assign the hardest task to a stingy person who can solve it with the minimum amount of hardware possible.
The Big Data People: Anything but Lazy
In order to apply the second point (and also the first one), you will need diligent developers and diligent systems engineers. I’ve seen myself that the larger the system is, the less information you will find on the Internet for solving the current outage in your systems that is affecting thousands of users.
You will probably face problems that have never happened before, and you will need fast thinkers. And regarding Big Data and large scale systems, there is hardly ever an “easy way” to solve a problem.
Not surprisingly, in a parallel, distributed environment, the technologies that perform best are the most difficult to learn. Strongly typed languages are your choice if your project will contain hundreds of thousands or millions of lines of code. All of this knowledge is not the typical environment of a lazy person.
This doesn’t mean there is no truth in the quote. We all know of computer science enthusiasts who like to make things more complicated than they are. Sometimes they want to use this new cool library or feature, or apply complicated design patterns for a simple task.
But all of this is removed in a low budget environment. When there are tight deadlines and limited resources, hardworking people will give you the best solution.