Tuesday, June 9, 2015

Big Data Processing


Big data is a somewhat nebulous term that describes data that can’t be processed by traditional data processing techniques, such as an RDBMS-based application running on a single machine. Big data isn’t necessarily a large volume of data. It can be data that is generated at a high velocity. Big data can also be data that has a lot of variety, such as unstructured data. 

Hadoop and Other Tools Apache Hadoop, a framework that allows for the distributed processing of large data sets, is probably the most well known of these tools. Besides providing a powerful MapReduce implementation and a reliable distributed file system—the Hadoop Distributed File System(HDFS)—there is also an ecosystem of big data tools built on top of Hadoop, including the following, to name a few:

■■ Apache HBase is a distributed database for large tables.
■■ Apache Hive is a data warehouse infrastructure that allows ad hoc SQL-like queries over HDFS stored data.
■■ Apache Pig is a high-level platform for creating MapReduce programs.
■■ Apache Mahout is a machine learning and data mining library.
■■ Apache Crunch and Cascading are both frameworks for creating MapReduce pipelines.

Although these tools are powerful, they also add overhead that won’t pay off unless your data set is really big.

There are many configuration tunings you can apply to a Hadoop cluster, and you can always add more nodes if your application is not processing your data as fast as you need it to. However, keep in mind that nothing will have a bigger impact on your big data application’s performance than making your own code faster.

Here is that every microsecond counts. Choose the fastest Java data structures for your problem, use cache when possible, avoid unnecessary object instantiations, use efficient String manipulation methods, and use your Java programming skills to produce the fastest code you can.

Big data technology, such as Apache Hadoop, tackles the problems of volume and velocity by scaling horizontally using fault-tolerant software, which tends to be cheaper and more scalable than the more traditional approach of vertically scaling very reliable hardware.  Apache Hadoop deals with variety by using storage formats that support both unstructured and structured data. Machine learning (ML) algorithms are commonly used to process big data.


Resources:
Handoop: http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520
Handoop Tuitorials: https://developer.yahoo.com/hadoop/tutorial/

1 comment:

super fast processing said...

At Superfastprocessing, we employ a multi-server configuration that has high-fault tolerance and features load balancers for ensuring consistent operations. Superfastprocessing solutions are horizontally scalable and perfect for fulfilling emerging real-time data processing requirements.