Overview of Big Data Tools
What is Big Data?
Big Data refers to very large data sets or volumes of data coming at a high velocity either in a structured or an unstructured format. Every business captures data and this data needs to be stored for posterity and further analysis. This data is usually stored in data marts (dedicated to single line of business) or data warehouses (collection of data marts). The data can be interpreted using analytical tools and thereby helping us get insights into critical aspects of business. Business require insights to help them deep dive into certain areas of their businesses and understand its performance and also in formulating future strategies and growth areas for them to remain competitive.
Why Big Data?
Every organization generates large volumes of data from their business. Business data contains information about many facets of the business and therefore needs to be stored so that we can analyze and derive insights about the functioning of the business. Big Data is stored in a separate data store such as HDFS (Hadoop Distributed File System), NoSQL databases etc. Storing the data in a separate database enables in faster and timely analysis and thereby helping organizations to make quick decisions.
How to make the leap to Big Data?
The biggest question in front of organisations is: How do we make big data work?
The hype and the range of big data technology options makes finding the right answer hard. The goal must be to design and build a big data environment that is low cost and has low complexity. It should be stable, highly integrated, scalable and available for all organisation to consume.
Some questions that the organisations should ask before beginning the big data journey:
1: Which data points to collect?
2: Data Sources: machine logs, sensors, websites, mobile sites and apps, social and many other sources
3: How to collect the data? Create open, secure and scalable API framework for the same
4: Processing tools: Does the data need processing before being fed into the data pipeline? What processing tools do we use?
5: Storage: What kind of storage technologies do we need?
The sections below list down some of the most common tools and software systems for big data:
Big Data Processing Tools:
a. Hadoop- This is an open source product from Apache. It is a scalable, reliable software that can process large distributed data sets. It can be deployed on a single host or can be scaled to several thousand hosts.
b. Storm- This is an open source product from Apache. It is used to reliably process unbounded streaming data in real time. Storm integrates seamlessly with your existing database and queuing technologies
c. Spark- This is an open source product from Apache. Spark is used for fast processing of large data sets. Spark processes data in real time because of its in-memory data processing capabilities. Spark SQL supports structured querying of large data sets
Big Data Storage Tools:
a. Cassandra- Open source, scalable, highly available, fault tolerant and can store data in the range of hundreds of TBs to tens of PBs. There are third party companies that offer professional support for Cassandra.
b. Hive- Open source platform built on top of Apache Hadoop. It stores data in HDFS storage. It provides SQL interface for querying, aggregating and managing large amounts of distributed data sets
c. GreenPlum- Open source, massively parallel data platform that provide powerful and rapid analytics on large data sources ranging in sizes from a few gigabytes to petabytes.
d. HBase- HBase is an open-source, distributed, versioned, column-oriented store modeled after Google's Bigtable. HBase helps us with random, realtime read/write access to your Big Data. HBase supports very very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.
Big Data Analysis Tools:
a. Presto- Open source distributed SQL engine that can run interactive analytic queries on very large data sources ranging in sizes from a few gigabytes to petabytes.
b. Drill- Open source SQL query engine for Hadoop, NoSQL and Cloud storage. Drill is designed to support querying of large semi-structured and evolving data sets using SQL constructs. Drill works on relational and nonrelational databases such as MongoDB, HDFS etc. A single query can be used to join data from multiple data sources.
c. Pig- Open source platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
d. Spagobi- Open source Business Intelligence tool. It offers structured reporting, data mining, ad-hoc analysis, ETL integration and data visualization capabilities
e. KNIME- Open source product to discover and mine useful insights from your data. Knime offers good functionality because of the several open source and community extensions available.