Apache Hadoop is an open-source platform, which can be used to store and distribute huge data sets on clusters of computers. The clusters are built from commodity hardware, and Hadoop services offer data storage, data access, data processing, security, data governance, and business operations.
History of Hadoop
The ancestry of Hadoop is in Google File System, which was first published back in October of 2003. This had further spawned another release from Google, which is MapReduce. This was a much-simplified approach to data processing in bigger clusters. The development was first started in the project of Apache Nutch but was then shifted to a new Hadoop subproject back in January of 2006.
The primary version Hadoop 0.1.0 came out in April 2006, which continued to evolve over time with many contributors making it richer and richer. The name ‘Hadoop’ came from the name of the toy elephant of one of the founders.
In 2011, the Hadoop team key player Rob Bearden joined hands with Yahoo! And established Hortonworks which also included other members of the Hadoop team as Alan Gates, Devaraj Das, Arun Murthy, Mahadev Konar, Sanjay Radia, Owen O’Malley, and Suresh Srinivas, etc. This paved the way to the current phase of Hadoop’s journey.
Some of the major reasons why organizations use Hadoop are its ability to store and manage a vast amount of data in both structured and unstructured format and also its analytical features. Hadoop can not only handle piles of data reliably, flexible, and quickly, but can do the same at a lower cost also. Some other major advantages of Hadoop are as follows:
- Performance and scalability – The distributed data management process of spreading tasks across the node in the cluster helps a Hadoop system to store and handle data in huge amount, i.e., in petabytes.
- Reliability – It is true that computing clusters could be more prone to face the failure of individual nodes. However, Hadoop is a resilient application as when any individual cluster fails; data gets replicated and automatically re-directed to other remaining nodes in the cluster.
- Flexibility – Compared to the relational database models, there is no need to create any structured schema to store data. It is possible to store data of any format including structured, semi-structured, or unstructured format and then parse the schema to the data while reading it.
- Cost Effective – Compared to the costly proprietary software, Hadoop is open-source and runs on low-cost commodity, which reduced the overall cost of operation.
The Hadoop ecosystem
As we have known the history and major advantages of Hadoop as above, next, we may try to understand more about the Hadoop Ecosystem. This also is an essential aspect to know before working with Hadoop.
As Remote DBA.com points out, the Hadoop ecosystem is not a programming language or a service, but it’s a common platform of the framework, which can take up big data challenges. You may consider Hadoop as a suite that encompasses a wide range of services including sourcing, storing, maintaining, and analyzing data sets. Let’s now discuss the nature of services in Hadoop. For this, let’s first have a look at some of the major Hadoop components, which when combined with the Hadoop ecosystem, will attain unmatched functionalities.
- HDFS – Hadoop Distributed File System
It is the core component and backbone of the Hadoop ecosystem. HDFS enables storing of various types of data sets. HDFS help create a great degree of abstraction over the available resources with which we can see the entire HDFS as a single unit.
HDFS stores the incoming data across the nodes in the cluster and maintains proper log files about the data. There are two core components for HDFS as:
- NameNode, which is the major node which doesn’t store actual data. It does contain metadata like log files or simply the table of content. So, this requires only less storage, but higher computational resources.
- DataNode, which is the actual node where data is stored and hence need higher storage resources. DataNodes are actually the commodity hardware as individual desktops or laptops as a part of the distributed environment. This model makes Hadoop so cost-effective.
While writing data, you actually communicate with NameNode. After then, it sends requests internally to the client and store as well as replicate data to various DataNodes.
You can find YARN as Hadoop Ecosystem’s brain. It is responsible for processing all your activities by allocating resources and scheduling the tasks. Yarn also has two components as:
- ResourceManager as the primary node for processing, which received the requests and then passes it to the concerned NodeManagers.
- NodeManager, which are installed on each DataNode, is the processing centers where the requests are executed.
MapReduce is actually the software framework that helps in writing the applications which process the large data sets with the help of parallel and distributed algorithms in a Hadoop ecosystem. The two functions of MapReduce are:
- Map() function, which can perform actions as filtering, sorting, and grouping, etc.
- Reduce() function which can aggregate and summarize the results of map function.
The result of a Map function is actually a key-value pair that functions as the input for the Reduce function.
- APACHE HIVE
Facebook actually created HIVE for those who are well versed with SQL. So, using HIVE in the Hadoop environment will make them feel comfortable. IT is basically a component for data warehousing that can perform actions like reading, writing, and handling a large set of data in a distributed environment with the help of an SQL-like interface.
- The HIVE query language is known as HQL (HIVE + SQL), which is almost the same as SQL.
- HIVE has two basic components as Hive Command Line which is used to execute the commands and the JDBC (Java Database Connectivity) driver which is to establish connections from the data storage.
The major advantage of HIVE is that it’s largely scalable. It can ideally serve the purposes of processing large data sets and also processing data in real time. HQL supports all the primitive SQL data types. You can also use the predefined functions or can make custom tailored functions to meet specific needs.
Some of the other major components which desire a mention in the Hadoop ecosystem are Apache Mahout, Apache Spark framework, Apache HBase, Apache Drill, Apache Zookeeper, Apache Oozie, Apache Flume, Apache Sqoop, Apache Ambari, etc.