What is Hadoop Hive? Apache Hive which is used in big data analytics. Hive is work on top of Hadoop system. It is an open source data warehouse system for querying and analyzing large data sets stored in Hadoop files. Queries which are being used in Hive is the way in which the dataset is being queried for is very similar to SQL. Initially, Hive was developed by Facebook, later the Apache software Foundation took it up and develop it further as an open source under the name Apache Hive. It is used by different companies. For Example, Amazon uses it in Amazon Elastic MapReduce.
Apache Hive Features:
- It is designed for OnLine Transaction Processing (OLTP)
- The hive stores schema in a database and processed data into Hadoop distributed file system (HDFS)
- It provides SQL like query language called Hive SQL or HSQL
- It is fast, extensible, familiar and scalable.
What Apache Hive provides you?
It provides you with an interface by which you can query your very large datasets. Based on your queries. You can perform any kind of analytics operations which you need to do.
What is Hadoop Hive Architecture?
It makes your job easy because it is very similar to SQL. You can perform operations like analysis of huge datasets. You can perform any kind of queries using Hive. Hive architecture basically comprises of hive clients and different services. Then how the processing and resource management is taking place. Hive client may contain of thrift applications. Hive thrift JDBC applications or ODBC applications. These drivers of hive JDBC applications and hive ODBC applications. Drivers connect with the hive services for performing in any kind of operations. Using the metastable which is provided by the hive architecture. It can also perform processing and resource management using MapReduce, Yarn and all these things.
Is that firstly the data which we have? It is stored in case of HDFS? Then since the data is stored. Naturally, there will be some kind of resource which will be holding this data set. We have MapReduce programs which are inbuilt, which will be handling all these kinds of resource management, processing, etc. We will be having Yarn, which will be performing the cluster management and all the other activities relating to resource management. All these measures about of the distributed storage the processing and resource management are hidden from the hive client or the hive users. Then on top of these two things, on top of distributed storage system, on top of the processing and service management system. We have the high services which run, which might comprise of the hive servers and then the hive drivers etc.
These are the services which run on top of these two layers. These services then provide the outcomes or the service request to the hive clients which can be any JDBC or ODBC driver. So, It can be a hive thrift client.