I have always wondered how someone with SQL knowledge can work in a Big Data environment or to be more precise, in an HDFS environment. Can we use our SQL skills to analyze data? Do we have to learn Java or Map Reduce to do the analysis? After a few years of questioning, I thought of giving it a try. So if you have similar questions then read the blog and watch the video to know more. There is good news for SQL programmers.
Apache Hive
Lets first figure out what Apache Hive is. The below is taken from the official website of Apache Hive.
The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive.
There are 2 key points here:
- Data Resides in Distributed Storage
- Data manipulation using SQL
As the name implies, in HDFS(Hadoop Distributed File System), the data is stored in a distributed manner. Hive can be used to read, write and manage large datasets residing in distributed storage. And how do we do it? Well, Hive uses a query language called HQL (Hive Query Language). The beauty of the HQL is that it resembles SQL. I know you are excited to learn more. Let’s get into it.
What Software Do I Need?
To get a taste of Apache Hive, we need two software. Both are free to download and use.
- Oracle Virtualbox
- Hortonworks Sandbox (HDP 2.5)
Once installed, you can watch the video to code in HQL and get a taste of Apache Hive. If you are a SQL programmer, I am pretty confident that you will feel much more confident after watching the video. Yes, your SQL skills can be used in a Big Data / HDFS environment for analytics. You can download the data file and code used in the video from the github link.
I hope this video was helpful to you. Let me know your thoughts!!
Questions? Comments? Suggestions? Let us know!! Like / Subscribe / Follow for more updates.