Since Hadoop came out, there have been various commercial and/or open-source attempts to expose some compatibility with SQL. Obviously by posting here I am not expecting an unbiased answer.
Seeking an SQL-on-Hadoop offering which provides: low-latency querying, and supports the most common CRUD, including [the basics!] along these lines:
SELECT * FROM,
UPDATE Table SET C1=2 WHERE,
DELETE FROM, and
DROP TABLE. Transactional support would be nice also, but is not a must-have.
I want a full replacement for the more traditional RDBMS, one which can
scale from 1 node to a serious Hadoop cluster.
Python is my language of choice for interfacing, however there does seem to be a Python JDBC wrapper.
Here is what I've found thus far:
- Apache Hive (SQL-like, with interactive SQL thanks to the Stinger initiative)
- Apache Drill (ANSI SQL support)
- Apache Spark (Spark SQL, queries only, add data via Hive, RDD or Paraquet)
- Apache Phoenix (built atop Apache HBase, lacks full transaction support, relational operators and some built-in functions)
- Cloudera Impala (significant HiveQL support,
some SQL language support, no support for indexes on its tables,
importantly missing DELETE, UPDATE and INTERSECT; amongst others)
- Presto from Facebook (can query Hive, Cassandra, relational DBs &etc. Doesn't seem to be designed for low-latency responses across small clusters, or support
UPDATE operations. It is optimized for data warehousing or analytics¹)
- SQL-Hadoop via MapR community edition (seems to be a packaging of Hive, HP Vertica, SparkSQL, Drill and a native ODBC wrapper)
- Apache Kylin from Ebay (provides an SQL interface and multi-dimensional analysis [OLAP],
"… offers ANSI SQL on Hadoop and supports most ANSI SQL query
functions". It depends on HDFS, MapReduce, Hive and HBase; and seems
targeted at very large data-sets though maintains low query latency)
- Apache Tajo (ANSI/ISO SQL standard compliance with JDBC driver support [benchmarks against Hive and Impala])
- Cascading's Lingual²
("Lingual provides JDBC Drivers, a SQL command shell, and a catalog
manager for publishing files [or any resource] as schemas and tables.")
Which—from this list or elsewhere—would you recommend, and why?
Thanks for all suggestions,