phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Samuel Marks <samuelma...@gmail.com>
Subject Re: Which [open-souce] SQL engine atop Hadoop?
Date Sat, 31 Jan 2015 01:36:29 GMT
Thanks, doesn't Drill have '03 support?

Phoenix does seem good, my main reason for not jumping to it immediately is
its additional degree of indirection (HBase; which IIRC: Splice also has).

And although most of these are analytical databases, that doesn't
necessitate high latency. Though it may require a decent cluster, which is
definitely worth considering (I need to scale from 1 node to many).

Cheers,

Samuel Marks
http://linkedin.com/in/samuelmarks
On 31 Jan 2015 03:57, "Vladimir Rodionov" <vladrodionov@gmail.com> wrote:

> Or SpliceDB ( not open-source though), but it provides full TX , ANSI
> SQL-99 support and can run TPCC/TPCH full.
>
> Disclaimer: I work for Splice Machine.
>
> -Vlad
>
> On Fri, Jan 30, 2015 at 8:25 AM, Vladimir Rodionov <vladrodionov@gmail.com
> > wrote:
>
>> I think Phoenix the only option you have. All other products (projects)
>> are analytical databases (or OLAP). If you need record - level operation
>> support and indexes - Phoenix.
>>
>> -Vlad
>>
>> On Fri, Jan 30, 2015 at 3:26 AM, Samuel Marks <samuelmarks@gmail.com>
>> wrote:
>>
>>> Since Hadoop <https://hive.apache.org> came out, there have been
>>> various commercial and/or open-source attempts to expose some compatibility
>>> with SQL <http://drill.apache.org>. Obviously by posting here I am not
>>> expecting an unbiased answer.
>>>
>>> Seeking an SQL-on-Hadoop offering which provides: low-latency querying,
>>> and supports the most common CRUD <https://spark.apache.org>, including
>>> [the basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT *
>>> FROM, UPDATE Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE.
>>> Transactional support would be nice also, but is not a must-have.
>>>
>>> Essentially I want a full replacement for the more traditional RDBMS,
>>> one which can scale from 1 node to a serious Hadoop cluster.
>>>
>>> Python is my language of choice for interfacing, however there does seem
>>> to be a Python JDBC wrapper <https://spark.apache.org/sql>.
>>>
>>> Here is what I've found thus far:
>>>
>>>    - Apache Hive <https://hive.apache.org> (SQL-like, with interactive
>>>    SQL thanks to the Stinger initiative)
>>>    - Apache Drill <http://drill.apache.org> (ANSI SQL support)
>>>    - Apache Spark <https://spark.apache.org> (Spark SQL
>>>    <https://spark.apache.org/sql>, queries only, add data via Hive, RDD
>>>    <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
>>>    or Paraquet <http://parquet.io/>)
>>>    - Apache Phoenix <http://phoenix.apache.org> (built atop Apache HBase
>>>    <http://hbase.apache.org>, lacks full transaction
>>>    <http://en.wikipedia.org/wiki/Database_transaction> support, relational
>>>    operators <http://en.wikipedia.org/wiki/Relational_operators> and
>>>    some built-in functions)
>>>    - Cloudera Impala
>>>    <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html>
>>>    (significant HiveQL support, some SQL language support, no support for
>>>    indexes on its tables, importantly missing DELETE, UPDATE and INTERSECT;
>>>    amongst others)
>>>    - Presto <https://github.com/facebook/presto> from Facebook (can
>>>    query Hive, Cassandra <http://cassandra.apache.org>, relational DBs
>>>    &etc. Doesn't seem to be designed for low-latency responses across small
>>>    clusters, or support UPDATE operations. It is optimized for data
>>>    warehousing or analytics¹
>>>    <http://prestodb.io/docs/current/overview/use-cases.html>)
>>>    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
>>>    community edition <https://www.mapr.com/products/hadoop-download>
>>>    (seems to be a packaging of Hive, HP Vertica
>>>    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL,
>>>    Drill and a native ODBC wrapper
>>>    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
>>>    - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
>>>    interface and multi-dimensional analysis [OLAP
>>>    <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop
>>>    and supports most ANSI SQL query functions". It depends on HDFS, MapReduce,
>>>    Hive and HBase; and seems targeted at very large data-sets though maintains
>>>    low query latency)
>>>    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
>>>    compliance with JDBC <http://en.wikipedia.org/wiki/JDBC> driver
>>>    support [benchmarks against Hive and Impala
>>>    <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
>>>    ])
>>>    - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
>>>    Lingual <http://docs.cascading.org/lingual/1.0/>²
>>>    <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
>>>    provides JDBC Drivers, a SQL command shell, and a catalog manager for
>>>    publishing files [or any resource] as schemas and tables.")
>>>
>>> Which—from this list or elsewhere—would you recommend, and why?
>>> Thanks for all suggestions,
>>>
>>> Samuel Marks
>>> http://linkedin.com/in/samuelmarks
>>>
>>
>>
>

Mime
View raw message