phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amit Mudgal <amud...@splicemachine.com>
Subject Re: Issues while Running Apache Phoenix against TPC-H data
Date Tue, 16 Aug 2016 10:59:17 GMT
Hi Teams,

Apologies for the late reply but i was trying to upload the data in LINEITEM table and my
experience was not very good with the older version of phoenix 4.7 but we did have a beefy
cluster as pointed by my colleague earlier. 

After the jobs got completed i have seen some erratic behavior. If we drop the then we can
see partial data in the LINEITEM table below. Also it took me 7 hours to load the LINEITEM
table file, I wanted to check if thats the time it took for you as well ?

0: jdbc:phoenix:stl-colo-srv073.> select count(*) from TPCH.LINEITEM;
+------------------------------------------+
|                 COUNT(1)                 |
+------------------------------------------+
| 0                                        |
+------------------------------------------+
1 row selected (1.648 seconds)
0: jdbc:phoenix:stl-colo-srv073.> drop index L_SHIPDATE_IDX on TPCH.LINEITEM;
No rows affected (6.769 seconds)
0: jdbc:phoenix:stl-colo-srv073.> select count(*) from TPCH.LINEITEM;
+------------------------------------------+
|                 COUNT(1)                 |
+------------------------------------------+
| 5038442                                  |
+------------------------------------------+
1 row selected (122.127 seconds)

Could you please let us know if there are any TPCH benchmarks you have stored with system
architecture details as well ?

To answer some questions posed :

1. We are loading the data via HDFS directly using the phoenix-client’s CsvBulk loader tool.
We monitor the jobs later to see its completion. Do you recommend direct csv upload compared
to hdfs upload , is there a preference for one against the other ?
2. There were no deleted rows from HBASE.
3. With the new cluster, as per your suggestion i did run the update stats and for the LINEITEM
table but did not get the desired result.

0: jdbc:phoenix:stl-colo-srv073> update statistics TPCH.LINEITEM  SET
. . . . . . . . . . . . . . . . . . . . . . .> "phoenix.stats.guidepost.width"=100000000;
No rows affected (122.254 seconds)
0: jdbc:phoenix:stl-colo-srv073> select count(*) from TPCH.LINEITEM;
+------------------------------------------+
|                 COUNT(1)                 |
+------------------------------------------+
| 5038442                                  |
+------------------------------------------+
1 row selected (122.137 seconds)
0: jdbc:phoenix:stl-colo-srv073> 

Did you also face issues while loading the csv files ?
4. We have automatic compression enabled and also did manual compression on the individual
tables from base shell.
5. What all queries worked for you on the TPCH data and which ones didn’t based on the comment
that most of the TPCH queries work on phoenix.

I would greatly appreciate any input on this. 

Many thanks 

Amit




> On Aug 15, 2016, at 9:58 AM, James Taylor <jamestaylor@apache.org> wrote:
> 
> Hi Amit,
> Couple more performance tips on top of what Mujtaba already mentioned:
> - Use the latest Phoenix (4.8.0). There are some great performance
> enhancements in here, especially around usage of DISTINCT. We've also got
> some new encoding schemes to reduce table sizes in our encodecolumns branch
> which will make it into master in a few weeks.
> - Specify the UPDATE_CACHE_FREQUENCY property in your CREATE TABLE and
> CREATE INDEX calls. This will greatly reduce the amount of RPC traffic. See
> here for more info[1].
> - Make sure to create secondary indexes to prevent full table scans
> - Declare tables as IMMUTABLE_ROWS=true if they're write-once/append-only.
> - Depending on the use case/query patterns and the schema involved, you may
> want to use multiple column families. This prevents having to scan data
> that you don't need. More info on how to declare column families can be
> found here[2].
> - Consider using the SMALL, SERIAL, NO_CACHE, and SKIP_SCAN hints as
> necessary. We haven't yet enhanced our optimizer to do this automatically,
> so some experience in understanding what HBase is doing under the covers
> will help. For example, for point lookups, use the SMALL and SERIAL hint.
> For large scans, use the NO_CACHE hint. For low cardinality columns in the
> PK, try the SKIP_SCAN hint. For more on these hints, see here[3].
> 
> Thanks,
> James
> 
> [1] https://phoenix.apache.org/#Altering
> [2] https://phoenix.apache.org/language/index.html#create_table
> [3] https://phoenix.apache.org/language/index.html#select_statement
> 
> On Fri, Aug 12, 2016 at 2:57 PM, Mujtaba Chohan <mujtaba@apache.org> wrote:
> 
>> Hi Amit,
>> 
>> * What's the heap size of each of your region servers?
>> * Do you see huge amount of disk reads when you do a select count(*) from
>> tpch.lineitem? If yes then try setting snappy compression on your table
>> followed by major compaction
>> * Were there any deleted rows in this table? What's the row count via HBase
>> shell?
>> * What's the schema of your table? How did you load your data?
>> * Update statistics with 100M guidepost width and check explain plan after
>> this async task finishes to see if this shows approximately correct row
>> count. update statistics TPCH.LINEITEM  SET
>> "phoenix.stats.guidepost.width"=100000000;
>> 
>> I get the following number with 600M rows (uncompressed - default phoenix
>> fast_diff encoded) in TPCH.LINEITEM on a 8 node HBase 0.98.20 cluster with
>> 12G heap/12+12 core (virtual+physical). Data is fetched from OS page cache.
>> 
>> select count(*) from lineitem;
>> +------------+
>> |  COUNT(1)  |
>> +------------+
>> | 600037902  |
>> +------------+
>> 1 row selected (*57.012 seconds*)
>> 
>> select l_returnflag, l_linestatus,sum(l_quantity) as
>> sum_qty,sum(l_extendedprice) as sum_base_price,sum(l_extendedprice * (1 -
>> l_discount)) as sum_disc_price,sum(l_extendedprice * (1 - l_discount) * (1
>> + l_tax)) as sum_charge,avg(l_quantity) as avg_qty,avg(l_extendedprice) as
>> avg_price,avg(l_discount) as avg_disc,count(*) as count_order from lineitem
>> where l_shipdate <= current_date()- 90 group by l_returnflag,l_linestatus
>> order by l_returnflag,l_linestatus;
>> +---------------+---------------+----------------+----------
>> ----------+----------------------+-----------------------+--
>> --------+-------------+-----------+--------------+
>> | L_RETURNFLAG  | L_LINESTATUS  |    SUM_QTY     |   SUM_BASE_PRICE   |
>> SUM_DISC_PRICE    |      SUM_CHARGE       | AVG_QTY  |  AVG_PRICE  |
>> AVG_DISC  | COUNT_ORDER  |
>> +---------------+---------------+----------------+----------
>> ----------+----------------------+-----------------------+--
>> --------+-------------+-----------+--------------+
>> | A             | F             | 3775127758     | 5660776097194.45   |
>> 5377736398183.9374   | 5592847429515.927026  | 25.4993  | 38236.1169  |
>> 0.05      | 148047881    |
>> | N             | F             | 98553062       | 147771098385.98    |
>> 140384965965.0348    | 145999793032.775829   | 25.5015  | 38237.1993  |
>> 0.0499    | 3864590      |
>> | N             | O             | 7651423419     | 11473321691083.27  |
>> 10899667121317.2466  | 11335664103186.27932  | 25.4998  | 38236.9914  |
>> 0.0499    | 300058170    |
>> | R             | F             | 3.77572497E+9  | 5661603032745.34   |
>> 5378513563915.4097   | 5593662252666.916161  | 25.5     | 38236.6972  |
>> 0.05      | 148067261    |
>> +---------------+---------------+----------------+----------
>> ----------+----------------------+-----------------------+--
>> --------+-------------+-----------+--------------+
>> 4 rows selected (*146.677 seconds*)
>> 
>> explain select count(*) from lineitem ;
>> +-----------------------------------------------------------
>> ----------------------------------+
>> |
>> PLAN                                             |
>> +-----------------------------------------------------------
>> ----------------------------------+
>> | CLIENT 4204-CHUNK 589522747 ROWS 409200095701 BYTES PARALLEL 1-WAY FULL
>> SCAN OVER LINEITEM  |
>> |     SERVER FILTER BY FIRST KEY
>> ONLY                                                         |
>> |     SERVER AGGREGATE INTO SINGLE
>> ROW                                                        |
>> +-----------------------------------------------------------
>> ----------------------------------+
>> 
>> DDL: CREATE TABLE LINEITEM(L_ORDERKEY INTEGER not null, L_PARTKEY  INTEGER,
>> L_SUPPKEY  INTEGER , L_LINENUMBER  INTEGER not null, L_QUANTITY
>> DECIMAL(15,2), L_EXTENDEDPRICE DECIMAL(15,2), L_DISCOUNT DECIMAL(15,2),
>> L_TAX DECIMAL(15,2), L_RETURNFLAG  CHAR(1), L_LINESTATUS  CHAR(1),
>> L_SHIPDATE DATE, L_COMMITDATE DATE, L_RECEIPTDATE DATE, L_SHIPINSTRUCT
>> CHAR(25), L_SHIPMODE CHAR(10), L_COMMENT VARCHAR(44) constraint pk primary
>> key(l_orderkey, l_linenumber));
>> 
>> Raw data: do curl -O http://static.druid.io/data/benchmarks/tpch
>> /100/lineitem.tbl.$i.gz ; done
>> 
>> //mujtaba
>> 
>> 
>> 
>> 
>> 
>> 
>> On Thu, Aug 11, 2016 at 1:33 PM, Amit Mudgal <amudgal@splicemachine.com>
>> wrote:
>> 
>>> 
>>> Hi team,
>>> 
>>> I was evaluating Apache Phoenix against the TPC-H data based on the
>>> presentation given at Hadoop summit in june stating that most TPC-H
>> queries
>>> should run.
>>> Here is the setup details i have in my local environment :
>>> 
>>> 1. One master node and 3 region servers with 3.6 TB Disks space, 62.9 GB
>>> memory with 24 CPU cores (OS: centos-release-6-8.el6.centos.12.3.x86_64
>> )
>>> 2. I am running the phoenix parcel (4.7.0) on Cloudera
>> 5.7.2-1.cdh5.7.2.p0.
>>> 18.
>>> 
>>> The data got uploaded and a compaction was manually triggered on hbase.
>>> There were 2 problems we were trying to find the answer to :
>>> 
>>> 1. While doing explain plan on standard TPCH data on LINEITEM table
>>> provided it shows 8,649,179,394 rows but there are only 600,000,000
>> records
>>> uploaded.
>>> 
>>> 0: jdbc:phoenix> explain select * from TPCH.LINEITEM where L_SUPPKEY =
>>> 768951;
>>> +-----------------------------------------------------------
>>> ----------------------------------------------------+
>>> |                                                     PLAN
>>>                                      |
>>> +-----------------------------------------------------------
>>> ----------------------------------------------------+
>>> | CLIENT 1458-CHUNK 8649179394 ROWS 424044167376 BYTES PARALLEL 1-WAY
>>> ROUND ROBIN FULL SCAN OVER TPCH.LINEITEM  |
>>> |     SERVER FILTER BY L_SUPPKEY = 768951
>>>                                      |
>>> +-----------------------------------------------------------
>>> ----------------------------------------------------+
>>> 2 rows selected (3.036 seconds)
>>> 
>>> I could not do a count(*) on the table due to the fact that it always
>>> failed for me with the error code Error: Operation timed out.
>>> (state=TIM01,code=6000)
>>> 
>>> 2. Secondly, I was not able to also run a simple query01 published by
>> TPCH
>>> as it times out regularly:
>>> 
>>> 
>>> 0: jdbc:phoenix:stl-colo-srv050> select l_returnflag,
>>> l_linestatus,sum(l_quantity) as sum_qty,sum(l_extendedprice) as
>>> sum_base_price,sum(l_extendedprice * (1 - l_discount)) as
>>> sum_disc_price,sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as
>>> sum_charge,avg(l_quantity) as avg_qty,avg(l_extendedprice) as
>>> avg_price,avg(l_discount) as avg_disc,count(*) as count_order from
>>> TPCH.lineitem where l_shipdate <= current_date()- 90 group by
>>> l_returnflag,l_linestatus order by l_returnflag,l_linestatus
>>> . . . . . . . . . . . . . . . . . . . . . . .> ;
>>> Error: Operation timed out. (state=TIM01,code=6000)
>>> java.sql.SQLTimeoutException: Operation timed out.
>>> at org.apache.phoenix.exception.SQLExceptionCode$14.
>>> newException(SQLExceptionCode.java:359)
>>> at org.apache.phoenix.exception.SQLExceptionInfo.buildException(
>>> SQLExceptionInfo.java:145)
>>> at org.apache.phoenix.iterate.BaseResultIterators.getIterators(
>>> BaseResultIterators.java:728)
>>> at org.apache.phoenix.iterate.BaseResultIterators.getIterators(
>>> BaseResultIterators.java:638)
>>> at org.apache.phoenix.iterate.MergeSortResultIterator.getMinHeap(
>>> MergeSortResultIterator.java:72)
>>> at org.apache.phoenix.iterate.MergeSortResultIterator.minIterator(
>>> MergeSortResultIterator.java:93)
>>> at org.apache.phoenix.iterate.MergeSortResultIterator.next(
>>> MergeSortResultIterator.java:58)
>>> at org.apache.phoenix.iterate.BaseGroupedAggregatingResultIterator.next(
>>> BaseGroupedAggregatingResultIterator.java:64)
>>> at org.apache.phoenix.jdbc.PhoenixResultSet.next(
>>> PhoenixResultSet.java:778)
>>> at sqlline.BufferedRows.<init>(BufferedRows.java:37)
>>> at sqlline.SqlLine.print(SqlLine.java:1650)
>>> at sqlline.Commands.execute(Commands.java:833)
>>> at sqlline.Commands.sql(Commands.java:732)
>>> at sqlline.SqlLine.dispatch(SqlLine.java:808)
>>> at sqlline.SqlLine.begin(SqlLine.java:681)
>>> at sqlline.SqlLine.start(SqlLine.java:398)
>>> at sqlline.SqlLine.main(SqlLine.java:292)
>>> 0: jdbc:phoenix:stl-colo-srv050>
>>> 
>>> On firing smaller queries like attaching a limit the data comes in fine :
>>> 
>>> 
>>> 0: jdbc:phoenix:stl-colo-srv052> select * from TPCH.LINEITEM limit 10;
>>> +-------------+------------+------------+---------------+---
>>> ----------+------------------+-------------+--------+-------
>>> --------+---------------+--------------------------+--------
>>> ------------------+---------+
>>> | L_ORDERKEY  | L_PARTKEY  | L_SUPPKEY  | L_LINENUMBER  | L_QUANTITY  |
>>> L_EXTENDEDPRICE  | L_DISCOUNT  | L_TAX  | L_RETURNFLAG  | L_LINESTATUS  |
>>>    L_SHIPDATE        |       L_COMMITDATE       |      L_ |
>>> +-------------+------------+------------+---------------+---
>>> ----------+------------------+-------------+--------+-------
>>> --------+---------------+--------------------------+--------
>>> ------------------+---------+
>>> | 1           | 15518935   | 768951     | 1             | 17          |
>> 33203.72
>>>        | 0.04        | 0.02   | N             | O             |
>> 1996-03-13
>>> 00:00:00.000  | 1996-02-12 00:00:00.000  | 1996-03 |
>>> | 1           | 6730908    | 730909     | 2             | 36          |
>> 69788.52
>>>        | 0.09        | 0.06   | N             | O             |
>> 1996-04-12
>>> 00:00:00.000  | 1996-02-28 00:00:00.000  | 1996-04 |
>>> | 1           | 6369978    | 369979     | 3             | 8           |
>> 16381.28
>>>        | 0.1         | 0.02   | N             | O             |
>> 1996-01-29
>>> 00:00:00.000  | 1996-03-05 00:00:00.000  | 1996-01 |
>>> | 1           | 213150     | 463151     | 4             | 28          |
>> 29767.92
>>>        | 0.09        | 0.06   | N             | O             |
>> 1996-04-21
>>> 00:00:00.000  | 1996-03-30 00:00:00.000  | 1996-05 |
>>> | 1           | 2402664    | 152671     | 5             | 24          |
>> 37596.96
>>>        | 0.1         | 0.04   | N             | O             |
>> 1996-03-30
>>> 00:00:00.000  | 1996-03-14 00:00:00.000  | 1996-04 |
>>> | 1           | 1563445    | 63448      | 6             | 32          |
>> 48267.84
>>>        | 0.07        | 0.02   | N             | O             |
>> 1996-01-30
>>> 00:00:00.000  | 1996-02-07 00:00:00.000  | 1996-02 |
>>> | 2           | 10616973   | 116994     | 1             | 38          |
>> 71798.72
>>>        | 0           | 0.05   | N             | O             |
>> 1997-01-28
>>> 00:00:00.000  | 1997-01-14 00:00:00.000  | 1997-02 |
>>> | 3           | 429697     | 179698     | 1             | 45          |
>> 73200.15
>>>        | 0.06        | 0      | R             | F             |
>> 1994-02-02
>>> 00:00:00.000  | 1994-01-04 00:00:00.000  | 1994-02 |
>>> | 3           | 1903543    | 653547     | 2             | 49          |
>> 75776.05
>>>        | 0.1         | 0      | R             | F             |
>> 1993-11-09
>>> 00:00:00.000  | 1993-12-20 00:00:00.000  | 1993-11 |
>>> | 3           | 12844823   | 344848     | 3             | 27          |
>> 47713.86
>>>        | 0.06        | 0.07   | A             | F             |
>> 1994-01-16
>>> 00:00:00.000  | 1993-11-22 00:00:00.000  | 1994-01 |
>>> +-------------+------------+------------+---------------+---
>>> ----------+------------------+-------------+--------+-------
>>> --------+---------------+--------------------------+--------
>>> ------------------+---------+
>>> 10 rows selected (0.603 seconds)
>>> 0: jdbc:phoenix:stl-colo-srv052>
>>> 
>>> 
>>> I am sure i am doing something wrong here and would greatly appreciate if
>>> you could please point me to the same.
>>> 
>>> Thanks Again
>>> 
>>> Amit
>>> 
>>> 
>>> 
>> 


Mime
View raw message