phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <jamestay...@apache.org>
Subject Re: group by problem
Date Wed, 08 Apr 2015 21:08:18 GMT
Any JIRA yet, Marek? It'd be good to get to the bottom of this. Your
work around is not going to perform nearly as well as using TRUNC on
the date.
Thanks,
James

On Tue, Apr 7, 2015 at 8:53 AM, James Taylor <jamestaylor@apache.org> wrote:
> Yes, please open a JIRA and attach that CSV (or ideally the smallest subset
> that exhibits the problem).
> Thanks,
> James
>
>
> On Tuesday, April 7, 2015, Marek Wiewiorka <marek.wiewiorka@gmail.com>
> wrote:
>>
>> Hi James - shall I still open a JIRA for that?
>> Thanks!
>> Marek
>>
>> 2015-04-06 22:48 GMT+02:00 Marek Wiewiorka <marek.wiewiorka@gmail.com>:
>>>
>>> psql from a csv file:
>>> ./psql.py dwh:2181:/hbase-unsecure -t SE_DWH.HOMES_USAGE_HOUR
>>> /mnt/spark/export/usage_convert.txt/usage_merged.csv
>>>
>>> Here is a sample:
>>> 2015-03-19 23:59:59,6854,0,2.167
>>> 2015-03-19 22:59:59,6854,0,2.421
>>> 2015-03-19 21:59:59,6854,0,2.738
>>> 2015-03-19 20:59:59,6854,0,0.9490000000000001
>>> 2015-03-19 19:59:59,6854,0,0.748
>>> 2015-03-19 18:59:59,6854,0,2.76
>>> 2015-03-19 17:59:59,6854,0,1.801
>>> 2015-03-19 16:59:59,6854,0,0.661
>>> 2015-03-19 15:59:59,6854,0,1.082
>>> 2015-03-19 14:59:59,6854,0,1.303
>>>
>>>
>>> M.
>>>
>>> 2015-04-06 22:38 GMT+02:00 James Taylor <jamestaylor@apache.org>:
>>>>
>>>> How did you input the data?
>>>>
>>>> On Mon, Apr 6, 2015 at 1:27 PM, Marek Wiewiorka
>>>> <marek.wiewiorka@gmail.com> wrote:
>>>> > Oh I'm sorry I forgot to attach them:
>>>> > DDL of my table:
>>>> > create table se_dwh.homes_usage_hour (hu_ts time not null ,hu_ho_id
>>>> > integer
>>>> > not null ,hu_stream_id integer not null, hu_usage double constraint
pk
>>>> > PRIMARY KEY(hu_ts,hu_ho_id,hu_stream_id) );
>>>> >
>>>> > Phoenix: 4.3.0
>>>> >
>>>> > Thanks,
>>>> > Marek
>>>> >
>>>> >
>>>> > 2015-04-06 22:25 GMT+02:00 James Taylor <jamestaylor@apache.org>:
>>>> >>
>>>> >> Hi Marek,
>>>> >> How did you input the data and what does your CREATE TABLE/VIEW
>>>> >> statement look like? What version of Phoenix and HBase are you using?
>>>> >> Thanks,
>>>> >> James
>>>> >>
>>>> >> On Monday, April 6, 2015, Marek Wiewiorka <marek.wiewiorka@gmail.com>
>>>> >> wrote:
>>>> >>>
>>>> >>> Hi James - sure here is the result of your query (limited to
5
>>>> >>> rows):
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>> >>> |              TO_CHAR(HU_TS)              |
>>>> >>> HU_HO_ID
>>>> >>> |               HU_STREAM_ID               |
>>>> >>> HU_USAGE
>>>> >>> |
>>>> >>>
>>>> >>>
>>>> >>> +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>> >>> | 2015-03-19 00:59:59.000                  | 4720
>>>> >>> | 0                                        | 0.287
>>>> >>> |
>>>> >>> | 2015-03-19 00:59:59.000                  | 6854
>>>> >>> | 0                                        | 3.6189999999999998
>>>> >>> |
>>>> >>> | 2015-03-19 01:59:59.000                  | 4720
>>>> >>> | 0                                        | 0.323
>>>> >>> |
>>>> >>> | 2015-03-19 01:59:59.000                  | 6854
>>>> >>> | 0                                        | 2.556
>>>> >>> |
>>>> >>> | 2015-03-19 02:59:59.000                  | 4720
>>>> >>> | 0                                        | 0.37
>>>> >>> |
>>>> >>>
>>>> >>>
>>>> >>> +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>> >>>
>>>> >>> Ok - I will try to prepare a sample csv with DDL/DML.
>>>> >>>
>>>> >>> Thanks again,
>>>> >>> Marek
>>>> >>>
>>>> >>>
>>>> >>> 2015-04-06 22:06 GMT+02:00 James Taylor <jamestaylor@apache.org>:
>>>> >>>>
>>>> >>>> Hi Marek,
>>>> >>>> Thanks for the additional information. If you could answer
my
>>>> >>>> earlier
>>>> >>>> questions, that would be helpful.
>>>> >>>>
>>>> >>>> If you can't repro with a simple test case, then how about
>>>> >>>> attaching a
>>>> >>>> csv dump of some of your data (the smallest amount of data
that
>>>> >>>> repros
>>>> >>>> the issue) to a JIRA along with the CREATE TABLE statement
and the
>>>> >>>> query?
>>>> >>>>
>>>> >>>> Thanks,
>>>> >>>> James
>>>> >>>>
>>>> >>>> On Mon, Apr 6, 2015 at 12:49 PM, Marek Wiewiorka
>>>> >>>> <marek.wiewiorka@gmail.com> wrote:
>>>> >>>> > Hi Guys - thanks for your messages.
>>>> >>>> >
>>>> >>>> > I did another round of testing I found that if I use
to_char
>>>> >>>> > instead
>>>> >>>> > of
>>>> >>>> > trunc function I'm getting the expected result:
>>>> >>>> >
>>>> >>>> > select to_char(dt,'yyyy-MM-dd'),id,stream,sum(usage)
from usage
>>>> >>>> > where
>>>> >>>> > id in
>>>> >>>> > (4720,6854) group by to_char(dt,'yyyy-MM-dd'),stream,id;
>>>> >>>> >
>>>> >>>> >
>>>> >>>> > +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>> >>>> > |               TO_CHAR(DT)                |
>>>> >>>> > ID
>>>> >>>> > |                  STREAM                  |
>>>> >>>> > SUM(USAGE)
>>>> >>>> > |
>>>> >>>> >
>>>> >>>> >
>>>> >>>> > +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>> >>>> > | 2015-03-19                               | 4720
>>>> >>>> > | 0                                        | 8.405999999999999
>>>> >>>> > |
>>>> >>>> > | 2015-03-19                               | 6854
>>>> >>>> > | 0                                        | 28.339000000000006
>>>> >>>> > |
>>>> >>>> >
>>>> >>>> >
>>>> >>>> > +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>> >>>> >
>>>> >>>> > the same query with trunc returns 48 rows (24 rows
per id)
>>>> >>>> > select trunc(dt,'DAY') as day,id,stream,sum(usage)
from usage
>>>> >>>> > where id
>>>> >>>> > in
>>>> >>>> > (4720,6854) group by trunc(dt,'DAY'),stream,id;
>>>> >>>> > e.g.:
>>>> >>>> > ....
>>>> >>>> > | 2015-03-19 00:00:00.000 | 4720
>>>> >>>> > |
>>>> >>>> > 0
>>>> >>>> > | 0.406                                    |
>>>> >>>> > | 2015-03-19 00:00:00.000 | 6854
>>>> >>>> > |
>>>> >>>> > 0
>>>> >>>> > | 0.9490000000000001                       |
>>>> >>>> > | 2015-03-19 00:00:00.000 | 4720
>>>> >>>> > |
>>>> >>>> > 0
>>>> >>>> > | 0.332                                    |
>>>> >>>> > | 2015-03-19 00:00:00.000 | 6854
>>>> >>>> > |
>>>> >>>> > 0
>>>> >>>> > | 2.738                                    |
>>>> >>>> > | 2015-03-19 00:00:00.000 | 4720
>>>> >>>> > |
>>>> >>>> > 0
>>>> >>>> > | 0.33499999999999996                      |
>>>> >>>> > | 2015-03-19 00:00:00.000 | 6854
>>>> >>>> > |
>>>> >>>> > 0
>>>> >>>> > | 2.421                                    |
>>>> >>>> > | 2015-03-19 00:00:00.000 | 4720
>>>> >>>> > |
>>>> >>>> > 0
>>>> >>>> > | 0.45599999999999996                      |
>>>> >>>> > | 2015-03-19 00:00:00.000 | 6854
>>>> >>>> > |
>>>> >>>> > 0
>>>> >>>> > | 2.167                                    |
>>>> >>>> >
>>>> >>>> >
>>>> >>>> > +-------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>> >>>> >
>>>> >>>> > so for some reason grouping by trunc is not working...but
using
>>>> >>>> > to_char is.
>>>> >>>> > This applies for table loaded using psql from csv file.
>>>> >>>> >
>>>> >>>> > When tried to create a sample table and populate it
with upserts
>>>> >>>> > everything
>>>> >>>> > worked as expected in both cases:
>>>> >>>> >
>>>> >>>> > create table usage (dt time not null ,id integer not
null,stream
>>>> >>>> > integer not
>>>> >>>> > null, usage double constraint pk PRIMARY KEY(dt,id,stream
));
>>>> >>>> > UPSERT INTO usage VALUES('2015-04-01 10:00:00',100,0,2.0);
>>>> >>>> > UPSERT INTO usage VALUES('2015-04-01 11:00:00',100,0,3.0);
>>>> >>>> >
>>>> >>>> > UPSERT INTO usage VALUES('2015-04-01 12:00:00',200,0,4.0);
>>>> >>>> > UPSERT INTO usage VALUES('2015-04-01 13:00:00',200,0,6.0);
>>>> >>>> >
>>>> >>>> > 0: jdbc:phoenix:dwh.sayene.com:2181:/hbase-un> select
>>>> >>>> > trunc(dt,'DAY')
>>>> >>>> > as
>>>> >>>> > day,id,stream,sum(usage) from usage where id in (100,200)
group
>>>> >>>> > by
>>>> >>>> > trunc(dt,'DAY'),stream,id;
>>>> >>>> >
>>>> >>>> >
>>>> >>>> > +-------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>> >>>> > |           DAY           |                    ID
>>>> >>>> > |
>>>> >>>> > STREAM                  |                SUM(USAGE)
>>>> >>>> > |
>>>> >>>> >
>>>> >>>> >
>>>> >>>> > +-------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>> >>>> > | 2015-04-01 00:00:00.000 | 100
>>>> >>>> > |
>>>> >>>> > 0
>>>> >>>> > | 5.0                                      |
>>>> >>>> > | 2015-04-01 00:00:00.000 | 200
>>>> >>>> > |
>>>> >>>> > 0
>>>> >>>> > | 10.0                                     |
>>>> >>>> >
>>>> >>>> >
>>>> >>>> > +-------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>> >>>> >
>>>> >>>> > 0: jdbc:phoenix:dwh.sayene.com:2181:/hbase-un> select
>>>> >>>> > to_char(dt,'yyyy-MM-dd'),id,stream,sum(usage) from
usage where id
>>>> >>>> > in
>>>> >>>> > (100,200) group by to_char(dt,'yyyy-MM-dd'),stream,id;
>>>> >>>> >
>>>> >>>> >
>>>> >>>> > +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>> >>>> > |               TO_CHAR(DT)                |
>>>> >>>> > ID
>>>> >>>> > |                  STREAM                  |
>>>> >>>> > SUM(USAGE)
>>>> >>>> > |
>>>> >>>> >
>>>> >>>> >
>>>> >>>> > +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>> >>>> > | 2015-04-01                               | 100
>>>> >>>> > | 0                                        | 5.0
>>>> >>>> > |
>>>> >>>> > | 2015-04-01                               | 200
>>>> >>>> > | 0                                        | 10.0
>>>> >>>> > |
>>>> >>>> >
>>>> >>>> >
>>>> >>>> > +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>> >>>> > 2 rows selected (1.49 seconds)
>>>> >>>> >
>>>> >>>> >
>>>> >>>> > Shall I open a jira for that?
>>>> >>>> >
>>>> >>>> > Regards,
>>>> >>>> > Marek
>>>> >>>> >
>>>> >>>> >
>>>> >>>> >
>>>> >>>> >
>>>> >>>> >
>>>> >>>> >
>>>> >>>> >
>>>> >>>> >
>>>> >>>> >
>>>> >>>> >
>>>> >>>> >
>>>> >>>> > 2015-04-06 20:16 GMT+02:00 James Taylor <jamestaylor@apache.org>:
>>>> >>>> >>
>>>> >>>> >> Hi Marek,
>>>> >>>> >> How did you input the data and what does your CREATE
TABLE/VIEW
>>>> >>>> >> statement look like? What version of Phoenix and
HBase are you
>>>> >>>> >> using?
>>>> >>>> >>
>>>> >>>> >> Also, would you mind running the following query
and letting us
>>>> >>>> >> know
>>>> >>>> >> the
>>>> >>>> >> output?
>>>> >>>> >>
>>>> >>>> >> select to_char(hu_ts,'yyyy-MM-dd
>>>> >>>> >> HH:mm:ss.SSS'),hu_ho_id,hu_stream_id,hu_usage
>>>> >>>> >> from se_dwh.homes_usage_hour
>>>> >>>> >> where hu_ho_id in (4720,6854);
>>>> >>>> >>
>>>> >>>> >> Thanks,
>>>> >>>> >> James
>>>> >>>> >>
>>>> >>>> >> On Mon, Apr 6, 2015 at 10:34 AM, Gabriel Reid
>>>> >>>> >> <gabriel.reid@gmail.com>
>>>> >>>> >> wrote:
>>>> >>>> >> > That certainly looks like a bug. Would it
be possible to make
>>>> >>>> >> > a
>>>> >>>> >> > small
>>>> >>>> >> > reproducible test case and if possible, log
this in the
>>>> >>>> >> > Phoenix
>>>> >>>> >> > JIRA
>>>> >>>> >> > (https://issues.apache.org/jira/browse/PHOENIX)
?
>>>> >>>> >> >
>>>> >>>> >> > - Gabriel
>>>> >>>> >> >
>>>> >>>> >> > On Mon, Apr 6, 2015 at 6:10 PM Marek Wiewiorka
>>>> >>>> >> > <marek.wiewiorka@gmail.com>
>>>> >>>> >> > wrote:
>>>> >>>> >> >>
>>>> >>>> >> >> Hi All,
>>>> >>>> >> >> I came across a weird situation while
running a query with
>>>> >>>> >> >> group
>>>> >>>> >> >> by.
>>>> >>>> >> >> I executed 2 queries:
>>>> >>>> >> >>
>>>> >>>> >> >> 1)
>>>> >>>> >> >> select
>>>> >>>> >> >> trunc(hu_ts,'DAY',1),hu_ho_id,hu_stream_id,sum(hu_usage)
>>>> >>>> >> >> from
>>>> >>>> >> >> se_dwh.homes_usage_hour where hu_ho_id
in (4720,6854) group
>>>> >>>> >> >> by
>>>> >>>> >> >> trunc(hu_ts,'DAY',1),hu_ho_id ,hu_stream_id;
>>>> >>>> >> >>
>>>> >>>> >> >> | 2015-03-19 00:00:00.000 | 4720
>>>> >>>> >> >> |
>>>> >>>> >> >> 0
>>>> >>>> >> >> | 0.45599999999999996                
     |
>>>> >>>> >> >> | 2015-03-19 00:00:00.000 | 6854
>>>> >>>> >> >> |
>>>> >>>> >> >> 0
>>>> >>>> >> >> | 2.167                              
     |
>>>> >>>> >> >>
>>>> >>>> >> >>
>>>> >>>> >> >> 2)
>>>> >>>> >> >>
>>>> >>>> >> >> select
>>>> >>>> >> >> trunc(hu_ts,'DAY',1),hu_ho_id,hu_stream_id,sum(hu_usage)
>>>> >>>> >> >> from
>>>> >>>> >> >> se_dwh.homes_usage_hour where hu_ho_id
in (4720) group by
>>>> >>>> >> >> trunc(hu_ts,'DAY',1),hu_ho_id ,hu_stream_id;
>>>> >>>> >> >>
>>>> >>>> >> >> | 2015-03-19 00:00:00.000 | 4720
>>>> >>>> >> >> |
>>>> >>>> >> >> 0
>>>> >>>> >> >> | 8.405999999999999                  
     |
>>>> >>>> >> >>
>>>> >>>> >> >>
>>>> >>>> >> >> The only difference is that in the first
case I included 2
>>>> >>>> >> >> ids(4720,6854)
>>>> >>>> >> >> and in the other one only 1 (4720).
>>>> >>>> >> >> The result for hu_ho_id=4720 should be
the same in both case
>>>> >>>> >> >> but
>>>> >>>> >> >> it
>>>> >>>> >> >> isn't
>>>> >>>> >> >> (0.45 vs 8.4).
>>>> >>>> >> >> The second result(8.4) is correct.
>>>> >>>> >> >>
>>>> >>>> >> >> Am I doing something wrong?
>>>> >>>> >> >>
>>>> >>>> >> >> Regards,
>>>> >>>> >> >> Marek
>>>> >>>> >> >>
>>>> >>>> >> >
>>>> >>>> >
>>>> >>>> >
>>>> >>>
>>>> >>>
>>>> >
>>>
>>>
>>
>

Mime
View raw message