phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marek Wiewiorka <marek.wiewio...@gmail.com>
Subject Re: group by problem
Date Tue, 07 Apr 2015 07:23:13 GMT
Hi James - shall I still open a JIRA for that?
Thanks!
Marek

2015-04-06 22:48 GMT+02:00 Marek Wiewiorka <marek.wiewiorka@gmail.com>:

> psql from a csv file:
> ./psql.py dwh:2181:/hbase-unsecure -t SE_DWH.HOMES_USAGE_HOUR
> /mnt/spark/export/usage_convert.txt/usage_merged.csv
>
> Here is a sample:
> 2015-03-19 23:59:59,6854,0,2.167
> 2015-03-19 22:59:59,6854,0,2.421
> 2015-03-19 21:59:59,6854,0,2.738
> 2015-03-19 20:59:59,6854,0,0.9490000000000001
> 2015-03-19 19:59:59,6854,0,0.748
> 2015-03-19 18:59:59,6854,0,2.76
> 2015-03-19 17:59:59,6854,0,1.801
> 2015-03-19 16:59:59,6854,0,0.661
> 2015-03-19 15:59:59,6854,0,1.082
> 2015-03-19 14:59:59,6854,0,1.303
>
>
> M.
>
> 2015-04-06 22:38 GMT+02:00 James Taylor <jamestaylor@apache.org>:
>
>> How did you input the data?
>>
>> On Mon, Apr 6, 2015 at 1:27 PM, Marek Wiewiorka
>> <marek.wiewiorka@gmail.com> wrote:
>> > Oh I'm sorry I forgot to attach them:
>> > DDL of my table:
>> > create table se_dwh.homes_usage_hour (hu_ts time not null ,hu_ho_id
>> integer
>> > not null ,hu_stream_id integer not null, hu_usage double constraint pk
>> > PRIMARY KEY(hu_ts,hu_ho_id,hu_stream_id) );
>> >
>> > Phoenix: 4.3.0
>> >
>> > Thanks,
>> > Marek
>> >
>> >
>> > 2015-04-06 22:25 GMT+02:00 James Taylor <jamestaylor@apache.org>:
>> >>
>> >> Hi Marek,
>> >> How did you input the data and what does your CREATE TABLE/VIEW
>> >> statement look like? What version of Phoenix and HBase are you using?
>> >> Thanks,
>> >> James
>> >>
>> >> On Monday, April 6, 2015, Marek Wiewiorka <marek.wiewiorka@gmail.com>
>> >> wrote:
>> >>>
>> >>> Hi James - sure here is the result of your query (limited to 5 rows):
>> >>>
>> >>>
>> >>>
>> +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>> >>> |              TO_CHAR(HU_TS)              |                 HU_HO_ID
>> >>> |               HU_STREAM_ID               |                 HU_USAGE
>> >>> |
>> >>>
>> >>>
>> +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>> >>> | 2015-03-19 00:59:59.000                  | 4720
>> >>> | 0                                        | 0.287
>> >>> |
>> >>> | 2015-03-19 00:59:59.000                  | 6854
>> >>> | 0                                        | 3.6189999999999998
>> >>> |
>> >>> | 2015-03-19 01:59:59.000                  | 4720
>> >>> | 0                                        | 0.323
>> >>> |
>> >>> | 2015-03-19 01:59:59.000                  | 6854
>> >>> | 0                                        | 2.556
>> >>> |
>> >>> | 2015-03-19 02:59:59.000                  | 4720
>> >>> | 0                                        | 0.37
>> >>> |
>> >>>
>> >>>
>> +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>> >>>
>> >>> Ok - I will try to prepare a sample csv with DDL/DML.
>> >>>
>> >>> Thanks again,
>> >>> Marek
>> >>>
>> >>>
>> >>> 2015-04-06 22:06 GMT+02:00 James Taylor <jamestaylor@apache.org>:
>> >>>>
>> >>>> Hi Marek,
>> >>>> Thanks for the additional information. If you could answer my earlier
>> >>>> questions, that would be helpful.
>> >>>>
>> >>>> If you can't repro with a simple test case, then how about attaching
>> a
>> >>>> csv dump of some of your data (the smallest amount of data that
>> repros
>> >>>> the issue) to a JIRA along with the CREATE TABLE statement and the
>> >>>> query?
>> >>>>
>> >>>> Thanks,
>> >>>> James
>> >>>>
>> >>>> On Mon, Apr 6, 2015 at 12:49 PM, Marek Wiewiorka
>> >>>> <marek.wiewiorka@gmail.com> wrote:
>> >>>> > Hi Guys - thanks for your messages.
>> >>>> >
>> >>>> > I did another round of testing I found that if I use to_char
>> instead
>> >>>> > of
>> >>>> > trunc function I'm getting the expected result:
>> >>>> >
>> >>>> > select to_char(dt,'yyyy-MM-dd'),id,stream,sum(usage) from usage
>> where
>> >>>> > id in
>> >>>> > (4720,6854) group by to_char(dt,'yyyy-MM-dd'),stream,id;
>> >>>> >
>> >>>> >
>> +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>> >>>> > |               TO_CHAR(DT)                |              
     ID
>> >>>> > |                  STREAM                  |
>> SUM(USAGE)
>> >>>> > |
>> >>>> >
>> >>>> >
>> +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>> >>>> > | 2015-03-19                               | 4720
>> >>>> > | 0                                        | 8.405999999999999
>> >>>> > |
>> >>>> > | 2015-03-19                               | 6854
>> >>>> > | 0                                        | 28.339000000000006
>> >>>> > |
>> >>>> >
>> >>>> >
>> +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>> >>>> >
>> >>>> > the same query with trunc returns 48 rows (24 rows per id)
>> >>>> > select trunc(dt,'DAY') as day,id,stream,sum(usage) from usage
>> where id
>> >>>> > in
>> >>>> > (4720,6854) group by trunc(dt,'DAY'),stream,id;
>> >>>> > e.g.:
>> >>>> > ....
>> >>>> > | 2015-03-19 00:00:00.000 | 4720
>>    |
>> >>>> > 0
>> >>>> > | 0.406                                    |
>> >>>> > | 2015-03-19 00:00:00.000 | 6854
>>    |
>> >>>> > 0
>> >>>> > | 0.9490000000000001                       |
>> >>>> > | 2015-03-19 00:00:00.000 | 4720
>>    |
>> >>>> > 0
>> >>>> > | 0.332                                    |
>> >>>> > | 2015-03-19 00:00:00.000 | 6854
>>    |
>> >>>> > 0
>> >>>> > | 2.738                                    |
>> >>>> > | 2015-03-19 00:00:00.000 | 4720
>>    |
>> >>>> > 0
>> >>>> > | 0.33499999999999996                      |
>> >>>> > | 2015-03-19 00:00:00.000 | 6854
>>    |
>> >>>> > 0
>> >>>> > | 2.421                                    |
>> >>>> > | 2015-03-19 00:00:00.000 | 4720
>>    |
>> >>>> > 0
>> >>>> > | 0.45599999999999996                      |
>> >>>> > | 2015-03-19 00:00:00.000 | 6854
>>    |
>> >>>> > 0
>> >>>> > | 2.167                                    |
>> >>>> >
>> >>>> >
>> +-------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>> >>>> >
>> >>>> > so for some reason grouping by trunc is not working...but using
>> >>>> > to_char is.
>> >>>> > This applies for table loaded using psql from csv file.
>> >>>> >
>> >>>> > When tried to create a sample table and populate it with upserts
>> >>>> > everything
>> >>>> > worked as expected in both cases:
>> >>>> >
>> >>>> > create table usage (dt time not null ,id integer not null,stream
>> >>>> > integer not
>> >>>> > null, usage double constraint pk PRIMARY KEY(dt,id,stream ));
>> >>>> > UPSERT INTO usage VALUES('2015-04-01 10:00:00',100,0,2.0);
>> >>>> > UPSERT INTO usage VALUES('2015-04-01 11:00:00',100,0,3.0);
>> >>>> >
>> >>>> > UPSERT INTO usage VALUES('2015-04-01 12:00:00',200,0,4.0);
>> >>>> > UPSERT INTO usage VALUES('2015-04-01 13:00:00',200,0,6.0);
>> >>>> >
>> >>>> > 0: jdbc:phoenix:dwh.sayene.com:2181:/hbase-un> select
>> trunc(dt,'DAY')
>> >>>> > as
>> >>>> > day,id,stream,sum(usage) from usage where id in (100,200) group
by
>> >>>> > trunc(dt,'DAY'),stream,id;
>> >>>> >
>> >>>> >
>> +-------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>> >>>> > |           DAY           |                    ID
>>   |
>> >>>> > STREAM                  |                SUM(USAGE)
>> |
>> >>>> >
>> >>>> >
>> +-------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>> >>>> > | 2015-04-01 00:00:00.000 | 100
>>   |
>> >>>> > 0
>> >>>> > | 5.0                                      |
>> >>>> > | 2015-04-01 00:00:00.000 | 200
>>   |
>> >>>> > 0
>> >>>> > | 10.0                                     |
>> >>>> >
>> >>>> >
>> +-------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>> >>>> >
>> >>>> > 0: jdbc:phoenix:dwh.sayene.com:2181:/hbase-un> select
>> >>>> > to_char(dt,'yyyy-MM-dd'),id,stream,sum(usage) from usage where
id
>> in
>> >>>> > (100,200) group by to_char(dt,'yyyy-MM-dd'),stream,id;
>> >>>> >
>> >>>> >
>> +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>> >>>> > |               TO_CHAR(DT)                |              
     ID
>> >>>> > |                  STREAM                  |
>> SUM(USAGE)
>> >>>> > |
>> >>>> >
>> >>>> >
>> +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>> >>>> > | 2015-04-01                               | 100
>> >>>> > | 0                                        | 5.0
>> >>>> > |
>> >>>> > | 2015-04-01                               | 200
>> >>>> > | 0                                        | 10.0
>> >>>> > |
>> >>>> >
>> >>>> >
>> +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>> >>>> > 2 rows selected (1.49 seconds)
>> >>>> >
>> >>>> >
>> >>>> > Shall I open a jira for that?
>> >>>> >
>> >>>> > Regards,
>> >>>> > Marek
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> > 2015-04-06 20:16 GMT+02:00 James Taylor <jamestaylor@apache.org>:
>> >>>> >>
>> >>>> >> Hi Marek,
>> >>>> >> How did you input the data and what does your CREATE TABLE/VIEW
>> >>>> >> statement look like? What version of Phoenix and HBase
are you
>> using?
>> >>>> >>
>> >>>> >> Also, would you mind running the following query and letting
us
>> know
>> >>>> >> the
>> >>>> >> output?
>> >>>> >>
>> >>>> >> select to_char(hu_ts,'yyyy-MM-dd
>> >>>> >> HH:mm:ss.SSS'),hu_ho_id,hu_stream_id,hu_usage
>> >>>> >> from se_dwh.homes_usage_hour
>> >>>> >> where hu_ho_id in (4720,6854);
>> >>>> >>
>> >>>> >> Thanks,
>> >>>> >> James
>> >>>> >>
>> >>>> >> On Mon, Apr 6, 2015 at 10:34 AM, Gabriel Reid
>> >>>> >> <gabriel.reid@gmail.com>
>> >>>> >> wrote:
>> >>>> >> > That certainly looks like a bug. Would it be possible
to make a
>> >>>> >> > small
>> >>>> >> > reproducible test case and if possible, log this in
the Phoenix
>> >>>> >> > JIRA
>> >>>> >> > (https://issues.apache.org/jira/browse/PHOENIX) ?
>> >>>> >> >
>> >>>> >> > - Gabriel
>> >>>> >> >
>> >>>> >> > On Mon, Apr 6, 2015 at 6:10 PM Marek Wiewiorka
>> >>>> >> > <marek.wiewiorka@gmail.com>
>> >>>> >> > wrote:
>> >>>> >> >>
>> >>>> >> >> Hi All,
>> >>>> >> >> I came across a weird situation while running
a query with
>> group
>> >>>> >> >> by.
>> >>>> >> >> I executed 2 queries:
>> >>>> >> >>
>> >>>> >> >> 1)
>> >>>> >> >> select trunc(hu_ts,'DAY',1),hu_ho_id,hu_stream_id,sum(hu_usage)
>> >>>> >> >> from
>> >>>> >> >> se_dwh.homes_usage_hour where hu_ho_id in (4720,6854)
group by
>> >>>> >> >> trunc(hu_ts,'DAY',1),hu_ho_id ,hu_stream_id;
>> >>>> >> >>
>> >>>> >> >> | 2015-03-19 00:00:00.000 | 4720
>> >>>> >> >> |
>> >>>> >> >> 0
>> >>>> >> >> | 0.45599999999999996                      |
>> >>>> >> >> | 2015-03-19 00:00:00.000 | 6854
>> >>>> >> >> |
>> >>>> >> >> 0
>> >>>> >> >> | 2.167                                    |
>> >>>> >> >>
>> >>>> >> >>
>> >>>> >> >> 2)
>> >>>> >> >>
>> >>>> >> >> select trunc(hu_ts,'DAY',1),hu_ho_id,hu_stream_id,sum(hu_usage)
>> >>>> >> >> from
>> >>>> >> >> se_dwh.homes_usage_hour where hu_ho_id in (4720)
group by
>> >>>> >> >> trunc(hu_ts,'DAY',1),hu_ho_id ,hu_stream_id;
>> >>>> >> >>
>> >>>> >> >> | 2015-03-19 00:00:00.000 | 4720
>> >>>> >> >> |
>> >>>> >> >> 0
>> >>>> >> >> | 8.405999999999999                        |
>> >>>> >> >>
>> >>>> >> >>
>> >>>> >> >> The only difference is that in the first case
I included 2
>> >>>> >> >> ids(4720,6854)
>> >>>> >> >> and in the other one only 1 (4720).
>> >>>> >> >> The result for hu_ho_id=4720 should be the same
in both case
>> but
>> >>>> >> >> it
>> >>>> >> >> isn't
>> >>>> >> >> (0.45 vs 8.4).
>> >>>> >> >> The second result(8.4) is correct.
>> >>>> >> >>
>> >>>> >> >> Am I doing something wrong?
>> >>>> >> >>
>> >>>> >> >> Regards,
>> >>>> >> >> Marek
>> >>>> >> >>
>> >>>> >> >
>> >>>> >
>> >>>> >
>> >>>
>> >>>
>> >
>>
>
>

Mime
View raw message