phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <jamestay...@apache.org>
Subject Re: Problem in finding the largest value of an indexed column
Date Fri, 10 Jul 2015 06:36:08 GMT
Sounds like something else is going wrong. Can you adapt your test by
setting the MAX_FILESIZE very low for your table (so that it splits after 4
or 5 rows are added) and package it up as a unit test?

On Thu, Jul 9, 2015 at 1:44 PM, Yufan Liu <yliu33@kent.edu> wrote:

> Just got a chance to revisit this issue: I have rebuilt the index and it
> still returns the unexpected result. By using the test case, I tried to
> insert enough rows to make it auto-split and it reproduces the problem too.
> It seems it still has trouble returning last row sorted by first component
> of primary key on split tables. Maybe there is another issue than
> PHOENIX-2096? The phoenix I am using is pulled from latest 4.x-HBase-0.98
> branch which includes the patch of PHOENIX-2096.
>
> 2015-07-02 19:55 GMT-07:00 James Taylor <jamestaylor@apache.org>:
>
>> On further investigation, I believe it should have been working before. I
>> did a bit of cleanup and attached a new patch to PHOENIX-2096, but this
>> would only prevent a merge sort when one is not required (basically
>> improving performance).
>>
>> Maybe your index is invalid? You can try rebuilding with this command:
>> https://phoenix.apache.org/language/index.html#alter_index
>>
>> On Thu, Jul 2, 2015 at 5:26 PM, Yufan Liu <yliu33@kent.edu> wrote:
>>
>>> The query on test dataset is returning the expected result with the
>>> patch. But on the original dataset (10million rows, 6 regions), it still
>>> return the same unexpected result, I will dig more into this. Thank you,
>>> James!
>>>
>>> 2015-07-02 9:58 GMT-07:00 Yufan Liu <yliu33@kent.edu>:
>>>
>>>> Sure, let me have a try
>>>>
>>>> 2015-07-02 9:46 GMT-07:00 James Taylor <jamestaylor@apache.org>:
>>>>
>>>>> Thanks, Yufan. I found an issue and filed PHOENIX-2096 with a patch.
>>>>> Would you mind confirming that this fixes the issue you're seeing?
>>>>>
>>>>>     James
>>>>>
>>>>> On Thu, Jul 2, 2015 at 9:45 AM, Yufan Liu <yliu33@kent.edu> wrote:
>>>>>
>>>>>> I'm using 4.4.0-HBase-0.98
>>>>>>
>>>>>> 2015-07-01 22:31 GMT-07:00 James Taylor <jamestaylor@apache.org>:
>>>>>>
>>>>>>> Yufan,
>>>>>>> What version of Phoenix are you using?
>>>>>>> Thanks,
>>>>>>> James
>>>>>>>
>>>>>>> On Wed, Jul 1, 2015 at 2:34 PM, Yufan Liu <yliu33@kent.edu>
wrote:
>>>>>>>
>>>>>>>> When I made more tests, I find that this problem happens
after
>>>>>>>> table got split.
>>>>>>>>
>>>>>>>> Here is the DDL I use to create table and index:
>>>>>>>> CREATE TABLE IF NOT EXISTS t1 (
>>>>>>>> uid BIGINT NOT NULL,
>>>>>>>> timestamp BIGINT NOT NULL,
>>>>>>>> eventName VARCHAR
>>>>>>>> CONSTRAINT my_pk PRIMARY KEY (uid,  timestamp))
>>>>>>>> COMPRESSION='SNAPPY';
>>>>>>>>
>>>>>>>> CREATE INDEX timestamp_index ON t1 (timestamp) INCLUDE (eventName)
>>>>>>>>
>>>>>>>> Attach is the sample data I used for test. It has about 4000
rows,
>>>>>>>> when the timestamp_index table has one region, the query
returns correct
>>>>>>>> result: 1433334048443, but when I manually split it into
4 regions (use
>>>>>>>> hbase tool), it returns 1433333024961.
>>>>>>>>
>>>>>>>> Let know if you find anything. Thanks!
>>>>>>>>
>>>>>>>>
>>>>>>>> 2015-07-01 11:27 GMT-07:00 James Taylor <jamestaylor@apache.org>:
>>>>>>>>
>>>>>>>>> If you could put a complete test (including your DDL
and upsert of
>>>>>>>>> data), that would be much appreciated.
>>>>>>>>> Thanks,
>>>>>>>>> James
>>>>>>>>>
>>>>>>>>> On Wed, Jul 1, 2015 at 11:20 AM, Yufan Liu <yliu33@kent.edu>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I have tried to use query "SELECT timestamp FROM
t1 ORDER BY
>>>>>>>>>> timestamp DESC NULLS LAST LIMIT 1". But it still
returns the same
>>>>>>>>>> unexpected result. There seems to be some internal
problems related.
>>>>>>>>>>
>>>>>>>>>> 2015-06-30 18:03 GMT-07:00 James Taylor <jamestaylor@apache.org>:
>>>>>>>>>>
>>>>>>>>>>> Yes, reverse scan will be leveraged when possible.
Make you use
>>>>>>>>>>> NULLS LAST in your ORDER BY as rows are ordered
with nulls first.
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jun 30, 2015 at 5:25 PM, Yufan Liu <yliu33@kent.edu>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I used the HBase reverse scan to find the
last row on the index
>>>>>>>>>>>> table. It returned the expected result. I
would like to know is Phoenix's
>>>>>>>>>>>> "ORDER BY"
>>>>>>>>>>>> and "DESC" implemented based on HBase reverse
scan?
>>>>>>>>>>>>
>>>>>>>>>>>> 2015-06-26 17:25 GMT-07:00 Yufan Liu <yliu33@kent.edu>:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you anyway, Michael!
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2015-06-26 17:21 GMT-07:00 Michael McAllister
<
>>>>>>>>>>>>> mmcallister@homeaway.com>:
>>>>>>>>>>>>>
>>>>>>>>>>>>>>  OK, I’m a Phoenix newbie, so that
was the extent of the
>>>>>>>>>>>>>> advice I could give you. There are
people here far more experienced than I
>>>>>>>>>>>>>> am who should be able to give you
deeper advice. Have a great weekend!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Mike
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *From:* Yufan Liu [mailto:yliu33@kent.edu]
>>>>>>>>>>>>>> *Sent:* Friday, June 26, 2015 7:19
PM
>>>>>>>>>>>>>> *To:* user@phoenix.apache.org
>>>>>>>>>>>>>> *Subject:* Re: Problem in finding
the largest value of an
>>>>>>>>>>>>>> indexed column
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Michael,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for the advice, for the first
one, it's "CLIENT
>>>>>>>>>>>>>> 67-CHUNK PARALLEL 1-WAY FULL SCAN
OVER TIMESTAMP_INDEX; SERVER FILTER BY
>>>>>>>>>>>>>> FIRST KEY ONLY; SERVER AGGREGATE
INTO SINGLE ROW" which is as expected. For
>>>>>>>>>>>>>> the second one, it's "CLIENT 67-CHUNK
SERIAL 1-WAY REVERSE FULL SCAN OVER
>>>>>>>>>>>>>> TIMESTAMP_INDEX; SERVER FILTER BY
FIRST KEY ONLY; SERVER 1 ROW LIMIT" which
>>>>>>>>>>>>>> looks correct, but still returns
the unexpected result.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2015-06-26 16:59 GMT-07:00 Michael
McAllister <
>>>>>>>>>>>>>> mmcallister@homeaway.com>:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yufan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Have you tried using the EXPLAIN
command to see what plan is
>>>>>>>>>>>>>> being used to access the data?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Michael McAllister
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Staff Data Warehouse Engineer | Decision
Systems
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> mmcallister@HomeAway.com | C: 512.423.7447
| skype:
>>>>>>>>>>>>>> michael.mcallister.ha <zimmkate@hotmail.com>
| webex:
>>>>>>>>>>>>>> https://h.a/mikewebex
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [image: Description: Description:
cid:3410354473_30269081]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This electronic communication (including
any attachment) is
>>>>>>>>>>>>>> confidential.  If you are not an
intended recipient of this communication,
>>>>>>>>>>>>>> please be advised that any disclosure,
dissemination, distribution, copying
>>>>>>>>>>>>>> or other use of this communication
or any attachment is strictly
>>>>>>>>>>>>>> prohibited.  If you have received
this communication in error, please
>>>>>>>>>>>>>> notify the sender immediately by
reply e-mail and promptly destroy all
>>>>>>>>>>>>>> electronic and printed copies of
this communication and any attachment.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *From:* Yufan Liu [mailto:yliu33@kent.edu]
>>>>>>>>>>>>>> *Sent:* Friday, June 26, 2015 6:31
PM
>>>>>>>>>>>>>> *To:* user@phoenix.apache.org
>>>>>>>>>>>>>> *Subject:* Problem in finding the
largest value of an
>>>>>>>>>>>>>> indexed column
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We have created a table (eg, t1),
and a global index of one
>>>>>>>>>>>>>> numeric column of t1 (eg, timestamp).
Now we want to find the largest value
>>>>>>>>>>>>>> of timestamp, we have tried two approaches:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1. select max(timestamp) from t1;
This query takes forever to
>>>>>>>>>>>>>> finish, so I think it maybe doing
a full table scan/comparison .
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2. select timestamp from t1 order
by timestamp desc limit 1;
>>>>>>>>>>>>>> This query finished fast, but the
result it returns is far from the largest
>>>>>>>>>>>>>> value. It seems it just return the
largest value for a certain range of
>>>>>>>>>>>>>> data.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Did anyone else encounter this issue/have
any suggestion?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yufan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> best,
>>>>>>>>>>>>>> Yufan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> best,
>>>>>>>>>>>>> Yufan
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> best,
>>>>>>>>>>>> Yufan
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> best,
>>>>>>>>>> Yufan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> best,
>>>>>>>> Yufan
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> best,
>>>>>> Yufan
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> best,
>>>> Yufan
>>>>
>>>>
>>>
>>>
>>> --
>>> best,
>>> Yufan
>>>
>>>
>>
>
>
> --
> best,
> Yufan
>
>

Mime
View raw message