phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maryann Xue <maryann....@gmail.com>
Subject Re: Error when using aggregates with correlated subqueries
Date Sun, 26 Apr 2015 17:33:54 GMT
Thank you in advance! Just let me know.


Thanks,
Maryann

On Sunday, April 26, 2015, khaleel mershad <khellom007@gmail.com> wrote:

>
> Thanks for the explanation. I see what you mean. I will also try testing a
> query that contains nested correlated subqueries with multiple level
> outer reference, and with using IN instead of EXISTS and I will see if it
> produces an error.
>
>
>
> Best,
> Khaleel
>
> On Sun, Apr 26, 2015 at 7:14 PM, Maryann Xue <maryann.xue@gmail.com
> <javascript:_e(%7B%7D,'cvml','maryann.xue@gmail.com');>> wrote:
>
>> Hi Khaleel,
>>
>> Thanks for looking into the problem! But there IS a bug with EXISTS
>> having an aggregate function, say "max(H."timestamp") as used in your very
>> first query that did not work. Otherwise, if subquery is SELECT * or SELECT
>> column1 or SELECT substr(column1), it's fine coz there is no aggregate.
>>
>> And interestingly the query with W."rowId" failed because it referenced
>> to the outmost query that had EXISTS. Otherwise if using H.
>> rowId", it only had to deal with "SELECT * from H" which was a comparison
>> subquery and so it worked fine.
>>
>> Anyway, I will verify if there is any problem with multiple level outer
>> reference in correlated subqueries, as an independent issue without EXISTS.
>>
>> Given that our EXISTS support is currently incomplete, you may want to
>> rewrite your EXISTS subqueries with equivalent IN subqueries, which would
>> most likely just work.
>>
>>
>> Thanks,
>> Maryann
>>
>>
>> On Sun, Apr 26, 2015 at 11:54 AM, khaleel mershad <khellom007@gmail.com
>> <javascript:_e(%7B%7D,'cvml','khellom007@gmail.com');>> wrote:
>>
>>>
>>> I don't think that the bug is with EXISTS. As you can see with the query
>>> from my last email, it contains a nested correlated subquery which contains
>>> an aggregate within *EXISTS* and it worked after replacing the
>>> reference to the outer table (W."rowId") with a reference to the inner
>>> subquery table (H."rowId") within the aggregate subquery. In other words,
>>> the following query generates an error:
>>>
>>> select distinct W."rowId", W."md5_current", W."size_current" from "Wiki"
>>> AS W
>>> where EXISTS (select * from "History" AS H
>>>                         where
>>> (REGEXP_SUBSTR(SUBSTR(H."rowId",19),'[^:]+') = W."rowId")
>>>                         AND H."timestamp" = (select MAX(H2."timestamp")
>>> from "History" AS H2
>>>                                                           where (*REGEXP_SUBSTR(SUBSTR(H2."rowId",19),'[^:]+')
>>> =*
>>> *
>>> W."rowId"*)
>>>                                                           AND
>>> (H2."status" = 'approved')) AND (TO_NUMBER(H."value") > 1000))
>>> AND NOT EXISTS (select * from "History" AS H3
>>>                               where
>>> (REGEXP_SUBSTR(SUBSTR(H3."rowId",19),'[^:]+') = W."rowId")
>>>                                 AND (H3."status" = 'pending') AND
>>> (TO_NUMBER(H3."value") < 1000));
>>>
>>>
>>> But if I replace *W."rowId"* with REGEXP_SUBSTR(SUBSTR(*H."rowId"*,19),'[^:]+'),
>>> the query works fine (which is the query from my last email). So I think
>>> the problem is not with EXISTS, but with the fact that a reference to a
>>> table from the outer query (W in my query) can be done up to a maximum one
>>> nested level, and not more than that. In other words, referring to a table
>>> from the outer query from within a subquery that is in the second or more
>>> nested level will generate an error.
>>>
>>>
>>>
>>> Best,
>>> Khaleel
>>>
>>>
>>>
>>> On Sun, Apr 26, 2015 at 6:33 PM, Maryann Xue <maryann.xue@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','maryann.xue@gmail.com');>> wrote:
>>>
>>>> Glad that your queries worked. Please do let us know if any further
>>>> problems.
>>>>
>>>> The bug is with EXISTS. Right now Phoenix does not handle aggregate
>>>> function calls correctly within an EXISTS subquery or EXISTS nested
>>>> subquery. I have opened a JIRA:
>>>> https://issues.apache.org/jira/browse/PHOENIX-1923.
>>>>
>>>>
>>>> Thanks,
>>>> Maryann
>>>>
>>>>
>>>> On Sun, Apr 26, 2015 at 8:54 AM, khaleel mershad <khellom007@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','khellom007@gmail.com');>> wrote:
>>>>
>>>>>
>>>>> Dear Maryann,
>>>>>
>>>>> I tried your query and it worked. I also executed a more complex query
>>>>> which I need in my testing, which is:
>>>>>
>>>>> select distinct W."rowId", W."md5_current", W."size_current" from
>>>>> "Wiki" AS W
>>>>> where EXISTS (select * from "History" AS H
>>>>>                         where
>>>>> (REGEXP_SUBSTR(SUBSTR(H."rowId",19),'[^:]+') = W."rowId")
>>>>>                         AND H."timestamp" = (select
>>>>> MAX(H2."timestamp") from "History" AS H2
>>>>>                                                           where
>>>>> (REGEXP_SUBSTR(SUBSTR(H2."rowId",19),'[^:]+') =
>>>>>
>>>>> REGEXP_SUBSTR(SUBSTR(H."rowId",19),'[^:]+'))
>>>>>                                                           AND
>>>>> (H2."status" = 'approved')) AND (TO_NUMBER(H."value") > 1000))
>>>>> AND NOT EXISTS (select * from "History" AS H3
>>>>>                               where
>>>>> (REGEXP_SUBSTR(SUBSTR(H3."rowId",19),'[^:]+') = W."rowId")
>>>>>                                 AND (H3."status" = 'pending') AND
>>>>> (TO_NUMBER(H3."value") < 1000));
>>>>>
>>>>> and it also worked after I understood your fix of the original query.
>>>>>
>>>>> So the trick here is that we can use the reference to the outer query
>>>>> within the next subquery level only, and not up to two levels as I was
>>>>> doing? Maybe this limitation exists because Phoenix joins the tables
from
>>>>> the outer and the inner correlated query, but it can perform this join
up
>>>>> to one level only?
>>>>>
>>>>>
>>>>>
>>>>> Best,
>>>>> Khaleel
>>>>>
>>>>>
>>>>> On Sat, Apr 25, 2015 at 8:11 PM, Maryann Xue <maryann.xue@gmail.com
>>>>> <javascript:_e(%7B%7D,'cvml','maryann.xue@gmail.com');>> wrote:
>>>>>
>>>>>> Hi Khaleel,
>>>>>>
>>>>>> Mind if you try the following query? I think it's the same semantics
>>>>>> as you meant for your query.
>>>>>>
>>>>>> select distinct W."rowId", W."md5_current", W."size_current" from
>>>>>> "Wiki" AS W
>>>>>> where
>>>>>> W."rowId" in (select H"rowId" from "History" AS H
>>>>>>                          where H."timestamp" =
>>>>>>                                 (select MAX(H2."timestamp") from
>>>>>> "History" AS H2 where H2."rowId" = H."rowId")
>>>>>>                          AND
>>>>>>  H."status" = 'approved')
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Maryann
>>>>>>
>>>>>> On Fri, Apr 24, 2015 at 5:25 PM, khaleel mershad <
>>>>>> khellom007@gmail.com
>>>>>> <javascript:_e(%7B%7D,'cvml','khellom007@gmail.com');>>
wrote:
>>>>>>
>>>>>>> Dear Maryann,
>>>>>>>
>>>>>>> Thanks for your question. You are right: the query that I was
>>>>>>> writing wasn't the correct one for my purpose. The query that
will satisfy
>>>>>>> my request would be:
>>>>>>>
>>>>>>> select distinct W."rowId", W."md5_current", W."size_current"
from
>>>>>>> "Wiki" AS W
>>>>>>> where
>>>>>>> EXISTS (select * from "History" AS H where (H."rowId" = W."rowId")
>>>>>>> AND
>>>>>>> H."timestamp" = (select MAX(H2."timestamp") from "History" AS
H2
>>>>>>> where (H2."rowId" = W."rowId")) AND
>>>>>>>  (H."status" = 'approved') )
>>>>>>>
>>>>>>> In this query I specify that I need to select the exact version
>>>>>>> which has the Maximum timestamp among all versions of the same
data item by
>>>>>>> using MAX(H2."timestamp") within the inner subquery within EXISTS.
>>>>>>>
>>>>>>> However I tried such query and it still produces the same error
as
>>>>>>> the old query, which is: (Aggregate may not contain columns not
in
>>>>>>> GROUP BY.)
>>>>>>>
>>>>>>> Thank you for your help.
>>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>> Khaleel
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Apr 23, 2015 at 5:49 PM, Maryann Xue <maryann.xue@gmail.com
>>>>>>> <javascript:_e(%7B%7D,'cvml','maryann.xue@gmail.com');>>
wrote:
>>>>>>>
>>>>>>>> Hi Khaleel,
>>>>>>>>
>>>>>>>> Thanks for the explanation! But my question was since this
is an
>>>>>>>> EXISTS, I assume if there ever is a H."timestamp" with an
'approved'
>>>>>>>> status, the EXISTS will return true regardless of whether
you are testing
>>>>>>>> H."timestamp" or Max(H."timestamp"). Is that correct? or
have I missed
>>>>>>>> something?
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Maryann
>>>>>>>>
>>>>>>>> On Thu, Apr 23, 2015 at 8:49 AM, khaleel mershad <
>>>>>>>> khellom007@gmail.com
>>>>>>>> <javascript:_e(%7B%7D,'cvml','khellom007@gmail.com');>>
wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hello Maryann,
>>>>>>>>>
>>>>>>>>> Thanks very much for your reply. Hopefully this bug gets
fixed in
>>>>>>>>> the next release so that I can continue working with
this part in my
>>>>>>>>> research project. Thanks for keeping me posted.
>>>>>>>>>
>>>>>>>>> With respect to your question, I am using the "History"
table as a
>>>>>>>>> data store of all versions of a certain data item. When
I say Max(
>>>>>>>>> H."timestamp") inside the query, I am selecting the latest
>>>>>>>>> version (most recent) that is "approved" (which is checked
using the
>>>>>>>>> condition H."status" = 'approved')
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best Regards,
>>>>>>>>> Khaleel
>>>>>>>>>
>>>>>>>>> On Thu, Apr 23, 2015 at 6:22 AM, Maryann Xue <
>>>>>>>>> maryann.xue@gmail.com
>>>>>>>>> <javascript:_e(%7B%7D,'cvml','maryann.xue@gmail.com');>>
wrote:
>>>>>>>>>
>>>>>>>>>> Hi Khaleel,
>>>>>>>>>>
>>>>>>>>>> Thanks a lot for reporting the problem, which looks
like a bug. I
>>>>>>>>>> will file a JIRA and keep you posted.
>>>>>>>>>>
>>>>>>>>>> One question though, why would we use MAX(H."timestamp")
instead
>>>>>>>>>> of H."timestamp"? What difference would it make?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Maryann
>>>>>>>>>>
>>>>>>>>>> On Wed, Apr 22, 2015 at 9:45 AM, khaleel mershad
<
>>>>>>>>>> khellom007@gmail.com
>>>>>>>>>> <javascript:_e(%7B%7D,'cvml','khellom007@gmail.com');>>
wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> I tried executing the following query in Phoenix:
>>>>>>>>>>> select distinct W."rowId", W."md5_current", W."size_current"
>>>>>>>>>>> from "Wiki" AS W where EXISTS( select MAX(H."timestamp")
from "History" AS
>>>>>>>>>>> H where (H."rowId" = W."rowId") AND (H."status"
= 'approved') );
>>>>>>>>>>>
>>>>>>>>>>> I got the following error:
>>>>>>>>>>> ERROR 1018 (42Y27): Aggregate may not contain
columns not in
>>>>>>>>>>> GROUP BY. (state=42Y27,code=1018)
>>>>>>>>>>>
>>>>>>>>>>> If I replace W."rowId" (in the where clause)
with a constant
>>>>>>>>>>> value (for example: '3587'), the query works
fine. Also, if I replace the
>>>>>>>>>>> aggregate MAX(H."timestamp") with H."timestamp",
the query also works fine.
>>>>>>>>>>>
>>>>>>>>>>> So it seems that Phoenix generates error when
using a reference
>>>>>>>>>>> to an outer query while using an aggregate within
the inner query.
>>>>>>>>>>>
>>>>>>>>>>> Any solutions?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>>
>>>>>>>>>>> Khaleel Mershad, Ph.D.
>>>>>>>>>>> Research Associate
>>>>>>>>>>>
>>>>>>>>>>> American University of Beirut
>>>>>>>>>>> Department of Electrical and Computer Engineering
>>>>>>>>>>> Bliss Street, Beirut, Lebanon
>>>>>>>>>>> email: kwm03@aub.edu.lb
>>>>>>>>>>> <javascript:_e(%7B%7D,'cvml','kwm03@aub.edu.lb');>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message