phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maryann Xue <maryann....@gmail.com>
Subject Re: Error when using aggregates with correlated subqueries
Date Sun, 26 Apr 2015 16:14:29 GMT
Hi Khaleel,

Thanks for looking into the problem! But there IS a bug with EXISTS having
an aggregate function, say "max(H."timestamp") as used in your very first
query that did not work. Otherwise, if subquery is SELECT * or SELECT
column1 or SELECT substr(column1), it's fine coz there is no aggregate.

And interestingly the query with W."rowId" failed because it referenced to
the outmost query that had EXISTS. Otherwise if using H.
rowId", it only had to deal with "SELECT * from H" which was a comparison
subquery and so it worked fine.

Anyway, I will verify if there is any problem with multiple level outer
reference in correlated subqueries, as an independent issue without EXISTS.

Given that our EXISTS support is currently incomplete, you may want to
rewrite your EXISTS subqueries with equivalent IN subqueries, which would
most likely just work.


Thanks,
Maryann


On Sun, Apr 26, 2015 at 11:54 AM, khaleel mershad <khellom007@gmail.com>
wrote:

>
> I don't think that the bug is with EXISTS. As you can see with the query
> from my last email, it contains a nested correlated subquery which contains
> an aggregate within *EXISTS* and it worked after replacing the reference
> to the outer table (W."rowId") with a reference to the inner subquery table
> (H."rowId") within the aggregate subquery. In other words, the following
> query generates an error:
>
> select distinct W."rowId", W."md5_current", W."size_current" from "Wiki"
> AS W
> where EXISTS (select * from "History" AS H
>                         where (REGEXP_SUBSTR(SUBSTR(H."rowId",19),'[^:]+')
> = W."rowId")
>                         AND H."timestamp" = (select MAX(H2."timestamp")
> from "History" AS H2
>                                                           where (*REGEXP_SUBSTR(SUBSTR(H2."rowId",19),'[^:]+')
> =*
> *
> W."rowId"*)
>                                                           AND (H2."status"
> = 'approved')) AND (TO_NUMBER(H."value") > 1000))
> AND NOT EXISTS (select * from "History" AS H3
>                               where
> (REGEXP_SUBSTR(SUBSTR(H3."rowId",19),'[^:]+') = W."rowId")
>                                 AND (H3."status" = 'pending') AND
> (TO_NUMBER(H3."value") < 1000));
>
>
> But if I replace *W."rowId"* with REGEXP_SUBSTR(SUBSTR(*H."rowId"*,19),'[^:]+'),
> the query works fine (which is the query from my last email). So I think
> the problem is not with EXISTS, but with the fact that a reference to a
> table from the outer query (W in my query) can be done up to a maximum one
> nested level, and not more than that. In other words, referring to a table
> from the outer query from within a subquery that is in the second or more
> nested level will generate an error.
>
>
>
> Best,
> Khaleel
>
>
>
> On Sun, Apr 26, 2015 at 6:33 PM, Maryann Xue <maryann.xue@gmail.com>
> wrote:
>
>> Glad that your queries worked. Please do let us know if any further
>> problems.
>>
>> The bug is with EXISTS. Right now Phoenix does not handle aggregate
>> function calls correctly within an EXISTS subquery or EXISTS nested
>> subquery. I have opened a JIRA:
>> https://issues.apache.org/jira/browse/PHOENIX-1923.
>>
>>
>> Thanks,
>> Maryann
>>
>>
>> On Sun, Apr 26, 2015 at 8:54 AM, khaleel mershad <khellom007@gmail.com>
>> wrote:
>>
>>>
>>> Dear Maryann,
>>>
>>> I tried your query and it worked. I also executed a more complex query
>>> which I need in my testing, which is:
>>>
>>> select distinct W."rowId", W."md5_current", W."size_current" from "Wiki"
>>> AS W
>>> where EXISTS (select * from "History" AS H
>>>                         where
>>> (REGEXP_SUBSTR(SUBSTR(H."rowId",19),'[^:]+') = W."rowId")
>>>                         AND H."timestamp" = (select MAX(H2."timestamp")
>>> from "History" AS H2
>>>                                                           where
>>> (REGEXP_SUBSTR(SUBSTR(H2."rowId",19),'[^:]+') =
>>>
>>> REGEXP_SUBSTR(SUBSTR(H."rowId",19),'[^:]+'))
>>>                                                           AND
>>> (H2."status" = 'approved')) AND (TO_NUMBER(H."value") > 1000))
>>> AND NOT EXISTS (select * from "History" AS H3
>>>                               where
>>> (REGEXP_SUBSTR(SUBSTR(H3."rowId",19),'[^:]+') = W."rowId")
>>>                                 AND (H3."status" = 'pending') AND
>>> (TO_NUMBER(H3."value") < 1000));
>>>
>>> and it also worked after I understood your fix of the original query.
>>>
>>> So the trick here is that we can use the reference to the outer query
>>> within the next subquery level only, and not up to two levels as I was
>>> doing? Maybe this limitation exists because Phoenix joins the tables from
>>> the outer and the inner correlated query, but it can perform this join up
>>> to one level only?
>>>
>>>
>>>
>>> Best,
>>> Khaleel
>>>
>>>
>>> On Sat, Apr 25, 2015 at 8:11 PM, Maryann Xue <maryann.xue@gmail.com>
>>> wrote:
>>>
>>>> Hi Khaleel,
>>>>
>>>> Mind if you try the following query? I think it's the same semantics as
>>>> you meant for your query.
>>>>
>>>> select distinct W."rowId", W."md5_current", W."size_current" from
>>>> "Wiki" AS W
>>>> where
>>>> W."rowId" in (select H"rowId" from "History" AS H
>>>>                          where H."timestamp" =
>>>>                                 (select MAX(H2."timestamp") from
>>>> "History" AS H2 where H2."rowId" = H."rowId")
>>>>                          AND
>>>>  H."status" = 'approved')
>>>>
>>>>
>>>>
>>>> Thanks,
>>>> Maryann
>>>>
>>>> On Fri, Apr 24, 2015 at 5:25 PM, khaleel mershad <khellom007@gmail.com>
>>>> wrote:
>>>>
>>>>> Dear Maryann,
>>>>>
>>>>> Thanks for your question. You are right: the query that I was writing
>>>>> wasn't the correct one for my purpose. The query that will satisfy my
>>>>> request would be:
>>>>>
>>>>> select distinct W."rowId", W."md5_current", W."size_current" from
>>>>> "Wiki" AS W
>>>>> where
>>>>> EXISTS (select * from "History" AS H where (H."rowId" = W."rowId") AND
>>>>> H."timestamp" = (select MAX(H2."timestamp") from "History" AS H2
>>>>> where (H2."rowId" = W."rowId")) AND
>>>>>  (H."status" = 'approved') )
>>>>>
>>>>> In this query I specify that I need to select the exact version which
>>>>> has the Maximum timestamp among all versions of the same data item by
using
>>>>> MAX(H2."timestamp") within the inner subquery within EXISTS.
>>>>>
>>>>> However I tried such query and it still produces the same error as the
>>>>> old query, which is: (Aggregate may not contain columns not in GROUP
>>>>> BY.)
>>>>>
>>>>> Thank you for your help.
>>>>>
>>>>>
>>>>> Best,
>>>>> Khaleel
>>>>>
>>>>>
>>>>> On Thu, Apr 23, 2015 at 5:49 PM, Maryann Xue <maryann.xue@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Khaleel,
>>>>>>
>>>>>> Thanks for the explanation! But my question was since this is an
>>>>>> EXISTS, I assume if there ever is a H."timestamp" with an 'approved'
>>>>>> status, the EXISTS will return true regardless of whether you are
testing
>>>>>> H."timestamp" or Max(H."timestamp"). Is that correct? or have I missed
>>>>>> something?
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Maryann
>>>>>>
>>>>>> On Thu, Apr 23, 2015 at 8:49 AM, khaleel mershad <
>>>>>> khellom007@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> Hello Maryann,
>>>>>>>
>>>>>>> Thanks very much for your reply. Hopefully this bug gets fixed
in
>>>>>>> the next release so that I can continue working with this part
in my
>>>>>>> research project. Thanks for keeping me posted.
>>>>>>>
>>>>>>> With respect to your question, I am using the "History" table
as a
>>>>>>> data store of all versions of a certain data item. When I say
Max(
>>>>>>> H."timestamp") inside the query, I am selecting the latest version
>>>>>>> (most recent) that is "approved" (which is checked using the
condition H."status"
>>>>>>> = 'approved')
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Khaleel
>>>>>>>
>>>>>>> On Thu, Apr 23, 2015 at 6:22 AM, Maryann Xue <maryann.xue@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Khaleel,
>>>>>>>>
>>>>>>>> Thanks a lot for reporting the problem, which looks like
a bug. I
>>>>>>>> will file a JIRA and keep you posted.
>>>>>>>>
>>>>>>>> One question though, why would we use MAX(H."timestamp")
instead
>>>>>>>> of H."timestamp"? What difference would it make?
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Maryann
>>>>>>>>
>>>>>>>> On Wed, Apr 22, 2015 at 9:45 AM, khaleel mershad <
>>>>>>>> khellom007@gmail.com> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I tried executing the following query in Phoenix:
>>>>>>>>> select distinct W."rowId", W."md5_current", W."size_current"
from
>>>>>>>>> "Wiki" AS W where EXISTS( select MAX(H."timestamp") from
"History" AS H
>>>>>>>>> where (H."rowId" = W."rowId") AND (H."status" = 'approved')
);
>>>>>>>>>
>>>>>>>>> I got the following error:
>>>>>>>>> ERROR 1018 (42Y27): Aggregate may not contain columns
not in GROUP
>>>>>>>>> BY. (state=42Y27,code=1018)
>>>>>>>>>
>>>>>>>>> If I replace W."rowId" (in the where clause) with a constant
value
>>>>>>>>> (for example: '3587'), the query works fine. Also, if
I replace the
>>>>>>>>> aggregate MAX(H."timestamp") with H."timestamp", the
query also works fine.
>>>>>>>>>
>>>>>>>>> So it seems that Phoenix generates error when using a
reference to
>>>>>>>>> an outer query while using an aggregate within the inner
query.
>>>>>>>>>
>>>>>>>>> Any solutions?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Khaleel Mershad, Ph.D.
>>>>>>>>> Research Associate
>>>>>>>>>
>>>>>>>>> American University of Beirut
>>>>>>>>> Department of Electrical and Computer Engineering
>>>>>>>>> Bliss Street, Beirut, Lebanon
>>>>>>>>> email: kwm03@aub.edu.lb
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message