phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jaanai Zhang <cloud.pos...@gmail.com>
Subject Re: "upsert select" with "limit" clause
Date Thu, 20 Dec 2018 06:35:56 GMT
Shawn,

I have done some tests in the 4.14.1-HBase-1.4 version. The detail
information is as follow:

CREATE TABLE test (id VARCHAR PRIMARY KEY, c1 varchar, c2 varchar)
SALT_BUCKETS = 10;

explain select * from test where c1 = 'xx' limit 5 offset 100;

CREATE TABLE test1 (id VARCHAR PRIMARY KEY, c1 varchar, c2 varchar)
SALT_BUCKETS = 10;

explain upsert into test1 select * from test limit 10;



0: jdbc:phoenix:thin:url=http://localhost:876> explain upsert into test1
select * from test limit 10;

+-----------------------------------------------------------------------------------+-----------------+----------------+--------------+

|                                       PLAN
        | EST_BYTES_READ  | EST_ROWS_READ  | EST_INFO_TS  |

+-----------------------------------------------------------------------------------+-----------------+----------------+--------------+

| UPSERT SELECT
         | 2040            | 10             | 0            |

| CLIENT 10-CHUNK 10 ROWS 2040 BYTES *SERIAL* 10-WAY ROUND ROBIN FULL SCAN
OVER TEST  | 2040            | 10             | 0            |

|     SERVER 10 ROW LIMIT
         | 2040            | 10             | 0            |

| CLIENT 10 ROW LIMIT
         | 2040            | 10             | 0            |

+-----------------------------------------------------------------------------------+-----------------+----------------+--------------+

4 rows selected (0.028 seconds)

0: jdbc:phoenix:thin:url=http://localhost:876> explain upsert into test1
select * from test;

+------------------------------------------------------------------+-----------------+----------------+--------------+

|                               PLAN                               |
EST_BYTES_READ  | EST_ROWS_READ  | EST_INFO_TS  |

+------------------------------------------------------------------+-----------------+----------------+--------------+

| UPSERT SELECT                                                    | null
          | null           | null         |

| CLIENT 10-CHUNK PARALLEL 10-WAY ROUND ROBIN FULL SCAN OVER TEST  | null
          | null           | null         |

+------------------------------------------------------------------+-----------------+----------------+--------------+

2 rows selected (0.033 seconds)


I notice that the UPSERT will produce serial scans with the limit clause.
what is your Phoenix's version?   @Vincent FYI






----------------------------------------
   Jaanai Zhang
   Best regards!



Vincent Poon <vincentpoon@apache.org> 于2018年12月20日周四 上午6:04写道:

> Shawn,
>
> Took a quick look, I think what is happening is the UPSERT is done
> serially when you have LIMIT.
> Parallel scans are issued for the SELECT, which is why the explain plan
> shows PARALLEL, but then the results are concatenated via a single
> LimitingResultIterator, in order to apply the CLIENT LIMIT.
> The upsert then reads from that iterator and does the mutations in batches.
>
> To insert in parallel, we would need some sort of shared state between the
> writing threads to ensure we respect the limit, and I don't think we
> currently have something like that.
>
> Vincent
>
> On Tue, Dec 18, 2018 at 2:31 PM Vincent Poon <vincentpoon@apache.org>
> wrote:
>
>>
>> Shawn, that sounds like a bug, I would file a JIRA.
>>
>> On Tue, Dec 18, 2018 at 12:33 PM Shawn Li <shawnlijob@gmail.com> wrote:
>>
>>> Hi Vincent & William,
>>>
>>>
>>>
>>> Below is the explain plan, both are PARALLEL excuted in plan:
>>>
>>>
>>>
>>> explain upsert into table1 select * from table2;
>>>
>>>
>>>
>>> UPSERT
>>> SELECT
>>>                       |
>>>
>>> CLIENT 27-CHUNK 915799 ROWS 2831155510 BYTES PARALLEL 18-WAY ROUND ROBIN
>>> FULL SCAN OVER table2
>>>
>>>
>>>
>>> explain upsert into table1 select * from table2 limit 2000000;
>>>
>>>
>>>
>>> UPSERT
>>> SELECT
>>>                       |
>>>
>>> | CLIENT 27-CHUNK 36000000 ROWS 481140000000 BYTES PARALLEL 18-WAY ROUND
>>> ROBIN FULL SCAN OVER table2 |
>>>
>>> |     SERVER 2000000 ROW
>>> LIMIT
>>>                     |
>>>
>>> | CLIENT 2000000 ROW LIMIT
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Shawn
>>>
>>> On Tue, Dec 18, 2018, 13:30 Vincent Poon <vincentpoon@apache.org wrote:
>>>
>>>> Shawn,
>>>>
>>>> Can you do an "explain" to show what your two statements are doing?
>>>> That might give some clues.  Perhaps one is able to be run on the server
>>>> for some reason and the other is not.
>>>> Otherwise, I don't see why one would be substantially slower than the
>>>> other.
>>>>
>>>> Vincent
>>>>
>>>> On Mon, Dec 17, 2018 at 9:14 PM Shawn Li <shawnlijob@gmail.com> wrote:
>>>>
>>>>> Hi Jonathan,
>>>>>
>>>>> The single threaded on one side sounds logical to me. Hopefully
>>>>> Vincent can confirm it.
>>>>>
>>>>> Thanks,
>>>>> Shawn
>>>>>
>>>>> On Mon, Dec 17, 2018 at 9:25 PM Jonathan Leech <jonathaz@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> My guess is that in order to enforce the limit that it’s effectively
>>>>>> single threaded in either the select or the upsert.
>>>>>>
>>>>>> On Dec 17, 2018, at 6:43 PM, Shawn Li <shawnlijob@gmail.com>
wrote:
>>>>>>
>>>>>> Hi Vincent,
>>>>>>
>>>>>> Thanks for explaining. That makes much more sense now and it explains
>>>>>> the high memory usage when without "limit" clause. Because it upserts
much
>>>>>> quickly when using "upsert select" without "limit", the memory usage
in
>>>>>> client machine is much higher than "upsert select" with "limit" .
>>>>>>
>>>>>> So back to the other question. Can you explain what is underlying
>>>>>> Phoenix implementation for "upsert select limit"? Why it is slower
than
>>>>>> without "limit" when insert a huge number (2m rows) like ""upsert
into
>>>>>> table2 select * from table1 limit 2,000,000;". This is much slower
than
>>>>>> inserting the whole table (upsert into table2 select * from table1;).
>>>>>>
>>>>>> Thanks,
>>>>>> Xiang
>>>>>>
>>>>>>
>>>>>> On Mon, Dec 17, 2018 at 1:56 PM Vincent Poon <vincentpoon@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Shawn,
>>>>>>> Your query *upsert into table2 select * from table1;  *would
not be
>>>>>>> run on the server - the source and target table are different.
 It would
>>>>>>> have to be something like:
>>>>>>> *upsert into table1 select * from table1;*
>>>>>>>
>>>>>>> If you want to run server-side upsert select on a target table
that
>>>>>>> is different from the source table, you need to set
>>>>>>> "phoenix.client.enable.server.upsert.select" to true on your
client.
>>>>>>> The are some other restrictions: the table can't have any global
>>>>>>> indexes, and the statement can't have a join or where subquery.
 We need to
>>>>>>> update the documentation with this information.
>>>>>>>
>>>>>>> The reason there are all these hurdles is because it's generally
not
>>>>>>> recommended to do server-side upsert select across different
tables,
>>>>>>> because that means you're doing cross-regionserver RPCs (e.g.
read data
>>>>>>> from a region of sourcetable, and write to a region of targettable
on a
>>>>>>> different regionserver), potentially tying up handlers in an
unpredictable
>>>>>>> way.
>>>>>>>
>>>>>>> On Sun, Dec 16, 2018 at 7:12 PM Shawn Li <shawnlijob@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Jaanai,
>>>>>>>>
>>>>>>>> According to Phoenix website, " If auto commit is on, and
both a)
>>>>>>>> the target table matches the source table, and b) the select
performs no
>>>>>>>> aggregation, then the population of the target table will
be done
>>>>>>>> completely on the server-side (with constraint violations
logged, but
>>>>>>>> otherwise ignored). Otherwise, data is buffered on the client
and, if auto
>>>>>>>> commit is on, committed in row batches as specified by the
UpsertBatchSize
>>>>>>>> connection property (or the phoenix.mutate.upsertBatchSize
HBase config
>>>>>>>> property which defaults to 10000 rows)"
>>>>>>>>
>>>>>>>> And our sql statement is just: *upsert into table2 select
* from
>>>>>>>> table1; *which should match the first case, all operations
should
>>>>>>>> be in server site. But the memory usage on the client machine
is higher
>>>>>>>> than "upsert select limit" clause. And the memory usage is
check by run
>>>>>>>> 'top'  command under Linux. So we are sure it is caused by
"select upsert"
>>>>>>>> in Phoenix and not others, and can't explain why there is
so high memory
>>>>>>>> usage on client/gateway machine when all operations are supposed
to happen
>>>>>>>> on the serve side.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Shawn
>>>>>>>>
>>>>>>>> On Thu, Dec 13, 2018 at 3:15 AM Jaanai Zhang <
>>>>>>>> cloud.poster@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Shawn,
>>>>>>>>>
>>>>>>>>> The UPSERT SELECT will run in a coprocessor on if it
hasn't limit
>>>>>>>>> clause, only query 1 table, the query is doing aggregation,
no sequences
>>>>>>>>> and auto commit is on. Please check your SQL ... and
you can also check
>>>>>>>>> whether some resources have not been released.
>>>>>>>>>
>>>>>>>>> ----------------------------------------
>>>>>>>>>    Jaanai Zhang
>>>>>>>>>    Best regards!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Shawn Li <shawnlijob@gmail.com> 于2018年12月13日周四
下午12:10写道:
>>>>>>>>>
>>>>>>>>>> Hi Jaanai,
>>>>>>>>>>
>>>>>>>>>> Thanks for putting your thought. The behavior you
describe is
>>>>>>>>>> correct on the Hbase region sever side. The memory
usage for blockcache and
>>>>>>>>>> memstore will be high under such high throughput.
But our phoenix client is
>>>>>>>>>> on a gateway machine (no hbase region server sitting
on it or any Hbase
>>>>>>>>>> service on it), so not sure how to explain such high
memory usage for
>>>>>>>>>> upsert select without "limit" clause. The high memory
usage behavior like
>>>>>>>>>> all select results send to client machine, cached
in client machine's
>>>>>>>>>> memory, and then insert back to target table, which
is not like the
>>>>>>>>>> behavior that should happen, all of this should be
done on the server side
>>>>>>>>>> as the table schema is exactly the same. By the way,
this happens on both
>>>>>>>>>> Phoenix 4.7 and Phoenix 4.14.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Shawn
>>>>>>>>>>
>>>>>>>>>> On Wed, Dec 12, 2018 at 10:26 PM Jaanai Zhang <
>>>>>>>>>> cloud.poster@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Shawn,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> For the upsert without limit,  which will read
the source table
>>>>>>>>>>> and write the target tables on the server side.
 I think the higher memory
>>>>>>>>>>> usage is caused by using scan cache and memstore
under the higher
>>>>>>>>>>> throughput.
>>>>>>>>>>>
>>>>>>>>>>> ----------------------------------------
>>>>>>>>>>>    Jaanai Zhang
>>>>>>>>>>>    Best regards!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Shawn Li <shawnlijob@gmail.com> 于2018年12月13日周四
上午10:13写道:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Vincent,
>>>>>>>>>>>>
>>>>>>>>>>>> So you describe limit will sent result to
client side and then
>>>>>>>>>>>> write to server, this might explain why upsert
with limit is slower than
>>>>>>>>>>>> without limit. But looks like it can't explain
the memory usage? The memory
>>>>>>>>>>>> usage on client machine is 8gb (without "limit")
vs 2gb (with limit),
>>>>>>>>>>>> sometime upsert without "limit" can even
reach 20gb for big table.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Shawn
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Dec 12, 2018 at 6:34 PM Vincent Poon
<
>>>>>>>>>>>> vincentpoon@apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I think it's done client-side if you
have LIMIT.  If you have
>>>>>>>>>>>>> e.g. LIMIT 1000 , it would be incorrect
for each regionserver to upsert
>>>>>>>>>>>>> 100, if you have more than one regionserver.
 So instead results are sent
>>>>>>>>>>>>> back to the client, where the LIMIT is
applied and then written back to the
>>>>>>>>>>>>> server in the UPSERT.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Dec 12, 2018 at 1:18 PM Shawn
Li <shawnlijob@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Vincent,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The table creation statement is similar
to below. We have
>>>>>>>>>>>>>> about 200 fields. Table is mutable
and don’t have any index on the table.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> CREATE TABLE IF NOT EXISTS us_population
(
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       state CHAR(2) NOT NULL,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       city VARCHAR,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       population BIGINT,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       …
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       CONSTRAINT my_pk PRIMARY KEY
(state));
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Shawn
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Dec 12, 2018, 13:42 Vincent
Poon <
>>>>>>>>>>>>>> vincentpoon@apache.org wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For #2, can you provide the table
definition and the
>>>>>>>>>>>>>>> statement used?  e.g. Is the
table immutable, or does it have indexes?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Dec 11, 2018 at 6:08
PM Shawn/Xiang Li <
>>>>>>>>>>>>>>> shawnlijob@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1.       Want to check what
is underlying running for
>>>>>>>>>>>>>>>> limit clause used in the
following Upsert statement (is it involving any
>>>>>>>>>>>>>>>> coprocessor working behind?):
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *                       
          upsert into table2
>>>>>>>>>>>>>>>> select * from table1 limit
3000000; * (table 1 and table 2
>>>>>>>>>>>>>>>> have same schema)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>               The above statement
is running a lot slower
>>>>>>>>>>>>>>>> than without “limit”
 clause as shown in following, even the above
>>>>>>>>>>>>>>>> statement upsert less data:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *                       
        upsert into table2 select
>>>>>>>>>>>>>>>> * from table1;*
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2.       We also observe
memory usable is pretty high
>>>>>>>>>>>>>>>> without the limit clause
(8gb vs 2gb), sometimes for big table it can reach
>>>>>>>>>>>>>>>> 20gb without using limit
clause.  According to phoenix website description
>>>>>>>>>>>>>>>> for upsert select “If auto
commit is on, and both a) the target table
>>>>>>>>>>>>>>>> matches the source table,
and b) the select performs no aggregation, then
>>>>>>>>>>>>>>>> the population of the target
table will be done completely on the
>>>>>>>>>>>>>>>> server-side (with constraint
violations logged, but otherwise ignored).”
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>                My question
is If everything is done on
>>>>>>>>>>>>>>>> server-side, how come we
have such high memory usage on the client machine?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Shawn
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>

Mime
View raw message