phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawn Li <shawnli...@gmail.com>
Subject Re: "upsert select" with "limit" clause
Date Thu, 20 Dec 2018 18:41:09 GMT
Hi Vincent,

Thanks for checking the source code. You mentioned with limit UPSERT uses
serial insert, can I interpret this as single thread process? So are you
saying UPSERT without limit is faster is because it uses parallel
insert/multi threads to insert? If so, then it makes sense now for both
speed and memory usage.

But the speed for UPSERT with limit is still pretty slow for us. The
inserting rate is about 300 rows/sec vs 3000 rows/sec without limit. Also
bulk csv data loading via psql is also use single thread (per phoenix site:
"Single-threaded client loading tool for CSV formatted data via the psql
<https://phoenix.apache.org/download.html#Loading-Data> command"), but the
insert rate can be 3000-5000 rows/sec for us. Any other reason upsert with
limit is so slow?

Thanks,
Shawn

On Wed, Dec 19, 2018 at 5:04 PM Vincent Poon <vincentpoon@apache.org> wrote:

> Shawn,
>
> Took a quick look, I think what is happening is the UPSERT is done
> serially when you have LIMIT.
> Parallel scans are issued for the SELECT, which is why the explain plan
> shows PARALLEL, but then the results are concatenated via a single
> LimitingResultIterator, in order to apply the CLIENT LIMIT.
> The upsert then reads from that iterator and does the mutations in batches.
>
> To insert in parallel, we would need some sort of shared state between the
> writing threads to ensure we respect the limit, and I don't think we
> currently have something like that.
>
> Vincent
>
> On Tue, Dec 18, 2018 at 2:31 PM Vincent Poon <vincentpoon@apache.org>
> wrote:
>
>>
>> Shawn, that sounds like a bug, I would file a JIRA.
>>
>> On Tue, Dec 18, 2018 at 12:33 PM Shawn Li <shawnlijob@gmail.com> wrote:
>>
>>> Hi Vincent & William,
>>>
>>>
>>>
>>> Below is the explain plan, both are PARALLEL excuted in plan:
>>>
>>>
>>>
>>> explain upsert into table1 select * from table2;
>>>
>>>
>>>
>>> UPSERT
>>> SELECT
>>>                       |
>>>
>>> CLIENT 27-CHUNK 915799 ROWS 2831155510 BYTES PARALLEL 18-WAY ROUND ROBIN
>>> FULL SCAN OVER table2
>>>
>>>
>>>
>>> explain upsert into table1 select * from table2 limit 2000000;
>>>
>>>
>>>
>>> UPSERT
>>> SELECT
>>>                       |
>>>
>>> | CLIENT 27-CHUNK 36000000 ROWS 481140000000 BYTES PARALLEL 18-WAY ROUND
>>> ROBIN FULL SCAN OVER table2 |
>>>
>>> |     SERVER 2000000 ROW
>>> LIMIT
>>>                     |
>>>
>>> | CLIENT 2000000 ROW LIMIT
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Shawn
>>>
>>> On Tue, Dec 18, 2018, 13:30 Vincent Poon <vincentpoon@apache.org wrote:
>>>
>>>> Shawn,
>>>>
>>>> Can you do an "explain" to show what your two statements are doing?
>>>> That might give some clues.  Perhaps one is able to be run on the server
>>>> for some reason and the other is not.
>>>> Otherwise, I don't see why one would be substantially slower than the
>>>> other.
>>>>
>>>> Vincent
>>>>
>>>> On Mon, Dec 17, 2018 at 9:14 PM Shawn Li <shawnlijob@gmail.com> wrote:
>>>>
>>>>> Hi Jonathan,
>>>>>
>>>>> The single threaded on one side sounds logical to me. Hopefully
>>>>> Vincent can confirm it.
>>>>>
>>>>> Thanks,
>>>>> Shawn
>>>>>
>>>>> On Mon, Dec 17, 2018 at 9:25 PM Jonathan Leech <jonathaz@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> My guess is that in order to enforce the limit that it’s effectively
>>>>>> single threaded in either the select or the upsert.
>>>>>>
>>>>>> On Dec 17, 2018, at 6:43 PM, Shawn Li <shawnlijob@gmail.com>
wrote:
>>>>>>
>>>>>> Hi Vincent,
>>>>>>
>>>>>> Thanks for explaining. That makes much more sense now and it explains
>>>>>> the high memory usage when without "limit" clause. Because it upserts
much
>>>>>> quickly when using "upsert select" without "limit", the memory usage
in
>>>>>> client machine is much higher than "upsert select" with "limit" .
>>>>>>
>>>>>> So back to the other question. Can you explain what is underlying
>>>>>> Phoenix implementation for "upsert select limit"? Why it is slower
than
>>>>>> without "limit" when insert a huge number (2m rows) like ""upsert
into
>>>>>> table2 select * from table1 limit 2,000,000;". This is much slower
than
>>>>>> inserting the whole table (upsert into table2 select * from table1;).
>>>>>>
>>>>>> Thanks,
>>>>>> Xiang
>>>>>>
>>>>>>
>>>>>> On Mon, Dec 17, 2018 at 1:56 PM Vincent Poon <vincentpoon@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Shawn,
>>>>>>> Your query *upsert into table2 select * from table1;  *would
not be
>>>>>>> run on the server - the source and target table are different.
 It would
>>>>>>> have to be something like:
>>>>>>> *upsert into table1 select * from table1;*
>>>>>>>
>>>>>>> If you want to run server-side upsert select on a target table
that
>>>>>>> is different from the source table, you need to set
>>>>>>> "phoenix.client.enable.server.upsert.select" to true on your
client.
>>>>>>> The are some other restrictions: the table can't have any global
>>>>>>> indexes, and the statement can't have a join or where subquery.
 We need to
>>>>>>> update the documentation with this information.
>>>>>>>
>>>>>>> The reason there are all these hurdles is because it's generally
not
>>>>>>> recommended to do server-side upsert select across different
tables,
>>>>>>> because that means you're doing cross-regionserver RPCs (e.g.
read data
>>>>>>> from a region of sourcetable, and write to a region of targettable
on a
>>>>>>> different regionserver), potentially tying up handlers in an
unpredictable
>>>>>>> way.
>>>>>>>
>>>>>>> On Sun, Dec 16, 2018 at 7:12 PM Shawn Li <shawnlijob@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Jaanai,
>>>>>>>>
>>>>>>>> According to Phoenix website, " If auto commit is on, and
both a)
>>>>>>>> the target table matches the source table, and b) the select
performs no
>>>>>>>> aggregation, then the population of the target table will
be done
>>>>>>>> completely on the server-side (with constraint violations
logged, but
>>>>>>>> otherwise ignored). Otherwise, data is buffered on the client
and, if auto
>>>>>>>> commit is on, committed in row batches as specified by the
UpsertBatchSize
>>>>>>>> connection property (or the phoenix.mutate.upsertBatchSize
HBase config
>>>>>>>> property which defaults to 10000 rows)"
>>>>>>>>
>>>>>>>> And our sql statement is just: *upsert into table2 select
* from
>>>>>>>> table1; *which should match the first case, all operations
should
>>>>>>>> be in server site. But the memory usage on the client machine
is higher
>>>>>>>> than "upsert select limit" clause. And the memory usage is
check by run
>>>>>>>> 'top'  command under Linux. So we are sure it is caused by
"select upsert"
>>>>>>>> in Phoenix and not others, and can't explain why there is
so high memory
>>>>>>>> usage on client/gateway machine when all operations are supposed
to happen
>>>>>>>> on the serve side.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Shawn
>>>>>>>>
>>>>>>>> On Thu, Dec 13, 2018 at 3:15 AM Jaanai Zhang <
>>>>>>>> cloud.poster@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Shawn,
>>>>>>>>>
>>>>>>>>> The UPSERT SELECT will run in a coprocessor on if it
hasn't limit
>>>>>>>>> clause, only query 1 table, the query is doing aggregation,
no sequences
>>>>>>>>> and auto commit is on. Please check your SQL ... and
you can also check
>>>>>>>>> whether some resources have not been released.
>>>>>>>>>
>>>>>>>>> ----------------------------------------
>>>>>>>>>    Jaanai Zhang
>>>>>>>>>    Best regards!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Shawn Li <shawnlijob@gmail.com> 于2018年12月13日周四
下午12:10写道:
>>>>>>>>>
>>>>>>>>>> Hi Jaanai,
>>>>>>>>>>
>>>>>>>>>> Thanks for putting your thought. The behavior you
describe is
>>>>>>>>>> correct on the Hbase region sever side. The memory
usage for blockcache and
>>>>>>>>>> memstore will be high under such high throughput.
But our phoenix client is
>>>>>>>>>> on a gateway machine (no hbase region server sitting
on it or any Hbase
>>>>>>>>>> service on it), so not sure how to explain such high
memory usage for
>>>>>>>>>> upsert select without "limit" clause. The high memory
usage behavior like
>>>>>>>>>> all select results send to client machine, cached
in client machine's
>>>>>>>>>> memory, and then insert back to target table, which
is not like the
>>>>>>>>>> behavior that should happen, all of this should be
done on the server side
>>>>>>>>>> as the table schema is exactly the same. By the way,
this happens on both
>>>>>>>>>> Phoenix 4.7 and Phoenix 4.14.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Shawn
>>>>>>>>>>
>>>>>>>>>> On Wed, Dec 12, 2018 at 10:26 PM Jaanai Zhang <
>>>>>>>>>> cloud.poster@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Shawn,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> For the upsert without limit,  which will read
the source table
>>>>>>>>>>> and write the target tables on the server side.
 I think the higher memory
>>>>>>>>>>> usage is caused by using scan cache and memstore
under the higher
>>>>>>>>>>> throughput.
>>>>>>>>>>>
>>>>>>>>>>> ----------------------------------------
>>>>>>>>>>>    Jaanai Zhang
>>>>>>>>>>>    Best regards!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Shawn Li <shawnlijob@gmail.com> 于2018年12月13日周四
上午10:13写道:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Vincent,
>>>>>>>>>>>>
>>>>>>>>>>>> So you describe limit will sent result to
client side and then
>>>>>>>>>>>> write to server, this might explain why upsert
with limit is slower than
>>>>>>>>>>>> without limit. But looks like it can't explain
the memory usage? The memory
>>>>>>>>>>>> usage on client machine is 8gb (without "limit")
vs 2gb (with limit),
>>>>>>>>>>>> sometime upsert without "limit" can even
reach 20gb for big table.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Shawn
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Dec 12, 2018 at 6:34 PM Vincent Poon
<
>>>>>>>>>>>> vincentpoon@apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I think it's done client-side if you
have LIMIT.  If you have
>>>>>>>>>>>>> e.g. LIMIT 1000 , it would be incorrect
for each regionserver to upsert
>>>>>>>>>>>>> 100, if you have more than one regionserver.
 So instead results are sent
>>>>>>>>>>>>> back to the client, where the LIMIT is
applied and then written back to the
>>>>>>>>>>>>> server in the UPSERT.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Dec 12, 2018 at 1:18 PM Shawn
Li <shawnlijob@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Vincent,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The table creation statement is similar
to below. We have
>>>>>>>>>>>>>> about 200 fields. Table is mutable
and don’t have any index on the table.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> CREATE TABLE IF NOT EXISTS us_population
(
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       state CHAR(2) NOT NULL,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       city VARCHAR,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       population BIGINT,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       …
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       CONSTRAINT my_pk PRIMARY KEY
(state));
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Shawn
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Dec 12, 2018, 13:42 Vincent
Poon <
>>>>>>>>>>>>>> vincentpoon@apache.org wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For #2, can you provide the table
definition and the
>>>>>>>>>>>>>>> statement used?  e.g. Is the
table immutable, or does it have indexes?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Dec 11, 2018 at 6:08
PM Shawn/Xiang Li <
>>>>>>>>>>>>>>> shawnlijob@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1.       Want to check what
is underlying running for
>>>>>>>>>>>>>>>> limit clause used in the
following Upsert statement (is it involving any
>>>>>>>>>>>>>>>> coprocessor working behind?):
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *                       
          upsert into table2
>>>>>>>>>>>>>>>> select * from table1 limit
3000000; * (table 1 and table 2
>>>>>>>>>>>>>>>> have same schema)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>               The above statement
is running a lot slower
>>>>>>>>>>>>>>>> than without “limit”
 clause as shown in following, even the above
>>>>>>>>>>>>>>>> statement upsert less data:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *                       
        upsert into table2 select
>>>>>>>>>>>>>>>> * from table1;*
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2.       We also observe
memory usable is pretty high
>>>>>>>>>>>>>>>> without the limit clause
(8gb vs 2gb), sometimes for big table it can reach
>>>>>>>>>>>>>>>> 20gb without using limit
clause.  According to phoenix website description
>>>>>>>>>>>>>>>> for upsert select “If auto
commit is on, and both a) the target table
>>>>>>>>>>>>>>>> matches the source table,
and b) the select performs no aggregation, then
>>>>>>>>>>>>>>>> the population of the target
table will be done completely on the
>>>>>>>>>>>>>>>> server-side (with constraint
violations logged, but otherwise ignored).”
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>                My question
is If everything is done on
>>>>>>>>>>>>>>>> server-side, how come we
have such high memory usage on the client machine?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Shawn
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>

Mime
View raw message