phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <jamestay...@apache.org>
Subject Re: Re: Local index related data bulkload
Date Fri, 12 Sep 2014 05:36:08 GMT
Hi Sun,
You make a good point. Immutable and local vs global are orthogonal. We
could support local immutable indexes as well as global immutable indexes.
Would you mind filing a JIRA on this?

In your experience, is the index maintenance a bottleneck for you if you
create completely covered immutable indexes? What's your mix of reads vs
writes for your use case?

Thanks,
James

On Thu, Sep 11, 2014 at 7:32 PM, sunfl@certusnet.com.cn <
sunfl@certusnet.com.cn> wrote:

> Hi, James
> Thanks for your reply. We understand the difference and application
> scenario for IMMUTABLE INDEX and MUTABLE INDEX.
> The main reason we expect to facilitate local indexing relates to the
> feature of write faster and we are trying to increase our data loading
> speed and performance. Another consideration is that local indexing did
> not require include addtional columns when specifying queries,
> which also fit our requirements.
> James, is there any possibility that local indexing can be created as
> immutable index? We are not quite understanding about the design of
> local indexing and why local indexing must be created as default mutable
> index. Noting that Hbase and Cassandra are more likely to process
> time-series data, maybe immutable index are more efficient in some
> situations. Thats are just our several considerations. Are their any options
> to select when using local index as immutable index? Corrects me if your
> design had unprevented and limited conditions for the default requirements.\
>
> Thanks,
> Sun
>
>
> *From:* James Taylor <jamestaylor@apache.org>
> *Date:* 2014-09-12 09:57
> *To:* user <user@phoenix.apache.org>
> *Subject:* Re: RE: Local index related data bulkload
> Hi Sun,
> Yes, that explains it. With immutable indexes, there is no index
> maintenance required, so there's no processing at all on the server side.
> If your data is write-once/append-only, then immutable indexes are about as
> efficient as you'll get. Any reason why you'd want to change them to local
> indexes? Local indexes is an alternative to global indexes for *mutable*
> data.
> Thanks,
> James
>
> On Thu, Sep 11, 2014 at 6:51 PM, sunfl@certusnet.com.cn <
> sunfl@certusnet.com.cn> wrote:
>
>> Hi, Rajeshbabu
>> Best appreciated for your kind reply and explaination. Exactly, we
>> created only one local index for the table.
>>
>> We have one question: as far as we are concerned, for local indexing the
>> index data may be already prepared
>>
>> for client upsert? Maybe there is no need to scan and search for
>> specified regionserver processing? Cause we
>>
>> did not had so much trouble for the case of global index loading (no
>> matther one index or more indexes related
>>
>> data loading).
>>
>> Another question. Gloable index we created are immutable indexes as
>> setting IMMUTABLE_ROWS=true, while
>> local indexing are default mutable indexes. Are these differences meaning
>> a lot for the performance diversity?
>>
>> Best thanks,
>> Sun
>>
>> ------------------------------
>> ------------------------------
>>
>>
>> *发件人:* rajeshbabu chintaguntla <rajeshbabu.chintaguntla@huawei.com>
>> *发送时间:* 2014-09-11 23:45
>> *收件人:* user@phoenix.apache.org
>> *主题:* RE: Re: Local index related data bulkload
>> Hi Sun,
>>  The code snippet(*PhoenixIndexBuilder#batchStarted*) you have pointed
>> out is not specific to local indexing, generic for any index. The main idea
>> of the method is to keep the rows to index in block cache. So next time
>> when ever scan the rows while preparing index updates we can get it from
>> cache.
>>          // The entire purpose of this method impl is to get the
>> existing rows for the
>>          // table rows being indexed into the block cache, as the index
>> maintenance code
>>          // does a point scan per row
>>
>>   This gives good performance when a table has more than one index.  One
>> more thing with psql tool we do upserts in batches and each batch have 1000
>> updates by default(if you don't specify any value to
>> phoenix.mutate.batchSize). Lets suppose if all the rows are different we
>> scan the region until we cache all the 1000 records. That's why
>>   hasMore = scanner.nextRaw(results);     //Here....  might be taking
>> more time.
>> Can you tell me how many indexes you have created? One improvement we can
>> do here is if we have only one index we can skip the scan in
>> *PhoenixIndexBuilder#batchStarted. *
>>
>>  @James, currently we are scanning the data region while preparing index
>> updates?why don't we prepare them without scanning data region if we can
>> have get all index columns data from hooks?
>>
>>
>>  bq. If someone had successfully done loading data through CsvBulkload
>> using Spark and HDFS, please provide us more kindly suggesion.
>>  Please refer "http://phoenix.apache.org/bulk_dataload.html#Loading via
>> MapReduce" to run the bulkload from HDFS. Here we can pass index table to
>> build as --index-table parameter.
>>  But currently there is a problem with local indexing. I will raise an
>> issue and work on it.
>>
>>
>>  Thanks,
>>  Rajeshbabu.
>>
>>     This e-mail and its attachments contain confidential information
>> from HUAWEI, which
>> is intended only for the person or entity whose address is listed above.
>> Any use of the
>> information contained herein in any way (including, but not limited to,
>> total or partial
>> disclosure, reproduction, or dissemination) by persons other than the
>> intended
>> recipient(s) is prohibited. If you receive this e-mail in error, please
>> notify the sender by
>> phone or email immediately and delete it!
>>   ------------------------------
>> *From:* sunfl@certusnet.com.cn [sunfl@certusnet.com.cn]
>> *Sent:* Thursday, September 11, 2014 6:34 AM
>> *To:* user
>> *Subject:* Re: Re: Local index related data bulkload
>>
>>   Very thanks.
>>
>>  ------------------------------
>>  ------------------------------
>>
>>
>>     *From:* rajesh babu Chintaguntla <chrajeshbabu32@gmail.com>
>> *Date:* 2014-09-10 21:09
>> *To:* user@phoenix.apache.org
>> *Subject:* Re: Local index related data bulkload
>>   Hi Sun I am not accessible to code. Tomorrow morning I will check and
>> let you know.
>>
>>  Thanks,
>> Rajeshbabu
>>
>> On Wednesday, September 10, 2014, sunfl@certusnet.com.cn <
>> sunfl@certusnet.com.cn> wrote:
>>
>>>  Any available suggestion?
>>>
>>>  ------------------------------
>>>
>>>    *发件人:* sunfl@certusnet.com.cn <http://UrlBlockedError.aspx>
>>> *发送时间:* 2014-09-09 14:24
>>> *收件人:* user <http://UrlBlockedError.aspx>
>>> *主题:* 回复: Local index related data bulkload
>>>   BTW.
>>> The stacktrace info illustrates that our job running performance
>>> bottleneck mainly lies in the following code :
>>>      region.startRegionOperation();
>>>           try {
>>>                boolean hasMore;
>>>                do {
>>>                   List<Cell> results = Lists.newArrayList();
>>>              // Results are potentially returned even when the return
>>> value of s.next is false
>>>              // since this is an indication of whether or not there are
>>> more values after the
>>>             // ones returned
>>>                  hasMore = scanner.nextRaw(results);     //Here....
>>>               } while (hasMore);
>>>             } finally {
>>>                try {
>>>                  scanner.close();
>>>                } finally {
>>>                   region.closeRegionOperation();
>>>                 }
>>>             }
>>>          }
>>>
>>>  ------------------------------
>>>
>>>    *发件人:* sunfl@certusnet.com.cn <http://UrlBlockedError.aspx>
>>> *发送时间:* 2014-09-09 14:18
>>> *收件人:* user <http://UrlBlockedError.aspx>
>>> *抄送:* rajeshbabu chintaguntla <http://UrlBlockedError.aspx>
>>> *主题:* Local index related data bulkload
>>>   Hi all and rajeshbabu,
>>>    Recently our job has encountered severe problems with trying to load
>>> data with local indexes
>>> into phoenix. The data load performance looks very bad compared with our
>>> previous data
>>> loading with gloable indexes. That seems quite absurd because phoenix
>>> local index targets
>>> scenarios with heavy write and space constraint use case, which is just
>>> our job application.
>>>    Observing stack trace during our job running, we can find the
>>> following info:
>>>
>>>
>>>  We then refer to the org.apache.phoenix.index.PhoenixIndexBuilder and
>>> commented the batchStarted method. After recompiling the phoenix and
>>> restart cluster,
>>> our job loading performance get significant advance. Following is the
>>> code for batcStarted method:
>>> Here are my questions:
>>> 1 Can these code committor explain the concrete functionality for this
>>> method? Especially concerning to local index data loading...
>>> 2 If we modify these codes (e.g. comment this method like what we do),
>>> are there any potential influence for phoenix work?
>>> 3 More helpful work..Can any guys share their codes about how to
>>> complete data bulkload with local indexes while data file are storaged
>>> within HDFS?
>>> I know that CsvBulkload can do index related data upserting while
>>> map-reduce bulkload didnot support that. Maybe our job is more likely to
>>> map-refuce bulkload? So, If someone
>>> had successfully done loading data through CsvBulkload using Spark and
>>> HDFS, please provide us more kindly suggesion.
>>>
>>>  Best Regards,
>>> Sun
>>>
>>>  /**
>>> * Index builder for covered-columns index that ties into phoenix for
>>> faster use.
>>> */
>>> public class PhoenixIndexBuilder extends CoveredColumnsIndexBuilder {
>>>
>>> @Override
>>> public void batchStarted(MiniBatchOperationInProgress<Mutation>
>>> miniBatchOp) throws IOException {
>>> // The entire purpose of this method impl is to get the existing rows
>>> for the
>>> // table rows being indexed into the block cache, as the index
>>> maintenance code
>>> // does a point scan per row
>>> List<KeyRange> keys =
>>> Lists.newArrayListWithExpectedSize(miniBatchOp.size());
>>> List<IndexMaintainer> maintainers = new ArrayList<IndexMaintainer>();
>>> for (int i = 0; i < miniBatchOp.size(); i++) {
>>> Mutation m = miniBatchOp.getOperation(i);
>>> keys.add(PDataType.VARBINARY.getKeyRange(m.getRow()));
>>> maintainers.addAll(getCodec().getIndexMaintainers(m.getAttributesMap()));
>>>
>>> }
>>> Scan scan = IndexManagementUtil.newLocalStateScan(maintainers);
>>> ScanRanges scanRanges =
>>> ScanRanges.create(Collections.singletonList(keys),
>>> SchemaUtil.VAR_BINARY_SCHEMA);
>>> scanRanges.setScanStartStopRow(scan);
>>> scan.setFilter(scanRanges.getSkipScanFilter());
>>> HRegion region = this.env.getRegion();
>>> RegionScanner scanner = region.getScanner(scan);
>>> // Run through the scanner using internal nextRaw method
>>> region.startRegionOperation();
>>> try {
>>> boolean hasMore;
>>> do {
>>> List<Cell> results = Lists.newArrayList();
>>> // Results are potentially returned even when the return value of s.next
>>> is false
>>> // since this is an indication of whether or not there are more values
>>> after the
>>> // ones returned
>>> hasMore = scanner.nextRaw(results);
>>> } while (hasMore);
>>> } finally {
>>> try {
>>> scanner.close();
>>> } finally {
>>> region.closeRegionOperation();
>>> }
>>> }
>>> }
>>> ------------------------------
>>>  ------------------------------
>>>
>>>
>>>
>

Mime
View raw message