phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <gabriel.r...@gmail.com>
Subject Re: Question about IndexTool
Date Wed, 16 Sep 2015 11:40:05 GMT
The call to

    HFileOutputFormat.configureIncrementalLoad(job, htable)

in the IndexTool configures the job to use a Reducer which does the
sorting on KeyValues.

The KeyValues written to an HFile do indeed need to be sorted, so I
would guess that you'll need to do basically the equivalent of a
reducer with total-order sorting in Spark in order to accomplish the
same thing.

- Gabriel


On Wed, Sep 16, 2015 at 10:32 AM, Yiannis Gkoufas <johngouf85@gmail.com> wrote:
> Hi Gabriel,
>
> thanks a lot for the reply. I noticed my self afterwards that it does a
> rollback on every upsert and then extracts the KeyValues.
> Basically I am trying to replicate the same job but in Spark and I cannot
> understand where in the existing source code of IndexTool is guaranteed that
> the row keys written in the HFiles are in the correct order.
> I have been getting errors "Added a key not lexically larger than previous
> key"
>
> Thanks a lot!
>
>
> On 15 September 2015 at 19:46, Gabriel Reid <gabriel.reid@gmail.com> wrote:
>>
>> The upsert statements in the MR jobs are used to convert data into the
>> appropriate encoding for writing to an HFile -- the data doesn't actually
>> get pushed to Phoenix from within the MR job. Instead, the created KeyValues
>> are extracted from the "output" of the upsert statement, and the statement
>> is rolled-back within the MR job. The extracted KeyValues are then written
>> to the HFile.
>>
>> - Gabriel
>>
>> On Tue, Sep 15, 2015 at 2:12 PM Yiannis Gkoufas <johngouf85@gmail.com>
>> wrote:
>>>
>>> Hi there,
>>>
>>> I was going through the code related to index creation via MapReduce job
>>> (IndexTool) and I have some questions.
>>> If I am not mistaken, for a global secondary index Phoenix creates a new
>>> HBase table which has the appropriate key (the column value of the original
>>> table you want to index) and loads the column values you have in your
>>> INCLUDE statement.
>>> In the PhoenixIndexImportMapper I can see that an Upsert statement is
>>> executed, but also HFiles are written.
>>> My question is the following: why is the Upsert statement needed if the
>>> table containing the secondary index will be populated from the HFiles
>>> written?
>>>
>>> Thanks a lot
>
>

Mime
View raw message