phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <jamestay...@apache.org>
Subject Re: MapReduce bulk load into Phoenix table
Date Tue, 13 Jan 2015 18:22:47 GMT
Hi Constantin,
1000-1500 rows per sec? Using our performance.py script, on my Mac
laptop, I'm seeing 27,000 rows per sec (Phoenix 4.2.2 with HBase
0.98.9).

If you want to realistically measure performance, I'd recommend doing
so on a real cluster. If you'll really only have a single machine,
then you're probably better off using something like MySQL. Using the
map-reduce based CSV loader on a single node is not going to speed
anything up. For a cluster it can make a difference, though. See
http://phoenix.apache.org/phoenix_mr.html

FYI, Phoenix indexes are only maintained if you go through Phoenix APIs.

Thanks,
James


On Tue, Jan 13, 2015 at 2:45 AM, Vaclav Loffelmann
<vaclav.loffelmann@socialbakers.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> I think the easiest way how to determine if indexes are maintained
> when inserting directly to HBase is to test it. If it is maintained by
> region observer coprocessors, it should. (I'll do tests when as soon
> I'll have some time.)
>
> I don't see any problem with different cols between multiple rows.
> Make view same as you'd make table definition. Null values are not
> stored at HBase hence theres no overhead.
>
> I'm afraid there is not any piece of code (publicly avail) how to do
> that, but it is very straight forward.
> If you use composite primary key, then concat multiple results of
> PDataType.TYPE.toBytes() as rowkey. For values use same logic. Data
> types are defined as enums at this class:
> org.apache.phoenix.schema.PDataType.
>
> Good luck,
> Vaclav;
>
> On 01/13/2015 10:58 AM, Ciureanu, Constantin (GfK) wrote:
>> Thank you Vaclav,
>>
>> I have just started today to write some code :) for MR job that
>> will load data into HBase + Phoenix. Previously I wrote some
>> application to load data using Phoenix JDBC (slow), but I also have
>> experience with HBase so I can understand and write code to load
>> data directly there.
>>
>> If doing so, I'm also worry about: - maintaining (some existing)
>> Phoenix indexes (if any) - perhaps this still works in case the
>> (same) coprocessors would trigger at insert time, but I cannot know
>> how it works behind the scenes. - having the Phoenix view around
>> the HBase table would "solve" the above problem (so there's no
>> index whatsoever) but would create a lot of other problems (my
>> table has a limited number of common columns and the rest are too
>> different from row to row - in total I have hundreds of possible
>> columns)
>>
>> So - to make things faster for me-  is there any good piece of code
>> I can find on the internet about how to map my data types to
>> Phoenix data types and use the results as regular HBase Bulk Load?
>>
>> Regards, Constantin
>>
>> -----Original Message----- From: Vaclav Loffelmann
>> [mailto:vaclav.loffelmann@socialbakers.com] Sent: Tuesday, January
>> 13, 2015 10:30 AM To: user@phoenix.apache.org Subject: Re:
>> MapReduce bulk load into Phoenix table
>>
>> Hi, our daily usage is to import raw data directly to HBase, but
>> mapped to Phoenix data types. And for querying we use Phoenix view
>> on top of that HBase table.
>>
>> Then you should hit bottleneck of HBase itself. It should be from
>> 10 to 30+ times faster than your current solution. Depending on HW
>> of course.
>>
>> I'd prefer this solution for stream writes.
>>
>> Vaclav
>>
>> On 01/13/2015 10:12 AM, Ciureanu, Constantin (GfK) wrote:
>>> Hello all,
>>
>>> (Due to the slow speed of Phoenix JDBC – single machine ~
>>> 1000-1500 rows /sec) I am also documenting myself about loading
>>> data into Phoenix via MapReduce.
>>
>>> So far I understood that the Key + List<[Key,Value]> to be
>>> inserted into HBase table is obtained via a “dummy” Phoenix
>>> connection – then those rows are stored into HFiles (then after
>>> the MR job finishes it is Bulk loading those HFiles normally into
>>> HBase).
>>
>>> My question: Is there any better / faster approach? I assume this
>>>  cannot reach the maximum speed to load data into Phoenix / HBase
>>>  table.
>>
>>> Also I would like to find a better / newer sample code than this
>>> one:
>>> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.phoenix/pho
>>>
>>>
> enix/4.0.0-incubating/org/apache/phoenix/mapreduce/CsvToKeyValueMapper
>>> .java#CsvToKeyValueMapper.loadPreUpsertProcessor%28org.apache.hadoop.c
>>>
>>>
> onf.Configuration%29
>>
>>> Thank you, Constantin
>>
>>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1
>
> iQEcBAEBAgAGBQJUtPc4AAoJEG3mD8Wtuk0WWu0IAIJcEveMvKZbrrf3FY0SRbNx
> 6V5jF44t+Dl88r1EX9VvKZrWzjp/uN6t8r3H6D5nNLmq0314jOfK68T0q3uaIqOp
> Ui7UwMRAGkmhhY9pqgZBfWvXsLLJSy9hJU70JSk3RFeynprqPsme9a8CWJi8IfDN
> G83he0X2avQKQJ72hKeDZX9NzKib9cxNQFKtWDpr2NQat5VnCJkCUGprMcMUxU31
> vqFg9aQ+b40WN0KFJ3p7cI5tlAuJ5Tz7Ogh+KOZUqTVZ8I5OLwQzqpiwDMD6stRb
> PJ1gc7LCs64wJghv5TIZpHyXl/3HOgmpYrO+UfGv1S1qzySpM3B1o9ajTbww3L0=
> =Fdvo
> -----END PGP SIGNATURE-----

Mime
View raw message