phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <jamestay...@apache.org>
Subject Re: create schema on write
Date Fri, 03 Jun 2016 16:28:27 GMT
Sure, Steve. Let's say you have a base table you define for your JSON
object like this:

-- Create sequence that doles out IDs when a new JSON type
-- is processed
CREATE SEQUENCE JSON_ID_SEQ;

-- Create base table for JSON
CREATE TABLE json (json_id BIGINT PRIMARY KEY,
    created_date DATE
    raw_json VARCHAR)
    APPEND_ONLY_SCHEMA = true,
    AUTO_PARTITION_SEQ = 'JSON_ID_SEQ';

Then as each json object is processed, you'd pull out the object name/type
from the message, each field (as you're doing now), and dynamically create
a view with the fields  as columns:

CREATE VIEW IF NOT EXISTS <json_type_name> (
    <field 1> <field 1 type>, ...)
    AS SELECT * FROM json
    APPEND_ONLY_SCHEMA = true;

This would then assign a json_id when a new json_type_name is encountered
and otherwise add any new columns to the json_type_name VIEW. By declaring
it as APPEND_ONLY_SCHEMA we can optimize the RPC traffic by doing no RPCs
if all field names are already know by the metadata cached on the client,
and otherwise doing a single RPC to create or alter the view with any new
field names.

The nice thing about this too is that the metadata is introspectable
through the standard JDBC metadata APIs - you can look up the fields and
their types based on the <json_type_name>.

Note that the APPEND_ONLY_SCHEMA and AUTO_PARTITION_SEQ and 4.8 features,
but you can do the same pre-4.8 with a few more RPCs by catching the
TableAlreadyExistsException and adding each field through ALTER VIEW calls.

This is how we've modeled time-series data in support of Argus[1], not as
JSON in this case, but as tags and a metric value.

HTH.  Thanks,
James

[1] https://github.com/SalesforceEng/Argus

On Fri, Jun 3, 2016 at 8:41 AM, Steve Terrell <sterrell@oculus360.us> wrote:

> James, I don't understand how to use IF NOT EXISTS to track the dynamic
> columns.  Could you elaborate?  Thanks!
>
> On Fri, Jun 3, 2016 at 10:36 AM, James Taylor <jamestaylor@apache.org>
> wrote:
>
>> That's pretty slick, Steve. Another variant along the same lines would be
>> to create/alter a view over a base table where you use the IF NOT EXISTS.
>> In this way, you let Phoenix track the dynamic columns for you. We have a
>> new feature in 4.8 for declaring a table as an APPEND_ONLY_SCHEMA and we'll
>> minimize the RPCs for ensuring the meta data is up to date. It works well
>> for dynamically tracking schema as data is processed, a pretty common
>> pattern.
>>
>> Thanks,
>> James
>>
>>
>> On Friday, June 3, 2016, Steve Terrell <sterrell@oculus360.us> wrote:
>>
>>> I have a similar situation.  I have records with varying fields that I
>>> wanted to access individually and also as a group.
>>>
>>> My actual records are JSON objects, so they look like like this:
>>> {"field1": value1, "field2": value2, …}
>>> To make matter harder, the fields are also varying types:  ints,
>>> strings, boolean, floats.
>>>
>>> What I ended up doing that has worked really well is make each field in
>>> the JSON record a dynamic field in a Phoenix table.  And, to preserve the
>>> integrity of the original record, I have one static string field in which
>>> the entire original JSON goes into.
>>>
>>> Because of the dynamic fields, I have to execute an UPSERT for each row
>>> instead of some kind of batch loading.  But, I routinely load millions of
>>> rows like this, sometimes with 20+ fields, and it has never failed on an
>>> upsert and runs fast enough.
>>>
>>> A few of the fields are actually always in each record, so I use those
>>> for my primary keys,
>>>
>>> The end result is great!  I can query by any field, quickly pull entire
>>> records from the key fields, and any JSON object can be loaded.
>>>
>>> In my code that translates the JSON into UPSERT statements I also keep a
>>> running tab of field names, their type, and number of occurrences.  Then a
>>> the end of my millions of UPSERTS I populate a simple table that just lists
>>> all the fields/types/counts so that there is no mystery about what fields
>>> are available when I want to query the big table.  I did this because I
>>> don't always know what field names are in the JSON, and I don't know how to
>>> get a list of dynamic field names from Phoenix SQL.
>>>
>>> Hope this helps,
>>>     Steve
>>>
>>> On Thu, Jun 2, 2016 at 9:28 PM, Stephen Wilcoxon <wilcoxon@gmail.com>
>>> wrote:
>>>
>>>> The simple solution is to transform the multi-valued field into a
>>>> separate linked table with multiple rows per "id".  If this is what you
>>>> mean by "exploding the data", why do you not want to do so?
>>>>
>>>> On Thu, Jun 2, 2016 at 7:11 PM, Rahul Jain <rjain7@gmail.com> wrote:
>>>>
>>>>>
>>>>> Folks,
>>>>>
>>>>> I used this question earlier on the wrong list; posting on apache
>>>>> phoenix user group on the advise of James Taylor:
>>>>>
>>>>> We have a use case of migration from legacy database to hbase/phoenix.
>>>>> I'd want to hear your thoughts on how to manage the  given multi-
>>>>> valued field:
>>>>>
>>>>> Field name : namesList
>>>>>
>>>>> Contains: comma separated list of strings.
>>>>>
>>>>>  e.g. "Jack, Richard, Fred"
>>>>>
>>>>> The field does not have an inherent limit on the number of strings,
>>>>> however for successful migration we need to support unto 100 strings
in the
>>>>> field
>>>>>
>>>>> This field is currently indexed in our legacy DB technology where the
>>>>> user is able to query by any of the contained strings.
>>>>>
>>>>> e.g. query:
>>>>>
>>>>> select * from myTable where namesList LIKE '%Fred%';
>>>>>
>>>>> Is there any way Phoenix supports generating multiple index values from
>>>>> a single column in a row ? We are looking for a method where we don't
>>>>> explicitly specify individual offsets for indexing, (nameList[0],
>>>>> nameList[1] etc) , rather say a custom functional index equivalent
>>>>> generates the indexes as a list and each of the index in the list is
>>>>> persisted individually.
>>>>>
>>>>> Effectively our goal is to avoid full table scans when doing queries
>>>>> that match values from multi valued fields such as example above. We
>>>>> do want to avoid exploding the data however possible.
>>>>>
>>>>> Any comments / suggestions are welcome.
>>>>>
>>>>> Thank you !
>>>>>
>>>>> Rahul
>>>>>
>>>>
>>>>
>>>
>

Mime
View raw message