madlib-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank McQuillan <fmcquil...@pivotal.io>
Subject Re: MADlib LDA :(
Date Fri, 08 Jan 2016 19:59:50 GMT
So to close on this thread:

MADlib LDA term_frequency function bugs
https://issues.apache.org/jira/browse/MADLIB-933
is valid to remove hard coded column names and make the vocabulary table
not be a temp table.

https://issues.apache.org/jira/browse/MADLIB-934
is marked as won't fix since INT4 by design for memory management issues











On Fri, Dec 4, 2015 at 2:58 PM, Srivatsan Ramanujam <sramanujam@pivotal.io>
wrote:

> Hi Rahul,
> I've updated the second ticket as discussed:
> https://issues.apache.org/jira/browse/MADLIB-934
>
> Thanks
>
>
> On Fri, Dec 4, 2015 at 2:25 PM, Srivatsan Ramanujam <sramanujam@pivotal.io
> > wrote:
>
>> Great, submitted https://issues.apache.org/jira/browse/MADLIB-934 and
>> https://issues.apache.org/jira/browse/MADLIB-933
>>
>> On Fri, Dec 4, 2015 at 2:10 PM, Rahul Iyer <riyer@pivotal.io> wrote:
>>
>>> We have a new Apache JIRA instance:
>>> https://issues.apache.org/jira/browse/MADLIB/
>>> You'll need a login for this (suggest you keep this same as apache id,
>>> if you have one).
>>>
>>> Looks like we're doing an assert(num of cols == 3) - so it's because of
>>> the additional column. IMO that's a horrible check and should be removed.
>>> Please add an issue for this as well and I'll get rid of it.
>>>
>>> On Fri, Dec 4, 2015 at 1:55 PM, Srivatsan Ramanujam <
>>> sramanujam@pivotal.io> wrote:
>>>
>>>> Thanks for the response Rahul.
>>>>
>>>> By JIRA are you referring to the internal JIRA or do we have something
>>>> else given now it's on Apache Incubation?
>>>>
>>>> For #3, i have to check, but i essentially had created my own input
>>>> table which had 4 columns "docid", "wordid", "count" as well as a fourth
>>>> column "word" (corresponding to the raw token). Of these, the type of the
>>>> "count" column was bigint and not int. I am not sure what prompted the
>>>> lda_train function to throw an error it said the input table did not
>>>> contain docid, wordid and count columns, i did not check to see if it was
>>>> because of the data type mismatch of the count column or if it was due to
>>>> the additional column i had. Can you confirm which one is it?
>>>>
>>>>
>>>> On Fri, Dec 4, 2015 at 12:12 PM, Rahul Iyer <riyer@pivotal.io> wrote:
>>>>
>>>>> Hi Vatsan
>>>>>
>>>>> Thanks for the feedback!
>>>>>
>>>>> Points 1 and 2 are bugs and not design choices - the fixes are minor
>>>>> and have been completed. Before adding that to the repo, I would prefer
if
>>>>> you could create a JIRA
>>>>> <https://issues.apache.org/jira/browse/MADLIB/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel>
so
>>>>> that we have a record of the problem.
>>>>>
>>>>> Point 3: If I understand correctly, you're looking to use the LDA
>>>>> function without providing a vocabulary table. The LDA interface was
not
>>>>> changed when we added the term_frequency function - lda_train() does
not
>>>>> require a vocab table and can be called directly using your own term
>>>>> frequency table.
>>>>>
>>>>> Note: lda_train() still has a limitation of hard-coded names for the
>>>>> input table columns - would recommend you to add another JIRA to remove
>>>>> that limitation.
>>>>>
>>>>> Best,
>>>>> Rahul
>>>>>
>>>>>
>>>>> On Fri, Dec 4, 2015 at 9:46 AM, Srivatsan Ramanujam <
>>>>> sramanujam@pivotal.io> wrote:
>>>>>
>>>>>> Not sure if we reviewed this implementation's interface before but
it
>>>>>> has couple of  annoyances:
>>>>>>
>>>>>>    1. madlib.term_frequency() function (
>>>>>>    http://doc.madlib.net/latest/group__grp__text__utilities.html)
>>>>>>    takes the docid column and words columns as inputs, but this just
fools us
>>>>>>    into thinking that we could name our columns as whatever we want,
coz it
>>>>>>    complains if the columns are not actually named "docid" and "words"!
>>>>>>    2. Secondly, it takes an output table as well as input (ex:
>>>>>>    documents_tf), but it creates a temp table for the vocabulary
>>>>>>    (therefore i can't specify a schema name like vatsan.documents_tf).
This is
>>>>>>    annoying for two reasons
>>>>>>       1. The user can't immediately senses what's with the
>>>>>>       vocabulary table and why is it a temp table while the documents_tf
table
>>>>>>       itself is not.
>>>>>>       2. If i have a real world dataset for LDA, my models are going
>>>>>>       to run for quite sometime. I may even terminate one session
and run the LDA
>>>>>>       model in another session, this would mean the vocabulary temp
table won't
>>>>>>       be available in the other session (or would have gotten dropped)
>>>>>>    3. Can i really create my own input table for LDA (one that has
>>>>>>    docid, wordid, count)? If so, should i also create a vocabulary
table (does
>>>>>>    madlib look for this in the same schema as the input table)? It
would be
>>>>>>    good to provide this functionality as well, because at times we'd
want to
>>>>>>    do our own stemming/lemmatization and frequency filtering of tokens,
before
>>>>>>    passing it as input to the LDA. While the current implementation
is an
>>>>>>    input over the previous one (where the user had to do everything
from
>>>>>>    scratch), it has introduced some inconveniences as well.
>>>>>>
>>>>>> Please clarify.
>>>>>>
>>>>>> Thanks
>>>>>> Vatsan
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> ____________________________________
>>>>>>
>>>>>> Srivatsan Ramanujam | Data Science
>>>>>> Pivotal HQ - Palo Alto, CA
>>>>>> Mobile: 650-483-5630
>>>>>> ____________________________________
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> ____________________________________
>>>>
>>>> Srivatsan Ramanujam | Data Science
>>>> Pivotal HQ - Palo Alto, CA
>>>> Mobile: 650-483-5630
>>>> ____________________________________
>>>>
>>>
>>>
>>
>>
>> --
>>
>> ____________________________________
>>
>> Srivatsan Ramanujam | Data Science
>> Pivotal HQ - Palo Alto, CA
>> Mobile: 650-483-5630
>> ____________________________________
>>
>
>
>
> --
>
> ____________________________________
>
> Srivatsan Ramanujam | Data Science
> Pivotal HQ - Palo Alto, CA
> Mobile: 650-483-5630
> ____________________________________
>

Mime
View raw message