madlib-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahul Iyer <>
Subject Re: MADlib LDA :(
Date Fri, 04 Dec 2015 22:10:07 GMT
We have a new Apache JIRA instance:
You'll need a login for this (suggest you keep this same as apache id, if
you have one).

Looks like we're doing an assert(num of cols == 3) - so it's because of the
additional column. IMO that's a horrible check and should be removed.
Please add an issue for this as well and I'll get rid of it.

On Fri, Dec 4, 2015 at 1:55 PM, Srivatsan Ramanujam <>

> Thanks for the response Rahul.
> By JIRA are you referring to the internal JIRA or do we have something
> else given now it's on Apache Incubation?
> For #3, i have to check, but i essentially had created my own input table
> which had 4 columns "docid", "wordid", "count" as well as a fourth column
> "word" (corresponding to the raw token). Of these, the type of the "count"
> column was bigint and not int. I am not sure what prompted the lda_train
> function to throw an error it said the input table did not contain docid,
> wordid and count columns, i did not check to see if it was because of the
> data type mismatch of the count column or if it was due to the additional
> column i had. Can you confirm which one is it?
> On Fri, Dec 4, 2015 at 12:12 PM, Rahul Iyer <> wrote:
>> Hi Vatsan
>> Thanks for the feedback!
>> Points 1 and 2 are bugs and not design choices - the fixes are minor and
>> have been completed. Before adding that to the repo, I would prefer if you
>> could create a JIRA
>> <>
>> that we have a record of the problem.
>> Point 3: If I understand correctly, you're looking to use the LDA
>> function without providing a vocabulary table. The LDA interface was not
>> changed when we added the term_frequency function - lda_train() does not
>> require a vocab table and can be called directly using your own term
>> frequency table.
>> Note: lda_train() still has a limitation of hard-coded names for the
>> input table columns - would recommend you to add another JIRA to remove
>> that limitation.
>> Best,
>> Rahul
>> On Fri, Dec 4, 2015 at 9:46 AM, Srivatsan Ramanujam <
>>> wrote:
>>> Not sure if we reviewed this implementation's interface before but it
>>> has couple of  annoyances:
>>>    1. madlib.term_frequency() function (
>>> takes
>>>    the docid column and words columns as inputs, but this just fools us into
>>>    thinking that we could name our columns as whatever we want, coz it
>>>    complains if the columns are not actually named "docid" and "words"!
>>>    2. Secondly, it takes an output table as well as input (ex:
>>>    documents_tf), but it creates a temp table for the vocabulary
>>>    (therefore i can't specify a schema name like vatsan.documents_tf). This is
>>>    annoying for two reasons
>>>       1. The user can't immediately senses what's with the vocabulary
>>>       table and why is it a temp table while the documents_tf table itself is
>>>       2. If i have a real world dataset for LDA, my models are going to
>>>       run for quite sometime. I may even terminate one session and run the LDA
>>>       model in another session, this would mean the vocabulary temp table won't
>>>       be available in the other session (or would have gotten dropped)
>>>    3. Can i really create my own input table for LDA (one that has
>>>    docid, wordid, count)? If so, should i also create a vocabulary table (does
>>>    madlib look for this in the same schema as the input table)? It would be
>>>    good to provide this functionality as well, because at times we'd want to
>>>    do our own stemming/lemmatization and frequency filtering of tokens, before
>>>    passing it as input to the LDA. While the current implementation is an
>>>    input over the previous one (where the user had to do everything from
>>>    scratch), it has introduced some inconveniences as well.
>>> Please clarify.
>>> Thanks
>>> Vatsan
>>> --
>>> ____________________________________
>>> Srivatsan Ramanujam | Data Science
>>> Pivotal HQ - Palo Alto, CA
>>> Mobile: 650-483-5630
>>> ____________________________________
> --
> ____________________________________
> Srivatsan Ramanujam | Data Science
> Pivotal HQ - Palo Alto, CA
> Mobile: 650-483-5630
> ____________________________________

View raw message