madlib-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahul Iyer <ri...@pivotal.io>
Subject Re: MADlib LDA :(
Date Fri, 04 Dec 2015 20:12:15 GMT
Hi Vatsan

Thanks for the feedback!

Points 1 and 2 are bugs and not design choices - the fixes are minor and
have been completed. Before adding that to the repo, I would prefer if you
could create a JIRA
<https://issues.apache.org/jira/browse/MADLIB/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel>
so
that we have a record of the problem.

Point 3: If I understand correctly, you're looking to use the LDA function
without providing a vocabulary table. The LDA interface was not changed
when we added the term_frequency function - lda_train() does not require a
vocab table and can be called directly using your own term frequency table.

Note: lda_train() still has a limitation of hard-coded names for the input
table columns - would recommend you to add another JIRA to remove that
limitation.

Best,
Rahul


On Fri, Dec 4, 2015 at 9:46 AM, Srivatsan Ramanujam <sramanujam@pivotal.io>
wrote:

> Not sure if we reviewed this implementation's interface before but it has
> couple of  annoyances:
>
>    1. madlib.term_frequency() function (
>    http://doc.madlib.net/latest/group__grp__text__utilities.html) takes
>    the docid column and words columns as inputs, but this just fools us into
>    thinking that we could name our columns as whatever we want, coz it
>    complains if the columns are not actually named "docid" and "words"!
>    2. Secondly, it takes an output table as well as input (ex:
>    documents_tf), but it creates a temp table for the vocabulary
>    (therefore i can't specify a schema name like vatsan.documents_tf). This is
>    annoying for two reasons
>       1. The user can't immediately senses what's with the vocabulary
>       table and why is it a temp table while the documents_tf table itself is not.
>       2. If i have a real world dataset for LDA, my models are going to
>       run for quite sometime. I may even terminate one session and run the LDA
>       model in another session, this would mean the vocabulary temp table won't
>       be available in the other session (or would have gotten dropped)
>    3. Can i really create my own input table for LDA (one that has docid,
>    wordid, count)? If so, should i also create a vocabulary table (does madlib
>    look for this in the same schema as the input table)? It would be good to
>    provide this functionality as well, because at times we'd want to do our
>    own stemming/lemmatization and frequency filtering of tokens, before
>    passing it as input to the LDA. While the current implementation is an
>    input over the previous one (where the user had to do everything from
>    scratch), it has introduced some inconveniences as well.
>
> Please clarify.
>
> Thanks
> Vatsan
>
>
> --
>
> ____________________________________
>
> Srivatsan Ramanujam | Data Science
> Pivotal HQ - Palo Alto, CA
> Mobile: 650-483-5630
> ____________________________________
>

Mime
View raw message