madlib-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Paaso <markus.pa...@gmail.com>
Subject Re: LDA output format
Date Sat, 23 Sep 2017 14:24:24 GMT
Thanks Frank!

I will keep track on this issue.


Regards, Markus

2017-09-22 23:32 GMT+03:00 Frank McQuillan <fmcquillan@pivotal.io>:

> Markus,
>
> Thanks for the extra explanation.
>
> I agree that having topic_assignment as a dense vector but words as a
> sparse vector is not good practice, I am not sure why it was done that way
> in the first place.  They should be consistent.
>
> The reason sparse vectors are used I believe is to keep the LDA model
> state as small as possible, since it is possible to hit the 1GB PostgreSQL
> field limit in certain real world use cases.
>
> So I think what you are doing seems reasonable.  I don’t have a simpler
> way to do it at this point, given the current way LDA is built.
>
> I have updated this JIRA
> https://issues.apache.org/jira/browse/MADLIB-1160
> proposing:
>
> 1) use consistent indexes (start at 1)
> 2) use consistent array types (sparse)
>
> which hopefully will help some.
>
> Frank
>
>
> On Fri, Sep 22, 2017 at 12:18 AM, Markus Paaso <markus.paaso@gmail.com>
> wrote:
>
>> Hi Frank,
>>
>> Thanks for the response.
>> The solution you suggest doesn't fully answer to my question.
>>
>> I don't want to join input table with top-k wordids from each topic as
>> this is not what lda_predict is doing. Am I right?
>> I want to explain the results of madlib.lda_predict exactly.
>> I want to explain why lda_predict gives some topic distribution for some
>> set of wordids.
>>
>> I understand your approach but unfortunately it doesn't fulfill my needs.
>> I have found the solution for my needs but and was just asking if you
>> know better way to handle the results of lda_predict (or data in
>> *output_data_table *of lda_train).
>>
>> So my question was more about how to construct wordid array from 'counts'
>> and 'words' columns of madlib.lda_predict results or e.g. 'my_outdata' in
>> your example.
>> Now I am doing it with (counts::text || ':' ||
>> words::text)::madlib.svec::float[]
>>
>> The thing I don't understand in *output_data_table* format is that why
>> *topic_assignment* is a dense vector but *words* is a sparse vector.
>> I would suggest to change *words* format into dense vector to remove the
>> need of playing with madlib.svec object to explain the lda_predict results
>> exactly.
>> And then the *counts* column is not needed anymore in *output_data_table*
>> .
>>
>>
>> Regards,
>> Markus Paaso
>>
>>
>>
>>
>>
>>
>> 2017-09-21 22:25 GMT+03:00 Frank McQuillan <fmcquillan@pivotal.io>:
>>
>>> Markus,
>>>
>>> Sorry for the delay in response.  Below is another approach that
>>> involves a JOIN between term frequency and an LDA output table.  It is a
>>> bit simpler at least for me.
>>>
>>> Regarding the indexing, you are correct that it is inconsistent.  It
>>> seems term frequency module uses 0 based indexing (e.g., docid) whereas LDA
>>> uses 1 based indexing (e.g., topicid).  I created a JIRA to address this
>>> https://issues.apache.org/jira/browse/MADLIB-1160
>>> probably the best way is to standardize on 1 based indexing.
>>>
>>> ----
>>>
>>> 1) Create data set
>>>
>>> DROP TABLE IF EXISTS documents;
>>> CREATE TABLE documents(docid INT4, contents TEXT);
>>> INSERT INTO documents VALUES
>>> (0, 'Statistical topic models are a class of Bayesian latent variable
>>> models, originally developed for analyzing the semantic content of large
>>> document corpora.'),
>>> (1, 'By the late 1960s, the balance between pitching and hitting had
>>> swung in favor of the pitchers. In 1968 Carl Yastrzemski won the American
>>> League batting title with an average of just .301, the lowest in history.'),
>>> (2, 'Machine learning is closely related to and often overlaps with
>>> computational statistics; a discipline that also specializes in
>>> prediction-making. It has strong ties to mathematical optimization, which
>>> deliver methods, theory and application domains to the field.'),
>>> (3, 'California''s diverse geography ranges from the Sierra Nevada in
>>> the east to the Pacific Coast in the west, from the Redwood–Douglas fir
>>> forests of the northwest, to the Mojave Desert areas in the southeast. The
>>> center of the state is dominated by the Central Valley, a major
>>> agricultural area. ');
>>>
>>>
>>> 2) Convert docs to a list of words
>>>
>>> ALTER TABLE documents ADD COLUMN words TEXT[];
>>> UPDATE documents SET words = regexp_split_to_array(lower(contents),
>>> E'[\\s+\\.\\,]');
>>>
>>>
>>> 3) Term frequency
>>>
>>> DROP TABLE IF EXISTS my_training, my_training_vocabulary;
>>> SELECT madlib.term_frequency('documents', 'docid', 'words',
>>> 'my_training', TRUE);
>>> SELECT * FROM my_training order by docid limit 20;
>>>  docid | wordid | count
>>> -------+--------+-------
>>>      0 |     57 |     1
>>>      0 |     86 |     1
>>>      0 |      4 |     1
>>>      0 |     55 |     1
>>>      0 |     69 |     2
>>>      0 |     81 |     1
>>>      0 |     30 |     1
>>>      0 |     33 |     1
>>>      0 |     36 |     1
>>>      0 |     43 |     1
>>>      0 |     25 |     1
>>>      0 |     65 |     2
>>>      0 |     72 |     1
>>>      0 |      9 |     1
>>>      0 |      0 |     2
>>>      0 |     29 |     1
>>>      0 |     18 |     1
>>>      0 |     12 |     1
>>>      0 |     96 |     1
>>>      0 |     91 |     1
>>> (20 rows)
>>>
>>>
>>> 4) Run LDA
>>>
>>> DROP TABLE IF EXISTS my_model, my_outdata;
>>> SELECT madlib.lda_train( 'my_training',   -- Data table
>>>                          'my_model',      -- Model table
>>>                          'my_outdata',    -- Output table
>>>                          104,             -- Vocab size
>>>                          5,               -- Num topics
>>>                          10,              -- Num iterations
>>>                          5,               -- Alpha (Dirichlet param)
>>>                          0.01             -- Beta (Dirichlet param)
>>>                        );
>>>
>>>
>>> 5) Get topic description by top-k words
>>>
>>> DROP TABLE IF EXISTS my_topic_desc;
>>> SELECT madlib.lda_get_topic_desc( 'my_model',
>>>                                   'my_training_vocabulary',
>>>                                   'my_topic_desc',
>>>                                   15);
>>>
>>> 6) Join topic description with term frequency
>>> SELECT docid, wordid, topicid FROM my_topic_desc JOIN my_training USING
>>> (wordid) WHERE docid=3 ORDER BY docid, wordid;
>>>
>>> docid | wordid | topicid
>>> -------+--------+---------
>>>      3 |      0 |       3
>>>      3 |      4 |       2
>>>      3 |      5 |       3
>>>      3 |     13 |       2
>>>      3 |     14 |       5
>>>      3 |     20 |       5
>>>      3 |     21 |       2
>>>      3 |     23 |       2
>>>      3 |     24 |       3
>>>      3 |     27 |       2
>>>      3 |     32 |       3
>>>      3 |     35 |       3
>>>      3 |     38 |       5
>>>      3 |     39 |       5
>>> etc
>>>
>>> Frank
>>>
>>>
>>>
>>> On Tue, Aug 29, 2017 at 11:34 PM, Markus Paaso <markus.paaso@gmail.com>
>>> wrote:
>>>
>>>> Hi Frank,
>>>>
>>>> I want to explain the LDA results for a single document (in this case
>>>> for docid = 6) by binding topicid into each wordid in the document.
>>>> The SQL query below gives exactly what I want but I am not sure if that
>>>> is the most effective way to build docid-wordid-topicid triples.
>>>>
>>>> SELECT docid, unnest((counts::text || ':' ||
>>>> words::text)::madlib.svec::float[]) AS wordid,
>>>> unnest(topic_assignment) + 1 AS topicid FROM lda_output WHERE docid = 6;
>>>>
>>>> I have trained LDA with 'lda_output' as the output_data_table argument
>>>> in madlib.lda_train.
>>>>
>>>>
>>>> Regards, Markus
>>>>
>>>> 2017-08-28 23:19 GMT+03:00 Frank McQuillan <fmcquillan@pivotal.io>:
>>>>
>>>>> Markus,
>>>>>
>>>>> Please see example 4 in the user docs
>>>>> http://madlib.apache.org/docs/latest/group__grp__lda.html#examples
>>>>> which provides helper functions for learning more about the learned
>>>>> model.
>>>>>
>>>>> -- The topic description by top-k words
>>>>> DROP TABLE IF EXISTS my_topic_desc;
>>>>> SELECT madlib.lda_get_topic_desc( 'my_model',
>>>>>                                   'my_training_vocabulary',
>>>>>                                   'my_topic_desc',
>>>>>                                   15);
>>>>> select * from my_topic_desc order by topicid, prob DESC;
>>>>>
>>>>> produces:
>>>>>
>>>>> topicid | wordid |        prob        |       word
>>>>> ---------+--------+--------------------+-------------------
>>>>>        1 |     69 |  0.181900726392252 | of
>>>>>        1 |     52 | 0.0608353510895884 | is
>>>>>        1 |     65 | 0.0608353510895884 | models
>>>>>        1 |     30 | 0.0305690072639225 | corpora
>>>>>        1 |      1 | 0.0305690072639225 | 1960s
>>>>>        1 |     57 | 0.0305690072639225 | latent
>>>>>
>>>>> Please let us know if this is of use, or you are looking for something
>>>>> else?
>>>>>
>>>>> Frank
>>>>>
>>>>>
>>>>> On Fri, Aug 11, 2017 at 6:45 AM, Markus Paaso <markus.paaso@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I found a working but quite awkward way to form docid-wordid-topicid
>>>>>> pairing with a single SQL query:
>>>>>>
>>>>>> SELECT docid, unnest((counts::text || ':' ||
>>>>>> words::text)::madlib.svec::float[]) AS wordid,
>>>>>> unnest(topic_assignment) + 1 AS topicid FROM lda_output WHERE docid
= 6;
>>>>>>
>>>>>> Output:
>>>>>>
>>>>>>  docid | wordid | topicid
>>>>>> -------+--------+---------
>>>>>>      6 |   7386 |       3
>>>>>>      6 |  42021 |      17
>>>>>>      6 |  42021 |      17
>>>>>>      6 |   7705 |      12
>>>>>>      6 | 105334 |      16
>>>>>>      6 |  18083 |       3
>>>>>>      6 |  89364 |       3
>>>>>>      6 |  31073 |       3
>>>>>>      6 |  28934 |       3
>>>>>>      6 |  28934 |      16
>>>>>>      6 |  56286 |      16
>>>>>>      6 |  61921 |       3
>>>>>>      6 |  61921 |       3
>>>>>>      6 |  59142 |      17
>>>>>>      6 |  33364 |       3
>>>>>>      6 |  79035 |      17
>>>>>>      6 |  37792 |      11
>>>>>>      6 |  91823 |      11
>>>>>>      6 |  30422 |       3
>>>>>>      6 |  94672 |      17
>>>>>>      6 |  62107 |       3
>>>>>>      6 |  94673 |       2
>>>>>>      6 |  62080 |      16
>>>>>>      6 | 101046 |      17
>>>>>>      6 |   4379 |       8
>>>>>>      6 |   4379 |       8
>>>>>>      6 |   4379 |       8
>>>>>>      6 |   4379 |       8
>>>>>>      6 |   4379 |       8
>>>>>>      6 |  26503 |      12
>>>>>>      6 |  61105 |       3
>>>>>>      6 |  19193 |       3
>>>>>>      6 |  28929 |       3
>>>>>>
>>>>>>
>>>>>> Is there any simpler way to do that?
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Markus Paaso
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2017-08-11 15:23 GMT+03:00 Markus Paaso <markus.paaso@gmail.com>:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am having some problems reading the LDA output.
>>>>>>>
>>>>>>>
>>>>>>> Please see this row of madlib.lda_train output:
>>>>>>>
>>>>>>> docid            | 6
>>>>>>> wordcount        | 33
>>>>>>> words            | {7386,42021,7705,105334,18083,
>>>>>>> 89364,31073,28934,56286,61921,59142,33364,79035,37792,91823,
>>>>>>> 30422,94672,62107,94673,62080,101046, 4379,26503,61105,19193,28929}
>>>>>>> counts           | {1,2,1,1,1,1,1,2,1,2,1,1,1,1,1
>>>>>>> ,1,1,1,1,1,1,5,1,1,1,1}
>>>>>>> topic_count      | {0,1,13,0,0,0,0,5,0,0,2,2,0,0,0,4,6,0,0,0}
>>>>>>> topic_assignment | {2,16,16,11,15,2,2,2,2,15,15,2
>>>>>>> ,2,16,2,16,10,10,2,16,2,1,15,16,7,7,7,7,7,11,2,2,2}
>>>>>>>
>>>>>>>
>>>>>>> It's hard to find which word ids are topic ids assigned to given
when
>>>>>>>  *words* array have different length than *topic_assignment*
array.
>>>>>>> It would be nice if *words* array was same length than
>>>>>>> *topic_assignment* array
>>>>>>>
>>>>>>> 1. What kind of SQL query would give a result with wordid - topicid
>>>>>>> pairs?
>>>>>>> I tried to match them by hand but failed for wordid: 28934. I
wonder
>>>>>>> if a repeating wordid can have different topic assignments in
a same
>>>>>>> document?
>>>>>>>
>>>>>>> wordid | topicid
>>>>>>> ----------------
>>>>>>> 7386   | 2
>>>>>>> 42021  | 16
>>>>>>> 7705   | 11
>>>>>>> 105334 | 15
>>>>>>> 18083  | 2
>>>>>>> 89364  | 2
>>>>>>> 31073  | 2
>>>>>>> 28934  | 2 OR 15 ?
>>>>>>> 56286  | 15
>>>>>>> 61921  | 2
>>>>>>> 59142  | 16
>>>>>>> 33364  | 2
>>>>>>> 79035  | 16
>>>>>>> 37792  | 10
>>>>>>> 91823  | 10
>>>>>>> 30422  | 2
>>>>>>> 94672  | 16
>>>>>>> 62107  | 2
>>>>>>> 94673  | 1
>>>>>>> 62080  | 15
>>>>>>> 101046 | 16
>>>>>>> 4379   | 7
>>>>>>> 26503  | 11
>>>>>>> 61105  | 2
>>>>>>> 19193  | 2
>>>>>>> 28929  | 2
>>>>>>>
>>>>>>>
>>>>>>> 2. Why is the *topic_assignment* using zero based indexing while
>>>>>>> other results use one base indexing?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Markus Paaso
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Markus Paaso
>>>>>> Tel: +358504067849 <+358%2050%204067849>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Markus Paaso
>>>> Tel: +358504067849 <+358%2050%204067849>
>>>>
>>>
>>>
>>
>>
>> --
>> Markus Paaso
>> Tel: +358504067849 <+358%2050%204067849>
>>
>

Mime
View raw message