madlib-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank McQuillan <fmcquil...@pivotal.io>
Subject Re: LDA output format
Date Mon, 28 Aug 2017 20:19:18 GMT
Markus,

Please see example 4 in the user docs
http://madlib.apache.org/docs/latest/group__grp__lda.html#examples
which provides helper functions for learning more about the learned model.

-- The topic description by top-k words
DROP TABLE IF EXISTS my_topic_desc;
SELECT madlib.lda_get_topic_desc( 'my_model',
                                  'my_training_vocabulary',
                                  'my_topic_desc',
                                  15);
select * from my_topic_desc order by topicid, prob DESC;

produces:

topicid | wordid |        prob        |       word
---------+--------+--------------------+-------------------
       1 |     69 |  0.181900726392252 | of
       1 |     52 | 0.0608353510895884 | is
       1 |     65 | 0.0608353510895884 | models
       1 |     30 | 0.0305690072639225 | corpora
       1 |      1 | 0.0305690072639225 | 1960s
       1 |     57 | 0.0305690072639225 | latent

Please let us know if this is of use, or you are looking for something else?

Frank


On Fri, Aug 11, 2017 at 6:45 AM, Markus Paaso <markus.paaso@gmail.com>
wrote:

> Hi,
>
> I found a working but quite awkward way to form docid-wordid-topicid
> pairing with a single SQL query:
>
> SELECT docid, unnest((counts::text || ':' || words::text)::madlib.svec::float[])
> AS wordid, unnest(topic_assignment) + 1 AS topicid FROM lda_output WHERE
> docid = 6;
>
> Output:
>
>  docid | wordid | topicid
> -------+--------+---------
>      6 |   7386 |       3
>      6 |  42021 |      17
>      6 |  42021 |      17
>      6 |   7705 |      12
>      6 | 105334 |      16
>      6 |  18083 |       3
>      6 |  89364 |       3
>      6 |  31073 |       3
>      6 |  28934 |       3
>      6 |  28934 |      16
>      6 |  56286 |      16
>      6 |  61921 |       3
>      6 |  61921 |       3
>      6 |  59142 |      17
>      6 |  33364 |       3
>      6 |  79035 |      17
>      6 |  37792 |      11
>      6 |  91823 |      11
>      6 |  30422 |       3
>      6 |  94672 |      17
>      6 |  62107 |       3
>      6 |  94673 |       2
>      6 |  62080 |      16
>      6 | 101046 |      17
>      6 |   4379 |       8
>      6 |   4379 |       8
>      6 |   4379 |       8
>      6 |   4379 |       8
>      6 |   4379 |       8
>      6 |  26503 |      12
>      6 |  61105 |       3
>      6 |  19193 |       3
>      6 |  28929 |       3
>
>
> Is there any simpler way to do that?
>
>
> Regards,
> Markus Paaso
>
>
>
> 2017-08-11 15:23 GMT+03:00 Markus Paaso <markus.paaso@gmail.com>:
>
>> Hi,
>>
>> I am having some problems reading the LDA output.
>>
>>
>> Please see this row of madlib.lda_train output:
>>
>> docid            | 6
>> wordcount        | 33
>> words            | {7386,42021,7705,105334,18083,
>> 89364,31073,28934,56286,61921,59142,33364,79035,37792,91823,
>> 30422,94672,62107,94673,62080,101046, 4379,26503,61105,19193,28929}
>> counts           | {1,2,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,5,1,1,1,1}
>> topic_count      | {0,1,13,0,0,0,0,5,0,0,2,2,0,0,0,4,6,0,0,0}
>> topic_assignment | {2,16,16,11,15,2,2,2,2,15,15,2
>> ,2,16,2,16,10,10,2,16,2,1,15,16,7,7,7,7,7,11,2,2,2}
>>
>>
>> It's hard to find which word ids are topic ids assigned to given when
>> *words* array have different length than *topic_assignment* array.
>> It would be nice if *words* array was same length than *topic_assignment*
>> array
>>
>> 1. What kind of SQL query would give a result with wordid - topicid pairs?
>> I tried to match them by hand but failed for wordid: 28934. I wonder if a
>> repeating wordid can have different topic assignments in a same document?
>>
>> wordid | topicid
>> ----------------
>> 7386   | 2
>> 42021  | 16
>> 7705   | 11
>> 105334 | 15
>> 18083  | 2
>> 89364  | 2
>> 31073  | 2
>> 28934  | 2 OR 15 ?
>> 56286  | 15
>> 61921  | 2
>> 59142  | 16
>> 33364  | 2
>> 79035  | 16
>> 37792  | 10
>> 91823  | 10
>> 30422  | 2
>> 94672  | 16
>> 62107  | 2
>> 94673  | 1
>> 62080  | 15
>> 101046 | 16
>> 4379   | 7
>> 26503  | 11
>> 61105  | 2
>> 19193  | 2
>> 28929  | 2
>>
>>
>> 2. Why is the *topic_assignment* using zero based indexing while other
>> results use one base indexing?
>>
>>
>>
>> Regards,
>> Markus Paaso
>>
>
>
>
> --
> Markus Paaso
> Tel: +358504067849 <+358%2050%204067849>
>

Mime
View raw message