madlib-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Paaso <markus.pa...@gmail.com>
Subject Re: LDA output format
Date Wed, 30 Aug 2017 06:34:43 GMT
Hi Frank,

I want to explain the LDA results for a single document (in this case for
docid = 6) by binding topicid into each wordid in the document.
The SQL query below gives exactly what I want but I am not sure if that is
the most effective way to build docid-wordid-topicid triples.

SELECT docid, unnest((counts::text || ':' ||
words::text)::madlib.svec::float[]) AS wordid, unnest(topic_assignment) + 1
AS topicid FROM lda_output WHERE docid = 6;

I have trained LDA with 'lda_output' as the output_data_table argument in
madlib.lda_train.


Regards, Markus

2017-08-28 23:19 GMT+03:00 Frank McQuillan <fmcquillan@pivotal.io>:

> Markus,
>
> Please see example 4 in the user docs
> http://madlib.apache.org/docs/latest/group__grp__lda.html#examples
> which provides helper functions for learning more about the learned model.
>
>
> -- The topic description by top-k words
> DROP TABLE IF EXISTS my_topic_desc;
> SELECT madlib.lda_get_topic_desc( 'my_model',
>                                   'my_training_vocabulary',
>                                   'my_topic_desc',
>                                   15);
> select * from my_topic_desc order by topicid, prob DESC;
>
> produces:
>
> topicid | wordid |        prob        |       word
> ---------+--------+--------------------+-------------------
>        1 |     69 |  0.181900726392252 | of
>        1 |     52 | 0.0608353510895884 | is
>        1 |     65 | 0.0608353510895884 | models
>        1 |     30 | 0.0305690072639225 | corpora
>        1 |      1 | 0.0305690072639225 | 1960s
>        1 |     57 | 0.0305690072639225 | latent
>
> Please let us know if this is of use, or you are looking for something
> else?
>
> Frank
>
>
> On Fri, Aug 11, 2017 at 6:45 AM, Markus Paaso <markus.paaso@gmail.com>
> wrote:
>
>> Hi,
>>
>> I found a working but quite awkward way to form docid-wordid-topicid
>> pairing with a single SQL query:
>>
>> SELECT docid, unnest((counts::text || ':' ||
>> words::text)::madlib.svec::float[]) AS wordid, unnest(topic_assignment)
>> + 1 AS topicid FROM lda_output WHERE docid = 6;
>>
>> Output:
>>
>>  docid | wordid | topicid
>> -------+--------+---------
>>      6 |   7386 |       3
>>      6 |  42021 |      17
>>      6 |  42021 |      17
>>      6 |   7705 |      12
>>      6 | 105334 |      16
>>      6 |  18083 |       3
>>      6 |  89364 |       3
>>      6 |  31073 |       3
>>      6 |  28934 |       3
>>      6 |  28934 |      16
>>      6 |  56286 |      16
>>      6 |  61921 |       3
>>      6 |  61921 |       3
>>      6 |  59142 |      17
>>      6 |  33364 |       3
>>      6 |  79035 |      17
>>      6 |  37792 |      11
>>      6 |  91823 |      11
>>      6 |  30422 |       3
>>      6 |  94672 |      17
>>      6 |  62107 |       3
>>      6 |  94673 |       2
>>      6 |  62080 |      16
>>      6 | 101046 |      17
>>      6 |   4379 |       8
>>      6 |   4379 |       8
>>      6 |   4379 |       8
>>      6 |   4379 |       8
>>      6 |   4379 |       8
>>      6 |  26503 |      12
>>      6 |  61105 |       3
>>      6 |  19193 |       3
>>      6 |  28929 |       3
>>
>>
>> Is there any simpler way to do that?
>>
>>
>> Regards,
>> Markus Paaso
>>
>>
>>
>> 2017-08-11 15:23 GMT+03:00 Markus Paaso <markus.paaso@gmail.com>:
>>
>>> Hi,
>>>
>>> I am having some problems reading the LDA output.
>>>
>>>
>>> Please see this row of madlib.lda_train output:
>>>
>>> docid            | 6
>>> wordcount        | 33
>>> words            | {7386,42021,7705,105334,18083,
>>> 89364,31073,28934,56286,61921,59142,33364,79035,37792,91823,
>>> 30422,94672,62107,94673,62080,101046, 4379,26503,61105,19193,28929}
>>> counts           | {1,2,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,5,1,1,1,1}
>>> topic_count      | {0,1,13,0,0,0,0,5,0,0,2,2,0,0,0,4,6,0,0,0}
>>> topic_assignment | {2,16,16,11,15,2,2,2,2,15,15,2
>>> ,2,16,2,16,10,10,2,16,2,1,15,16,7,7,7,7,7,11,2,2,2}
>>>
>>>
>>> It's hard to find which word ids are topic ids assigned to given when
>>> *words* array have different length than *topic_assignment* array.
>>> It would be nice if *words* array was same length than
>>> *topic_assignment* array
>>>
>>> 1. What kind of SQL query would give a result with wordid - topicid
>>> pairs?
>>> I tried to match them by hand but failed for wordid: 28934. I wonder if
>>> a repeating wordid can have different topic assignments in a same document?
>>>
>>> wordid | topicid
>>> ----------------
>>> 7386   | 2
>>> 42021  | 16
>>> 7705   | 11
>>> 105334 | 15
>>> 18083  | 2
>>> 89364  | 2
>>> 31073  | 2
>>> 28934  | 2 OR 15 ?
>>> 56286  | 15
>>> 61921  | 2
>>> 59142  | 16
>>> 33364  | 2
>>> 79035  | 16
>>> 37792  | 10
>>> 91823  | 10
>>> 30422  | 2
>>> 94672  | 16
>>> 62107  | 2
>>> 94673  | 1
>>> 62080  | 15
>>> 101046 | 16
>>> 4379   | 7
>>> 26503  | 11
>>> 61105  | 2
>>> 19193  | 2
>>> 28929  | 2
>>>
>>>
>>> 2. Why is the *topic_assignment* using zero based indexing while other
>>> results use one base indexing?
>>>
>>>
>>>
>>> Regards,
>>> Markus Paaso
>>>
>>
>>
>>
>> --
>> Markus Paaso
>> Tel: +358504067849 <+358%2050%204067849>
>>
>
>


-- 
Markus Paaso
Tel: +358504067849

Mime
View raw message