madlib-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Paaso <markus.pa...@gmail.com>
Subject Re: LDA output format
Date Fri, 11 Aug 2017 13:45:27 GMT
Hi,

I found a working but quite awkward way to form docid-wordid-topicid
pairing with a single SQL query:

SELECT docid, unnest((counts::text || ':' ||
words::text)::madlib.svec::float[]) AS wordid, unnest(topic_assignment) + 1
AS topicid FROM lda_output WHERE docid = 6;

Output:

 docid | wordid | topicid
-------+--------+---------
     6 |   7386 |       3
     6 |  42021 |      17
     6 |  42021 |      17
     6 |   7705 |      12
     6 | 105334 |      16
     6 |  18083 |       3
     6 |  89364 |       3
     6 |  31073 |       3
     6 |  28934 |       3
     6 |  28934 |      16
     6 |  56286 |      16
     6 |  61921 |       3
     6 |  61921 |       3
     6 |  59142 |      17
     6 |  33364 |       3
     6 |  79035 |      17
     6 |  37792 |      11
     6 |  91823 |      11
     6 |  30422 |       3
     6 |  94672 |      17
     6 |  62107 |       3
     6 |  94673 |       2
     6 |  62080 |      16
     6 | 101046 |      17
     6 |   4379 |       8
     6 |   4379 |       8
     6 |   4379 |       8
     6 |   4379 |       8
     6 |   4379 |       8
     6 |  26503 |      12
     6 |  61105 |       3
     6 |  19193 |       3
     6 |  28929 |       3


Is there any simpler way to do that?


Regards,
Markus Paaso



2017-08-11 15:23 GMT+03:00 Markus Paaso <markus.paaso@gmail.com>:

> Hi,
>
> I am having some problems reading the LDA output.
>
>
> Please see this row of madlib.lda_train output:
>
> docid            | 6
> wordcount        | 33
> words            | {7386,42021,7705,105334,18083,
> 89364,31073,28934,56286,61921,59142,33364,79035,37792,91823,
> 30422,94672,62107,94673,62080,101046, 4379,26503,61105,19193,28929}
> counts           | {1,2,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,5,1,1,1,1}
> topic_count      | {0,1,13,0,0,0,0,5,0,0,2,2,0,0,0,4,6,0,0,0}
> topic_assignment | {2,16,16,11,15,2,2,2,2,15,15,
> 2,2,16,2,16,10,10,2,16,2,1,15,16,7,7,7,7,7,11,2,2,2}
>
>
> It's hard to find which word ids are topic ids assigned to given when
> *words* array have different length than *topic_assignment* array.
> It would be nice if *words* array was same length than *topic_assignment*
> array
>
> 1. What kind of SQL query would give a result with wordid - topicid pairs?
> I tried to match them by hand but failed for wordid: 28934. I wonder if a
> repeating wordid can have different topic assignments in a same document?
>
> wordid | topicid
> ----------------
> 7386   | 2
> 42021  | 16
> 7705   | 11
> 105334 | 15
> 18083  | 2
> 89364  | 2
> 31073  | 2
> 28934  | 2 OR 15 ?
> 56286  | 15
> 61921  | 2
> 59142  | 16
> 33364  | 2
> 79035  | 16
> 37792  | 10
> 91823  | 10
> 30422  | 2
> 94672  | 16
> 62107  | 2
> 94673  | 1
> 62080  | 15
> 101046 | 16
> 4379   | 7
> 26503  | 11
> 61105  | 2
> 19193  | 2
> 28929  | 2
>
>
> 2. Why is the *topic_assignment* using zero based indexing while other
> results use one base indexing?
>
>
>
> Regards,
> Markus Paaso
>



-- 
Markus Paaso
Tel: +358504067849

Mime
View raw message