madlib-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Paaso <markus.pa...@gmail.com>
Subject LDA output format
Date Fri, 11 Aug 2017 12:23:00 GMT
Hi,

I am having some problems reading the LDA output.


Please see this row of madlib.lda_train output:

docid            | 6
wordcount        | 33
words            |
{7386,42021,7705,105334,18083,89364,31073,28934,56286,61921,59142,33364,79035,37792,91823,30422,94672,62107,94673,62080,101046,
4379,26503,61105,19193,28929}
counts           | {1,2,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,5,1,1,1,1}
topic_count      | {0,1,13,0,0,0,0,5,0,0,2,2,0,0,0,4,6,0,0,0}
topic_assignment |
{2,16,16,11,15,2,2,2,2,15,15,2,2,16,2,16,10,10,2,16,2,1,15,16,7,7,7,7,7,11,2,2,2}


It's hard to find which word ids are topic ids assigned to given when
*words* array have different length than *topic_assignment* array.
It would be nice if *words* array was same length than *topic_assignment*
array

1. What kind of SQL query would give a result with wordid - topicid pairs?
I tried to match them by hand but failed for wordid: 28934. I wonder if a
repeating wordid can have different topic assignments in a same document?

wordid | topicid
----------------
7386   | 2
42021  | 16
7705   | 11
105334 | 15
18083  | 2
89364  | 2
31073  | 2
28934  | 2 OR 15 ?
56286  | 15
61921  | 2
59142  | 16
33364  | 2
79035  | 16
37792  | 10
91823  | 10
30422  | 2
94672  | 16
62107  | 2
94673  | 1
62080  | 15
101046 | 16
4379   | 7
26503  | 11
61105  | 2
19193  | 2
28929  | 2


2. Why is the *topic_assignment* using zero based indexing while other
results use one base indexing?



Regards,
Markus Paaso

Mime
View raw message