madlib-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank McQuillan <fmcquil...@pivotal.io>
Subject Re: LDA output format
Date Thu, 21 Sep 2017 19:25:06 GMT
Markus,

Sorry for the delay in response.  Below is another approach that involves a
JOIN between term frequency and an LDA output table.  It is a bit simpler
at least for me.

Regarding the indexing, you are correct that it is inconsistent.  It seems
term frequency module uses 0 based indexing (e.g., docid) whereas LDA uses
1 based indexing (e.g., topicid).  I created a JIRA to address this
https://issues.apache.org/jira/browse/MADLIB-1160
probably the best way is to standardize on 1 based indexing.

----

1) Create data set

DROP TABLE IF EXISTS documents;
CREATE TABLE documents(docid INT4, contents TEXT);
INSERT INTO documents VALUES
(0, 'Statistical topic models are a class of Bayesian latent variable
models, originally developed for analyzing the semantic content of large
document corpora.'),
(1, 'By the late 1960s, the balance between pitching and hitting had swung
in favor of the pitchers. In 1968 Carl Yastrzemski won the American League
batting title with an average of just .301, the lowest in history.'),
(2, 'Machine learning is closely related to and often overlaps with
computational statistics; a discipline that also specializes in
prediction-making. It has strong ties to mathematical optimization, which
deliver methods, theory and application domains to the field.'),
(3, 'California''s diverse geography ranges from the Sierra Nevada in the
east to the Pacific Coast in the west, from the Redwood–Douglas fir forests
of the northwest, to the Mojave Desert areas in the southeast. The center
of the state is dominated by the Central Valley, a major agricultural area.
');


2) Convert docs to a list of words

ALTER TABLE documents ADD COLUMN words TEXT[];
UPDATE documents SET words = regexp_split_to_array(lower(contents),
E'[\\s+\\.\\,]');


3) Term frequency

DROP TABLE IF EXISTS my_training, my_training_vocabulary;
SELECT madlib.term_frequency('documents', 'docid', 'words', 'my_training',
TRUE);
SELECT * FROM my_training order by docid limit 20;
 docid | wordid | count
-------+--------+-------
     0 |     57 |     1
     0 |     86 |     1
     0 |      4 |     1
     0 |     55 |     1
     0 |     69 |     2
     0 |     81 |     1
     0 |     30 |     1
     0 |     33 |     1
     0 |     36 |     1
     0 |     43 |     1
     0 |     25 |     1
     0 |     65 |     2
     0 |     72 |     1
     0 |      9 |     1
     0 |      0 |     2
     0 |     29 |     1
     0 |     18 |     1
     0 |     12 |     1
     0 |     96 |     1
     0 |     91 |     1
(20 rows)


4) Run LDA

DROP TABLE IF EXISTS my_model, my_outdata;
SELECT madlib.lda_train( 'my_training',   -- Data table
                         'my_model',      -- Model table
                         'my_outdata',    -- Output table
                         104,             -- Vocab size
                         5,               -- Num topics
                         10,              -- Num iterations
                         5,               -- Alpha (Dirichlet param)
                         0.01             -- Beta (Dirichlet param)
                       );


5) Get topic description by top-k words

DROP TABLE IF EXISTS my_topic_desc;
SELECT madlib.lda_get_topic_desc( 'my_model',
                                  'my_training_vocabulary',
                                  'my_topic_desc',
                                  15);

6) Join topic description with term frequency
SELECT docid, wordid, topicid FROM my_topic_desc JOIN my_training USING
(wordid) WHERE docid=3 ORDER BY docid, wordid;

docid | wordid | topicid
-------+--------+---------
     3 |      0 |       3
     3 |      4 |       2
     3 |      5 |       3
     3 |     13 |       2
     3 |     14 |       5
     3 |     20 |       5
     3 |     21 |       2
     3 |     23 |       2
     3 |     24 |       3
     3 |     27 |       2
     3 |     32 |       3
     3 |     35 |       3
     3 |     38 |       5
     3 |     39 |       5
etc

Frank



On Tue, Aug 29, 2017 at 11:34 PM, Markus Paaso <markus.paaso@gmail.com>
wrote:

> Hi Frank,
>
> I want to explain the LDA results for a single document (in this case for
> docid = 6) by binding topicid into each wordid in the document.
> The SQL query below gives exactly what I want but I am not sure if that is
> the most effective way to build docid-wordid-topicid triples.
>
> SELECT docid, unnest((counts::text || ':' || words::text)::madlib.svec::float[])
> AS wordid, unnest(topic_assignment) + 1 AS topicid FROM lda_output WHERE
> docid = 6;
>
> I have trained LDA with 'lda_output' as the output_data_table argument in
> madlib.lda_train.
>
>
> Regards, Markus
>
> 2017-08-28 23:19 GMT+03:00 Frank McQuillan <fmcquillan@pivotal.io>:
>
>> Markus,
>>
>> Please see example 4 in the user docs
>> http://madlib.apache.org/docs/latest/group__grp__lda.html#examples
>> which provides helper functions for learning more about the learned
>> model.
>>
>> -- The topic description by top-k words
>> DROP TABLE IF EXISTS my_topic_desc;
>> SELECT madlib.lda_get_topic_desc( 'my_model',
>>                                   'my_training_vocabulary',
>>                                   'my_topic_desc',
>>                                   15);
>> select * from my_topic_desc order by topicid, prob DESC;
>>
>> produces:
>>
>> topicid | wordid |        prob        |       word
>> ---------+--------+--------------------+-------------------
>>        1 |     69 |  0.181900726392252 | of
>>        1 |     52 | 0.0608353510895884 | is
>>        1 |     65 | 0.0608353510895884 | models
>>        1 |     30 | 0.0305690072639225 | corpora
>>        1 |      1 | 0.0305690072639225 | 1960s
>>        1 |     57 | 0.0305690072639225 | latent
>>
>> Please let us know if this is of use, or you are looking for something
>> else?
>>
>> Frank
>>
>>
>> On Fri, Aug 11, 2017 at 6:45 AM, Markus Paaso <markus.paaso@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I found a working but quite awkward way to form docid-wordid-topicid
>>> pairing with a single SQL query:
>>>
>>> SELECT docid, unnest((counts::text || ':' ||
>>> words::text)::madlib.svec::float[]) AS wordid, unnest(topic_assignment)
>>> + 1 AS topicid FROM lda_output WHERE docid = 6;
>>>
>>> Output:
>>>
>>>  docid | wordid | topicid
>>> -------+--------+---------
>>>      6 |   7386 |       3
>>>      6 |  42021 |      17
>>>      6 |  42021 |      17
>>>      6 |   7705 |      12
>>>      6 | 105334 |      16
>>>      6 |  18083 |       3
>>>      6 |  89364 |       3
>>>      6 |  31073 |       3
>>>      6 |  28934 |       3
>>>      6 |  28934 |      16
>>>      6 |  56286 |      16
>>>      6 |  61921 |       3
>>>      6 |  61921 |       3
>>>      6 |  59142 |      17
>>>      6 |  33364 |       3
>>>      6 |  79035 |      17
>>>      6 |  37792 |      11
>>>      6 |  91823 |      11
>>>      6 |  30422 |       3
>>>      6 |  94672 |      17
>>>      6 |  62107 |       3
>>>      6 |  94673 |       2
>>>      6 |  62080 |      16
>>>      6 | 101046 |      17
>>>      6 |   4379 |       8
>>>      6 |   4379 |       8
>>>      6 |   4379 |       8
>>>      6 |   4379 |       8
>>>      6 |   4379 |       8
>>>      6 |  26503 |      12
>>>      6 |  61105 |       3
>>>      6 |  19193 |       3
>>>      6 |  28929 |       3
>>>
>>>
>>> Is there any simpler way to do that?
>>>
>>>
>>> Regards,
>>> Markus Paaso
>>>
>>>
>>>
>>> 2017-08-11 15:23 GMT+03:00 Markus Paaso <markus.paaso@gmail.com>:
>>>
>>>> Hi,
>>>>
>>>> I am having some problems reading the LDA output.
>>>>
>>>>
>>>> Please see this row of madlib.lda_train output:
>>>>
>>>> docid            | 6
>>>> wordcount        | 33
>>>> words            | {7386,42021,7705,105334,18083,
>>>> 89364,31073,28934,56286,61921,59142,33364,79035,37792,91823,
>>>> 30422,94672,62107,94673,62080,101046, 4379,26503,61105,19193,28929}
>>>> counts           | {1,2,1,1,1,1,1,2,1,2,1,1,1,1,1
>>>> ,1,1,1,1,1,1,5,1,1,1,1}
>>>> topic_count      | {0,1,13,0,0,0,0,5,0,0,2,2,0,0,0,4,6,0,0,0}
>>>> topic_assignment | {2,16,16,11,15,2,2,2,2,15,15,2
>>>> ,2,16,2,16,10,10,2,16,2,1,15,16,7,7,7,7,7,11,2,2,2}
>>>>
>>>>
>>>> It's hard to find which word ids are topic ids assigned to given when
>>>> *words* array have different length than *topic_assignment* array.
>>>> It would be nice if *words* array was same length than
>>>> *topic_assignment* array
>>>>
>>>> 1. What kind of SQL query would give a result with wordid - topicid
>>>> pairs?
>>>> I tried to match them by hand but failed for wordid: 28934. I wonder if
>>>> a repeating wordid can have different topic assignments in a same document?
>>>>
>>>> wordid | topicid
>>>> ----------------
>>>> 7386   | 2
>>>> 42021  | 16
>>>> 7705   | 11
>>>> 105334 | 15
>>>> 18083  | 2
>>>> 89364  | 2
>>>> 31073  | 2
>>>> 28934  | 2 OR 15 ?
>>>> 56286  | 15
>>>> 61921  | 2
>>>> 59142  | 16
>>>> 33364  | 2
>>>> 79035  | 16
>>>> 37792  | 10
>>>> 91823  | 10
>>>> 30422  | 2
>>>> 94672  | 16
>>>> 62107  | 2
>>>> 94673  | 1
>>>> 62080  | 15
>>>> 101046 | 16
>>>> 4379   | 7
>>>> 26503  | 11
>>>> 61105  | 2
>>>> 19193  | 2
>>>> 28929  | 2
>>>>
>>>>
>>>> 2. Why is the *topic_assignment* using zero based indexing while other
>>>> results use one base indexing?
>>>>
>>>>
>>>>
>>>> Regards,
>>>> Markus Paaso
>>>>
>>>
>>>
>>>
>>> --
>>> Markus Paaso
>>> Tel: +358504067849 <+358%2050%204067849>
>>>
>>
>>
>
>
> --
> Markus Paaso
> Tel: +358504067849 <+358%2050%204067849>
>

Mime
View raw message