madlib-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Paaso <markus.pa...@gmail.com>
Subject Re: LDA output format
Date Fri, 22 Sep 2017 07:18:37 GMT
Hi Frank,

Thanks for the response.
The solution you suggest doesn't fully answer to my question.

I don't want to join input table with top-k wordids from each topic as this
is not what lda_predict is doing. Am I right?
I want to explain the results of madlib.lda_predict exactly.
I want to explain why lda_predict gives some topic distribution for some
set of wordids.

I understand your approach but unfortunately it doesn't fulfill my needs.
I have found the solution for my needs but and was just asking if you know
better way to handle the results of lda_predict (or data in
*output_data_table *of lda_train).

So my question was more about how to construct wordid array from 'counts'
and 'words' columns of madlib.lda_predict results or e.g. 'my_outdata' in
your example.
Now I am doing it with (counts::text || ':' || words::text)::madlib.svec::
float[]

The thing I don't understand in *output_data_table* format is that why
*topic_assignment* is a dense vector but *words* is a sparse vector.
I would suggest to change *words* format into dense vector to remove the
need of playing with madlib.svec object to explain the lda_predict results
exactly.
And then the *counts* column is not needed anymore in *output_data_table*.


Regards,
Markus Paaso






2017-09-21 22:25 GMT+03:00 Frank McQuillan <fmcquillan@pivotal.io>:

> Markus,
>
> Sorry for the delay in response.  Below is another approach that involves
> a JOIN between term frequency and an LDA output table.  It is a bit simpler
> at least for me.
>
> Regarding the indexing, you are correct that it is inconsistent.  It seems
> term frequency module uses 0 based indexing (e.g., docid) whereas LDA uses
> 1 based indexing (e.g., topicid).  I created a JIRA to address this
> https://issues.apache.org/jira/browse/MADLIB-1160
> probably the best way is to standardize on 1 based indexing.
>
> ----
>
> 1) Create data set
>
> DROP TABLE IF EXISTS documents;
> CREATE TABLE documents(docid INT4, contents TEXT);
> INSERT INTO documents VALUES
> (0, 'Statistical topic models are a class of Bayesian latent variable
> models, originally developed for analyzing the semantic content of large
> document corpora.'),
> (1, 'By the late 1960s, the balance between pitching and hitting had swung
> in favor of the pitchers. In 1968 Carl Yastrzemski won the American League
> batting title with an average of just .301, the lowest in history.'),
> (2, 'Machine learning is closely related to and often overlaps with
> computational statistics; a discipline that also specializes in
> prediction-making. It has strong ties to mathematical optimization, which
> deliver methods, theory and application domains to the field.'),
> (3, 'California''s diverse geography ranges from the Sierra Nevada in the
> east to the Pacific Coast in the west, from the Redwood–Douglas fir forests
> of the northwest, to the Mojave Desert areas in the southeast. The center
> of the state is dominated by the Central Valley, a major agricultural area.
> ');
>
>
> 2) Convert docs to a list of words
>
> ALTER TABLE documents ADD COLUMN words TEXT[];
> UPDATE documents SET words = regexp_split_to_array(lower(contents),
> E'[\\s+\\.\\,]');
>
>
> 3) Term frequency
>
> DROP TABLE IF EXISTS my_training, my_training_vocabulary;
> SELECT madlib.term_frequency('documents', 'docid', 'words',
> 'my_training', TRUE);
> SELECT * FROM my_training order by docid limit 20;
>  docid | wordid | count
> -------+--------+-------
>      0 |     57 |     1
>      0 |     86 |     1
>      0 |      4 |     1
>      0 |     55 |     1
>      0 |     69 |     2
>      0 |     81 |     1
>      0 |     30 |     1
>      0 |     33 |     1
>      0 |     36 |     1
>      0 |     43 |     1
>      0 |     25 |     1
>      0 |     65 |     2
>      0 |     72 |     1
>      0 |      9 |     1
>      0 |      0 |     2
>      0 |     29 |     1
>      0 |     18 |     1
>      0 |     12 |     1
>      0 |     96 |     1
>      0 |     91 |     1
> (20 rows)
>
>
> 4) Run LDA
>
> DROP TABLE IF EXISTS my_model, my_outdata;
> SELECT madlib.lda_train( 'my_training',   -- Data table
>                          'my_model',      -- Model table
>                          'my_outdata',    -- Output table
>                          104,             -- Vocab size
>                          5,               -- Num topics
>                          10,              -- Num iterations
>                          5,               -- Alpha (Dirichlet param)
>                          0.01             -- Beta (Dirichlet param)
>                        );
>
>
> 5) Get topic description by top-k words
>
> DROP TABLE IF EXISTS my_topic_desc;
> SELECT madlib.lda_get_topic_desc( 'my_model',
>                                   'my_training_vocabulary',
>                                   'my_topic_desc',
>                                   15);
>
> 6) Join topic description with term frequency
> SELECT docid, wordid, topicid FROM my_topic_desc JOIN my_training USING
> (wordid) WHERE docid=3 ORDER BY docid, wordid;
>
> docid | wordid | topicid
> -------+--------+---------
>      3 |      0 |       3
>      3 |      4 |       2
>      3 |      5 |       3
>      3 |     13 |       2
>      3 |     14 |       5
>      3 |     20 |       5
>      3 |     21 |       2
>      3 |     23 |       2
>      3 |     24 |       3
>      3 |     27 |       2
>      3 |     32 |       3
>      3 |     35 |       3
>      3 |     38 |       5
>      3 |     39 |       5
> etc
>
> Frank
>
>
>
> On Tue, Aug 29, 2017 at 11:34 PM, Markus Paaso <markus.paaso@gmail.com>
> wrote:
>
>> Hi Frank,
>>
>> I want to explain the LDA results for a single document (in this case for
>> docid = 6) by binding topicid into each wordid in the document.
>> The SQL query below gives exactly what I want but I am not sure if that
>> is the most effective way to build docid-wordid-topicid triples.
>>
>> SELECT docid, unnest((counts::text || ':' ||
>> words::text)::madlib.svec::float[]) AS wordid, unnest(topic_assignment)
>> + 1 AS topicid FROM lda_output WHERE docid = 6;
>>
>> I have trained LDA with 'lda_output' as the output_data_table argument
>> in madlib.lda_train.
>>
>>
>> Regards, Markus
>>
>> 2017-08-28 23:19 GMT+03:00 Frank McQuillan <fmcquillan@pivotal.io>:
>>
>>> Markus,
>>>
>>> Please see example 4 in the user docs
>>> http://madlib.apache.org/docs/latest/group__grp__lda.html#examples
>>> which provides helper functions for learning more about the learned
>>> model.
>>>
>>> -- The topic description by top-k words
>>> DROP TABLE IF EXISTS my_topic_desc;
>>> SELECT madlib.lda_get_topic_desc( 'my_model',
>>>                                   'my_training_vocabulary',
>>>                                   'my_topic_desc',
>>>                                   15);
>>> select * from my_topic_desc order by topicid, prob DESC;
>>>
>>> produces:
>>>
>>> topicid | wordid |        prob        |       word
>>> ---------+--------+--------------------+-------------------
>>>        1 |     69 |  0.181900726392252 | of
>>>        1 |     52 | 0.0608353510895884 | is
>>>        1 |     65 | 0.0608353510895884 | models
>>>        1 |     30 | 0.0305690072639225 | corpora
>>>        1 |      1 | 0.0305690072639225 | 1960s
>>>        1 |     57 | 0.0305690072639225 | latent
>>>
>>> Please let us know if this is of use, or you are looking for something
>>> else?
>>>
>>> Frank
>>>
>>>
>>> On Fri, Aug 11, 2017 at 6:45 AM, Markus Paaso <markus.paaso@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I found a working but quite awkward way to form docid-wordid-topicid
>>>> pairing with a single SQL query:
>>>>
>>>> SELECT docid, unnest((counts::text || ':' ||
>>>> words::text)::madlib.svec::float[]) AS wordid,
>>>> unnest(topic_assignment) + 1 AS topicid FROM lda_output WHERE docid = 6;
>>>>
>>>> Output:
>>>>
>>>>  docid | wordid | topicid
>>>> -------+--------+---------
>>>>      6 |   7386 |       3
>>>>      6 |  42021 |      17
>>>>      6 |  42021 |      17
>>>>      6 |   7705 |      12
>>>>      6 | 105334 |      16
>>>>      6 |  18083 |       3
>>>>      6 |  89364 |       3
>>>>      6 |  31073 |       3
>>>>      6 |  28934 |       3
>>>>      6 |  28934 |      16
>>>>      6 |  56286 |      16
>>>>      6 |  61921 |       3
>>>>      6 |  61921 |       3
>>>>      6 |  59142 |      17
>>>>      6 |  33364 |       3
>>>>      6 |  79035 |      17
>>>>      6 |  37792 |      11
>>>>      6 |  91823 |      11
>>>>      6 |  30422 |       3
>>>>      6 |  94672 |      17
>>>>      6 |  62107 |       3
>>>>      6 |  94673 |       2
>>>>      6 |  62080 |      16
>>>>      6 | 101046 |      17
>>>>      6 |   4379 |       8
>>>>      6 |   4379 |       8
>>>>      6 |   4379 |       8
>>>>      6 |   4379 |       8
>>>>      6 |   4379 |       8
>>>>      6 |  26503 |      12
>>>>      6 |  61105 |       3
>>>>      6 |  19193 |       3
>>>>      6 |  28929 |       3
>>>>
>>>>
>>>> Is there any simpler way to do that?
>>>>
>>>>
>>>> Regards,
>>>> Markus Paaso
>>>>
>>>>
>>>>
>>>> 2017-08-11 15:23 GMT+03:00 Markus Paaso <markus.paaso@gmail.com>:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am having some problems reading the LDA output.
>>>>>
>>>>>
>>>>> Please see this row of madlib.lda_train output:
>>>>>
>>>>> docid            | 6
>>>>> wordcount        | 33
>>>>> words            | {7386,42021,7705,105334,18083,
>>>>> 89364,31073,28934,56286,61921,59142,33364,79035,37792,91823,
>>>>> 30422,94672,62107,94673,62080,101046, 4379,26503,61105,19193,28929}
>>>>> counts           | {1,2,1,1,1,1,1,2,1,2,1,1,1,1,1
>>>>> ,1,1,1,1,1,1,5,1,1,1,1}
>>>>> topic_count      | {0,1,13,0,0,0,0,5,0,0,2,2,0,0,0,4,6,0,0,0}
>>>>> topic_assignment | {2,16,16,11,15,2,2,2,2,15,15,2
>>>>> ,2,16,2,16,10,10,2,16,2,1,15,16,7,7,7,7,7,11,2,2,2}
>>>>>
>>>>>
>>>>> It's hard to find which word ids are topic ids assigned to given when
>>>>> *words* array have different length than *topic_assignment* array.
>>>>> It would be nice if *words* array was same length than
>>>>> *topic_assignment* array
>>>>>
>>>>> 1. What kind of SQL query would give a result with wordid - topicid
>>>>> pairs?
>>>>> I tried to match them by hand but failed for wordid: 28934. I wonder
>>>>> if a repeating wordid can have different topic assignments in a same
>>>>> document?
>>>>>
>>>>> wordid | topicid
>>>>> ----------------
>>>>> 7386   | 2
>>>>> 42021  | 16
>>>>> 7705   | 11
>>>>> 105334 | 15
>>>>> 18083  | 2
>>>>> 89364  | 2
>>>>> 31073  | 2
>>>>> 28934  | 2 OR 15 ?
>>>>> 56286  | 15
>>>>> 61921  | 2
>>>>> 59142  | 16
>>>>> 33364  | 2
>>>>> 79035  | 16
>>>>> 37792  | 10
>>>>> 91823  | 10
>>>>> 30422  | 2
>>>>> 94672  | 16
>>>>> 62107  | 2
>>>>> 94673  | 1
>>>>> 62080  | 15
>>>>> 101046 | 16
>>>>> 4379   | 7
>>>>> 26503  | 11
>>>>> 61105  | 2
>>>>> 19193  | 2
>>>>> 28929  | 2
>>>>>
>>>>>
>>>>> 2. Why is the *topic_assignment* using zero based indexing while
>>>>> other results use one base indexing?
>>>>>
>>>>>
>>>>>
>>>>> Regards,
>>>>> Markus Paaso
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Markus Paaso
>>>> Tel: +358504067849 <+358%2050%204067849>
>>>>
>>>
>>>
>>
>>
>> --
>> Markus Paaso
>> Tel: +358504067849 <+358%2050%204067849>
>>
>
>


-- 
Markus Paaso
Tel: +358504067849

Mime
View raw message