madlib-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank McQuillan <fmcquil...@pivotal.io>
Subject Re: LDA output format
Date Fri, 22 Sep 2017 20:32:30 GMT
Markus,

Thanks for the extra explanation.

I agree that having topic_assignment as a dense vector but words as a
sparse vector is not good practice, I am not sure why it was done that way
in the first place.  They should be consistent.

The reason sparse vectors are used I believe is to keep the LDA model state
as small as possible, since it is possible to hit the 1GB PostgreSQL field
limit in certain real world use cases.

So I think what you are doing seems reasonable.  I don’t have a simpler way
to do it at this point, given the current way LDA is built.

I have updated this JIRA
https://issues.apache.org/jira/browse/MADLIB-1160
proposing:

1) use consistent indexes (start at 1)
2) use consistent array types (sparse)

which hopefully will help some.

Frank


On Fri, Sep 22, 2017 at 12:18 AM, Markus Paaso <markus.paaso@gmail.com>
wrote:

> Hi Frank,
>
> Thanks for the response.
> The solution you suggest doesn't fully answer to my question.
>
> I don't want to join input table with top-k wordids from each topic as
> this is not what lda_predict is doing. Am I right?
> I want to explain the results of madlib.lda_predict exactly.
> I want to explain why lda_predict gives some topic distribution for some
> set of wordids.
>
> I understand your approach but unfortunately it doesn't fulfill my needs.
> I have found the solution for my needs but and was just asking if you know
> better way to handle the results of lda_predict (or data in
> *output_data_table *of lda_train).
>
> So my question was more about how to construct wordid array from 'counts'
> and 'words' columns of madlib.lda_predict results or e.g. 'my_outdata' in
> your example.
> Now I am doing it with (counts::text || ':' || words::text)::madlib.svec::
> float[]
>
> The thing I don't understand in *output_data_table* format is that why
> *topic_assignment* is a dense vector but *words* is a sparse vector.
> I would suggest to change *words* format into dense vector to remove the
> need of playing with madlib.svec object to explain the lda_predict results
> exactly.
> And then the *counts* column is not needed anymore in *output_data_table*.
>
>
> Regards,
> Markus Paaso
>
>
>
>
>
>
> 2017-09-21 22:25 GMT+03:00 Frank McQuillan <fmcquillan@pivotal.io>:
>
>> Markus,
>>
>> Sorry for the delay in response.  Below is another approach that involves
>> a JOIN between term frequency and an LDA output table.  It is a bit simpler
>> at least for me.
>>
>> Regarding the indexing, you are correct that it is inconsistent.  It
>> seems term frequency module uses 0 based indexing (e.g., docid) whereas LDA
>> uses 1 based indexing (e.g., topicid).  I created a JIRA to address this
>> https://issues.apache.org/jira/browse/MADLIB-1160
>> probably the best way is to standardize on 1 based indexing.
>>
>> ----
>>
>> 1) Create data set
>>
>> DROP TABLE IF EXISTS documents;
>> CREATE TABLE documents(docid INT4, contents TEXT);
>> INSERT INTO documents VALUES
>> (0, 'Statistical topic models are a class of Bayesian latent variable
>> models, originally developed for analyzing the semantic content of large
>> document corpora.'),
>> (1, 'By the late 1960s, the balance between pitching and hitting had
>> swung in favor of the pitchers. In 1968 Carl Yastrzemski won the American
>> League batting title with an average of just .301, the lowest in history.'),
>> (2, 'Machine learning is closely related to and often overlaps with
>> computational statistics; a discipline that also specializes in
>> prediction-making. It has strong ties to mathematical optimization, which
>> deliver methods, theory and application domains to the field.'),
>> (3, 'California''s diverse geography ranges from the Sierra Nevada in the
>> east to the Pacific Coast in the west, from the Redwood–Douglas fir forests
>> of the northwest, to the Mojave Desert areas in the southeast. The center
>> of the state is dominated by the Central Valley, a major agricultural area.
>> ');
>>
>>
>> 2) Convert docs to a list of words
>>
>> ALTER TABLE documents ADD COLUMN words TEXT[];
>> UPDATE documents SET words = regexp_split_to_array(lower(contents),
>> E'[\\s+\\.\\,]');
>>
>>
>> 3) Term frequency
>>
>> DROP TABLE IF EXISTS my_training, my_training_vocabulary;
>> SELECT madlib.term_frequency('documents', 'docid', 'words',
>> 'my_training', TRUE);
>> SELECT * FROM my_training order by docid limit 20;
>>  docid | wordid | count
>> -------+--------+-------
>>      0 |     57 |     1
>>      0 |     86 |     1
>>      0 |      4 |     1
>>      0 |     55 |     1
>>      0 |     69 |     2
>>      0 |     81 |     1
>>      0 |     30 |     1
>>      0 |     33 |     1
>>      0 |     36 |     1
>>      0 |     43 |     1
>>      0 |     25 |     1
>>      0 |     65 |     2
>>      0 |     72 |     1
>>      0 |      9 |     1
>>      0 |      0 |     2
>>      0 |     29 |     1
>>      0 |     18 |     1
>>      0 |     12 |     1
>>      0 |     96 |     1
>>      0 |     91 |     1
>> (20 rows)
>>
>>
>> 4) Run LDA
>>
>> DROP TABLE IF EXISTS my_model, my_outdata;
>> SELECT madlib.lda_train( 'my_training',   -- Data table
>>                          'my_model',      -- Model table
>>                          'my_outdata',    -- Output table
>>                          104,             -- Vocab size
>>                          5,               -- Num topics
>>                          10,              -- Num iterations
>>                          5,               -- Alpha (Dirichlet param)
>>                          0.01             -- Beta (Dirichlet param)
>>                        );
>>
>>
>> 5) Get topic description by top-k words
>>
>> DROP TABLE IF EXISTS my_topic_desc;
>> SELECT madlib.lda_get_topic_desc( 'my_model',
>>                                   'my_training_vocabulary',
>>                                   'my_topic_desc',
>>                                   15);
>>
>> 6) Join topic description with term frequency
>> SELECT docid, wordid, topicid FROM my_topic_desc JOIN my_training USING
>> (wordid) WHERE docid=3 ORDER BY docid, wordid;
>>
>> docid | wordid | topicid
>> -------+--------+---------
>>      3 |      0 |       3
>>      3 |      4 |       2
>>      3 |      5 |       3
>>      3 |     13 |       2
>>      3 |     14 |       5
>>      3 |     20 |       5
>>      3 |     21 |       2
>>      3 |     23 |       2
>>      3 |     24 |       3
>>      3 |     27 |       2
>>      3 |     32 |       3
>>      3 |     35 |       3
>>      3 |     38 |       5
>>      3 |     39 |       5
>> etc
>>
>> Frank
>>
>>
>>
>> On Tue, Aug 29, 2017 at 11:34 PM, Markus Paaso <markus.paaso@gmail.com>
>> wrote:
>>
>>> Hi Frank,
>>>
>>> I want to explain the LDA results for a single document (in this case
>>> for docid = 6) by binding topicid into each wordid in the document.
>>> The SQL query below gives exactly what I want but I am not sure if that
>>> is the most effective way to build docid-wordid-topicid triples.
>>>
>>> SELECT docid, unnest((counts::text || ':' ||
>>> words::text)::madlib.svec::float[]) AS wordid, unnest(topic_assignment)
>>> + 1 AS topicid FROM lda_output WHERE docid = 6;
>>>
>>> I have trained LDA with 'lda_output' as the output_data_table argument
>>> in madlib.lda_train.
>>>
>>>
>>> Regards, Markus
>>>
>>> 2017-08-28 23:19 GMT+03:00 Frank McQuillan <fmcquillan@pivotal.io>:
>>>
>>>> Markus,
>>>>
>>>> Please see example 4 in the user docs
>>>> http://madlib.apache.org/docs/latest/group__grp__lda.html#examples
>>>> which provides helper functions for learning more about the learned
>>>> model.
>>>>
>>>> -- The topic description by top-k words
>>>> DROP TABLE IF EXISTS my_topic_desc;
>>>> SELECT madlib.lda_get_topic_desc( 'my_model',
>>>>                                   'my_training_vocabulary',
>>>>                                   'my_topic_desc',
>>>>                                   15);
>>>> select * from my_topic_desc order by topicid, prob DESC;
>>>>
>>>> produces:
>>>>
>>>> topicid | wordid |        prob        |       word
>>>> ---------+--------+--------------------+-------------------
>>>>        1 |     69 |  0.181900726392252 | of
>>>>        1 |     52 | 0.0608353510895884 | is
>>>>        1 |     65 | 0.0608353510895884 | models
>>>>        1 |     30 | 0.0305690072639225 | corpora
>>>>        1 |      1 | 0.0305690072639225 | 1960s
>>>>        1 |     57 | 0.0305690072639225 | latent
>>>>
>>>> Please let us know if this is of use, or you are looking for something
>>>> else?
>>>>
>>>> Frank
>>>>
>>>>
>>>> On Fri, Aug 11, 2017 at 6:45 AM, Markus Paaso <markus.paaso@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I found a working but quite awkward way to form docid-wordid-topicid
>>>>> pairing with a single SQL query:
>>>>>
>>>>> SELECT docid, unnest((counts::text || ':' ||
>>>>> words::text)::madlib.svec::float[]) AS wordid,
>>>>> unnest(topic_assignment) + 1 AS topicid FROM lda_output WHERE docid =
6;
>>>>>
>>>>> Output:
>>>>>
>>>>>  docid | wordid | topicid
>>>>> -------+--------+---------
>>>>>      6 |   7386 |       3
>>>>>      6 |  42021 |      17
>>>>>      6 |  42021 |      17
>>>>>      6 |   7705 |      12
>>>>>      6 | 105334 |      16
>>>>>      6 |  18083 |       3
>>>>>      6 |  89364 |       3
>>>>>      6 |  31073 |       3
>>>>>      6 |  28934 |       3
>>>>>      6 |  28934 |      16
>>>>>      6 |  56286 |      16
>>>>>      6 |  61921 |       3
>>>>>      6 |  61921 |       3
>>>>>      6 |  59142 |      17
>>>>>      6 |  33364 |       3
>>>>>      6 |  79035 |      17
>>>>>      6 |  37792 |      11
>>>>>      6 |  91823 |      11
>>>>>      6 |  30422 |       3
>>>>>      6 |  94672 |      17
>>>>>      6 |  62107 |       3
>>>>>      6 |  94673 |       2
>>>>>      6 |  62080 |      16
>>>>>      6 | 101046 |      17
>>>>>      6 |   4379 |       8
>>>>>      6 |   4379 |       8
>>>>>      6 |   4379 |       8
>>>>>      6 |   4379 |       8
>>>>>      6 |   4379 |       8
>>>>>      6 |  26503 |      12
>>>>>      6 |  61105 |       3
>>>>>      6 |  19193 |       3
>>>>>      6 |  28929 |       3
>>>>>
>>>>>
>>>>> Is there any simpler way to do that?
>>>>>
>>>>>
>>>>> Regards,
>>>>> Markus Paaso
>>>>>
>>>>>
>>>>>
>>>>> 2017-08-11 15:23 GMT+03:00 Markus Paaso <markus.paaso@gmail.com>:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am having some problems reading the LDA output.
>>>>>>
>>>>>>
>>>>>> Please see this row of madlib.lda_train output:
>>>>>>
>>>>>> docid            | 6
>>>>>> wordcount        | 33
>>>>>> words            | {7386,42021,7705,105334,18083,
>>>>>> 89364,31073,28934,56286,61921,59142,33364,79035,37792,91823,
>>>>>> 30422,94672,62107,94673,62080,101046, 4379,26503,61105,19193,28929}
>>>>>> counts           | {1,2,1,1,1,1,1,2,1,2,1,1,1,1,1
>>>>>> ,1,1,1,1,1,1,5,1,1,1,1}
>>>>>> topic_count      | {0,1,13,0,0,0,0,5,0,0,2,2,0,0,0,4,6,0,0,0}
>>>>>> topic_assignment | {2,16,16,11,15,2,2,2,2,15,15,2
>>>>>> ,2,16,2,16,10,10,2,16,2,1,15,16,7,7,7,7,7,11,2,2,2}
>>>>>>
>>>>>>
>>>>>> It's hard to find which word ids are topic ids assigned to given
when
>>>>>>  *words* array have different length than *topic_assignment* array.
>>>>>> It would be nice if *words* array was same length than
>>>>>> *topic_assignment* array
>>>>>>
>>>>>> 1. What kind of SQL query would give a result with wordid - topicid
>>>>>> pairs?
>>>>>> I tried to match them by hand but failed for wordid: 28934. I wonder
>>>>>> if a repeating wordid can have different topic assignments in a same
>>>>>> document?
>>>>>>
>>>>>> wordid | topicid
>>>>>> ----------------
>>>>>> 7386   | 2
>>>>>> 42021  | 16
>>>>>> 7705   | 11
>>>>>> 105334 | 15
>>>>>> 18083  | 2
>>>>>> 89364  | 2
>>>>>> 31073  | 2
>>>>>> 28934  | 2 OR 15 ?
>>>>>> 56286  | 15
>>>>>> 61921  | 2
>>>>>> 59142  | 16
>>>>>> 33364  | 2
>>>>>> 79035  | 16
>>>>>> 37792  | 10
>>>>>> 91823  | 10
>>>>>> 30422  | 2
>>>>>> 94672  | 16
>>>>>> 62107  | 2
>>>>>> 94673  | 1
>>>>>> 62080  | 15
>>>>>> 101046 | 16
>>>>>> 4379   | 7
>>>>>> 26503  | 11
>>>>>> 61105  | 2
>>>>>> 19193  | 2
>>>>>> 28929  | 2
>>>>>>
>>>>>>
>>>>>> 2. Why is the *topic_assignment* using zero based indexing while
>>>>>> other results use one base indexing?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Markus Paaso
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Markus Paaso
>>>>> Tel: +358504067849 <+358%2050%204067849>
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Markus Paaso
>>> Tel: +358504067849 <+358%2050%204067849>
>>>
>>
>>
>
>
> --
> Markus Paaso
> Tel: +358504067849 <+358%2050%204067849>
>

Mime
View raw message