xmlgraphics-fop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vincent Hennebert <vhenneb...@gmail.com>
Subject Re: Choosing a better threshold in line breaking
Date Tue, 28 Oct 2008 12:53:56 GMT
Hi Dario,

This is an interesting study. There is probably room for improvement, as
an adjustment ratio of 20, even if this is the last resort, is really
high. More below.

Dario Laera wrote:
> Hi all,
> 
> I found the reason why breaking paragraphs into short lines is really
> slow and memory hungry: the threshold of the adjustment ratio, set to 20
> at the last try, is too high and makes EVERY legal breakpoint a feasible
> breakpoint too. A check should be performed to avoid such situation and
> to choose then a better threshold.
> 
> For example, if I have a 2 columns in A4 page layout the line width is
> ~140000. The glue stretchability, as far as I can see in TextLM class,
> is often set to "3 * LineLayoutManager.DEFAULT_SPACE_WIDTH" that is
> equal to ~10000. When you compute the adj ratio for a line that have
> just one glue you get r = 140000/10000 = 14, that is lower than the
> threshold = 20.0, so an active node is added.
> 
> A better threshold can be chosen as follow: let idealDifference be a
> reasonable size we choose as good threshold. We can assume "3 *
> LineLayoutManager.DEFAULT_SPACE_WIDTH" as default stretchability a
> compute a better threshold in that way:
> 
>     idealRatio = idealDifference / (3 *
> LineLayoutManager.DEFAULT_SPACE_WIDTH);
> 
> and bound that value:
> 
>     1.0 <= idealRatio <= 20.
> 
> How to choose idealDifference? A naive solution, but probably not so
> bad, can be:
> 
>     idealDifference = iLineWidth / 2;
> 
> A more sophisticated, maybe too much sophisticated, solution can choose
> it by looking at the average box length: we can see how many average box
> can fit a line (wordsPerLine) and execute:
> 
>     avgWord = avgBox + LineLayoutManager.DEFAULT_SPACE_WIDTH;
>     idealDifference = iLineWidth - (avgWord * (wordsPerLine / 2));

I’m not sure I’m following you here. What’s the value of wordsPerLine?
Is is set manually to a value that’s considered to be a reasonable one?
Because if it’s computed automatically, the formula can be simplified:
    wordsPerLine = lineWidth / avgWord, so
    idealDifference = lineWidth - lineWidth / 2
                    = lineWidth / 2

Anyway, the adjustment ratio is already a notion that is independent of
the line width; that’s precisely the purpose of a ratio. In the case of
left-justified mode, the only available stretchability is due to the
space at the end of the line; the question is to determine up to how
much we accept that space to be...
Ok, by writing that I think I know what you mean now :-) But the issue
should probably be considered the other way around: the problem is not
so much the adjustment ratio as the amount of space allowed at the end
of the line. In the case of narrow columns, that “3 times the width of
a space character” is too big WRT the line width. Instead of having
a fixed value, it should be changed into a small proportion of the line
width.
At the origin that 3 * space-width value was probably chosen for
“normal” line widths, that is lines containing an optimal amount of
words. I’ve read somewhere that the optimal number of letters per line
is 60. Taking the Times font, the average width of lowercase letters is
459, so the optimal line width roughly is 459*60 = 27540. The width of
the space character is 250, so 3 times a space character at the end of
a line makes 2.7% of that line. So let’s go for an elastic space of 3%
the line width, and then we can always chose the same adjustment ratio;
the number of active nodes would be “automatically” limited, whatever
the line width.

> > Do you mean that this last try is /always/ performed (even when we 
> > already have a set of feasible breaks)?
> 
> It's not always performed (so it's formally correct), but in my tests 
> it's rarely avoided, more precisely just once, with the file 
> "my_franklin_rep-jus.fo" that is composed of many paragraph in 1 column 
> with justified text. What I think (obviously, I may be wrong, as it has 
> been proved in other mails ;) is that another intermediate try, with 
> a judicious threshold, can be performed, leading to the same final 
> result but with much better performance, if this intermediate try 
> doesn't fail like the previous.
> Anyway I always run my tests with hyphenation enabled, I should try 
> disabling to see if the second try is run with threshold=5 and if this 
> doesn't fails.

The two-column case is not surprising: the columns are too narrow, which
makes line-breaking particularly challenging. The one-column
left-justified case surprises me a bit, however. I would have expected
that text could be broken without even needing hyphenation. I find it
a bit ironical that justifying text actually is easier for the
line-breaking algorithm...
At any rate, that adjustment ratio of 20 for the last run is surely too
much. It can probably be reduced to 5. Actually, I’m not even sure
a third run with a high adjustment ratio is desirable. Maybe we should
simply re-run the algorithm in forcing mode, and accept the underfull
lines that will be introduced.

If you could run statistics on more real-life documents (how often is
the first run without hyphenation sufficient, the third run required,
justified and left-aligned text, single / two-column on A4 paper, etc),
that would be fantastic.


Thanks,
Vincent

Mime
View raw message