ant-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kev Jackson <>
Subject Re: sandbox gendoc
Date Mon, 16 Jan 2006 04:09:18 GMT

>Java does regex just fine, albeit more verbose (when is Java not verbose ;-),
>but my main point is that you already have (Java) tools allow you to
>have an XML view of the existing HTML manual (tagsoup, etc...). Leave
>the parsing to these tools, and concentrate of transforming the
>"loose" HTML schema into a more structured XML, probably using XSL as
>the language rather than scripting. By adding a little more structure
>to the HTML with <div>s, the XML view of the HTML could be complete
>enough for robust transformation to XML, and perhaps even robust
>enough so that the HTML remains as the official "source" document of
>the manual (but stripped of all formatting, which would be added later
>in the XML processing pipeline). The main advantage of this would be
>that editing HTML using an HTML editor for manual editing can be
>easier/nicer and kinda wysiwyg, compared to editing the transformed
I'm using libraries - I'm not writing my own html tokenizer :).

>>I'm aiming for a proof of concept script (for echo task) sometime in
>>the next week (if work doesn't get in the way too much).  After that
>>I'll see how easy a refactoring job will be for making it generic.
>>From above, you can see that I envision the possibility of the HTML
>manual to remain, so it's all the more important that the transform is
This suggests that the HTML manual is the "one true source" for the 
manual, and all other versions are derived from it through some processing.

>Talks about the tokenizer being too greedy make me uneasy ;-) Leave
>the parsing to existing parsing tool, and just manipulate the
>structure of the document once it's been "reformatted" to a SAX event
>stream. In this form, it feeds easily and naturally to an XSL
>transform pipeline.
>That's my view of the whole thing anyway ;-) --DD
 From this discussion my understanding is:

1 - Better to use Java + libs - presumably so that an Ant task can be 
derived from it (Ant creates it's own manual would be rather nice I'd 
have to say)
2 - Conversion util must be robust
3 - Conversion util will be long-lived
4 - Modification of existing HTML to make it easier for conversion util 
would probably be a good idea
5 - Structure of XML is as yet undecided - Docbook with RelaxNG has now 
been suggested as an alternative to a bespoke xml


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message