>Java does regex just fine, albeit more verbose (when is Java not verbose ;-),
>but my main point is that you already have (Java) tools allow you to
>have an XML view of the existing HTML manual (tagsoup, etc...). Leave
>the parsing to these tools, and concentrate of transforming the
>"loose" HTML schema into a more structured XML, probably using XSL as
>the language rather than scripting. By adding a little more structure
>to the HTML with <div>s, the XML view of the HTML could be complete
>enough for robust transformation to XML, and perhaps even robust
>enough so that the HTML remains as the official "source" document of
>the manual (but stripped of all formatting, which would be added later
>in the XML processing pipeline). The main advantage of this would be
>that editing HTML using an HTML editor for manual editing can be
>easier/nicer and kinda wysiwyg, compared to editing the transformed
>XML.
>
>
>
I'm using libraries - I'm not writing my own html tokenizer :).
>>I'm aiming for a proof of concept script (for echo task) sometime in
>>the next week (if work doesn't get in the way too much). After that
>>I'll see how easy a refactoring job will be for making it generic.
>>
>>
>
>>From above, you can see that I envision the possibility of the HTML
>manual to remain, so it's all the more important that the transform is
>robust.
>
>
>
This suggests that the HTML manual is the "one true source" for the
manual, and all other versions are derived from it through some processing.
>Talks about the tokenizer being too greedy make me uneasy ;-) Leave
>the parsing to existing parsing tool, and just manipulate the
>structure of the document once it's been "reformatted" to a SAX event
>stream. In this form, it feeds easily and naturally to an XSL
>transform pipeline.
>
>That's my view of the whole thing anyway ;-) --DD
>
>
>
From this discussion my understanding is:
1 - Better to use Java + libs - presumably so that an Ant task can be
derived from it (Ant creates it's own manual would be rather nice I'd
have to say)
2 - Conversion util must be robust
3 - Conversion util will be long-lived
4 - Modification of existing HTML to make it easier for conversion util
would probably be a good idea
5 - Structure of XML is as yet undecided - Docbook with RelaxNG has now
been suggested as an alternative to a bespoke xml
Kev
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@ant.apache.org
For additional commands, e-mail: dev-help@ant.apache.org
|