Hi all,


Story time:
So we build this script, reading from the REST API of a webapp, writing some .xml file and uploading zipped into some sftp endpoint.
For writing .xml we used a textbook [1] like way [2] to build some nice, horrifying, XML-ish document.
Using what amounts to unvalidated user input in some of the text nodes.

To no ones surprise (at this point), now, 2 years later, the receiving entity complains, that they find illegal characters '0x8' in their uploads, which they cannot parse.


Turns out XML [3] and HTML [4] both have their own opinion, about what characters are allowed in their documents.
But at least they agree, that most control characters (0x0 - 0x8; 0xB; 0xC; 0xD - 0x1F) are bad, and some are at least 'discouraged'.
(0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF, ...)


Now, the "MarkupBuilder" is first and foremost called "MarkupBuilder". So one could argue, that it does handle the markup part just fine and that's all that it should do.

On the other hand, the class proclaims itself in the javadoc [5] to be "for creating XML or HTML markup".
And the documentation [1] also kindof markets it for that purpose.
(And, maybe, it's a bad look, to be able to write invalid .xml?)


So here is the question to you:
1) Is the MarkupBuilder's behavior okay as-is?
2) (if 1) What should the behavior be?

3) Is this historically a 'done discussion', and are we unwilling to open up that can of worms again?
(What was the previous consensus?)



Going a bit further with this, personally, I could imagine:
* by default sanitizing the output of MarkupBuilder to a compatible subset of characters for _both_ formats
* having some config option to switch to 'xml', 'html' or 'off' mode for "character set validation"
* dealing with invalid characters by replacing them with \uFFFD (�) character
  (as one comment on the Jeff Atwood answer post [6] suggested)

Which might be the maximum degree of changing things.
But I'm eager to hear some of your opinions.

Any thoughts / arguments / things I've missed so far?
Any chance of finding some kind of consensus on the matter?


Best,
Simon


[1] https://groovy-lang.org/processing-xml.html#_markupbuilder
[2]
    private toXmlFile(body) {
        def writer = new StringWriter()
        def xml = new MarkupBuilder(writer)

        body(xml)

        '<?xml version="1.0" encoding="UTF-8"?>' + "\n" + writer.toString() + "\n"
    }

[3]
https://www.w3.org/TR/xml/#NT-Char
"Consequently, XML processors MUST accept any character in the range specified for Char.
[2]       Char       ::=       #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]    /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */"
(Note: this is a "positive definition", and could be amended at some point to include more character.)

[4] https://html.spec.whatwg.org/#character-references
"The numeric character reference forms described above are allowed to reference any code point excluding U+000D CR, noncharacters, and controls other than ASCII whitespace."

[5] https://docs.groovy-lang.org/latest/html/api/groovy/xml/MarkupBuilder.html
[6] https://stackoverflow.com/questions/397250/unicode-regex-invalid-xml-characters/961504#961504