trjhtr

Question

Accepts a String (or a char array) as an input and stores the data in a tree structure.

For example, given the input:

<foo>
 <bar>baz</bar>
 <qux fox="jump"/>
</foo>

Output will be:

XMLElementelementName='foo', 
 children=[XMLElementelementName='bar', elementValue='baz', 
 XMLElementelementName='qux', attributes=[ElementAttributename='fox', value='jump']

I would like to hear your criticism on design principles (SRP, DRY, KISS, etc..), readability (naming of variables, methods) and maintainability (code structure, methods) of the code you see.

Already notes in the comments of the code but:

XML provided as input must not contain any XML comments.

Mixed data such as: <MyElement>Some <b>Mixed</b> Data</MyElement> is not supported.

Without further ado, let's jump into the code..

Entity classes

XMLElement.java

package xml2json2;

import java.util.ArrayList;
import java.util.List;

public class XMLElement 

 private String elementName; // can not be null
 private String elementValue = "";

 private List<ElementAttribute> attributes = new ArrayList<>(); // can be empty
 private List<XMLElement> children = new ArrayList<>(); // can be empty

 public String getElementName() 
 return elementName;
 

 public void setElementName(String elementName) 
 this.elementName = elementName;
 

 public String getElementValue() 
 return elementValue;
 

 public void setElementValue(String elementValue) 
 this.elementValue = elementValue;
 

 public List<ElementAttribute> getAttributes() 
 return attributes;
 


 public List<XMLElement> getChildren() 
 return children;
 

 @Override
 public String toString() 
 final StringBuffer sb = new StringBuffer("XMLElement");
 sb.append("elementName='").append(elementName).append(''');
 if (!elementValue.equals("")) 
 sb.append(", elementValue='").append(elementValue).append(''');
 
 if (attributes.size() != 0) 
 sb.append(", attributes=").append(attributes);
 
 if (children.size() != 0) 
 sb.append(", children=").append(children);
 
 sb.append('');
 return sb.toString();

ElementAttribute.java

package xml2json2;

public class ElementAttribute 

 private String name;
 private String value;

 public String getName() 
 return name;
 

 public void setName(String name) 
 this.name = name;
 

 public String getValue() 
 return value;
 

 public void setValue(String value) 
 this.value = value;
 

 @Override
 public String toString() 
 final StringBuffer sb = new StringBuffer("ElementAttribute");
 sb.append("name='").append(name).append(''');
 sb.append(", value='").append(value).append(''');
 sb.append('');
 return sb.toString();

Processor

XMLElementTreeBuilderImpl.java

package xml2json2;

// References:
// XML Spec : https://www.liquid-technologies.com/XML
// Regex : https://regexone.com

/*
 This tree builder does not support elements with mixed data such as: <MyElement>Some <b>Mixed</b> Data</MyElement>.
 Mixed data can contain text and child elements within the containing element. This is typically only used to mark up data (HTML etc).
 Its typically only used to hold mark-up/formatted text entered by a person,
 it is typically not he best choice for storing machine readable data as adds significant complexity to the parser.
 */

/*
 XML to be processed must not contain any comments!
 */

public class XMLElementTreeBuilderImpl 

 private char xmlArray;
 private int currentIndex = 0;


 // This class has only 2 public methods:
 public XMLElement buildTreeFromXML(String xml) 
 return buildTreeFromXML(xml.toCharArray());
 

 public XMLElement buildTreeFromXML(char arr) 
 this.xmlArray = arr;
 XMLElement root = nodeFromStringRecursively();
 return root;
 

 // Everything else below here is private, i.e. inner workings of the class..
 private XMLElement nodeFromStringRecursively() 
 final XMLElement xmlElement = new XMLElement();

 clearWhiteSpace();

 if (tagStart()) // A new XML Element is starting..
 currentIndex++;
 final String elementName = parseStartingTagElement(); // finishes element name..
 xmlElement.setElementName(elementName);
 

 clearWhiteSpace();

 // We have not closed our tag yet..
 // At this point we might have attributes.. Lets add them if they exist..
 while (isLetter()) 
 addAttribute(xmlElement);
 clearWhiteSpace();
 

 // At this point we will have one of the following in current index:
 // [/] -> Self closing tag..
 // [>] -> Tag ending - (Data or children or starting or immediately followed by an ending tag..)

 if (selfClosingTagEnd()) 
 return xmlElement;
 

 // At this point we are sure this element was not a self closing element..
 currentIndex++; // skipping the tag close character, i.e. '>'

 // At this point we are facing one of the following cases:
 // Assume our starting tag was <foo> for the examples..
 // 1 - [</] : Immediate tag end. "</foo>"
 // 2 - [sw]+[</] : Any whitespace or any alphanumeric character, one or more repetitions, followed by tag end. "sample</foo>"
 // 3 - [s]*(<![CDATA[...]]>)[s]*[</] : Zero or more white space, followed by CDATA. followed by zero or more white space. "<![CDATA[...]]></foo>
 // 4 - [s]*[<]+ : Zero or more white space, followed by one or more child start..

 int currentCase = currentCase();

 switch (currentCase) 
 case 1: // Immediate closing tag, no data to set, no children to add.. Do nothing.
 break;
 case 2:
 setData(xmlElement);
 break;
 case 3:
 setCData(xmlElement);
 case 4:
 while (currentCase() == 4) // Add children recursively.
 final XMLElement childToken = nodeFromStringRecursively();
 xmlElement.getChildren().add(childToken);
 
 
 walkClosingTag();
 return xmlElement;
 

 private String parseStartingTagElement() 
 final StringBuilder elementNameBuilder = new StringBuilder();
 while (!isWhiteSpace() && !selfClosingTagEnd() && !tagEnd()) 
 elementNameBuilder.append(charAtCurrentIndex());
 currentIndex++;
 
 final String elementName = elementNameBuilder.toString();
 return elementName;
 

 private void addAttribute(XMLElement xmlElement) 
 // Attribute name..
 final StringBuilder attributeNameBuilder = new StringBuilder();
 while (!isWhiteSpace() && charAtCurrentIndex() != '=') 
 attributeNameBuilder.append(charAtCurrentIndex());
 currentIndex++;
 

 // Everything in between that is not much of interest to us..
 clearWhiteSpace();
 currentIndex++; // Passing the '='
 clearWhiteSpace();
 currentIndex++; // Passing the '"'

 // Attribute value..
 final StringBuilder attributeValueBuilder = new StringBuilder();
 while (charAtCurrentIndex() != '"') 
 attributeValueBuilder.append(charAtCurrentIndex());
 currentIndex++;
 
 currentIndex++; // Passing the final '"'
 clearWhiteSpace();

 // Build the attribute object and..
 final ElementAttribute elementAttribute = new ElementAttribute();
 elementAttribute.setName(attributeNameBuilder.toString());
 elementAttribute.setValue(attributeValueBuilder.toString());

 // ..add the attribute to the xmlElement
 xmlElement.getAttributes().add(elementAttribute);
 

 private int currentCase() 
 if (endTagStart()) 
 return 1;
 
 if (cDataStart()) 
 return 3;
 
 if (tagStart() && !endTagStart()) 
 return 4;
 
 // Here we will look forward, so we need to keep track of where we actually started..
 int currentIndexRollBackPoint = currentIndex;
 while (!endTagStart() && !cDataStart() && !tagStart()) 
 currentIndex++;
 if (endTagStart()) 
 currentIndex = currentIndexRollBackPoint;
 return 2;
 
 if (cDataStart()) 
 currentIndex = currentIndexRollBackPoint;
 return 3;
 
 if (tagStart() && !endTagStart()) 
 currentIndex = currentIndexRollBackPoint;
 return 4;
 
 

 throw new UnsupportedOperationException("Encountered an unsupported XML.");
 

 private void setData(XMLElement xmlElement) 
 final StringBuilder dataBuilder = new StringBuilder();
 while (!tagStart()) 
 dataBuilder.append(charAtCurrentIndex());
 currentIndex++;
 
 String data = dataBuilder.toString();

 data = data.replaceAll("&lt;", "<");
 data = data.replaceAll("&gt;", ">");
 data = data.replaceAll("&quot;", """);
 data = data.replaceAll("&apos;", "'");
 data = data.replaceAll("&amp;", "&");


 xmlElement.setElementValue(data);
 

 private void setCData(XMLElement xmlElement) 
 final StringBuilder cdataBuilder = new StringBuilder();
 while (!endTagStart()) 
 cdataBuilder.append(charAtCurrentIndex());
 currentIndex++;
 
 String cdata = cdataBuilder.toString();
 cdata = cdata.trim();
 // cutting 9 chars because: <![CDATA[
 cdata = cdata.substring(9, cdata.indexOf(']'));
 xmlElement.setElementValue(cdata);

 

 private void walkClosingTag() 
 while (!tagEnd()) 
 currentIndex++;
 
 currentIndex++;
 

 // Convenience methods
 private void clearWhiteSpace() 
 while (isWhiteSpace()) 
 currentIndex++;
 
 

 private boolean isLetter() 
 return Character.isLetter(charAtCurrentIndex());
 

 private boolean isWhiteSpace() 
 return Character.isWhitespace(charAtCurrentIndex());
 

 private boolean tagStart() 
 return charAtCurrentIndex() == '<';
 

 private boolean tagEnd() 
 return charAtCurrentIndex() == '>';
 

 private boolean endTagStart() 
 return charAtCurrentIndex() == '<' && charAtNextIndex() == '/';
 

 private boolean selfClosingTagEnd() 
 return charAtCurrentIndex() == '/' && charAtNextIndex() == '>';
 

 private boolean cDataStart() 
 return charAtCurrentIndex() == '<' && charAtNextIndex() == '!' && xmlArray[currentIndex + 2] == '[';
 

 private char charAtCurrentIndex() 
 return xmlArray[currentIndex];
 

 private char charAtNextIndex() 
 return xmlArray[currentIndex + 1];

Unit Tests

package xml2json2;

import java.util.List;

public class TreeFromXMLBuilderImplTest 

 private static XMLElementTreeBuilderImpl treeFromXMLBuilder;

 public static void main(String args) 
 selfClosingTagWithoutSpace();
 selfClosingTagWithSpace();
 selfClosingTagWithNewLine();
 emptyElementNoSpace();
 emptyElementWithSpace();
 emptyElementWithNewLine();
 selfClosingTagWithAttributeNoSpace();
 selfClosingTagWithAttributeWithSpace();
 selfClosingTagWithMultipleAttributes();
 xmlElementWithData();
 xmlElementWithAttributeAndWithData();
 xmlElementWithChild();
 sampleXMLNote();
 sampleXmlWithGrandChildren();
 withCharacterData();
 dataWithPreDefinedEntities();
 

 private static void selfClosingTagWithoutSpace() 
 treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
 final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo/>");
 assert xmlElement.getElementName().equals("foo") : "was : " + xmlElement.getElementName();
 

 private static void selfClosingTagWithSpace() 
 treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
 final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo />");
 assert xmlElement.getElementName().equals("foo") : "was : " + xmlElement.getElementName();
 

 private static void selfClosingTagWithNewLine() 
 treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
 final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foonn/>");
 assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();
 

 private static void emptyElementNoSpace() 
 treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
 final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo></foo>");
 assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();
 

 private static void emptyElementWithSpace() 
 treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
 final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo></foo >");
 assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();
 

 private static void emptyElementWithNewLine() 
 treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
 final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo></foo nnn>");
 assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();
 

 private static void selfClosingTagWithAttributeNoSpace() 
 treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
 final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo bar="baz"/>");
 assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();
 final ElementAttribute attribute = xmlElement.getAttributes().iterator().next();
 assert attribute.getName().equals("bar") : "was: " + attribute.getName();
 assert attribute.getValue().equals("baz") : "was: " + attribute.getValue();
 

 private static void selfClosingTagWithAttributeWithSpace() 
 treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
 final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo bar = "baz" />");
 assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();
 final ElementAttribute attribute = xmlElement.getAttributes().iterator().next();
 assert attribute.getName().equals("bar") : "was: " + attribute.getName();
 assert attribute.getValue().equals("baz") : "was: " + attribute.getValue();
 

 private static void selfClosingTagWithMultipleAttributes() 
 treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
 final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo bar="baz" qux="booze"/>");
 assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();
 ElementAttribute attribute = xmlElement.getAttributes().get(0);
 assert attribute.getName().equals("bar") : "was: " + attribute.getName();
 assert attribute.getValue().equals("baz") : "was: " + attribute.getValue();
 attribute = xmlElement.getAttributes().get(1);
 assert attribute.getName().equals("qux") : "was: " + attribute.getName();
 assert attribute.getValue().equals("booze") : "was: " + attribute.getValue();
 

 private static void xmlElementWithData() 
 treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
 final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo>bar</foo>");
 assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();
 assert xmlElement.getElementValue().equals("bar") : "was: " + xmlElement.getElementValue();
 

 private static void xmlElementWithAttributeAndWithData() 
 treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
 final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo baz = "baz" > bar </foo>");
 assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();
 assert xmlElement.getElementValue().equals(" bar ") : "was: " + xmlElement.getElementValue();
 ElementAttribute attribute = xmlElement.getAttributes().get(0);
 assert attribute.getName().equals("baz") : "was: " + attribute.getName();
 assert attribute.getValue().equals("baz") : "was: " + attribute.getValue();
 

 private static void xmlElementWithChild() 
 treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
 final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo><bar></bar><tar>rat</tar><baz/></foo>");
 assert xmlElement.getElementName().equals("foo");
 assert xmlElement.getAttributes().isEmpty();
 assert xmlElement.getChildren().size() == 3;
 assert xmlElement.getChildren().get(0).getElementName().equals("bar");
 assert xmlElement.getChildren().get(0).getElementValue().equals("");
 assert xmlElement.getChildren().get(1).getElementName().equals("tar");
 assert xmlElement.getChildren().get(1).getElementValue().equals("rat");
 assert xmlElement.getChildren().get(2).getElementName().equals("baz");
 assert xmlElement.getChildren().get(2).getElementValue().equals("");
 

 private static void sampleXMLNote() 
 treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
 String note =
 "<note>n" +
 "<to>Tove</to>n" +
 "<from>Jani</from>n" +
 "<heading>Reminder</heading>n" +
 "<body>Don't forget me this weekend!</body>n" +
 "</note>"
 ;
 final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML(note);
 // For visual inspection..
 System.out.println(xmlElement);
 

 /*
 <foo>
 <bar>
 <baz>test</baz>
 </bar>
 <qux att="tta">
 <fox>jumped</fox>
 </qux>
 </foo>
 */
 private static void sampleXmlWithGrandChildren() 
 treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
 String sampleWithGrandChildren = "<foo><bar><baz>test</baz></bar><qux att="tta"><fox>jumped</fox></qux></foo>";
 final XMLElement foo = treeFromXMLBuilder.buildTreeFromXML(sampleWithGrandChildren);
 assert foo.getElementName().equals("foo");
 final List<XMLElement> children = foo.getChildren();
 assert children.size() == 2; // bar and qux
 final XMLElement bar = children.get(0);
 assert bar.getElementName().equals("bar");
 assert bar.getElementValue().equals("");
 final List<XMLElement> barChildren = bar.getChildren();
 assert barChildren.size() == 1;
 final XMLElement baz = barChildren.get(0);
 assert baz.getElementName().equals("baz");
 assert baz.getElementValue().equals("test");
 final XMLElement qux = children.get(1);
 assert qux.getAttributes().size() == 1;
 assert qux.getAttributes().get(0).getName().equals("att");
 assert qux.getAttributes().get(0).getValue().equals("tta");
 final List<XMLElement> quxChildren = qux.getChildren();
 assert quxChildren.size() == 1;
 final XMLElement fox = quxChildren.get(0);
 assert fox.getElementName().equals("fox");
 assert fox.getElementValue().equals("jumped");
 // System.out.println(sampleWithGrandChildren);
 

 private static void withCharacterData() 
 treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
 final String sampleXMLWithCData = "<foo><![CDATA[ This must be preserved!!! ]]></foo>";
 final XMLElement root = treeFromXMLBuilder.buildTreeFromXML(sampleXMLWithCData);
 assert root.getElementName().equals("foo");
 assert root.getElementValue().equals(" This must be preserved!!! ") : "was: " + root.getElementValue();
 

 private static void dataWithPreDefinedEntities() 
 treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
 final String withCharacterData = "<foo>&lt;&gt;&quot;&apos;&amp;</foo>";
 final XMLElement root = treeFromXMLBuilder.buildTreeFromXML(withCharacterData);
 assert root.getElementValue().equals("<>"'&");

I would feel remiss if I did not recommend that you use the JDOM XML model instead of writing your own parser. Note that I a maintainer of the JDOM project. JDOM is not a parser (it uses Xerces by default), but it is an in-memory model. Document doc = new SAXBuilder().build(new StringReader(mystringvar)); will parse an XML document in a string.... — May 8 at 2:01

Stingy 1,888212 · Accepted Answer · 2018-05-08 01:17:43Z

Just for the sake of completeness, there are things that this parser doesn't support other than comments and elements with mixed content, such as processing instructions or character references (e.g. & or & instead of &, i.e. references to unicode code points rather than entities). Processing instructions are meant to carry information relevant only to the application receiving the XML document and are not part of the data stored by the XML document, but the parser should recognize them nonetheless. And since you support references to the five predefined entities (i.e. &, < and so on), it would seem natural also to support character references.

The parser also doesn't read the prolog, which consists of an optional XML declaration and a likewise optional document type declaration, although, admittely, the XML declaration only contains information specific to the process of parsing the document itself (such as the XML version, or the character encoding), and the document type declaration defines such things as the stucture of the document and entities to be referenced by an entity reference, so it might not make sense for the parser that parses the XML data to also parse the prolog.

So now about what you have implemented:

There are several problems with this parser, the biggest of which seems to be that the parser takes it for granted that the document is well-formed and produces valid output even if the syntax of the XML document is invalid. For example, the parser does not check whether the name of an end tag matches the name of the corresponding start tag. Or when parsing attributes in a start tag, it does not check whether the character assumed to be "=" is indeed "=" when there's whitespace between the attribute name and the "=" sign. Likewise, the character assumed to be the opening quotation mark of the attribute value could as well be any other character. This means that the parser would treat <foo attributeName x ybar"> as equivalent to <foo attributeName = "bar">. It gets even worse with CDATA sections, because for all your parser cares, the input could contain garbage syntax like <![jklÃƒÂ¶ÃƒÂ¤Ã‚Â°some character data]]>, and it will treat it as if it were <![CDATA[some character data]]>.

Another problem are the character categories on which you base the decision of how to continue with the parsing process. For example, your method isWhiteSpace() checks the conditions described in the documentation of the method Character.isWhitespace(char). But this is not what counts as whitespace according to the XML specification. The XML specification only counts four Unicode characters as whitespace:
- U+0020 SPACE
- U+0009 CHARACTER TABULATION
- U+000D CARRIAGE RETURN
- U+000A LINE FEED
So if an XML document contained a tag like <foo>, but with an U+2003 EM SPACE inserted between foo and >, then your parser would consider it legal XML syntax, when in fact the em-space would be illegal here, because it is neither whitespace nor a legal element name character.

Similarly, you are not acknowledging the fact that an attribute name does not necessarily have to begin with a letter. It could also begin with a colon, an underscore, or some other exotic character that is not covered by Character.isLetter(), for example U+02EA MODIFIER LETTER YIN DEPARTING TONE MARK, "Ã‹Âª" (whatever that is).

So how to rectify these issues? Since the XML specification so nicely provides regular expressions, you can simply imitate these regular expressions in the parser. While it would probably not be possible to parse the whole document with a single regular expression due to the possibility of nested tags with the same name, or siblings with the same name, you can at least write regular expressions for small units, like tags, which can also be composed of multiple regular expressions for even smaller, reusable units (like whitespace, name characters etc.). For example:
```
String whitespace = "(?:[\x20\x09\x0d\x0a]+)";
String nameStartCharacter = "(?:[:A-Z_a-z\xc0-\xd6\xd8-\xf6\xf8-\u02ff" +
 "\u0370-\u037d\u037f-\u1fff\u200C-\u200D\u2070-\u218F" +
 "\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD" +
 "\x10000-\xEFFFF])";
String nameCharacter = "(?:" + nameStartCharacter + "|[-.0-9\xb7\u0300-\u036F\u203F-\u2040])";
String name = "(?:" + nameStartCharacter + nameCharacter + "*)";
String endTag = "(?:</(?<name>" + name + ")" + whitespace + "?>)";
```
And to use endTag:
```
Matcher endTagMatcher = Pattern.compile(endTag).matcher("</test>");
System.out.println(endTagMatcher.matches()); // true
System.out.println(endTagMatcher.group("name")); // "test"
```
Note that I wrapped each regular expression inside a non-capturing group ((?:X)), so that appending a quantifier to it will always work on the whole expression (for instance, if nameCharacter were not wrapped in a group, the quantifier * appended to it in name would only apply to the character class to the right of | in nameCharacter).

Of course, you can not use regular expessions with a char directly, so you would have to find a way around that. Maybe a CharBuffer can be of use, since it implements CharSequence, and unlike String.subSequence(int, int) and StringBuilder.subSequence(int, int), which create a new String, CharBuffer.subSequence(int, int) does not copy the char data but reads/writes through to the original CharBuffer.

Your parser has some bugs:
- It does not consume the final /> part of an empty element tag, so any elements that follow an empty element will not be interpreted correctly (try it with "<root><foo /><bar></bar></root>"; the program will die from lack of memory since it will be trapped in the loop while (currentCase() == 4)).
- A closing bracket ] might be part of a CDATA section and does not necessarily end it. Only ]]> is guaranteed to terminate a CDATA section.
- An element with a CDATA section might still contain other character data not part of the CDATA section. In fact, an element can even contain multiple CDATA sections. To quote the relevant section in the XML specification:
  
  CDATA sections may occur anywhere character data may occur;
  
  Your parser will fail with elements that contain other character data (or CDATA sections) in addition to one CDATA section.
- References may not only occur in character data, but also in attribute values.

Finally, some stylistic suggestions:
- In this loop:
```
while (!endTagStart() && !cDataStart() && !tagStart()) 
 currentIndex++;
 if (endTagStart()) 
 currentIndex = currentIndexRollBackPoint;
 return 2;
 
 if (cDataStart()) 
 currentIndex = currentIndexRollBackPoint;
 return 3;
 
 if (tagStart() && !endTagStart()) 
 currentIndex = currentIndexRollBackPoint;
 return 4;
 
```
  The termination condition is completely pointless, because the loop will always be terminated from within. Usually, I'm in favor of using concrete termination conditions instead of something like while(true), but here, the termination condition will never evaluate to false and therefore does not fulfill any purpose, so I think that, in this case, the code would be easier to read if you simply changed it to while (true).
- Since your parser doesn't support elements with mixed content, you could make XMLElement an abstract class and make two subclasses, one for character data elements, and another for elements with child elements. That way, a character data element will not have a meaningless field children, and an element with children will not have the meaningless field value.

Simply beautiful, thank you.
â€“Â Koray Tugay
May 8 at 12:35 — May 8 at 12:35

Stingy 1,888212 · Accepted Answer · 2018-05-08 01:17:43Z

Just for the sake of completeness, there are things that this parser doesn't support other than comments and elements with mixed content, such as processing instructions or character references (e.g. & or & instead of &, i.e. references to unicode code points rather than entities). Processing instructions are meant to carry information relevant only to the application receiving the XML document and are not part of the data stored by the XML document, but the parser should recognize them nonetheless. And since you support references to the five predefined entities (i.e. &, < and so on), it would seem natural also to support character references.

The parser also doesn't read the prolog, which consists of an optional XML declaration and a likewise optional document type declaration, although, admittely, the XML declaration only contains information specific to the process of parsing the document itself (such as the XML version, or the character encoding), and the document type declaration defines such things as the stucture of the document and entities to be referenced by an entity reference, so it might not make sense for the parser that parses the XML data to also parse the prolog.

So now about what you have implemented:

There are several problems with this parser, the biggest of which seems to be that the parser takes it for granted that the document is well-formed and produces valid output even if the syntax of the XML document is invalid. For example, the parser does not check whether the name of an end tag matches the name of the corresponding start tag. Or when parsing attributes in a start tag, it does not check whether the character assumed to be "=" is indeed "=" when there's whitespace between the attribute name and the "=" sign. Likewise, the character assumed to be the opening quotation mark of the attribute value could as well be any other character. This means that the parser would treat <foo attributeName x ybar"> as equivalent to <foo attributeName = "bar">. It gets even worse with CDATA sections, because for all your parser cares, the input could contain garbage syntax like <![jklÃƒÂ¶ÃƒÂ¤Ã‚Â°some character data]]>, and it will treat it as if it were <![CDATA[some character data]]>.

Another problem are the character categories on which you base the decision of how to continue with the parsing process. For example, your method isWhiteSpace() checks the conditions described in the documentation of the method Character.isWhitespace(char). But this is not what counts as whitespace according to the XML specification. The XML specification only counts four Unicode characters as whitespace:
- U+0020 SPACE
- U+0009 CHARACTER TABULATION
- U+000D CARRIAGE RETURN
- U+000A LINE FEED
So if an XML document contained a tag like <foo>, but with an U+2003 EM SPACE inserted between foo and >, then your parser would consider it legal XML syntax, when in fact the em-space would be illegal here, because it is neither whitespace nor a legal element name character.

Similarly, you are not acknowledging the fact that an attribute name does not necessarily have to begin with a letter. It could also begin with a colon, an underscore, or some other exotic character that is not covered by Character.isLetter(), for example U+02EA MODIFIER LETTER YIN DEPARTING TONE MARK, "Ã‹Âª" (whatever that is).

So how to rectify these issues? Since the XML specification so nicely provides regular expressions, you can simply imitate these regular expressions in the parser. While it would probably not be possible to parse the whole document with a single regular expression due to the possibility of nested tags with the same name, or siblings with the same name, you can at least write regular expressions for small units, like tags, which can also be composed of multiple regular expressions for even smaller, reusable units (like whitespace, name characters etc.). For example:
```
String whitespace = "(?:[\x20\x09\x0d\x0a]+)";
String nameStartCharacter = "(?:[:A-Z_a-z\xc0-\xd6\xd8-\xf6\xf8-\u02ff" +
 "\u0370-\u037d\u037f-\u1fff\u200C-\u200D\u2070-\u218F" +
 "\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD" +
 "\x10000-\xEFFFF])";
String nameCharacter = "(?:" + nameStartCharacter + "|[-.0-9\xb7\u0300-\u036F\u203F-\u2040])";
String name = "(?:" + nameStartCharacter + nameCharacter + "*)";
String endTag = "(?:</(?<name>" + name + ")" + whitespace + "?>)";
```
And to use endTag:
```
Matcher endTagMatcher = Pattern.compile(endTag).matcher("</test>");
System.out.println(endTagMatcher.matches()); // true
System.out.println(endTagMatcher.group("name")); // "test"
```
Note that I wrapped each regular expression inside a non-capturing group ((?:X)), so that appending a quantifier to it will always work on the whole expression (for instance, if nameCharacter were not wrapped in a group, the quantifier * appended to it in name would only apply to the character class to the right of | in nameCharacter).

Of course, you can not use regular expessions with a char directly, so you would have to find a way around that. Maybe a CharBuffer can be of use, since it implements CharSequence, and unlike String.subSequence(int, int) and StringBuilder.subSequence(int, int), which create a new String, CharBuffer.subSequence(int, int) does not copy the char data but reads/writes through to the original CharBuffer.

Your parser has some bugs:
- It does not consume the final /> part of an empty element tag, so any elements that follow an empty element will not be interpreted correctly (try it with "<root><foo /><bar></bar></root>"; the program will die from lack of memory since it will be trapped in the loop while (currentCase() == 4)).
- A closing bracket ] might be part of a CDATA section and does not necessarily end it. Only ]]> is guaranteed to terminate a CDATA section.
- An element with a CDATA section might still contain other character data not part of the CDATA section. In fact, an element can even contain multiple CDATA sections. To quote the relevant section in the XML specification:
  
  CDATA sections may occur anywhere character data may occur;
  
  Your parser will fail with elements that contain other character data (or CDATA sections) in addition to one CDATA section.
- References may not only occur in character data, but also in attribute values.

Finally, some stylistic suggestions:
- In this loop:
```
while (!endTagStart() && !cDataStart() && !tagStart()) 
 currentIndex++;
 if (endTagStart()) 
 currentIndex = currentIndexRollBackPoint;
 return 2;
 
 if (cDataStart()) 
 currentIndex = currentIndexRollBackPoint;
 return 3;
 
 if (tagStart() && !endTagStart()) 
 currentIndex = currentIndexRollBackPoint;
 return 4;
 
```
  The termination condition is completely pointless, because the loop will always be terminated from within. Usually, I'm in favor of using concrete termination conditions instead of something like while(true), but here, the termination condition will never evaluate to false and therefore does not fulfill any purpose, so I think that, in this case, the code would be easier to read if you simply changed it to while (true).
- Since your parser doesn't support elements with mixed content, you could make XMLElement an abstract class and make two subclasses, one for character data elements, and another for elements with child elements. That way, a character data element will not have a meaningless field children, and an element with children will not have the meaningless field value.

Simply beautiful, thank you.
â€“Â Koray Tugay
May 8 at 12:35 — May 8 at 12:35

搜尋此網誌

trjhtr

XML Parser written in Java

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Chat program with C++ and SFML

Function to Return a JSON Like Objects Using VBA Collections and Arrays

Will my employers contract hold up in court?

XML Parser written in Java

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Chat program with C++ and SFML

Function to Return a JSON Like Objects Using VBA Collections and Arrays

Will my employers contract hold up in court?

1 Answer
1

1 Answer
1

1 Answer
1