XML Parser written in Java

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
3
down vote

favorite












Accepts a String (or a char array) as an input and stores the data in a tree structure.



For example, given the input:



<foo>
<bar>baz</bar>
<qux fox="jump"/>
</foo>


Output will be:



XMLElementelementName='foo', 
children=[XMLElementelementName='bar', elementValue='baz',
XMLElementelementName='qux', attributes=[ElementAttributename='fox', value='jump']



I would like to hear your criticism on design principles (SRP, DRY, KISS, etc..), readability (naming of variables, methods) and maintainability (code structure, methods) of the code you see.



Already notes in the comments of the code but:



  • XML provided as input must not contain any XML comments.

  • Mixed data such as: <MyElement>Some <b>Mixed</b> Data</MyElement> is not supported.

Without further ado, let's jump into the code..



Entity classes



XMLElement.java



package xml2json2;

import java.util.ArrayList;
import java.util.List;

public class XMLElement

private String elementName; // can not be null
private String elementValue = "";

private List<ElementAttribute> attributes = new ArrayList<>(); // can be empty
private List<XMLElement> children = new ArrayList<>(); // can be empty

public String getElementName()
return elementName;


public void setElementName(String elementName)
this.elementName = elementName;


public String getElementValue()
return elementValue;


public void setElementValue(String elementValue)
this.elementValue = elementValue;


public List<ElementAttribute> getAttributes()
return attributes;



public List<XMLElement> getChildren()
return children;


@Override
public String toString()
final StringBuffer sb = new StringBuffer("XMLElement");
sb.append("elementName='").append(elementName).append(''');
if (!elementValue.equals(""))
sb.append(", elementValue='").append(elementValue).append(''');

if (attributes.size() != 0)
sb.append(", attributes=").append(attributes);

if (children.size() != 0)
sb.append(", children=").append(children);

sb.append('');
return sb.toString();




ElementAttribute.java



package xml2json2;

public class ElementAttribute

private String name;
private String value;

public String getName()
return name;


public void setName(String name)
this.name = name;


public String getValue()
return value;


public void setValue(String value)
this.value = value;


@Override
public String toString()
final StringBuffer sb = new StringBuffer("ElementAttribute");
sb.append("name='").append(name).append(''');
sb.append(", value='").append(value).append(''');
sb.append('');
return sb.toString();




Processor



XMLElementTreeBuilderImpl.java



package xml2json2;

// References:
// XML Spec : https://www.liquid-technologies.com/XML
// Regex : https://regexone.com

/*
This tree builder does not support elements with mixed data such as: <MyElement>Some <b>Mixed</b> Data</MyElement>.
Mixed data can contain text and child elements within the containing element. This is typically only used to mark up data (HTML etc).
Its typically only used to hold mark-up/formatted text entered by a person,
it is typically not he best choice for storing machine readable data as adds significant complexity to the parser.
*/

/*
XML to be processed must not contain any comments!
*/

public class XMLElementTreeBuilderImpl

private char xmlArray;
private int currentIndex = 0;


// This class has only 2 public methods:
public XMLElement buildTreeFromXML(String xml)
return buildTreeFromXML(xml.toCharArray());


public XMLElement buildTreeFromXML(char arr)
this.xmlArray = arr;
XMLElement root = nodeFromStringRecursively();
return root;


// Everything else below here is private, i.e. inner workings of the class..
private XMLElement nodeFromStringRecursively()
final XMLElement xmlElement = new XMLElement();

clearWhiteSpace();

if (tagStart()) // A new XML Element is starting..
currentIndex++;
final String elementName = parseStartingTagElement(); // finishes element name..
xmlElement.setElementName(elementName);


clearWhiteSpace();

// We have not closed our tag yet..
// At this point we might have attributes.. Lets add them if they exist..
while (isLetter())
addAttribute(xmlElement);
clearWhiteSpace();


// At this point we will have one of the following in current index:
// [/] -> Self closing tag..
// [>] -> Tag ending - (Data or children or starting or immediately followed by an ending tag..)

if (selfClosingTagEnd())
return xmlElement;


// At this point we are sure this element was not a self closing element..
currentIndex++; // skipping the tag close character, i.e. '>'

// At this point we are facing one of the following cases:
// Assume our starting tag was <foo> for the examples..
// 1 - [</] : Immediate tag end. "</foo>"
// 2 - [sw]+[</] : Any whitespace or any alphanumeric character, one or more repetitions, followed by tag end. "sample</foo>"
// 3 - [s]*(<![CDATA[...]]>)[s]*[</] : Zero or more white space, followed by CDATA. followed by zero or more white space. "<![CDATA[...]]></foo>
// 4 - [s]*[<]+ : Zero or more white space, followed by one or more child start..

int currentCase = currentCase();

switch (currentCase)
case 1: // Immediate closing tag, no data to set, no children to add.. Do nothing.
break;
case 2:
setData(xmlElement);
break;
case 3:
setCData(xmlElement);
case 4:
while (currentCase() == 4) // Add children recursively.
final XMLElement childToken = nodeFromStringRecursively();
xmlElement.getChildren().add(childToken);


walkClosingTag();
return xmlElement;


private String parseStartingTagElement()
final StringBuilder elementNameBuilder = new StringBuilder();
while (!isWhiteSpace() && !selfClosingTagEnd() && !tagEnd())
elementNameBuilder.append(charAtCurrentIndex());
currentIndex++;

final String elementName = elementNameBuilder.toString();
return elementName;


private void addAttribute(XMLElement xmlElement)
// Attribute name..
final StringBuilder attributeNameBuilder = new StringBuilder();
while (!isWhiteSpace() && charAtCurrentIndex() != '=')
attributeNameBuilder.append(charAtCurrentIndex());
currentIndex++;


// Everything in between that is not much of interest to us..
clearWhiteSpace();
currentIndex++; // Passing the '='
clearWhiteSpace();
currentIndex++; // Passing the '"'

// Attribute value..
final StringBuilder attributeValueBuilder = new StringBuilder();
while (charAtCurrentIndex() != '"')
attributeValueBuilder.append(charAtCurrentIndex());
currentIndex++;

currentIndex++; // Passing the final '"'
clearWhiteSpace();

// Build the attribute object and..
final ElementAttribute elementAttribute = new ElementAttribute();
elementAttribute.setName(attributeNameBuilder.toString());
elementAttribute.setValue(attributeValueBuilder.toString());

// ..add the attribute to the xmlElement
xmlElement.getAttributes().add(elementAttribute);


private int currentCase()
if (endTagStart())
return 1;

if (cDataStart())
return 3;

if (tagStart() && !endTagStart())
return 4;

// Here we will look forward, so we need to keep track of where we actually started..
int currentIndexRollBackPoint = currentIndex;
while (!endTagStart() && !cDataStart() && !tagStart())
currentIndex++;
if (endTagStart())
currentIndex = currentIndexRollBackPoint;
return 2;

if (cDataStart())
currentIndex = currentIndexRollBackPoint;
return 3;

if (tagStart() && !endTagStart())
currentIndex = currentIndexRollBackPoint;
return 4;



throw new UnsupportedOperationException("Encountered an unsupported XML.");


private void setData(XMLElement xmlElement)
final StringBuilder dataBuilder = new StringBuilder();
while (!tagStart())
dataBuilder.append(charAtCurrentIndex());
currentIndex++;

String data = dataBuilder.toString();

data = data.replaceAll("&lt;", "<");
data = data.replaceAll("&gt;", ">");
data = data.replaceAll("&quot;", """);
data = data.replaceAll("&apos;", "'");
data = data.replaceAll("&amp;", "&");


xmlElement.setElementValue(data);


private void setCData(XMLElement xmlElement)
final StringBuilder cdataBuilder = new StringBuilder();
while (!endTagStart())
cdataBuilder.append(charAtCurrentIndex());
currentIndex++;

String cdata = cdataBuilder.toString();
cdata = cdata.trim();
// cutting 9 chars because: <![CDATA[
cdata = cdata.substring(9, cdata.indexOf(']'));
xmlElement.setElementValue(cdata);



private void walkClosingTag()
while (!tagEnd())
currentIndex++;

currentIndex++;


// Convenience methods
private void clearWhiteSpace()
while (isWhiteSpace())
currentIndex++;



private boolean isLetter()
return Character.isLetter(charAtCurrentIndex());


private boolean isWhiteSpace()
return Character.isWhitespace(charAtCurrentIndex());


private boolean tagStart()
return charAtCurrentIndex() == '<';


private boolean tagEnd()
return charAtCurrentIndex() == '>';


private boolean endTagStart()
return charAtCurrentIndex() == '<' && charAtNextIndex() == '/';


private boolean selfClosingTagEnd()
return charAtCurrentIndex() == '/' && charAtNextIndex() == '>';


private boolean cDataStart()
return charAtCurrentIndex() == '<' && charAtNextIndex() == '!' && xmlArray[currentIndex + 2] == '[';


private char charAtCurrentIndex()
return xmlArray[currentIndex];


private char charAtNextIndex()
return xmlArray[currentIndex + 1];




Unit Tests



package xml2json2;

import java.util.List;

public class TreeFromXMLBuilderImplTest

private static XMLElementTreeBuilderImpl treeFromXMLBuilder;

public static void main(String args)
selfClosingTagWithoutSpace();
selfClosingTagWithSpace();
selfClosingTagWithNewLine();
emptyElementNoSpace();
emptyElementWithSpace();
emptyElementWithNewLine();
selfClosingTagWithAttributeNoSpace();
selfClosingTagWithAttributeWithSpace();
selfClosingTagWithMultipleAttributes();
xmlElementWithData();
xmlElementWithAttributeAndWithData();
xmlElementWithChild();
sampleXMLNote();
sampleXmlWithGrandChildren();
withCharacterData();
dataWithPreDefinedEntities();


private static void selfClosingTagWithoutSpace()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo/>");
assert xmlElement.getElementName().equals("foo") : "was : " + xmlElement.getElementName();


private static void selfClosingTagWithSpace()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo />");
assert xmlElement.getElementName().equals("foo") : "was : " + xmlElement.getElementName();


private static void selfClosingTagWithNewLine()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foonn/>");
assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();


private static void emptyElementNoSpace()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo></foo>");
assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();


private static void emptyElementWithSpace()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo></foo >");
assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();


private static void emptyElementWithNewLine()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo></foo nnn>");
assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();


private static void selfClosingTagWithAttributeNoSpace()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo bar="baz"/>");
assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();
final ElementAttribute attribute = xmlElement.getAttributes().iterator().next();
assert attribute.getName().equals("bar") : "was: " + attribute.getName();
assert attribute.getValue().equals("baz") : "was: " + attribute.getValue();


private static void selfClosingTagWithAttributeWithSpace()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo bar = "baz" />");
assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();
final ElementAttribute attribute = xmlElement.getAttributes().iterator().next();
assert attribute.getName().equals("bar") : "was: " + attribute.getName();
assert attribute.getValue().equals("baz") : "was: " + attribute.getValue();


private static void selfClosingTagWithMultipleAttributes()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo bar="baz" qux="booze"/>");
assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();
ElementAttribute attribute = xmlElement.getAttributes().get(0);
assert attribute.getName().equals("bar") : "was: " + attribute.getName();
assert attribute.getValue().equals("baz") : "was: " + attribute.getValue();
attribute = xmlElement.getAttributes().get(1);
assert attribute.getName().equals("qux") : "was: " + attribute.getName();
assert attribute.getValue().equals("booze") : "was: " + attribute.getValue();


private static void xmlElementWithData()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo>bar</foo>");
assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();
assert xmlElement.getElementValue().equals("bar") : "was: " + xmlElement.getElementValue();


private static void xmlElementWithAttributeAndWithData()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo baz = "baz" > bar </foo>");
assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();
assert xmlElement.getElementValue().equals(" bar ") : "was: " + xmlElement.getElementValue();
ElementAttribute attribute = xmlElement.getAttributes().get(0);
assert attribute.getName().equals("baz") : "was: " + attribute.getName();
assert attribute.getValue().equals("baz") : "was: " + attribute.getValue();


private static void xmlElementWithChild()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo><bar></bar><tar>rat</tar><baz/></foo>");
assert xmlElement.getElementName().equals("foo");
assert xmlElement.getAttributes().isEmpty();
assert xmlElement.getChildren().size() == 3;
assert xmlElement.getChildren().get(0).getElementName().equals("bar");
assert xmlElement.getChildren().get(0).getElementValue().equals("");
assert xmlElement.getChildren().get(1).getElementName().equals("tar");
assert xmlElement.getChildren().get(1).getElementValue().equals("rat");
assert xmlElement.getChildren().get(2).getElementName().equals("baz");
assert xmlElement.getChildren().get(2).getElementValue().equals("");


private static void sampleXMLNote()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
String note =
"<note>n" +
"<to>Tove</to>n" +
"<from>Jani</from>n" +
"<heading>Reminder</heading>n" +
"<body>Don't forget me this weekend!</body>n" +
"</note>"
;
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML(note);
// For visual inspection..
System.out.println(xmlElement);


/*
<foo>
<bar>
<baz>test</baz>
</bar>
<qux att="tta">
<fox>jumped</fox>
</qux>
</foo>
*/
private static void sampleXmlWithGrandChildren()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
String sampleWithGrandChildren = "<foo><bar><baz>test</baz></bar><qux att="tta"><fox>jumped</fox></qux></foo>";
final XMLElement foo = treeFromXMLBuilder.buildTreeFromXML(sampleWithGrandChildren);
assert foo.getElementName().equals("foo");
final List<XMLElement> children = foo.getChildren();
assert children.size() == 2; // bar and qux
final XMLElement bar = children.get(0);
assert bar.getElementName().equals("bar");
assert bar.getElementValue().equals("");
final List<XMLElement> barChildren = bar.getChildren();
assert barChildren.size() == 1;
final XMLElement baz = barChildren.get(0);
assert baz.getElementName().equals("baz");
assert baz.getElementValue().equals("test");
final XMLElement qux = children.get(1);
assert qux.getAttributes().size() == 1;
assert qux.getAttributes().get(0).getName().equals("att");
assert qux.getAttributes().get(0).getValue().equals("tta");
final List<XMLElement> quxChildren = qux.getChildren();
assert quxChildren.size() == 1;
final XMLElement fox = quxChildren.get(0);
assert fox.getElementName().equals("fox");
assert fox.getElementValue().equals("jumped");
// System.out.println(sampleWithGrandChildren);


private static void withCharacterData()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final String sampleXMLWithCData = "<foo><![CDATA[ This must be preserved!!! ]]></foo>";
final XMLElement root = treeFromXMLBuilder.buildTreeFromXML(sampleXMLWithCData);
assert root.getElementName().equals("foo");
assert root.getElementValue().equals(" This must be preserved!!! ") : "was: " + root.getElementValue();


private static void dataWithPreDefinedEntities()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final String withCharacterData = "<foo>&lt;&gt;&quot;&apos;&amp;</foo>";
final XMLElement root = treeFromXMLBuilder.buildTreeFromXML(withCharacterData);
assert root.getElementValue().equals("<>"'&");








share|improve this question





















  • I would feel remiss if I did not recommend that you use the JDOM XML model instead of writing your own parser. Note that I a maintainer of the JDOM project. JDOM is not a parser (it uses Xerces by default), but it is an in-memory model. Document doc = new SAXBuilder().build(new StringReader(mystringvar)); will parse an XML document in a string....
    – rolfl♦
    May 8 at 2:01
















up vote
3
down vote

favorite












Accepts a String (or a char array) as an input and stores the data in a tree structure.



For example, given the input:



<foo>
<bar>baz</bar>
<qux fox="jump"/>
</foo>


Output will be:



XMLElementelementName='foo', 
children=[XMLElementelementName='bar', elementValue='baz',
XMLElementelementName='qux', attributes=[ElementAttributename='fox', value='jump']



I would like to hear your criticism on design principles (SRP, DRY, KISS, etc..), readability (naming of variables, methods) and maintainability (code structure, methods) of the code you see.



Already notes in the comments of the code but:



  • XML provided as input must not contain any XML comments.

  • Mixed data such as: <MyElement>Some <b>Mixed</b> Data</MyElement> is not supported.

Without further ado, let's jump into the code..



Entity classes



XMLElement.java



package xml2json2;

import java.util.ArrayList;
import java.util.List;

public class XMLElement

private String elementName; // can not be null
private String elementValue = "";

private List<ElementAttribute> attributes = new ArrayList<>(); // can be empty
private List<XMLElement> children = new ArrayList<>(); // can be empty

public String getElementName()
return elementName;


public void setElementName(String elementName)
this.elementName = elementName;


public String getElementValue()
return elementValue;


public void setElementValue(String elementValue)
this.elementValue = elementValue;


public List<ElementAttribute> getAttributes()
return attributes;



public List<XMLElement> getChildren()
return children;


@Override
public String toString()
final StringBuffer sb = new StringBuffer("XMLElement");
sb.append("elementName='").append(elementName).append(''');
if (!elementValue.equals(""))
sb.append(", elementValue='").append(elementValue).append(''');

if (attributes.size() != 0)
sb.append(", attributes=").append(attributes);

if (children.size() != 0)
sb.append(", children=").append(children);

sb.append('');
return sb.toString();




ElementAttribute.java



package xml2json2;

public class ElementAttribute

private String name;
private String value;

public String getName()
return name;


public void setName(String name)
this.name = name;


public String getValue()
return value;


public void setValue(String value)
this.value = value;


@Override
public String toString()
final StringBuffer sb = new StringBuffer("ElementAttribute");
sb.append("name='").append(name).append(''');
sb.append(", value='").append(value).append(''');
sb.append('');
return sb.toString();




Processor



XMLElementTreeBuilderImpl.java



package xml2json2;

// References:
// XML Spec : https://www.liquid-technologies.com/XML
// Regex : https://regexone.com

/*
This tree builder does not support elements with mixed data such as: <MyElement>Some <b>Mixed</b> Data</MyElement>.
Mixed data can contain text and child elements within the containing element. This is typically only used to mark up data (HTML etc).
Its typically only used to hold mark-up/formatted text entered by a person,
it is typically not he best choice for storing machine readable data as adds significant complexity to the parser.
*/

/*
XML to be processed must not contain any comments!
*/

public class XMLElementTreeBuilderImpl

private char xmlArray;
private int currentIndex = 0;


// This class has only 2 public methods:
public XMLElement buildTreeFromXML(String xml)
return buildTreeFromXML(xml.toCharArray());


public XMLElement buildTreeFromXML(char arr)
this.xmlArray = arr;
XMLElement root = nodeFromStringRecursively();
return root;


// Everything else below here is private, i.e. inner workings of the class..
private XMLElement nodeFromStringRecursively()
final XMLElement xmlElement = new XMLElement();

clearWhiteSpace();

if (tagStart()) // A new XML Element is starting..
currentIndex++;
final String elementName = parseStartingTagElement(); // finishes element name..
xmlElement.setElementName(elementName);


clearWhiteSpace();

// We have not closed our tag yet..
// At this point we might have attributes.. Lets add them if they exist..
while (isLetter())
addAttribute(xmlElement);
clearWhiteSpace();


// At this point we will have one of the following in current index:
// [/] -> Self closing tag..
// [>] -> Tag ending - (Data or children or starting or immediately followed by an ending tag..)

if (selfClosingTagEnd())
return xmlElement;


// At this point we are sure this element was not a self closing element..
currentIndex++; // skipping the tag close character, i.e. '>'

// At this point we are facing one of the following cases:
// Assume our starting tag was <foo> for the examples..
// 1 - [</] : Immediate tag end. "</foo>"
// 2 - [sw]+[</] : Any whitespace or any alphanumeric character, one or more repetitions, followed by tag end. "sample</foo>"
// 3 - [s]*(<![CDATA[...]]>)[s]*[</] : Zero or more white space, followed by CDATA. followed by zero or more white space. "<![CDATA[...]]></foo>
// 4 - [s]*[<]+ : Zero or more white space, followed by one or more child start..

int currentCase = currentCase();

switch (currentCase)
case 1: // Immediate closing tag, no data to set, no children to add.. Do nothing.
break;
case 2:
setData(xmlElement);
break;
case 3:
setCData(xmlElement);
case 4:
while (currentCase() == 4) // Add children recursively.
final XMLElement childToken = nodeFromStringRecursively();
xmlElement.getChildren().add(childToken);


walkClosingTag();
return xmlElement;


private String parseStartingTagElement()
final StringBuilder elementNameBuilder = new StringBuilder();
while (!isWhiteSpace() && !selfClosingTagEnd() && !tagEnd())
elementNameBuilder.append(charAtCurrentIndex());
currentIndex++;

final String elementName = elementNameBuilder.toString();
return elementName;


private void addAttribute(XMLElement xmlElement)
// Attribute name..
final StringBuilder attributeNameBuilder = new StringBuilder();
while (!isWhiteSpace() && charAtCurrentIndex() != '=')
attributeNameBuilder.append(charAtCurrentIndex());
currentIndex++;


// Everything in between that is not much of interest to us..
clearWhiteSpace();
currentIndex++; // Passing the '='
clearWhiteSpace();
currentIndex++; // Passing the '"'

// Attribute value..
final StringBuilder attributeValueBuilder = new StringBuilder();
while (charAtCurrentIndex() != '"')
attributeValueBuilder.append(charAtCurrentIndex());
currentIndex++;

currentIndex++; // Passing the final '"'
clearWhiteSpace();

// Build the attribute object and..
final ElementAttribute elementAttribute = new ElementAttribute();
elementAttribute.setName(attributeNameBuilder.toString());
elementAttribute.setValue(attributeValueBuilder.toString());

// ..add the attribute to the xmlElement
xmlElement.getAttributes().add(elementAttribute);


private int currentCase()
if (endTagStart())
return 1;

if (cDataStart())
return 3;

if (tagStart() && !endTagStart())
return 4;

// Here we will look forward, so we need to keep track of where we actually started..
int currentIndexRollBackPoint = currentIndex;
while (!endTagStart() && !cDataStart() && !tagStart())
currentIndex++;
if (endTagStart())
currentIndex = currentIndexRollBackPoint;
return 2;

if (cDataStart())
currentIndex = currentIndexRollBackPoint;
return 3;

if (tagStart() && !endTagStart())
currentIndex = currentIndexRollBackPoint;
return 4;



throw new UnsupportedOperationException("Encountered an unsupported XML.");


private void setData(XMLElement xmlElement)
final StringBuilder dataBuilder = new StringBuilder();
while (!tagStart())
dataBuilder.append(charAtCurrentIndex());
currentIndex++;

String data = dataBuilder.toString();

data = data.replaceAll("&lt;", "<");
data = data.replaceAll("&gt;", ">");
data = data.replaceAll("&quot;", """);
data = data.replaceAll("&apos;", "'");
data = data.replaceAll("&amp;", "&");


xmlElement.setElementValue(data);


private void setCData(XMLElement xmlElement)
final StringBuilder cdataBuilder = new StringBuilder();
while (!endTagStart())
cdataBuilder.append(charAtCurrentIndex());
currentIndex++;

String cdata = cdataBuilder.toString();
cdata = cdata.trim();
// cutting 9 chars because: <![CDATA[
cdata = cdata.substring(9, cdata.indexOf(']'));
xmlElement.setElementValue(cdata);



private void walkClosingTag()
while (!tagEnd())
currentIndex++;

currentIndex++;


// Convenience methods
private void clearWhiteSpace()
while (isWhiteSpace())
currentIndex++;



private boolean isLetter()
return Character.isLetter(charAtCurrentIndex());


private boolean isWhiteSpace()
return Character.isWhitespace(charAtCurrentIndex());


private boolean tagStart()
return charAtCurrentIndex() == '<';


private boolean tagEnd()
return charAtCurrentIndex() == '>';


private boolean endTagStart()
return charAtCurrentIndex() == '<' && charAtNextIndex() == '/';


private boolean selfClosingTagEnd()
return charAtCurrentIndex() == '/' && charAtNextIndex() == '>';


private boolean cDataStart()
return charAtCurrentIndex() == '<' && charAtNextIndex() == '!' && xmlArray[currentIndex + 2] == '[';


private char charAtCurrentIndex()
return xmlArray[currentIndex];


private char charAtNextIndex()
return xmlArray[currentIndex + 1];




Unit Tests



package xml2json2;

import java.util.List;

public class TreeFromXMLBuilderImplTest

private static XMLElementTreeBuilderImpl treeFromXMLBuilder;

public static void main(String args)
selfClosingTagWithoutSpace();
selfClosingTagWithSpace();
selfClosingTagWithNewLine();
emptyElementNoSpace();
emptyElementWithSpace();
emptyElementWithNewLine();
selfClosingTagWithAttributeNoSpace();
selfClosingTagWithAttributeWithSpace();
selfClosingTagWithMultipleAttributes();
xmlElementWithData();
xmlElementWithAttributeAndWithData();
xmlElementWithChild();
sampleXMLNote();
sampleXmlWithGrandChildren();
withCharacterData();
dataWithPreDefinedEntities();


private static void selfClosingTagWithoutSpace()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo/>");
assert xmlElement.getElementName().equals("foo") : "was : " + xmlElement.getElementName();


private static void selfClosingTagWithSpace()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo />");
assert xmlElement.getElementName().equals("foo") : "was : " + xmlElement.getElementName();


private static void selfClosingTagWithNewLine()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foonn/>");
assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();


private static void emptyElementNoSpace()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo></foo>");
assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();


private static void emptyElementWithSpace()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo></foo >");
assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();


private static void emptyElementWithNewLine()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo></foo nnn>");
assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();


private static void selfClosingTagWithAttributeNoSpace()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo bar="baz"/>");
assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();
final ElementAttribute attribute = xmlElement.getAttributes().iterator().next();
assert attribute.getName().equals("bar") : "was: " + attribute.getName();
assert attribute.getValue().equals("baz") : "was: " + attribute.getValue();


private static void selfClosingTagWithAttributeWithSpace()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo bar = "baz" />");
assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();
final ElementAttribute attribute = xmlElement.getAttributes().iterator().next();
assert attribute.getName().equals("bar") : "was: " + attribute.getName();
assert attribute.getValue().equals("baz") : "was: " + attribute.getValue();


private static void selfClosingTagWithMultipleAttributes()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo bar="baz" qux="booze"/>");
assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();
ElementAttribute attribute = xmlElement.getAttributes().get(0);
assert attribute.getName().equals("bar") : "was: " + attribute.getName();
assert attribute.getValue().equals("baz") : "was: " + attribute.getValue();
attribute = xmlElement.getAttributes().get(1);
assert attribute.getName().equals("qux") : "was: " + attribute.getName();
assert attribute.getValue().equals("booze") : "was: " + attribute.getValue();


private static void xmlElementWithData()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo>bar</foo>");
assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();
assert xmlElement.getElementValue().equals("bar") : "was: " + xmlElement.getElementValue();


private static void xmlElementWithAttributeAndWithData()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo baz = "baz" > bar </foo>");
assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();
assert xmlElement.getElementValue().equals(" bar ") : "was: " + xmlElement.getElementValue();
ElementAttribute attribute = xmlElement.getAttributes().get(0);
assert attribute.getName().equals("baz") : "was: " + attribute.getName();
assert attribute.getValue().equals("baz") : "was: " + attribute.getValue();


private static void xmlElementWithChild()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo><bar></bar><tar>rat</tar><baz/></foo>");
assert xmlElement.getElementName().equals("foo");
assert xmlElement.getAttributes().isEmpty();
assert xmlElement.getChildren().size() == 3;
assert xmlElement.getChildren().get(0).getElementName().equals("bar");
assert xmlElement.getChildren().get(0).getElementValue().equals("");
assert xmlElement.getChildren().get(1).getElementName().equals("tar");
assert xmlElement.getChildren().get(1).getElementValue().equals("rat");
assert xmlElement.getChildren().get(2).getElementName().equals("baz");
assert xmlElement.getChildren().get(2).getElementValue().equals("");


private static void sampleXMLNote()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
String note =
"<note>n" +
"<to>Tove</to>n" +
"<from>Jani</from>n" +
"<heading>Reminder</heading>n" +
"<body>Don't forget me this weekend!</body>n" +
"</note>"
;
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML(note);
// For visual inspection..
System.out.println(xmlElement);


/*
<foo>
<bar>
<baz>test</baz>
</bar>
<qux att="tta">
<fox>jumped</fox>
</qux>
</foo>
*/
private static void sampleXmlWithGrandChildren()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
String sampleWithGrandChildren = "<foo><bar><baz>test</baz></bar><qux att="tta"><fox>jumped</fox></qux></foo>";
final XMLElement foo = treeFromXMLBuilder.buildTreeFromXML(sampleWithGrandChildren);
assert foo.getElementName().equals("foo");
final List<XMLElement> children = foo.getChildren();
assert children.size() == 2; // bar and qux
final XMLElement bar = children.get(0);
assert bar.getElementName().equals("bar");
assert bar.getElementValue().equals("");
final List<XMLElement> barChildren = bar.getChildren();
assert barChildren.size() == 1;
final XMLElement baz = barChildren.get(0);
assert baz.getElementName().equals("baz");
assert baz.getElementValue().equals("test");
final XMLElement qux = children.get(1);
assert qux.getAttributes().size() == 1;
assert qux.getAttributes().get(0).getName().equals("att");
assert qux.getAttributes().get(0).getValue().equals("tta");
final List<XMLElement> quxChildren = qux.getChildren();
assert quxChildren.size() == 1;
final XMLElement fox = quxChildren.get(0);
assert fox.getElementName().equals("fox");
assert fox.getElementValue().equals("jumped");
// System.out.println(sampleWithGrandChildren);


private static void withCharacterData()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final String sampleXMLWithCData = "<foo><![CDATA[ This must be preserved!!! ]]></foo>";
final XMLElement root = treeFromXMLBuilder.buildTreeFromXML(sampleXMLWithCData);
assert root.getElementName().equals("foo");
assert root.getElementValue().equals(" This must be preserved!!! ") : "was: " + root.getElementValue();


private static void dataWithPreDefinedEntities()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final String withCharacterData = "<foo>&lt;&gt;&quot;&apos;&amp;</foo>";
final XMLElement root = treeFromXMLBuilder.buildTreeFromXML(withCharacterData);
assert root.getElementValue().equals("<>"'&");








share|improve this question





















  • I would feel remiss if I did not recommend that you use the JDOM XML model instead of writing your own parser. Note that I a maintainer of the JDOM project. JDOM is not a parser (it uses Xerces by default), but it is an in-memory model. Document doc = new SAXBuilder().build(new StringReader(mystringvar)); will parse an XML document in a string....
    – rolfl♦
    May 8 at 2:01












up vote
3
down vote

favorite









up vote
3
down vote

favorite











Accepts a String (or a char array) as an input and stores the data in a tree structure.



For example, given the input:



<foo>
<bar>baz</bar>
<qux fox="jump"/>
</foo>


Output will be:



XMLElementelementName='foo', 
children=[XMLElementelementName='bar', elementValue='baz',
XMLElementelementName='qux', attributes=[ElementAttributename='fox', value='jump']



I would like to hear your criticism on design principles (SRP, DRY, KISS, etc..), readability (naming of variables, methods) and maintainability (code structure, methods) of the code you see.



Already notes in the comments of the code but:



  • XML provided as input must not contain any XML comments.

  • Mixed data such as: <MyElement>Some <b>Mixed</b> Data</MyElement> is not supported.

Without further ado, let's jump into the code..



Entity classes



XMLElement.java



package xml2json2;

import java.util.ArrayList;
import java.util.List;

public class XMLElement

private String elementName; // can not be null
private String elementValue = "";

private List<ElementAttribute> attributes = new ArrayList<>(); // can be empty
private List<XMLElement> children = new ArrayList<>(); // can be empty

public String getElementName()
return elementName;


public void setElementName(String elementName)
this.elementName = elementName;


public String getElementValue()
return elementValue;


public void setElementValue(String elementValue)
this.elementValue = elementValue;


public List<ElementAttribute> getAttributes()
return attributes;



public List<XMLElement> getChildren()
return children;


@Override
public String toString()
final StringBuffer sb = new StringBuffer("XMLElement");
sb.append("elementName='").append(elementName).append(''');
if (!elementValue.equals(""))
sb.append(", elementValue='").append(elementValue).append(''');

if (attributes.size() != 0)
sb.append(", attributes=").append(attributes);

if (children.size() != 0)
sb.append(", children=").append(children);

sb.append('');
return sb.toString();




ElementAttribute.java



package xml2json2;

public class ElementAttribute

private String name;
private String value;

public String getName()
return name;


public void setName(String name)
this.name = name;


public String getValue()
return value;


public void setValue(String value)
this.value = value;


@Override
public String toString()
final StringBuffer sb = new StringBuffer("ElementAttribute");
sb.append("name='").append(name).append(''');
sb.append(", value='").append(value).append(''');
sb.append('');
return sb.toString();




Processor



XMLElementTreeBuilderImpl.java



package xml2json2;

// References:
// XML Spec : https://www.liquid-technologies.com/XML
// Regex : https://regexone.com

/*
This tree builder does not support elements with mixed data such as: <MyElement>Some <b>Mixed</b> Data</MyElement>.
Mixed data can contain text and child elements within the containing element. This is typically only used to mark up data (HTML etc).
Its typically only used to hold mark-up/formatted text entered by a person,
it is typically not he best choice for storing machine readable data as adds significant complexity to the parser.
*/

/*
XML to be processed must not contain any comments!
*/

public class XMLElementTreeBuilderImpl

private char xmlArray;
private int currentIndex = 0;


// This class has only 2 public methods:
public XMLElement buildTreeFromXML(String xml)
return buildTreeFromXML(xml.toCharArray());


public XMLElement buildTreeFromXML(char arr)
this.xmlArray = arr;
XMLElement root = nodeFromStringRecursively();
return root;


// Everything else below here is private, i.e. inner workings of the class..
private XMLElement nodeFromStringRecursively()
final XMLElement xmlElement = new XMLElement();

clearWhiteSpace();

if (tagStart()) // A new XML Element is starting..
currentIndex++;
final String elementName = parseStartingTagElement(); // finishes element name..
xmlElement.setElementName(elementName);


clearWhiteSpace();

// We have not closed our tag yet..
// At this point we might have attributes.. Lets add them if they exist..
while (isLetter())
addAttribute(xmlElement);
clearWhiteSpace();


// At this point we will have one of the following in current index:
// [/] -> Self closing tag..
// [>] -> Tag ending - (Data or children or starting or immediately followed by an ending tag..)

if (selfClosingTagEnd())
return xmlElement;


// At this point we are sure this element was not a self closing element..
currentIndex++; // skipping the tag close character, i.e. '>'

// At this point we are facing one of the following cases:
// Assume our starting tag was <foo> for the examples..
// 1 - [</] : Immediate tag end. "</foo>"
// 2 - [sw]+[</] : Any whitespace or any alphanumeric character, one or more repetitions, followed by tag end. "sample</foo>"
// 3 - [s]*(<![CDATA[...]]>)[s]*[</] : Zero or more white space, followed by CDATA. followed by zero or more white space. "<![CDATA[...]]></foo>
// 4 - [s]*[<]+ : Zero or more white space, followed by one or more child start..

int currentCase = currentCase();

switch (currentCase)
case 1: // Immediate closing tag, no data to set, no children to add.. Do nothing.
break;
case 2:
setData(xmlElement);
break;
case 3:
setCData(xmlElement);
case 4:
while (currentCase() == 4) // Add children recursively.
final XMLElement childToken = nodeFromStringRecursively();
xmlElement.getChildren().add(childToken);


walkClosingTag();
return xmlElement;


private String parseStartingTagElement()
final StringBuilder elementNameBuilder = new StringBuilder();
while (!isWhiteSpace() && !selfClosingTagEnd() && !tagEnd())
elementNameBuilder.append(charAtCurrentIndex());
currentIndex++;

final String elementName = elementNameBuilder.toString();
return elementName;


private void addAttribute(XMLElement xmlElement)
// Attribute name..
final StringBuilder attributeNameBuilder = new StringBuilder();
while (!isWhiteSpace() && charAtCurrentIndex() != '=')
attributeNameBuilder.append(charAtCurrentIndex());
currentIndex++;


// Everything in between that is not much of interest to us..
clearWhiteSpace();
currentIndex++; // Passing the '='
clearWhiteSpace();
currentIndex++; // Passing the '"'

// Attribute value..
final StringBuilder attributeValueBuilder = new StringBuilder();
while (charAtCurrentIndex() != '"')
attributeValueBuilder.append(charAtCurrentIndex());
currentIndex++;

currentIndex++; // Passing the final '"'
clearWhiteSpace();

// Build the attribute object and..
final ElementAttribute elementAttribute = new ElementAttribute();
elementAttribute.setName(attributeNameBuilder.toString());
elementAttribute.setValue(attributeValueBuilder.toString());

// ..add the attribute to the xmlElement
xmlElement.getAttributes().add(elementAttribute);


private int currentCase()
if (endTagStart())
return 1;

if (cDataStart())
return 3;

if (tagStart() && !endTagStart())
return 4;

// Here we will look forward, so we need to keep track of where we actually started..
int currentIndexRollBackPoint = currentIndex;
while (!endTagStart() && !cDataStart() && !tagStart())
currentIndex++;
if (endTagStart())
currentIndex = currentIndexRollBackPoint;
return 2;

if (cDataStart())
currentIndex = currentIndexRollBackPoint;
return 3;

if (tagStart() && !endTagStart())
currentIndex = currentIndexRollBackPoint;
return 4;



throw new UnsupportedOperationException("Encountered an unsupported XML.");


private void setData(XMLElement xmlElement)
final StringBuilder dataBuilder = new StringBuilder();
while (!tagStart())
dataBuilder.append(charAtCurrentIndex());
currentIndex++;

String data = dataBuilder.toString();

data = data.replaceAll("&lt;", "<");
data = data.replaceAll("&gt;", ">");
data = data.replaceAll("&quot;", """);
data = data.replaceAll("&apos;", "'");
data = data.replaceAll("&amp;", "&");


xmlElement.setElementValue(data);


private void setCData(XMLElement xmlElement)
final StringBuilder cdataBuilder = new StringBuilder();
while (!endTagStart())
cdataBuilder.append(charAtCurrentIndex());
currentIndex++;

String cdata = cdataBuilder.toString();
cdata = cdata.trim();
// cutting 9 chars because: <![CDATA[
cdata = cdata.substring(9, cdata.indexOf(']'));
xmlElement.setElementValue(cdata);



private void walkClosingTag()
while (!tagEnd())
currentIndex++;

currentIndex++;


// Convenience methods
private void clearWhiteSpace()
while (isWhiteSpace())
currentIndex++;



private boolean isLetter()
return Character.isLetter(charAtCurrentIndex());


private boolean isWhiteSpace()
return Character.isWhitespace(charAtCurrentIndex());


private boolean tagStart()
return charAtCurrentIndex() == '<';


private boolean tagEnd()
return charAtCurrentIndex() == '>';


private boolean endTagStart()
return charAtCurrentIndex() == '<' && charAtNextIndex() == '/';


private boolean selfClosingTagEnd()
return charAtCurrentIndex() == '/' && charAtNextIndex() == '>';


private boolean cDataStart()
return charAtCurrentIndex() == '<' && charAtNextIndex() == '!' && xmlArray[currentIndex + 2] == '[';


private char charAtCurrentIndex()
return xmlArray[currentIndex];


private char charAtNextIndex()
return xmlArray[currentIndex + 1];




Unit Tests



package xml2json2;

import java.util.List;

public class TreeFromXMLBuilderImplTest

private static XMLElementTreeBuilderImpl treeFromXMLBuilder;

public static void main(String args)
selfClosingTagWithoutSpace();
selfClosingTagWithSpace();
selfClosingTagWithNewLine();
emptyElementNoSpace();
emptyElementWithSpace();
emptyElementWithNewLine();
selfClosingTagWithAttributeNoSpace();
selfClosingTagWithAttributeWithSpace();
selfClosingTagWithMultipleAttributes();
xmlElementWithData();
xmlElementWithAttributeAndWithData();
xmlElementWithChild();
sampleXMLNote();
sampleXmlWithGrandChildren();
withCharacterData();
dataWithPreDefinedEntities();


private static void selfClosingTagWithoutSpace()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo/>");
assert xmlElement.getElementName().equals("foo") : "was : " + xmlElement.getElementName();


private static void selfClosingTagWithSpace()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo />");
assert xmlElement.getElementName().equals("foo") : "was : " + xmlElement.getElementName();


private static void selfClosingTagWithNewLine()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foonn/>");
assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();


private static void emptyElementNoSpace()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo></foo>");
assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();


private static void emptyElementWithSpace()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo></foo >");
assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();


private static void emptyElementWithNewLine()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo></foo nnn>");
assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();


private static void selfClosingTagWithAttributeNoSpace()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo bar="baz"/>");
assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();
final ElementAttribute attribute = xmlElement.getAttributes().iterator().next();
assert attribute.getName().equals("bar") : "was: " + attribute.getName();
assert attribute.getValue().equals("baz") : "was: " + attribute.getValue();


private static void selfClosingTagWithAttributeWithSpace()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo bar = "baz" />");
assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();
final ElementAttribute attribute = xmlElement.getAttributes().iterator().next();
assert attribute.getName().equals("bar") : "was: " + attribute.getName();
assert attribute.getValue().equals("baz") : "was: " + attribute.getValue();


private static void selfClosingTagWithMultipleAttributes()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo bar="baz" qux="booze"/>");
assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();
ElementAttribute attribute = xmlElement.getAttributes().get(0);
assert attribute.getName().equals("bar") : "was: " + attribute.getName();
assert attribute.getValue().equals("baz") : "was: " + attribute.getValue();
attribute = xmlElement.getAttributes().get(1);
assert attribute.getName().equals("qux") : "was: " + attribute.getName();
assert attribute.getValue().equals("booze") : "was: " + attribute.getValue();


private static void xmlElementWithData()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo>bar</foo>");
assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();
assert xmlElement.getElementValue().equals("bar") : "was: " + xmlElement.getElementValue();


private static void xmlElementWithAttributeAndWithData()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo baz = "baz" > bar </foo>");
assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();
assert xmlElement.getElementValue().equals(" bar ") : "was: " + xmlElement.getElementValue();
ElementAttribute attribute = xmlElement.getAttributes().get(0);
assert attribute.getName().equals("baz") : "was: " + attribute.getName();
assert attribute.getValue().equals("baz") : "was: " + attribute.getValue();


private static void xmlElementWithChild()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo><bar></bar><tar>rat</tar><baz/></foo>");
assert xmlElement.getElementName().equals("foo");
assert xmlElement.getAttributes().isEmpty();
assert xmlElement.getChildren().size() == 3;
assert xmlElement.getChildren().get(0).getElementName().equals("bar");
assert xmlElement.getChildren().get(0).getElementValue().equals("");
assert xmlElement.getChildren().get(1).getElementName().equals("tar");
assert xmlElement.getChildren().get(1).getElementValue().equals("rat");
assert xmlElement.getChildren().get(2).getElementName().equals("baz");
assert xmlElement.getChildren().get(2).getElementValue().equals("");


private static void sampleXMLNote()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
String note =
"<note>n" +
"<to>Tove</to>n" +
"<from>Jani</from>n" +
"<heading>Reminder</heading>n" +
"<body>Don't forget me this weekend!</body>n" +
"</note>"
;
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML(note);
// For visual inspection..
System.out.println(xmlElement);


/*
<foo>
<bar>
<baz>test</baz>
</bar>
<qux att="tta">
<fox>jumped</fox>
</qux>
</foo>
*/
private static void sampleXmlWithGrandChildren()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
String sampleWithGrandChildren = "<foo><bar><baz>test</baz></bar><qux att="tta"><fox>jumped</fox></qux></foo>";
final XMLElement foo = treeFromXMLBuilder.buildTreeFromXML(sampleWithGrandChildren);
assert foo.getElementName().equals("foo");
final List<XMLElement> children = foo.getChildren();
assert children.size() == 2; // bar and qux
final XMLElement bar = children.get(0);
assert bar.getElementName().equals("bar");
assert bar.getElementValue().equals("");
final List<XMLElement> barChildren = bar.getChildren();
assert barChildren.size() == 1;
final XMLElement baz = barChildren.get(0);
assert baz.getElementName().equals("baz");
assert baz.getElementValue().equals("test");
final XMLElement qux = children.get(1);
assert qux.getAttributes().size() == 1;
assert qux.getAttributes().get(0).getName().equals("att");
assert qux.getAttributes().get(0).getValue().equals("tta");
final List<XMLElement> quxChildren = qux.getChildren();
assert quxChildren.size() == 1;
final XMLElement fox = quxChildren.get(0);
assert fox.getElementName().equals("fox");
assert fox.getElementValue().equals("jumped");
// System.out.println(sampleWithGrandChildren);


private static void withCharacterData()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final String sampleXMLWithCData = "<foo><![CDATA[ This must be preserved!!! ]]></foo>";
final XMLElement root = treeFromXMLBuilder.buildTreeFromXML(sampleXMLWithCData);
assert root.getElementName().equals("foo");
assert root.getElementValue().equals(" This must be preserved!!! ") : "was: " + root.getElementValue();


private static void dataWithPreDefinedEntities()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final String withCharacterData = "<foo>&lt;&gt;&quot;&apos;&amp;</foo>";
final XMLElement root = treeFromXMLBuilder.buildTreeFromXML(withCharacterData);
assert root.getElementValue().equals("<>"'&");








share|improve this question













Accepts a String (or a char array) as an input and stores the data in a tree structure.



For example, given the input:



<foo>
<bar>baz</bar>
<qux fox="jump"/>
</foo>


Output will be:



XMLElementelementName='foo', 
children=[XMLElementelementName='bar', elementValue='baz',
XMLElementelementName='qux', attributes=[ElementAttributename='fox', value='jump']



I would like to hear your criticism on design principles (SRP, DRY, KISS, etc..), readability (naming of variables, methods) and maintainability (code structure, methods) of the code you see.



Already notes in the comments of the code but:



  • XML provided as input must not contain any XML comments.

  • Mixed data such as: <MyElement>Some <b>Mixed</b> Data</MyElement> is not supported.

Without further ado, let's jump into the code..



Entity classes



XMLElement.java



package xml2json2;

import java.util.ArrayList;
import java.util.List;

public class XMLElement

private String elementName; // can not be null
private String elementValue = "";

private List<ElementAttribute> attributes = new ArrayList<>(); // can be empty
private List<XMLElement> children = new ArrayList<>(); // can be empty

public String getElementName()
return elementName;


public void setElementName(String elementName)
this.elementName = elementName;


public String getElementValue()
return elementValue;


public void setElementValue(String elementValue)
this.elementValue = elementValue;


public List<ElementAttribute> getAttributes()
return attributes;



public List<XMLElement> getChildren()
return children;


@Override
public String toString()
final StringBuffer sb = new StringBuffer("XMLElement");
sb.append("elementName='").append(elementName).append(''');
if (!elementValue.equals(""))
sb.append(", elementValue='").append(elementValue).append(''');

if (attributes.size() != 0)
sb.append(", attributes=").append(attributes);

if (children.size() != 0)
sb.append(", children=").append(children);

sb.append('');
return sb.toString();




ElementAttribute.java



package xml2json2;

public class ElementAttribute

private String name;
private String value;

public String getName()
return name;


public void setName(String name)
this.name = name;


public String getValue()
return value;


public void setValue(String value)
this.value = value;


@Override
public String toString()
final StringBuffer sb = new StringBuffer("ElementAttribute");
sb.append("name='").append(name).append(''');
sb.append(", value='").append(value).append(''');
sb.append('');
return sb.toString();




Processor



XMLElementTreeBuilderImpl.java



package xml2json2;

// References:
// XML Spec : https://www.liquid-technologies.com/XML
// Regex : https://regexone.com

/*
This tree builder does not support elements with mixed data such as: <MyElement>Some <b>Mixed</b> Data</MyElement>.
Mixed data can contain text and child elements within the containing element. This is typically only used to mark up data (HTML etc).
Its typically only used to hold mark-up/formatted text entered by a person,
it is typically not he best choice for storing machine readable data as adds significant complexity to the parser.
*/

/*
XML to be processed must not contain any comments!
*/

public class XMLElementTreeBuilderImpl

private char xmlArray;
private int currentIndex = 0;


// This class has only 2 public methods:
public XMLElement buildTreeFromXML(String xml)
return buildTreeFromXML(xml.toCharArray());


public XMLElement buildTreeFromXML(char arr)
this.xmlArray = arr;
XMLElement root = nodeFromStringRecursively();
return root;


// Everything else below here is private, i.e. inner workings of the class..
private XMLElement nodeFromStringRecursively()
final XMLElement xmlElement = new XMLElement();

clearWhiteSpace();

if (tagStart()) // A new XML Element is starting..
currentIndex++;
final String elementName = parseStartingTagElement(); // finishes element name..
xmlElement.setElementName(elementName);


clearWhiteSpace();

// We have not closed our tag yet..
// At this point we might have attributes.. Lets add them if they exist..
while (isLetter())
addAttribute(xmlElement);
clearWhiteSpace();


// At this point we will have one of the following in current index:
// [/] -> Self closing tag..
// [>] -> Tag ending - (Data or children or starting or immediately followed by an ending tag..)

if (selfClosingTagEnd())
return xmlElement;


// At this point we are sure this element was not a self closing element..
currentIndex++; // skipping the tag close character, i.e. '>'

// At this point we are facing one of the following cases:
// Assume our starting tag was <foo> for the examples..
// 1 - [</] : Immediate tag end. "</foo>"
// 2 - [sw]+[</] : Any whitespace or any alphanumeric character, one or more repetitions, followed by tag end. "sample</foo>"
// 3 - [s]*(<![CDATA[...]]>)[s]*[</] : Zero or more white space, followed by CDATA. followed by zero or more white space. "<![CDATA[...]]></foo>
// 4 - [s]*[<]+ : Zero or more white space, followed by one or more child start..

int currentCase = currentCase();

switch (currentCase)
case 1: // Immediate closing tag, no data to set, no children to add.. Do nothing.
break;
case 2:
setData(xmlElement);
break;
case 3:
setCData(xmlElement);
case 4:
while (currentCase() == 4) // Add children recursively.
final XMLElement childToken = nodeFromStringRecursively();
xmlElement.getChildren().add(childToken);


walkClosingTag();
return xmlElement;


private String parseStartingTagElement()
final StringBuilder elementNameBuilder = new StringBuilder();
while (!isWhiteSpace() && !selfClosingTagEnd() && !tagEnd())
elementNameBuilder.append(charAtCurrentIndex());
currentIndex++;

final String elementName = elementNameBuilder.toString();
return elementName;


private void addAttribute(XMLElement xmlElement)
// Attribute name..
final StringBuilder attributeNameBuilder = new StringBuilder();
while (!isWhiteSpace() && charAtCurrentIndex() != '=')
attributeNameBuilder.append(charAtCurrentIndex());
currentIndex++;


// Everything in between that is not much of interest to us..
clearWhiteSpace();
currentIndex++; // Passing the '='
clearWhiteSpace();
currentIndex++; // Passing the '"'

// Attribute value..
final StringBuilder attributeValueBuilder = new StringBuilder();
while (charAtCurrentIndex() != '"')
attributeValueBuilder.append(charAtCurrentIndex());
currentIndex++;

currentIndex++; // Passing the final '"'
clearWhiteSpace();

// Build the attribute object and..
final ElementAttribute elementAttribute = new ElementAttribute();
elementAttribute.setName(attributeNameBuilder.toString());
elementAttribute.setValue(attributeValueBuilder.toString());

// ..add the attribute to the xmlElement
xmlElement.getAttributes().add(elementAttribute);


private int currentCase()
if (endTagStart())
return 1;

if (cDataStart())
return 3;

if (tagStart() && !endTagStart())
return 4;

// Here we will look forward, so we need to keep track of where we actually started..
int currentIndexRollBackPoint = currentIndex;
while (!endTagStart() && !cDataStart() && !tagStart())
currentIndex++;
if (endTagStart())
currentIndex = currentIndexRollBackPoint;
return 2;

if (cDataStart())
currentIndex = currentIndexRollBackPoint;
return 3;

if (tagStart() && !endTagStart())
currentIndex = currentIndexRollBackPoint;
return 4;



throw new UnsupportedOperationException("Encountered an unsupported XML.");


private void setData(XMLElement xmlElement)
final StringBuilder dataBuilder = new StringBuilder();
while (!tagStart())
dataBuilder.append(charAtCurrentIndex());
currentIndex++;

String data = dataBuilder.toString();

data = data.replaceAll("&lt;", "<");
data = data.replaceAll("&gt;", ">");
data = data.replaceAll("&quot;", """);
data = data.replaceAll("&apos;", "'");
data = data.replaceAll("&amp;", "&");


xmlElement.setElementValue(data);


private void setCData(XMLElement xmlElement)
final StringBuilder cdataBuilder = new StringBuilder();
while (!endTagStart())
cdataBuilder.append(charAtCurrentIndex());
currentIndex++;

String cdata = cdataBuilder.toString();
cdata = cdata.trim();
// cutting 9 chars because: <![CDATA[
cdata = cdata.substring(9, cdata.indexOf(']'));
xmlElement.setElementValue(cdata);



private void walkClosingTag()
while (!tagEnd())
currentIndex++;

currentIndex++;


// Convenience methods
private void clearWhiteSpace()
while (isWhiteSpace())
currentIndex++;



private boolean isLetter()
return Character.isLetter(charAtCurrentIndex());


private boolean isWhiteSpace()
return Character.isWhitespace(charAtCurrentIndex());


private boolean tagStart()
return charAtCurrentIndex() == '<';


private boolean tagEnd()
return charAtCurrentIndex() == '>';


private boolean endTagStart()
return charAtCurrentIndex() == '<' && charAtNextIndex() == '/';


private boolean selfClosingTagEnd()
return charAtCurrentIndex() == '/' && charAtNextIndex() == '>';


private boolean cDataStart()
return charAtCurrentIndex() == '<' && charAtNextIndex() == '!' && xmlArray[currentIndex + 2] == '[';


private char charAtCurrentIndex()
return xmlArray[currentIndex];


private char charAtNextIndex()
return xmlArray[currentIndex + 1];




Unit Tests



package xml2json2;

import java.util.List;

public class TreeFromXMLBuilderImplTest

private static XMLElementTreeBuilderImpl treeFromXMLBuilder;

public static void main(String args)
selfClosingTagWithoutSpace();
selfClosingTagWithSpace();
selfClosingTagWithNewLine();
emptyElementNoSpace();
emptyElementWithSpace();
emptyElementWithNewLine();
selfClosingTagWithAttributeNoSpace();
selfClosingTagWithAttributeWithSpace();
selfClosingTagWithMultipleAttributes();
xmlElementWithData();
xmlElementWithAttributeAndWithData();
xmlElementWithChild();
sampleXMLNote();
sampleXmlWithGrandChildren();
withCharacterData();
dataWithPreDefinedEntities();


private static void selfClosingTagWithoutSpace()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo/>");
assert xmlElement.getElementName().equals("foo") : "was : " + xmlElement.getElementName();


private static void selfClosingTagWithSpace()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo />");
assert xmlElement.getElementName().equals("foo") : "was : " + xmlElement.getElementName();


private static void selfClosingTagWithNewLine()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foonn/>");
assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();


private static void emptyElementNoSpace()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo></foo>");
assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();


private static void emptyElementWithSpace()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo></foo >");
assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();


private static void emptyElementWithNewLine()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo></foo nnn>");
assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();


private static void selfClosingTagWithAttributeNoSpace()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo bar="baz"/>");
assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();
final ElementAttribute attribute = xmlElement.getAttributes().iterator().next();
assert attribute.getName().equals("bar") : "was: " + attribute.getName();
assert attribute.getValue().equals("baz") : "was: " + attribute.getValue();


private static void selfClosingTagWithAttributeWithSpace()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo bar = "baz" />");
assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();
final ElementAttribute attribute = xmlElement.getAttributes().iterator().next();
assert attribute.getName().equals("bar") : "was: " + attribute.getName();
assert attribute.getValue().equals("baz") : "was: " + attribute.getValue();


private static void selfClosingTagWithMultipleAttributes()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo bar="baz" qux="booze"/>");
assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();
ElementAttribute attribute = xmlElement.getAttributes().get(0);
assert attribute.getName().equals("bar") : "was: " + attribute.getName();
assert attribute.getValue().equals("baz") : "was: " + attribute.getValue();
attribute = xmlElement.getAttributes().get(1);
assert attribute.getName().equals("qux") : "was: " + attribute.getName();
assert attribute.getValue().equals("booze") : "was: " + attribute.getValue();


private static void xmlElementWithData()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo>bar</foo>");
assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();
assert xmlElement.getElementValue().equals("bar") : "was: " + xmlElement.getElementValue();


private static void xmlElementWithAttributeAndWithData()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo baz = "baz" > bar </foo>");
assert xmlElement.getElementName().equals("foo") : "was: " + xmlElement.getElementName();
assert xmlElement.getElementValue().equals(" bar ") : "was: " + xmlElement.getElementValue();
ElementAttribute attribute = xmlElement.getAttributes().get(0);
assert attribute.getName().equals("baz") : "was: " + attribute.getName();
assert attribute.getValue().equals("baz") : "was: " + attribute.getValue();


private static void xmlElementWithChild()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML("<foo><bar></bar><tar>rat</tar><baz/></foo>");
assert xmlElement.getElementName().equals("foo");
assert xmlElement.getAttributes().isEmpty();
assert xmlElement.getChildren().size() == 3;
assert xmlElement.getChildren().get(0).getElementName().equals("bar");
assert xmlElement.getChildren().get(0).getElementValue().equals("");
assert xmlElement.getChildren().get(1).getElementName().equals("tar");
assert xmlElement.getChildren().get(1).getElementValue().equals("rat");
assert xmlElement.getChildren().get(2).getElementName().equals("baz");
assert xmlElement.getChildren().get(2).getElementValue().equals("");


private static void sampleXMLNote()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
String note =
"<note>n" +
"<to>Tove</to>n" +
"<from>Jani</from>n" +
"<heading>Reminder</heading>n" +
"<body>Don't forget me this weekend!</body>n" +
"</note>"
;
final XMLElement xmlElement = treeFromXMLBuilder.buildTreeFromXML(note);
// For visual inspection..
System.out.println(xmlElement);


/*
<foo>
<bar>
<baz>test</baz>
</bar>
<qux att="tta">
<fox>jumped</fox>
</qux>
</foo>
*/
private static void sampleXmlWithGrandChildren()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
String sampleWithGrandChildren = "<foo><bar><baz>test</baz></bar><qux att="tta"><fox>jumped</fox></qux></foo>";
final XMLElement foo = treeFromXMLBuilder.buildTreeFromXML(sampleWithGrandChildren);
assert foo.getElementName().equals("foo");
final List<XMLElement> children = foo.getChildren();
assert children.size() == 2; // bar and qux
final XMLElement bar = children.get(0);
assert bar.getElementName().equals("bar");
assert bar.getElementValue().equals("");
final List<XMLElement> barChildren = bar.getChildren();
assert barChildren.size() == 1;
final XMLElement baz = barChildren.get(0);
assert baz.getElementName().equals("baz");
assert baz.getElementValue().equals("test");
final XMLElement qux = children.get(1);
assert qux.getAttributes().size() == 1;
assert qux.getAttributes().get(0).getName().equals("att");
assert qux.getAttributes().get(0).getValue().equals("tta");
final List<XMLElement> quxChildren = qux.getChildren();
assert quxChildren.size() == 1;
final XMLElement fox = quxChildren.get(0);
assert fox.getElementName().equals("fox");
assert fox.getElementValue().equals("jumped");
// System.out.println(sampleWithGrandChildren);


private static void withCharacterData()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final String sampleXMLWithCData = "<foo><![CDATA[ This must be preserved!!! ]]></foo>";
final XMLElement root = treeFromXMLBuilder.buildTreeFromXML(sampleXMLWithCData);
assert root.getElementName().equals("foo");
assert root.getElementValue().equals(" This must be preserved!!! ") : "was: " + root.getElementValue();


private static void dataWithPreDefinedEntities()
treeFromXMLBuilder = new XMLElementTreeBuilderImpl();
final String withCharacterData = "<foo>&lt;&gt;&quot;&apos;&amp;</foo>";
final XMLElement root = treeFromXMLBuilder.buildTreeFromXML(withCharacterData);
assert root.getElementValue().equals("<>"'&");










share|improve this question












share|improve this question




share|improve this question








edited May 7 at 16:17









mdfst13

16.8k42055




16.8k42055









asked Apr 30 at 1:01









Koray Tugay

7931032




7931032











  • I would feel remiss if I did not recommend that you use the JDOM XML model instead of writing your own parser. Note that I a maintainer of the JDOM project. JDOM is not a parser (it uses Xerces by default), but it is an in-memory model. Document doc = new SAXBuilder().build(new StringReader(mystringvar)); will parse an XML document in a string....
    – rolfl♦
    May 8 at 2:01
















  • I would feel remiss if I did not recommend that you use the JDOM XML model instead of writing your own parser. Note that I a maintainer of the JDOM project. JDOM is not a parser (it uses Xerces by default), but it is an in-memory model. Document doc = new SAXBuilder().build(new StringReader(mystringvar)); will parse an XML document in a string....
    – rolfl♦
    May 8 at 2:01















I would feel remiss if I did not recommend that you use the JDOM XML model instead of writing your own parser. Note that I a maintainer of the JDOM project. JDOM is not a parser (it uses Xerces by default), but it is an in-memory model. Document doc = new SAXBuilder().build(new StringReader(mystringvar)); will parse an XML document in a string....
– rolfl♦
May 8 at 2:01




I would feel remiss if I did not recommend that you use the JDOM XML model instead of writing your own parser. Note that I a maintainer of the JDOM project. JDOM is not a parser (it uses Xerces by default), but it is an in-memory model. Document doc = new SAXBuilder().build(new StringReader(mystringvar)); will parse an XML document in a string....
– rolfl♦
May 8 at 2:01










1 Answer
1






active

oldest

votes

















up vote
2
down vote



accepted
+50












Just for the sake of completeness, there are things that this parser doesn't support other than comments and elements with mixed content, such as processing instructions or character references (e.g. & or & instead of &amp;, i.e. references to unicode code points rather than entities). Processing instructions are meant to carry information relevant only to the application receiving the XML document and are not part of the data stored by the XML document, but the parser should recognize them nonetheless. And since you support references to the five predefined entities (i.e. &amp;, &lt; and so on), it would seem natural also to support character references.



The parser also doesn't read the prolog, which consists of an optional XML declaration and a likewise optional document type declaration, although, admittely, the XML declaration only contains information specific to the process of parsing the document itself (such as the XML version, or the character encoding), and the document type declaration defines such things as the stucture of the document and entities to be referenced by an entity reference, so it might not make sense for the parser that parses the XML data to also parse the prolog.



So now about what you have implemented:



  • There are several problems with this parser, the biggest of which seems to be that the parser takes it for granted that the document is well-formed and produces valid output even if the syntax of the XML document is invalid. For example, the parser does not check whether the name of an end tag matches the name of the corresponding start tag. Or when parsing attributes in a start tag, it does not check whether the character assumed to be "=" is indeed "=" when there's whitespace between the attribute name and the "=" sign. Likewise, the character assumed to be the opening quotation mark of the attribute value could as well be any other character. This means that the parser would treat <foo attributeName x ybar"> as equivalent to <foo attributeName = "bar">. It gets even worse with CDATA sections, because for all your parser cares, the input could contain garbage syntax like <![jklöä°some character data]]>, and it will treat it as if it were <![CDATA[some character data]]>.



  • Another problem are the character categories on which you base the decision of how to continue with the parsing process. For example, your method isWhiteSpace() checks the conditions described in the documentation of the method Character.isWhitespace(char). But this is not what counts as whitespace according to the XML specification. The XML specification only counts four Unicode characters as whitespace:



    • U+0020 SPACE

    • U+0009 CHARACTER TABULATION

    • U+000D CARRIAGE RETURN

    • U+000A LINE FEED

    So if an XML document contained a tag like <foo>, but with an U+2003 EM SPACE inserted between foo and >, then your parser would consider it legal XML syntax, when in fact the em-space would be illegal here, because it is neither whitespace nor a legal element name character.



    Similarly, you are not acknowledging the fact that an attribute name does not necessarily have to begin with a letter. It could also begin with a colon, an underscore, or some other exotic character that is not covered by Character.isLetter(), for example U+02EA MODIFIER LETTER YIN DEPARTING TONE MARK, "˪" (whatever that is).



    So how to rectify these issues? Since the XML specification so nicely provides regular expressions, you can simply imitate these regular expressions in the parser. While it would probably not be possible to parse the whole document with a single regular expression due to the possibility of nested tags with the same name, or siblings with the same name, you can at least write regular expressions for small units, like tags, which can also be composed of multiple regular expressions for even smaller, reusable units (like whitespace, name characters etc.). For example:



    String whitespace = "(?:[\x20\x09\x0d\x0a]+)";
    String nameStartCharacter = "(?:[:A-Z_a-z\xc0-\xd6\xd8-\xf6\xf8-\u02ff" +
    "\u0370-\u037d\u037f-\u1fff\u200C-\u200D\u2070-\u218F" +
    "\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD" +
    "\x10000-\xEFFFF])";
    String nameCharacter = "(?:" + nameStartCharacter + "|[-.0-9\xb7\u0300-\u036F\u203F-\u2040])";
    String name = "(?:" + nameStartCharacter + nameCharacter + "*)";
    String endTag = "(?:</(?<name>" + name + ")" + whitespace + "?>)";


    And to use endTag:



    Matcher endTagMatcher = Pattern.compile(endTag).matcher("</test>");
    System.out.println(endTagMatcher.matches()); // true
    System.out.println(endTagMatcher.group("name")); // "test"


    Note that I wrapped each regular expression inside a non-capturing group ((?:X)), so that appending a quantifier to it will always work on the whole expression (for instance, if nameCharacter were not wrapped in a group, the quantifier * appended to it in name would only apply to the character class to the right of | in nameCharacter).



    Of course, you can not use regular expessions with a char directly, so you would have to find a way around that. Maybe a CharBuffer can be of use, since it implements CharSequence, and unlike String.subSequence(int, int) and StringBuilder.subSequence(int, int), which create a new String, CharBuffer.subSequence(int, int) does not copy the char data but reads/writes through to the original CharBuffer.




  • Your parser has some bugs:



    • It does not consume the final /> part of an empty element tag, so any elements that follow an empty element will not be interpreted correctly (try it with "<root><foo /><bar></bar></root>"; the program will die from lack of memory since it will be trapped in the loop while (currentCase() == 4)).


    • A closing bracket ] might be part of a CDATA section and does not necessarily end it. Only ]]> is guaranteed to terminate a CDATA section.



    • An element with a CDATA section might still contain other character data not part of the CDATA section. In fact, an element can even contain multiple CDATA sections. To quote the relevant section in the XML specification:




      CDATA sections may occur anywhere character data may occur;




      Your parser will fail with elements that contain other character data (or CDATA sections) in addition to one CDATA section.



    • References may not only occur in character data, but also in attribute values.




  • Finally, some stylistic suggestions:




    • In this loop:



      while (!endTagStart() && !cDataStart() && !tagStart()) 
      currentIndex++;
      if (endTagStart())
      currentIndex = currentIndexRollBackPoint;
      return 2;

      if (cDataStart())
      currentIndex = currentIndexRollBackPoint;
      return 3;

      if (tagStart() && !endTagStart())
      currentIndex = currentIndexRollBackPoint;
      return 4;




      The termination condition is completely pointless, because the loop will always be terminated from within. Usually, I'm in favor of using concrete termination conditions instead of something like while(true), but here, the termination condition will never evaluate to false and therefore does not fulfill any purpose, so I think that, in this case, the code would be easier to read if you simply changed it to while (true).



    • Since your parser doesn't support elements with mixed content, you could make XMLElement an abstract class and make two subclasses, one for character data elements, and another for elements with child elements. That way, a character data element will not have a meaningless field children, and an element with children will not have the meaningless field value.







share|improve this answer





















  • Simply beautiful, thank you.
    – Koray Tugay
    May 8 at 12:35










Your Answer




StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);








 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f193234%2fxml-parser-written-in-java%23new-answer', 'question_page');

);

Post as a guest






























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
2
down vote



accepted
+50












Just for the sake of completeness, there are things that this parser doesn't support other than comments and elements with mixed content, such as processing instructions or character references (e.g. & or & instead of &amp;, i.e. references to unicode code points rather than entities). Processing instructions are meant to carry information relevant only to the application receiving the XML document and are not part of the data stored by the XML document, but the parser should recognize them nonetheless. And since you support references to the five predefined entities (i.e. &amp;, &lt; and so on), it would seem natural also to support character references.



The parser also doesn't read the prolog, which consists of an optional XML declaration and a likewise optional document type declaration, although, admittely, the XML declaration only contains information specific to the process of parsing the document itself (such as the XML version, or the character encoding), and the document type declaration defines such things as the stucture of the document and entities to be referenced by an entity reference, so it might not make sense for the parser that parses the XML data to also parse the prolog.



So now about what you have implemented:



  • There are several problems with this parser, the biggest of which seems to be that the parser takes it for granted that the document is well-formed and produces valid output even if the syntax of the XML document is invalid. For example, the parser does not check whether the name of an end tag matches the name of the corresponding start tag. Or when parsing attributes in a start tag, it does not check whether the character assumed to be "=" is indeed "=" when there's whitespace between the attribute name and the "=" sign. Likewise, the character assumed to be the opening quotation mark of the attribute value could as well be any other character. This means that the parser would treat <foo attributeName x ybar"> as equivalent to <foo attributeName = "bar">. It gets even worse with CDATA sections, because for all your parser cares, the input could contain garbage syntax like <![jklöä°some character data]]>, and it will treat it as if it were <![CDATA[some character data]]>.



  • Another problem are the character categories on which you base the decision of how to continue with the parsing process. For example, your method isWhiteSpace() checks the conditions described in the documentation of the method Character.isWhitespace(char). But this is not what counts as whitespace according to the XML specification. The XML specification only counts four Unicode characters as whitespace:



    • U+0020 SPACE

    • U+0009 CHARACTER TABULATION

    • U+000D CARRIAGE RETURN

    • U+000A LINE FEED

    So if an XML document contained a tag like <foo>, but with an U+2003 EM SPACE inserted between foo and >, then your parser would consider it legal XML syntax, when in fact the em-space would be illegal here, because it is neither whitespace nor a legal element name character.



    Similarly, you are not acknowledging the fact that an attribute name does not necessarily have to begin with a letter. It could also begin with a colon, an underscore, or some other exotic character that is not covered by Character.isLetter(), for example U+02EA MODIFIER LETTER YIN DEPARTING TONE MARK, "˪" (whatever that is).



    So how to rectify these issues? Since the XML specification so nicely provides regular expressions, you can simply imitate these regular expressions in the parser. While it would probably not be possible to parse the whole document with a single regular expression due to the possibility of nested tags with the same name, or siblings with the same name, you can at least write regular expressions for small units, like tags, which can also be composed of multiple regular expressions for even smaller, reusable units (like whitespace, name characters etc.). For example:



    String whitespace = "(?:[\x20\x09\x0d\x0a]+)";
    String nameStartCharacter = "(?:[:A-Z_a-z\xc0-\xd6\xd8-\xf6\xf8-\u02ff" +
    "\u0370-\u037d\u037f-\u1fff\u200C-\u200D\u2070-\u218F" +
    "\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD" +
    "\x10000-\xEFFFF])";
    String nameCharacter = "(?:" + nameStartCharacter + "|[-.0-9\xb7\u0300-\u036F\u203F-\u2040])";
    String name = "(?:" + nameStartCharacter + nameCharacter + "*)";
    String endTag = "(?:</(?<name>" + name + ")" + whitespace + "?>)";


    And to use endTag:



    Matcher endTagMatcher = Pattern.compile(endTag).matcher("</test>");
    System.out.println(endTagMatcher.matches()); // true
    System.out.println(endTagMatcher.group("name")); // "test"


    Note that I wrapped each regular expression inside a non-capturing group ((?:X)), so that appending a quantifier to it will always work on the whole expression (for instance, if nameCharacter were not wrapped in a group, the quantifier * appended to it in name would only apply to the character class to the right of | in nameCharacter).



    Of course, you can not use regular expessions with a char directly, so you would have to find a way around that. Maybe a CharBuffer can be of use, since it implements CharSequence, and unlike String.subSequence(int, int) and StringBuilder.subSequence(int, int), which create a new String, CharBuffer.subSequence(int, int) does not copy the char data but reads/writes through to the original CharBuffer.




  • Your parser has some bugs:



    • It does not consume the final /> part of an empty element tag, so any elements that follow an empty element will not be interpreted correctly (try it with "<root><foo /><bar></bar></root>"; the program will die from lack of memory since it will be trapped in the loop while (currentCase() == 4)).


    • A closing bracket ] might be part of a CDATA section and does not necessarily end it. Only ]]> is guaranteed to terminate a CDATA section.



    • An element with a CDATA section might still contain other character data not part of the CDATA section. In fact, an element can even contain multiple CDATA sections. To quote the relevant section in the XML specification:




      CDATA sections may occur anywhere character data may occur;




      Your parser will fail with elements that contain other character data (or CDATA sections) in addition to one CDATA section.



    • References may not only occur in character data, but also in attribute values.




  • Finally, some stylistic suggestions:




    • In this loop:



      while (!endTagStart() && !cDataStart() && !tagStart()) 
      currentIndex++;
      if (endTagStart())
      currentIndex = currentIndexRollBackPoint;
      return 2;

      if (cDataStart())
      currentIndex = currentIndexRollBackPoint;
      return 3;

      if (tagStart() && !endTagStart())
      currentIndex = currentIndexRollBackPoint;
      return 4;




      The termination condition is completely pointless, because the loop will always be terminated from within. Usually, I'm in favor of using concrete termination conditions instead of something like while(true), but here, the termination condition will never evaluate to false and therefore does not fulfill any purpose, so I think that, in this case, the code would be easier to read if you simply changed it to while (true).



    • Since your parser doesn't support elements with mixed content, you could make XMLElement an abstract class and make two subclasses, one for character data elements, and another for elements with child elements. That way, a character data element will not have a meaningless field children, and an element with children will not have the meaningless field value.







share|improve this answer





















  • Simply beautiful, thank you.
    – Koray Tugay
    May 8 at 12:35














up vote
2
down vote



accepted
+50












Just for the sake of completeness, there are things that this parser doesn't support other than comments and elements with mixed content, such as processing instructions or character references (e.g. & or & instead of &amp;, i.e. references to unicode code points rather than entities). Processing instructions are meant to carry information relevant only to the application receiving the XML document and are not part of the data stored by the XML document, but the parser should recognize them nonetheless. And since you support references to the five predefined entities (i.e. &amp;, &lt; and so on), it would seem natural also to support character references.



The parser also doesn't read the prolog, which consists of an optional XML declaration and a likewise optional document type declaration, although, admittely, the XML declaration only contains information specific to the process of parsing the document itself (such as the XML version, or the character encoding), and the document type declaration defines such things as the stucture of the document and entities to be referenced by an entity reference, so it might not make sense for the parser that parses the XML data to also parse the prolog.



So now about what you have implemented:



  • There are several problems with this parser, the biggest of which seems to be that the parser takes it for granted that the document is well-formed and produces valid output even if the syntax of the XML document is invalid. For example, the parser does not check whether the name of an end tag matches the name of the corresponding start tag. Or when parsing attributes in a start tag, it does not check whether the character assumed to be "=" is indeed "=" when there's whitespace between the attribute name and the "=" sign. Likewise, the character assumed to be the opening quotation mark of the attribute value could as well be any other character. This means that the parser would treat <foo attributeName x ybar"> as equivalent to <foo attributeName = "bar">. It gets even worse with CDATA sections, because for all your parser cares, the input could contain garbage syntax like <![jklöä°some character data]]>, and it will treat it as if it were <![CDATA[some character data]]>.



  • Another problem are the character categories on which you base the decision of how to continue with the parsing process. For example, your method isWhiteSpace() checks the conditions described in the documentation of the method Character.isWhitespace(char). But this is not what counts as whitespace according to the XML specification. The XML specification only counts four Unicode characters as whitespace:



    • U+0020 SPACE

    • U+0009 CHARACTER TABULATION

    • U+000D CARRIAGE RETURN

    • U+000A LINE FEED

    So if an XML document contained a tag like <foo>, but with an U+2003 EM SPACE inserted between foo and >, then your parser would consider it legal XML syntax, when in fact the em-space would be illegal here, because it is neither whitespace nor a legal element name character.



    Similarly, you are not acknowledging the fact that an attribute name does not necessarily have to begin with a letter. It could also begin with a colon, an underscore, or some other exotic character that is not covered by Character.isLetter(), for example U+02EA MODIFIER LETTER YIN DEPARTING TONE MARK, "˪" (whatever that is).



    So how to rectify these issues? Since the XML specification so nicely provides regular expressions, you can simply imitate these regular expressions in the parser. While it would probably not be possible to parse the whole document with a single regular expression due to the possibility of nested tags with the same name, or siblings with the same name, you can at least write regular expressions for small units, like tags, which can also be composed of multiple regular expressions for even smaller, reusable units (like whitespace, name characters etc.). For example:



    String whitespace = "(?:[\x20\x09\x0d\x0a]+)";
    String nameStartCharacter = "(?:[:A-Z_a-z\xc0-\xd6\xd8-\xf6\xf8-\u02ff" +
    "\u0370-\u037d\u037f-\u1fff\u200C-\u200D\u2070-\u218F" +
    "\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD" +
    "\x10000-\xEFFFF])";
    String nameCharacter = "(?:" + nameStartCharacter + "|[-.0-9\xb7\u0300-\u036F\u203F-\u2040])";
    String name = "(?:" + nameStartCharacter + nameCharacter + "*)";
    String endTag = "(?:</(?<name>" + name + ")" + whitespace + "?>)";


    And to use endTag:



    Matcher endTagMatcher = Pattern.compile(endTag).matcher("</test>");
    System.out.println(endTagMatcher.matches()); // true
    System.out.println(endTagMatcher.group("name")); // "test"


    Note that I wrapped each regular expression inside a non-capturing group ((?:X)), so that appending a quantifier to it will always work on the whole expression (for instance, if nameCharacter were not wrapped in a group, the quantifier * appended to it in name would only apply to the character class to the right of | in nameCharacter).



    Of course, you can not use regular expessions with a char directly, so you would have to find a way around that. Maybe a CharBuffer can be of use, since it implements CharSequence, and unlike String.subSequence(int, int) and StringBuilder.subSequence(int, int), which create a new String, CharBuffer.subSequence(int, int) does not copy the char data but reads/writes through to the original CharBuffer.




  • Your parser has some bugs:



    • It does not consume the final /> part of an empty element tag, so any elements that follow an empty element will not be interpreted correctly (try it with "<root><foo /><bar></bar></root>"; the program will die from lack of memory since it will be trapped in the loop while (currentCase() == 4)).


    • A closing bracket ] might be part of a CDATA section and does not necessarily end it. Only ]]> is guaranteed to terminate a CDATA section.



    • An element with a CDATA section might still contain other character data not part of the CDATA section. In fact, an element can even contain multiple CDATA sections. To quote the relevant section in the XML specification:




      CDATA sections may occur anywhere character data may occur;




      Your parser will fail with elements that contain other character data (or CDATA sections) in addition to one CDATA section.



    • References may not only occur in character data, but also in attribute values.




  • Finally, some stylistic suggestions:




    • In this loop:



      while (!endTagStart() && !cDataStart() && !tagStart()) 
      currentIndex++;
      if (endTagStart())
      currentIndex = currentIndexRollBackPoint;
      return 2;

      if (cDataStart())
      currentIndex = currentIndexRollBackPoint;
      return 3;

      if (tagStart() && !endTagStart())
      currentIndex = currentIndexRollBackPoint;
      return 4;




      The termination condition is completely pointless, because the loop will always be terminated from within. Usually, I'm in favor of using concrete termination conditions instead of something like while(true), but here, the termination condition will never evaluate to false and therefore does not fulfill any purpose, so I think that, in this case, the code would be easier to read if you simply changed it to while (true).



    • Since your parser doesn't support elements with mixed content, you could make XMLElement an abstract class and make two subclasses, one for character data elements, and another for elements with child elements. That way, a character data element will not have a meaningless field children, and an element with children will not have the meaningless field value.







share|improve this answer





















  • Simply beautiful, thank you.
    – Koray Tugay
    May 8 at 12:35












up vote
2
down vote



accepted
+50







up vote
2
down vote



accepted
+50




+50






Just for the sake of completeness, there are things that this parser doesn't support other than comments and elements with mixed content, such as processing instructions or character references (e.g. & or & instead of &amp;, i.e. references to unicode code points rather than entities). Processing instructions are meant to carry information relevant only to the application receiving the XML document and are not part of the data stored by the XML document, but the parser should recognize them nonetheless. And since you support references to the five predefined entities (i.e. &amp;, &lt; and so on), it would seem natural also to support character references.



The parser also doesn't read the prolog, which consists of an optional XML declaration and a likewise optional document type declaration, although, admittely, the XML declaration only contains information specific to the process of parsing the document itself (such as the XML version, or the character encoding), and the document type declaration defines such things as the stucture of the document and entities to be referenced by an entity reference, so it might not make sense for the parser that parses the XML data to also parse the prolog.



So now about what you have implemented:



  • There are several problems with this parser, the biggest of which seems to be that the parser takes it for granted that the document is well-formed and produces valid output even if the syntax of the XML document is invalid. For example, the parser does not check whether the name of an end tag matches the name of the corresponding start tag. Or when parsing attributes in a start tag, it does not check whether the character assumed to be "=" is indeed "=" when there's whitespace between the attribute name and the "=" sign. Likewise, the character assumed to be the opening quotation mark of the attribute value could as well be any other character. This means that the parser would treat <foo attributeName x ybar"> as equivalent to <foo attributeName = "bar">. It gets even worse with CDATA sections, because for all your parser cares, the input could contain garbage syntax like <![jklöä°some character data]]>, and it will treat it as if it were <![CDATA[some character data]]>.



  • Another problem are the character categories on which you base the decision of how to continue with the parsing process. For example, your method isWhiteSpace() checks the conditions described in the documentation of the method Character.isWhitespace(char). But this is not what counts as whitespace according to the XML specification. The XML specification only counts four Unicode characters as whitespace:



    • U+0020 SPACE

    • U+0009 CHARACTER TABULATION

    • U+000D CARRIAGE RETURN

    • U+000A LINE FEED

    So if an XML document contained a tag like <foo>, but with an U+2003 EM SPACE inserted between foo and >, then your parser would consider it legal XML syntax, when in fact the em-space would be illegal here, because it is neither whitespace nor a legal element name character.



    Similarly, you are not acknowledging the fact that an attribute name does not necessarily have to begin with a letter. It could also begin with a colon, an underscore, or some other exotic character that is not covered by Character.isLetter(), for example U+02EA MODIFIER LETTER YIN DEPARTING TONE MARK, "˪" (whatever that is).



    So how to rectify these issues? Since the XML specification so nicely provides regular expressions, you can simply imitate these regular expressions in the parser. While it would probably not be possible to parse the whole document with a single regular expression due to the possibility of nested tags with the same name, or siblings with the same name, you can at least write regular expressions for small units, like tags, which can also be composed of multiple regular expressions for even smaller, reusable units (like whitespace, name characters etc.). For example:



    String whitespace = "(?:[\x20\x09\x0d\x0a]+)";
    String nameStartCharacter = "(?:[:A-Z_a-z\xc0-\xd6\xd8-\xf6\xf8-\u02ff" +
    "\u0370-\u037d\u037f-\u1fff\u200C-\u200D\u2070-\u218F" +
    "\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD" +
    "\x10000-\xEFFFF])";
    String nameCharacter = "(?:" + nameStartCharacter + "|[-.0-9\xb7\u0300-\u036F\u203F-\u2040])";
    String name = "(?:" + nameStartCharacter + nameCharacter + "*)";
    String endTag = "(?:</(?<name>" + name + ")" + whitespace + "?>)";


    And to use endTag:



    Matcher endTagMatcher = Pattern.compile(endTag).matcher("</test>");
    System.out.println(endTagMatcher.matches()); // true
    System.out.println(endTagMatcher.group("name")); // "test"


    Note that I wrapped each regular expression inside a non-capturing group ((?:X)), so that appending a quantifier to it will always work on the whole expression (for instance, if nameCharacter were not wrapped in a group, the quantifier * appended to it in name would only apply to the character class to the right of | in nameCharacter).



    Of course, you can not use regular expessions with a char directly, so you would have to find a way around that. Maybe a CharBuffer can be of use, since it implements CharSequence, and unlike String.subSequence(int, int) and StringBuilder.subSequence(int, int), which create a new String, CharBuffer.subSequence(int, int) does not copy the char data but reads/writes through to the original CharBuffer.




  • Your parser has some bugs:



    • It does not consume the final /> part of an empty element tag, so any elements that follow an empty element will not be interpreted correctly (try it with "<root><foo /><bar></bar></root>"; the program will die from lack of memory since it will be trapped in the loop while (currentCase() == 4)).


    • A closing bracket ] might be part of a CDATA section and does not necessarily end it. Only ]]> is guaranteed to terminate a CDATA section.



    • An element with a CDATA section might still contain other character data not part of the CDATA section. In fact, an element can even contain multiple CDATA sections. To quote the relevant section in the XML specification:




      CDATA sections may occur anywhere character data may occur;




      Your parser will fail with elements that contain other character data (or CDATA sections) in addition to one CDATA section.



    • References may not only occur in character data, but also in attribute values.




  • Finally, some stylistic suggestions:




    • In this loop:



      while (!endTagStart() && !cDataStart() && !tagStart()) 
      currentIndex++;
      if (endTagStart())
      currentIndex = currentIndexRollBackPoint;
      return 2;

      if (cDataStart())
      currentIndex = currentIndexRollBackPoint;
      return 3;

      if (tagStart() && !endTagStart())
      currentIndex = currentIndexRollBackPoint;
      return 4;




      The termination condition is completely pointless, because the loop will always be terminated from within. Usually, I'm in favor of using concrete termination conditions instead of something like while(true), but here, the termination condition will never evaluate to false and therefore does not fulfill any purpose, so I think that, in this case, the code would be easier to read if you simply changed it to while (true).



    • Since your parser doesn't support elements with mixed content, you could make XMLElement an abstract class and make two subclasses, one for character data elements, and another for elements with child elements. That way, a character data element will not have a meaningless field children, and an element with children will not have the meaningless field value.







share|improve this answer















Just for the sake of completeness, there are things that this parser doesn't support other than comments and elements with mixed content, such as processing instructions or character references (e.g. & or & instead of &amp;, i.e. references to unicode code points rather than entities). Processing instructions are meant to carry information relevant only to the application receiving the XML document and are not part of the data stored by the XML document, but the parser should recognize them nonetheless. And since you support references to the five predefined entities (i.e. &amp;, &lt; and so on), it would seem natural also to support character references.



The parser also doesn't read the prolog, which consists of an optional XML declaration and a likewise optional document type declaration, although, admittely, the XML declaration only contains information specific to the process of parsing the document itself (such as the XML version, or the character encoding), and the document type declaration defines such things as the stucture of the document and entities to be referenced by an entity reference, so it might not make sense for the parser that parses the XML data to also parse the prolog.



So now about what you have implemented:



  • There are several problems with this parser, the biggest of which seems to be that the parser takes it for granted that the document is well-formed and produces valid output even if the syntax of the XML document is invalid. For example, the parser does not check whether the name of an end tag matches the name of the corresponding start tag. Or when parsing attributes in a start tag, it does not check whether the character assumed to be "=" is indeed "=" when there's whitespace between the attribute name and the "=" sign. Likewise, the character assumed to be the opening quotation mark of the attribute value could as well be any other character. This means that the parser would treat <foo attributeName x ybar"> as equivalent to <foo attributeName = "bar">. It gets even worse with CDATA sections, because for all your parser cares, the input could contain garbage syntax like <![jklöä°some character data]]>, and it will treat it as if it were <![CDATA[some character data]]>.



  • Another problem are the character categories on which you base the decision of how to continue with the parsing process. For example, your method isWhiteSpace() checks the conditions described in the documentation of the method Character.isWhitespace(char). But this is not what counts as whitespace according to the XML specification. The XML specification only counts four Unicode characters as whitespace:



    • U+0020 SPACE

    • U+0009 CHARACTER TABULATION

    • U+000D CARRIAGE RETURN

    • U+000A LINE FEED

    So if an XML document contained a tag like <foo>, but with an U+2003 EM SPACE inserted between foo and >, then your parser would consider it legal XML syntax, when in fact the em-space would be illegal here, because it is neither whitespace nor a legal element name character.



    Similarly, you are not acknowledging the fact that an attribute name does not necessarily have to begin with a letter. It could also begin with a colon, an underscore, or some other exotic character that is not covered by Character.isLetter(), for example U+02EA MODIFIER LETTER YIN DEPARTING TONE MARK, "˪" (whatever that is).



    So how to rectify these issues? Since the XML specification so nicely provides regular expressions, you can simply imitate these regular expressions in the parser. While it would probably not be possible to parse the whole document with a single regular expression due to the possibility of nested tags with the same name, or siblings with the same name, you can at least write regular expressions for small units, like tags, which can also be composed of multiple regular expressions for even smaller, reusable units (like whitespace, name characters etc.). For example:



    String whitespace = "(?:[\x20\x09\x0d\x0a]+)";
    String nameStartCharacter = "(?:[:A-Z_a-z\xc0-\xd6\xd8-\xf6\xf8-\u02ff" +
    "\u0370-\u037d\u037f-\u1fff\u200C-\u200D\u2070-\u218F" +
    "\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD" +
    "\x10000-\xEFFFF])";
    String nameCharacter = "(?:" + nameStartCharacter + "|[-.0-9\xb7\u0300-\u036F\u203F-\u2040])";
    String name = "(?:" + nameStartCharacter + nameCharacter + "*)";
    String endTag = "(?:</(?<name>" + name + ")" + whitespace + "?>)";


    And to use endTag:



    Matcher endTagMatcher = Pattern.compile(endTag).matcher("</test>");
    System.out.println(endTagMatcher.matches()); // true
    System.out.println(endTagMatcher.group("name")); // "test"


    Note that I wrapped each regular expression inside a non-capturing group ((?:X)), so that appending a quantifier to it will always work on the whole expression (for instance, if nameCharacter were not wrapped in a group, the quantifier * appended to it in name would only apply to the character class to the right of | in nameCharacter).



    Of course, you can not use regular expessions with a char directly, so you would have to find a way around that. Maybe a CharBuffer can be of use, since it implements CharSequence, and unlike String.subSequence(int, int) and StringBuilder.subSequence(int, int), which create a new String, CharBuffer.subSequence(int, int) does not copy the char data but reads/writes through to the original CharBuffer.




  • Your parser has some bugs:



    • It does not consume the final /> part of an empty element tag, so any elements that follow an empty element will not be interpreted correctly (try it with "<root><foo /><bar></bar></root>"; the program will die from lack of memory since it will be trapped in the loop while (currentCase() == 4)).


    • A closing bracket ] might be part of a CDATA section and does not necessarily end it. Only ]]> is guaranteed to terminate a CDATA section.



    • An element with a CDATA section might still contain other character data not part of the CDATA section. In fact, an element can even contain multiple CDATA sections. To quote the relevant section in the XML specification:




      CDATA sections may occur anywhere character data may occur;




      Your parser will fail with elements that contain other character data (or CDATA sections) in addition to one CDATA section.



    • References may not only occur in character data, but also in attribute values.




  • Finally, some stylistic suggestions:




    • In this loop:



      while (!endTagStart() && !cDataStart() && !tagStart()) 
      currentIndex++;
      if (endTagStart())
      currentIndex = currentIndexRollBackPoint;
      return 2;

      if (cDataStart())
      currentIndex = currentIndexRollBackPoint;
      return 3;

      if (tagStart() && !endTagStart())
      currentIndex = currentIndexRollBackPoint;
      return 4;




      The termination condition is completely pointless, because the loop will always be terminated from within. Usually, I'm in favor of using concrete termination conditions instead of something like while(true), but here, the termination condition will never evaluate to false and therefore does not fulfill any purpose, so I think that, in this case, the code would be easier to read if you simply changed it to while (true).



    • Since your parser doesn't support elements with mixed content, you could make XMLElement an abstract class and make two subclasses, one for character data elements, and another for elements with child elements. That way, a character data element will not have a meaningless field children, and an element with children will not have the meaningless field value.








share|improve this answer













share|improve this answer



share|improve this answer











answered May 8 at 1:17









Stingy

1,888212




1,888212











  • Simply beautiful, thank you.
    – Koray Tugay
    May 8 at 12:35
















  • Simply beautiful, thank you.
    – Koray Tugay
    May 8 at 12:35















Simply beautiful, thank you.
– Koray Tugay
May 8 at 12:35




Simply beautiful, thank you.
– Koray Tugay
May 8 at 12:35












 

draft saved


draft discarded


























 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f193234%2fxml-parser-written-in-java%23new-answer', 'question_page');

);

Post as a guest













































































Popular posts from this blog

Chat program with C++ and SFML

Function to Return a JSON Like Objects Using VBA Collections and Arrays

Will my employers contract hold up in court?