Project 03: HTML Management
This project focuses on implementing a class to manage and validate HTML tags using stacks and queues. You’ll write code to detect and fix invalid HTML tag nesting, similar to how web browsers handle malformed HTML.
You don’t need HTML knowledge to complete this assignment. The focus is on Java collections like stacks and queues.
There are no B tests on this project. Part 1 corresponds to the C tests and part 2 to the A tests.
What is HTML?
HTML (Hypertext Markup Language) uses tags to format text. A tag consists of a named element between less-than (<) and greater-than (>) symbols. For example:
1
2
3
<b>bold text</b>
<i>italic text</i>
<b><i>bold and italic</i></b>
There are four types of tags you’ll need to handle:
- Opening tags like
<b>
that start formatting - Closing tags like
</b>
that end formatting - Self-closing tags like
<br/>
or<img />
that don’t enclose text - Comment tags like
<!-- any text -->
Additionally, tags can have attributes like <img src="cat.jpg" width="100">
.
The HTMLTag
Class
You’ll be working with HTMLTag
objects that have these key methods:
-
HTMLTag(String element, HTMLTagType type)
- Constructor that takes the tag’s contents and type -
isOpening()
- Returns true if this is an opening tag like<p>
-
isClosing()
- Returns true if this is a closing tag like</p>
-
isSelfClosing()
- Returns true if this is a self-closing tag like<br/>
-
matches(other)
- Returns true if the other tag matches this one (e.g.<p>
matches</p>
) -
getMatching()
- Returns a new matching tag of opposite type (e.g.<p>
returns</p>
) -
equals(other)
- Returns true if the tags are equal (same type, contents, attributes etc) -
toString()
- Returns string representation like"</p>"
-
getElement()
- Returns the element name of the tag (e.g. “p” for<p>
)
Note: The element parameter in the constructor refers to the contents inside the tag, excluding:
- The angular braces
<>
- Any starting
/
(for closing tags) - Any ending
/
(for self-closing tags) - The
<!--
and-->
delimiters for comments - Any whitespace before or after these special characters
For example:
1
2
3
4
5
"<p>" → element is "p"
"</p>" → element is "p"
"<br />" → element is "br"
"<!-- comment -->" → element is "comment"
"< p >" → element is "p"
Part 1: HTML Manager Implementation
In this part you’ll create a class called HTMLManager
that stores HTML tags and can fix invalid nesting. The class must use a Queue
to store the tags internally.
HTML Tag Validation
Tags must follow certain rules to be valid:
- Every opening tag must have a matching closing tag
- Inner tags must be closed before outer tags are closed
- Self-closing tags don’t need to be closed
Examples of invalid HTML:
1
2
3
<b><i>this is invalid</b></i> <!-- Wrong: i tag should close first -->
<b>unclosed tag <!-- Wrong: missing </b> -->
</p>no opening tag</p> <!-- Wrong: closing tag with no opener -->
Valid HTML:
1
2
3
4
5
<p>
This <b>word</b> is bold.
<br/> <!-- Self-closing tag is fine -->
<b><i>Both</i></b> formats. <!-- Proper nesting -->
</p>
Constructor and Basic Methods
public HTMLManager(Queue<HTMLTag> page)
Takes in a Queue
of HTMLTags
representing an HTML page. If the queue is null
, an IllegalArgumentException
should be thrown.
Otherwise, this constructor creates and stores a deep copy of the queue page
.
A deep copy means that the data structure is not a reference to the original one; instead, this constructor should make a new Queue
and add all the original elements of page
to it.
The given queue page
must be unchanged after this constructor.
public void add(HTMLTag tag)
Adds an HTMLTag
to the end of the stored tags. If the tag
is null
, an IllegalArgumentException
should be thrown.
public List<HTMLTag> getTags()
Returns a deeply copied List
of all stored tags in order.
A deep copy means that the List
returned must be a new one that has the same contents of your internal field (rather than returning a reference to the internal field) to avoid exposing internal state.
public String toString()
Returns a pretty-printed string representation of the HTML with the following requirements:
- Each nested tag increases indentation by 2 spaces
- Self-closing and content tags don’t affect indentation
- Comment tags appear on their own line
- Opening tags start a new line
- Content appears after its opening tag
- Closing tags appear on a new line if they close a tag with nested content
Example:
1
2
3
4
5
6
7
8
9
10
<div>
<p>
Hello
<br/>
<!-- comment -->
<b>
World
</b>
</p>
</div>
The fixHTML
Method
public void fixHTML()
Fixes invalid HTML tag nesting using a stack-based algorithm:
- Create a Stack to track unclosed opening tags
- Create a new Queue for the fixed output
- Process each tag from the stored queue:
- For self-closing tags: add directly to output
- For comment/content tags: add directly to output (You may find
isNotTag()
helpful to identify these tags) - For opening tags: add to output and push onto stack
- For closing tags:
- If stack is empty: discard tag (no matching opener)
- If matches top of stack: pop stack and add to output
- If doesn’t match: add closing tags for items on stack until matching tag found, then process current tag
- After all tags processed, add closing tags for any remaining items on stack
Examples:
- Invalid Nesting Fixed:
-
1 2
Input: [<b>, <i>, </b>, </i>] Output: [<b>, <i>, </i>, </b>]
-
1 2
Input: [<p>, <b>, </p>] Output: [<p>, <b>, </b>, </p>]
-
1 2
Input: [</b>, <i>] Output: [<i>, </i>] // Discarded </b> with no opener
-
- Self-closing tags Preserved:
-
1 2
Input: [<p>, <br/>, </p>] Output: [<p>, <br/>, </p>]
-
Task 1.
Implement the HTMLManager
class
methods in the order explained in this guide.
Part 2: HTML Parser Implementation
The HTMLParser
Class
You’ll work with the HTMLParser class that parses raw HTML content into HTMLTag
objects. The class maintains an invariant that the current page content always starts with a potential HTML tag or text content - any already processed content has been removed. This means that any time your findCommentTag
, findNormalTag
or findContent
functions are called, they should try to match the start of the current page with whatever type of HTMLTag
you are constructing.
Reading the following section about the Regex pattern given to you will almost definitely be necessary in order for you to successfully complete this assignment.
In the HTMLParser
class, you will find that you are given a Regex Pattern
callded TAG_PATTERN
which is ^<\\s*(?<closing>/)?\\s*(?<tagData>[^>]*[^/> ])\\s*(?<selfclosing>/)?\\s*>
. How this Regex works will be discussed in lecture, and you can go to Prof. Blank’s OH for a deeper understanding if you’re curious! Below, we will show some of the syntax necessary for effectively using this Pattern
:
-
Matcher m = TAG_PATTERN.matcher(text);
will create aMatcher
which can be used to apply yourPattern
totext
, which is aString
here. -
m.find()
will “search” for the pattern in the text (this must be done before the proceeding operations). Additionally, it will returnfalse
if no match is found. -
m.start()
will return the index of location where the match was found. -
m.group(groupName)
will return the contents of the named groupgroupName
in the match that was found. If it was an optional group and was not matched, this will returnnull
. The named groups in the given regex areclosing
,tagData
, andselfclosing
. -
m.group(0)
will return the entire string that was matched.
public HTMLTag findNormalTag()
Returns a new “normal” (opening, closing, or self-closing) HTMLTag
corresponding to the first tag remaining in the page if there is one.
If no valid tag is found at the beginning of the current page, returns null
.
To do this, you should:
- Extract the content of the tag and construct the
HTMLTag
with it as theelement
and the correct tag type - Update the
prevTag
field to the empty string for non-opening tags, and the element name for opening tags - Remove the matched tag from the current page content
Hints:
- You should make sure to ignore all excess whitespace within tags. This means white space between <, /, and >. This means
< / p >
is just a closing tag withelement
“p”. Note this does not include other whitespace between attributes etc. - You should use
getElement()
to get the element name of a tag for the purposes of settingprevTag
- You will find
TAG_PATTERN
helpful for this method.
Example:
1
2
3
// If page content is "<p class="main">Hello</p>"
findNormalTag() // Returns HTMLTag for element `p class="main"` and updates prevTag using getElement()
// Page content becomes "Hello</p>"
public HTMLTag findCommentTag()
Returns a new “comment” HTMLTag
corresponding to the first tag remaining in the page if there is one.
If no valid tag is found at the beginning of the current page, returns null
.
To do this, you should:
- Extract the content of the comment tag and construct the
HTMLTag
with it as theelement
and the correct tag type - Update
prevTag
field to the empty string - Remove the end comment tag from the current page content
Hints:
- You should make sure to preserve the contents of the comment exactly as it is, removing only the opening
<!--
and closing-->
- You may find writing your own regex pattern here helpful, but you do not have to.
Example:
1
2
3
// If page content is "<!-- header -->content"
findCommentTag() // Returns comment tag with " header " content
// Page content becomes "content"
public HTMLTag findContent()
Returns a new “content” HTMLTag
corresponding to the non-tag text between an innermost opening tag and the corresponding closing tag of first tag remaining in the page if there is one.
If no valid closing tag is found returns null
.
To do this, you should:
- If
prevTag
is ascript
tag, then search for the next closingscript
tag and extract all content up to it -
Otherwise, extract content up to next closing tag
- Construct the
HTMLTag
with the extracted content as the element and the correct tag type - Update
prevTag
field to the empty string - Remove the extracted content from the current page content
Hints:
- You should make sure to preserve the contents of the content exactly as it is, up to the next closing tag or closing
script
tag as described above. - You may find writing your own regex pattern here helpful, but you do not have to.
Example:
1
2
3
// If page content is "Hello<br>"
findContent() // Returns content tag with "Hello"
// Page content becomes "<br>"
Task 2.
Implement the HTMLParser
class
methods in the order explained in this guide.