Project 03: HTML Management

This project focuses on implementing a class to manage and validate HTML tags using stacks and queues. You’ll write code to detect and fix invalid HTML tag nesting, similar to how web browsers handle malformed HTML.

You don’t need HTML knowledge to complete this assignment. The focus is on Java collections like stacks and queues.

There are no B tests on this project. Part 1 corresponds to the C tests and part 2 to the A tests.

What is HTML?

HTML (Hypertext Markup Language) uses tags to format text. A tag consists of a named element between less-than (<) and greater-than (>) symbols. For example:

<b>bold text</b>
<i>italic text</i>
<b><i>bold and italic</i></b>

There are four types of tags you’ll need to handle:

Opening tags like  that start formatting
Closing tags like  that end formatting
Self-closing tags like   or <img /> that don’t enclose text
Comment tags like

Additionally, tags can have attributes like <img src="cat.jpg" width="100">.

The `HTMLTag` Class

You’ll be working with HTMLTag objects that have these key methods:

HTMLTag(String element, HTMLTagType type) - Constructor that takes the tag’s contents and type
isOpening() - Returns true if this is an opening tag like 
isClosing() - Returns true if this is a closing tag like 
isSelfClosing() - Returns true if this is a self-closing tag like  
matches(other) - Returns true if the other tag matches this one (e.g.  matches )
getMatching() - Returns a new matching tag of opposite type (e.g.  returns )
equals(other) - Returns true if the tags are equal (same type, contents, attributes etc)
toString() - Returns string representation like ""
getElement() - Returns the element name of the tag (e.g. “p” for )

Note: The element parameter in the constructor refers to the contents inside the tag, excluding:

The angular braces <>
Any starting / (for closing tags)
Any ending / (for self-closing tags)
The  delimiters for comments
Any whitespace before or after these special characters

For example:

"<p>" → element is "p"
"</p>" → element is "p"
"<br />" → element is "br"
"<!-- comment -->" → element is "comment"
"<  p  >" → element is "p"

Part 1: HTML Manager Implementation

In this part you’ll create a class called HTMLManager that stores HTML tags and can fix invalid nesting. The class must use a Queue to store the tags internally.

HTML Tag Validation

Tags must follow certain rules to be valid:

Every opening tag must have a matching closing tag
Inner tags must be closed before outer tags are closed
Self-closing tags don’t need to be closed

Examples of invalid HTML:

<b><i>this is invalid</b></i> <!-- Wrong: i tag should close first -->
<b>unclosed tag <!-- Wrong: missing </b> -->
</p>no opening tag</p> <!-- Wrong: closing tag with no opener -->

Valid HTML:

<p>
This <b>word</b> is bold.
<br/> <!-- Self-closing tag is fine -->
<b><i>Both</i></b> formats. <!-- Proper nesting -->
</p>

Constructor and Basic Methods

public HTMLManager(Queue<HTMLTag> page)

Takes in a Queue of HTMLTags representing an HTML page. If the queue is null, an IllegalArgumentException should be thrown. Otherwise, this constructor creates and stores a deep copy of the queue page.

A deep copy means that the data structure is not a reference to the original one; instead, this constructor should make a new Queue and add all the original elements of page to it. The given queue page must be unchanged after this constructor.

public void add(HTMLTag tag)

Adds an HTMLTag to the end of the stored tags. If the tag is null, an IllegalArgumentException should be thrown.

public List<HTMLTag> getTags()

Returns a deeply copied List of all stored tags in order.

A deep copy means that the List returned must be a new one that has the same contents of your internal field (rather than returning a reference to the internal field) to avoid exposing internal state.

public String toString()

Returns a pretty-printed string representation of the HTML with the following requirements:

Each nested tag increases indentation by 2 spaces
Self-closing and content tags don’t affect indentation
Comment tags appear on their own line
Opening tags start a new line
Content appears after its opening tag
Closing tags appear on a new line if they close a tag with nested content

Example:

<div>
  <p>
    Hello
    <br/>
    <!-- comment -->
    <b>
      World
    </b>
  </p>
</div>

The `fixHTML` Method

public void fixHTML()

Fixes invalid HTML tag nesting using a stack-based algorithm:

Create a Stack to track unclosed opening tags
Create a new Queue for the fixed output
Process each tag from the stored queue:
- For self-closing tags: add directly to output
- For comment/content tags: add directly to output (You may find isNotTag() helpful to identify these tags)
- For opening tags: add to output and push onto stack
- For closing tags:
  - If stack is empty: discard tag (no matching opener)
  - If matches top of stack: pop stack and add to output
  - If doesn’t match: add closing tags for items on stack until matching tag found, then process current tag
After all tags processed, add closing tags for any remaining items on stack

Examples:

Invalid Nesting Fixed:

  Input: [<b>, <i>, </b>, </i>]
  Output: [<b>, <i>, </i>, </b>]

  Input: [<p>, <b>, </p>]
  Output: [<p>, <b>, </b>, </p>]

  Input: [</b>, <i>]
  Output: [<i>, </i>] // Discarded </b> with no opener

Self-closing tags Preserved:

  Input: [<p>, <br/>, </p>]
  Output: [<p>, <br/>, </p>]

Task 1. Implement the HTMLManager class methods in the order explained in this guide.

Part 2: HTML Parser Implementation

The `HTMLParser` Class

You’ll work with the HTMLParser class that parses raw HTML content into HTMLTag objects. The class maintains an invariant that the current page content always starts with a potential HTML tag or text content - any already processed content has been removed. This means that any time your findCommentTag, findNormalTag or findContent functions are called, they should try to match the start of the current page with whatever type of HTMLTag you are constructing.

Reading the following section about the Regex pattern given to you will almost definitely be necessary in order for you to successfully complete this assignment.

In the HTMLParser class, you will find that you are given a Regex Pattern callded TAG_PATTERN which is ^<\\s*(?<closing>/)?\\s*(?<tagData>[^>]*[^/> ])\\s*(?<selfclosing>/)?\\s*>. How this Regex works will be discussed in lecture, and you can go to Prof. Blank’s OH for a deeper understanding if you’re curious! Below, we will show some of the syntax necessary for effectively using this Pattern:

Matcher m = TAG_PATTERN.matcher(text); will create a Matcher which can be used to apply your Pattern to text, which is a String here.
m.find() will “search” for the pattern in the text (this must be done before the proceeding operations). Additionally, it will return false if no match is found.
m.start() will return the index of location where the match was found.
m.group(groupName) will return the contents of the named group groupName in the match that was found. If it was an optional group and was not matched, this will return null. The named groups in the given regex are closing, tagData, and selfclosing.
m.group(0) will return the entire string that was matched.

public HTMLTag findNormalTag()

Returns a new “normal” (opening, closing, or self-closing) HTMLTag corresponding to the first tag remaining in the page if there is one. If no valid tag is found at the beginning of the current page, returns null.

To do this, you should:

Extract the content of the tag and construct the HTMLTag with it as the element and the correct tag type
Update the prevTag field to the empty string for non-opening tags, and the element name for opening tags
Remove the matched tag from the current page content

Hints:

You should make sure to ignore all excess whitespace within tags. This means white space between <, /, and >. This means  is just a closing tag with element “p”. Note this does not include other whitespace between attributes etc.
You should use getElement() to get the element name of a tag for the purposes of setting prevTag
You will find TAG_PATTERN helpful for this method.

Example:

// If page content is "<p class="main">Hello</p>"
findNormalTag() // Returns HTMLTag for element `p class="main"` and updates prevTag using getElement()
                // Page content becomes "Hello</p>"

public HTMLTag findCommentTag()

Returns a new “comment” HTMLTag corresponding to the first tag remaining in the page if there is one. If no valid tag is found at the beginning of the current page, returns null.

To do this, you should:

Extract the content of the comment tag and construct the HTMLTag with it as the element and the correct tag type
Update prevTag field to the empty string
Remove the end comment tag from the current page content

Hints:

You should make sure to preserve the contents of the comment exactly as it is, removing only the opening 
You may find writing your own regex pattern here helpful, but you do not have to.

Example:

// If page content is "<!-- header -->content"
findCommentTag() // Returns comment tag with " header " content
                 // Page content becomes "content"

public HTMLTag findContent()

Returns a new “content” HTMLTag corresponding to the non-tag text between an innermost opening tag and the corresponding closing tag of first tag remaining in the page if there is one. If no valid closing tag is found returns null.

To do this, you should:

If prevTag is a script tag, then search for the next closing script tag and extract all content up to it
Otherwise, extract content up to next closing tag
Construct the HTMLTag with the extracted content as the element and the correct tag type
Update prevTag field to the empty string
Remove the extracted content from the current page content

Hints:

You should make sure to preserve the contents of the content exactly as it is, up to the next closing tag or closing script tag as described above.
You may find writing your own regex pattern here helpful, but you do not have to.

Example:

// If page content is "Hello</br>"
findContent() // Returns content tag with "Hello"
              // Page content becomes "</br>"

Task 2. Implement the HTMLParser class methods in the order explained in this guide.

Project 03: HTML Management

What is HTML?

The HTMLTag Class

Part 1: HTML Manager Implementation

HTML Tag Validation

Constructor and Basic Methods

The fixHTML Method

Part 2: HTML Parser Implementation

The HTMLParser Class

The `HTMLTag` Class

The `fixHTML` Method

The `HTMLParser` Class