CS 2 (Winter 2025) Project 03: HTML Management

Project 03: HTML Management

This project focuses on implementing a class to manage and validate HTML tags using stacks and queues. You’ll write code to detect and fix invalid HTML tag nesting, similar to how web browsers handle malformed HTML.

What is HTML?

HTML (Hypertext Markup Language) uses tags to format text. A tag consists of a named element between less-than (<) and greater-than (>) symbols. For example:

1
2
3
<b>bold text</b>
<i>italic text</i>
<b><i>bold and italic</i></b>

There are four types of tags you’ll need to handle:

  1. Opening tags like <b> that start formatting
  2. Closing tags like </b> that end formatting
  3. Self-closing tags like <br/> or <img /> that don’t enclose text
  4. Comment tags like <!-- any text -->

Additionally, tags can have attributes like <img src="cat.jpg" width="100">.

The HTMLTag Class

You’ll be working with HTMLTag objects that have these key methods:

Note: The element parameter in the constructor refers to the contents inside the tag, excluding:

For example:

1
2
3
4
5
"<p>"  element is "p"
"</p>"  element is "p"
"<br />"  element is "br"
"<!-- comment -->"  element is "comment"
"<  p  >"  element is "p"

Part 1: HTML Manager Implementation

In this part you’ll create a class called HTMLManager that stores HTML tags and can fix invalid nesting. The class must use a Queue to store the tags internally.

HTML Tag Validation

Tags must follow certain rules to be valid:

Examples of invalid HTML:

1
2
3
<b><i>this is invalid</b></i> <!-- Wrong: i tag should close first -->
<b>unclosed tag <!-- Wrong: missing </b> -->
</p>no opening tag</p> <!-- Wrong: closing tag with no opener -->

Valid HTML:

1
2
3
4
5
<p>
This <b>word</b> is bold.
<br/> <!-- Self-closing tag is fine -->
<b><i>Both</i></b> formats. <!-- Proper nesting -->
</p>

Constructor and Basic Methods

public HTMLManager(Queue<HTMLTag> page)

Takes in a Queue of HTMLTags representing an HTML page. If the queue is null, an IllegalArgumentException should be thrown. Otherwise, this constructor creates and stores a deep copy of the queue page.


A deep copy means that the data structure is not a reference to the original one; instead, this constructor should make a new Queue and add all the original elements of page to it. The given queue page must be unchanged after this constructor.

public void add(HTMLTag tag)

Adds an HTMLTag to the end of the stored tags. If the tag is null, an IllegalArgumentException should be thrown.

public List<HTMLTag> getTags()

Returns a deeply copied List of all stored tags in order.


A deep copy means that the List returned must be a new one that has the same contents of your internal field (rather than returning a reference to the internal field) to avoid exposing internal state.

public String toString()

Returns a pretty-printed string representation of the HTML with the following requirements:

  • Each nested tag increases indentation by 2 spaces
  • Self-closing and content tags don’t affect indentation
  • Comment tags appear on their own line
  • Opening tags start a new line
  • Content appears after its opening tag
  • Closing tags appear on a new line if they close a tag with nested content

Example:

1
2
3
4
5
6
7
8
9
10
<div>
  <p>
    Hello
    <br/>
    <!-- comment -->
    <b>
      World
    </b>
  </p>
</div>

The fixHTML Method

public void fixHTML()

Fixes invalid HTML tag nesting using a stack-based algorithm:

  1. Create a Stack to track unclosed opening tags
  2. Create a new Queue for the fixed output
  3. Process each tag from the stored queue:
    • For self-closing tags: add directly to output
    • For comment/content tags: add directly to output (You may find isNotTag() helpful to identify these tags)
    • For opening tags: add to output and push onto stack
    • For closing tags:
      • If stack is empty: discard tag (no matching opener)
      • If matches top of stack: pop stack and add to output
      • If doesn’t match: add closing tags for items on stack until matching tag found, then process current tag
  4. After all tags processed, add closing tags for any remaining items on stack

Examples:

  • Invalid Nesting Fixed:
    • 1
      2
      
        Input: [<b>, <i>, </b>, </i>]
        Output: [<b>, <i>, </i>, </b>]
      
    • 1
      2
      
        Input: [<p>, <b>, </p>]
        Output: [<p>, <b>, </b>, </p>]
      
    • 1
      2
      
        Input: [</b>, <i>]
        Output: [<i>, </i>] // Discarded </b> with no opener
      
  • Self-closing tags Preserved:
    • 1
      2
      
        Input: [<p>, <br/>, </p>]
        Output: [<p>, <br/>, </p>]
      

Part 2: HTML Parser Implementation

The HTMLParser Class

You’ll work with the HTMLParser class that parses raw HTML content into HTMLTag objects. The class maintains an invariant that the current page content always starts with a potential HTML tag or text content - any already processed content has been removed. This means that any time your findCommentTag, findNormalTag or findContent functions are called, they should try to match the start of the current page with whatever type of HTMLTag you are constructing.

In the HTMLParser class, you will find that you are given a Regex Pattern callded TAG_PATTERN which is ^<\\s*(?<closing>/)?\\s*(?<tagData>[^>]*[^/> ])\\s*(?<selfclosing>/)?\\s*>. How this Regex works will be discussed in lecture, and you can go to Prof. Blank’s OH for a deeper understanding if you’re curious! Below, we will show some of the syntax necessary for effectively using this Pattern:

public HTMLTag findNormalTag()

Returns a new “normal” (opening, closing, or self-closing) HTMLTag corresponding to the first tag remaining in the page if there is one. If no valid tag is found at the beginning of the current page, returns null.


To do this, you should:

  • Extract the content of the tag and construct the HTMLTag with it as the element and the correct tag type
  • Update the prevTag field to the empty string for non-opening tags, and the element name for opening tags
  • Remove the matched tag from the current page content

Hints:

  1. You should make sure to ignore all excess whitespace within tags. This means white space between <, /, and >. This means < / p > is just a closing tag with element “p”. Note this does not include other whitespace between attributes etc.
  2. You should use getElement() to get the element name of a tag for the purposes of setting prevTag
  3. You will find TAG_PATTERN helpful for this method.

Example:

1
2
3
// If page content is "<p class="main">Hello</p>"
findNormalTag() // Returns HTMLTag for element `p class="main"` and updates prevTag using getElement()
                // Page content becomes "Hello</p>"

public HTMLTag findCommentTag()

Returns a new “comment” HTMLTag corresponding to the first tag remaining in the page if there is one. If no valid tag is found at the beginning of the current page, returns null.


To do this, you should:

  • Extract the content of the comment tag and construct the HTMLTag with it as the element and the correct tag type
  • Update prevTag field to the empty string
  • Remove the end comment tag from the current page content

Hints:

  1. You should make sure to preserve the contents of the comment exactly as it is, removing only the opening <!-- and closing -->
  2. You may find writing your own regex pattern here helpful, but you do not have to.

Example:

1
2
3
// If page content is "<!-- header -->content"
findCommentTag() // Returns comment tag with " header " content
                 // Page content becomes "content"

public HTMLTag findContent()

Returns a new “content” HTMLTag corresponding to the non-tag text between an innermost opening tag and the corresponding closing tag of first tag remaining in the page if there is one. If no valid closing tag is found returns null.


To do this, you should:

  • If prevTag is a script tag, then search for the next closing script tag and extract all content up to it
  • Otherwise, extract content up to next closing tag

  • Construct the HTMLTag with the extracted content as the element and the correct tag type
  • Update prevTag field to the empty string
  • Remove the extracted content from the current page content

Hints:

  1. You should make sure to preserve the contents of the content exactly as it is, up to the next closing tag or closing script tag as described above.
  2. You may find writing your own regex pattern here helpful, but you do not have to.

Example:

1
2
3
// If page content is "Hello<br>"
findContent() // Returns content tag with "Hello"
              // Page content becomes "<br>"