CS 2 (Winter 2026) Project04: JSON and Webcrawler

Project 04: JSON and Webcrawler

The first part of this project focuses on implementing a class to parse and dump (i.e. read and write) JSON using recursion.

What is JSON?

JSON (JavaScript Object Notation) is a text-based format used to transmit data. It’s frequently used due to its easy-to-read nature for both humans and machines. For more information and a visualization of what a JSON object looks like, you can check out this link.

There are three main JSON types you’ll need to handle: Objects, Lists, and Values. Values consist of five different types: 4 data types Booleans, Doubles, Integers, and Strings, along with a special null value. We have already given you code for classes that you should use to store each of these data types, named appropriately.

Each class implements the JSON interface, so they inherit the method dump and the static method parse, both of which you will be writing in this project.

JSON syntax

JSON values are formatted as the value itself. If the value is a string, the string should also be surrounded by quotations, and any special characters should be physically written out (for example a new line will be written out as the two character string “\n”). Because there are so many of these characters we have given you two methods to easily convert to and from these different formats, which will be discussed later in the guide.

JSON Lists are denoted by brackets that contain some number of JSON types separated by commas. Example: [1,2,3,4,5]

JSON Objects are denoted by curly brackets that contain some number of members separated by commas. Each member consists of a string, formatted in the same way we would a JSONStringValue, followed by a colon followed by a JSON type. Example: {"name":"Caltech","city":"Pasadena","state":"California","zip code":91125}

Whitespace is often used to make larger JSON files more easily readable for humans, and thus is allowed between any pieces of the above syntax. Such whitespace should be ignored when parsing.

Helper Methods and ParsableString Class

We’ve provided you with a JSONHelpers interface, which has the following methods:

We’ve also provided you with a ParsableString class. This will allow you to easily keep track of what index has been parsed up to so far. All methods to alter this index will also automatically account for whitespace for you. This class has the below methods:

Part 1: JSON Implementation

In this part, you will implement two major methods in the JSON interface that can dump and parse JSON elements respectively. Both of these methods must use recursion to accomplish their goals.

In last week’s project you wrote a toString method for HTMLManager and several helper methods for HTMLParser, each of which used stacks and queues. This time, to create JSON’s dump and parse methods you will instead use recursion.

Both dump and parse MUST be written recursively when indicated to receive any credit.

public String dump(int numIndents)

Returns a pretty-print string representation of the JSON input. More specific requirements are included in the boxes for Task 1.1 and Task 1.2.


You should recurse on each JSON element to decide what formatting should be used when representing it as a string. Because this is a method, we’ve already given you headers for dump in each JSON class. Note that the dump methods you implement will have a single parameter, numIndents. Because of inheritance, calling json.dump(numIndents) on any instance of a JSON class will call that specific class’s implementation of dump.


Example: A JSONList containing JSONStringValue “Pasadena” and a JSONObject mapping “metro” to the JSONDoubleValue 5.0 and “car” to the JSONDoubleValue 2.5 would be dumped as:

1
2
3
4
5
6
7
[
  "Pasadena",
  {
    "car": 2.5,
    "metro": 5
  }
]

public static JSON parse(String inputStr)

Takes in a String representation of a JSON file, and returns the outermost JSON object. You can think of parse() as the “reverse operation” of dump() - instead of taking Java objects and turning it into JSON, we want to turn JSON into Java objects. If the input is not a valid JSON file, throws a JSONParsingException. The index you pass in when creating/throwing the exception does not matter (it will not be tested on).

If the input is valid, this method returns an object of type JSON, the outermost JSON instance in the input.


Unlike dump, there is one single parse method in the JSON class for you to implement.

You should recurse on each JSON element contained in the input string to decide what type is being read and how it should be decoded by your program. Your recursion should happen in order, similar to HTMLParser, i.e., using ParsableString you should never need to move the index backwards.

You should write helper functions to accomplish this - several headers have already been provided for you.

Hint: You should return an instance of a JSON class in each recursive step.

If you’re stuck on some failing test cases, below are a couple of hints about potential edge cases for parse().

I’m using .indexOf() to find the end of the element I’m parsing, but it’s not grabbing the correct element?

Using .indexOf() to find the end of the current element will often return an incorrect element if objects of the same type are nested within the current object. For example, if you used .indexOf(]) to find the end of a JSONList, then the list

1
2
3
4
5
6
7
8
9
[
    "a", 
    [
        "cat",
        "dog"
    ],
    "b"
]

will cause .indexOf() to find the inner bracket before the outer bracket, missing the element b at the end of the list. Instead, if you recursively parse each element of the list (or Object) one at a time until you reach the end of the list (or Object) then such cases should be automatically handled if you implement your recursion (and error-ing in case of malformed input) carefully. Note that trying to find the end index of an item in a List (or key or value in an Object) using indexOf can run into the same problems described above and parsing each element recursively also addresses this.

The tests for parsing too many ]’s and }’s are failing, but my code finds an equal number of opening and closing braces?

The failure of this test often indicates that your parse() method is not checking the string is fully parsed before returning the JSON value. For example, parsing the JSONList

1
2
3
4
[
    "cat"
]]

should lead to a JSONParsingException to be thrown. However, if you have not checked that the string is fully parsed before returning from parse(), then your code might be returning the JSONList(“cat”) object instead of detecting that the string is not valid JSON.

Part 2: WebCrawler

What is a WebCrawler?

A web crawler is a program designed to browse the internet, and index webpages, traveling from page to page using the links on each page. Our web crawler will also be scraping these webpages for their textual content, which will be useful in next week’s project where you will construct a search engine.

Implementation

public static JSONObject crawl(String link, int maxDepth)

Given a starting link, visits all webpages of distance at most maxDepth - 1 away. To do this, you should define a recursive method that, given a link, will parse that link’s HTML and will recursively visit links in the HTML, to create a nested object that is returned as described below.

  • This method returns a JSONObject which is constructed as follows. For each visited link:
    • Stores it as a key in the JSONObject. The value of that key should be another JSONObject. This inner object should contain:
      • A key labeled “content”, with the value equal to the content of the link/page (see below for definition of content).
      • A key labeled “links”, with the value being another JSONObject. The key, value pairs for this innermost object should be as follows:
        • Each key is a link found on this webpage, and the corresponding value is the number of occurrences of this link. There is an example below of what such a JSONObject would look like that may be helpful!

Note that there are 3 main types of links you should look out for:

  • Absolute URLs - an entire link. These always start with a “https://”. Example: “https://wikipedia.org/wiki/Horse”
  • Relative URLs - a link that points to a file within a website. These always start with a “/”. Example: If you are on https://wikipedia.org/wiki/Horse, a relative url of “/wiki/Duck” would reroute you to https://wikipedia.org/wiki/Duck
  • Folder-Relative URLS - a link that points to a file within a website’s outermost folder. These can start with anything. Example: If you are on https://wikipedia.org/wiki/Horse, a folder-relative url of “Duck” would reroute you to https://wikipedia.org/wiki/Duck.

Each of these URL types should be converted to Absolute URL format before you put it in the JSONObject that is returned.


The value of “content” for a page should be constructed using the following rules:

  • It should contain the values of all CONTENT type tags concatenated together, with whitespaces in between.
  • Pre-existing whitespace in the CONTENT tags’ values should not be trimmed
  • CONTENT tags which are inside of <script> or <style> blocks should be ignored, and not included in the “content”.

Links should be handled according to the following rules:

  • Self-links should be ignored (the links in a page which refer to that page itself).
  • Only links stored inside tags of the form <a href="link"> should be handled.
  • We strongly recommend you use the provided HREF regular expression to detect the links inside of a tags.

Example:

We are given an input of Wikipedia.org/wiki/Chicken with maxDepth 2.

https://Wikipedia.org/wiki/Chicken has text “Gallus gallus domesticus” and links to both https://Wikipedia.org/wiki/Pet and https://Wikipedia.org/wiki/Hemoglobin. https://Wikipedia.org/wiki/Pet has text “companion animal” and links to https://Wikipedia.org/wiki/Chicken and https://Wikipedia.org/wiki/Koi. https://Wikipedia.org/wiki/Hemoglobin has text “blood” and a link to https://Wikipedia.org/wiki/Hyperbolic_functions.

If these are the only links we found on the webpages we searched, our output would be:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
{
  "https://Wikipedia.org/wiki/Chicken":
  {
    "content": "Gallus gallus domesticus",
    "links":
    {
      "https://Wikipedia.org/wiki/Pet": 1,
      "https://Wikipedia.org/wiki/Hemoglobin": 1
    }
  },
  "https://Wikipedia.org/wiki/Pet": 
  {
    "content": "companion animal",
    "links":
    {
      "https://Wikipedia.org/wiki/Chicken": 1,
      "https://Wikipedia.org/wiki/Koi": 1
    }
  },
  "https://Wikipedia.org/wiki/Hemoglobin" : 
  {
    "content": "blood",
    "links":
    {
      "https://Wikipedia.org/wiki/Hyperbolic_functions": 1
    }
  }
}

Note that we don’t explore the links to the Wikipedia pages “Koi” or “Hyperbolic_functions” because they are greater than maxDepth - 1 = 1 pages away from our starting Wikipedia page of “Chicken”.

Helper Methods

Additionally, we have given you two more helper methods in WebCrawler:

These methods will be necessary for your implementation of crawl: after obtaining a Queue of tags, you can process the Queue to find tags that correspond to links. For instance, if the link to webpage B appears in webpage A, then once you get a Queue of tags in webpage A, the tag <a href="link to B"> will be in that Queue. Note that the actual link to B is a substring of the String "<a href="link to B">", which you want to extract using the HREF regular expression.

Result

Congratulations! You’ve successfully created a database of links and webpages using your webcrawler and your own implementation of JSON. If you look at the tests, you will see that in the tests, we dump out the databases you’ve created (using your dump() method). Next week, you will see what we can do with these dumped-out databases.