You cannot use ternary operators in this course. Doing so will result in receiving no credit for the project.

Project 04: JSON and Webcrawler

The first part of this project focuses on implementing a class to parse and dump (i.e. read and write) JSON using recursion.

You don’t need JSON knowledge to complete this assignment. The focus is on implementing recursion.

There are no Gitlab tests for this project, only local tests. We will rerun the local tests after you submit and verify for correctness.

The repository might take a while to clone! This is perfectly normal, and is because we have a large amount of data to work with in the tests.

What is JSON?

JSON (JavaScript Object Notation) is a text-based format used to transmit data. It’s frequently used due to its easy-to-read nature for both humans and machines. For more information and a visualization of what a JSON object looks like, you can check out this link.

There are three main JSON types you’ll need to handle: Objects, Lists, and Values. Values consist of five different types: 4 data types Booleans, Doubles, Integers, and Strings, along with a special null value. We have already given you code for classes that you should use to store each of these data types, named appropriately.

Each class implements the JSON interface, so they inherit the method dump and the static method parse, both of which you will be writing in this project.

JSON syntax

JSON values are formatted as the value itself. If the value is a string, the string should also be surrounded by quotations, and any special characters should be physically written out (for example a new line will be written out as the two character string “\n”). Because there are so many of these characters we have given you two methods to easily convert to and from these different formats, which will be discussed later in the guide.

JSON Lists are denoted by brackets that contain some number of JSON types separated by commas. Example: [1,2,3,4,5]

JSON Objects are denoted by curly brackets that contain some number of members separated by commas. Each member consists of a string, formatted in the same way we would a JSONStringValue, followed by a colon followed by a JSON type. Example: {"name":"Caltech","city":"Pasadena","state":"California","zip code":91125}

Whitespace is often used to make larger JSON files more easily readable for humans, and thus is allowed between any pieces of the above syntax. Such whitespace should be ignored when parsing.

Helper Methods and ParsableString Class

We’ve provided you with a JSONHelpers interface, which has the following methods:

toFormattedString(String s) - Given a String, returns it as formatted for a JSON file.
tryParseInteger(String s) - Given a String, returns the parsed integer if it’s an integer and null otherwise.
tryParseDouble(String s) - Given a String, returns the parsed double if the provided value can be parsed as a double (this includes integers) and null otherwise.

We’ve also provided you with a ParsableString class. This will allow you to easily keep track of what index has been parsed up to so far. All methods to alter this index will also automatically account for whitespace for you. This class has the below methods:

incrementIndex(int i) - Moves the index forwards i characters, adjusting for whitespace
parseInternalString() - If the index is currently at the start of a String, parses and returns that String, moving the index forwards accordingly, adjusting for whitespace. Throws a JSONParsingException if it’s not at the start of a String.
getNumber() - If the index is currently at an integer or double, returns the entirety of that number as a string, moving the index forwards accordingly, adjusting for whitespace. Throws a JSONParsingException if it’s not at the start of an integer or double.

Part 1: JSON Implementation

In this part, you will implement two major methods in the JSON interface that can dump and parse JSON elements respectively. Both of these methods must use recursion to accomplish their goals.

In last week’s project you wrote a toString method for HTMLManager and several helper methods for HTMLParser, each of which used stacks and queues. This time, to create JSON’s dump and parse methods you will instead use recursion.

Both dump and parse MUST be written recursively when indicated to receive any credit.

public String dump(int numIndents)

Returns a pretty-print string representation of the JSON input. More specific requirements are included in the boxes for Task 1.1 and Task 1.2.

You should recurse on each JSON element to decide what formatting should be used when representing it as a string. Because this is a method, we’ve already given you headers for dump in each JSON class. Note that the dump methods you implement will have a single parameter, numIndents. Because of inheritance, calling json.dump(numIndents) on any instance of a JSON class will call that specific class’s implementation of dump.

Example: A JSONList containing JSONStringValue “Pasadena” and a JSONObject mapping “metro” to the JSONDoubleValue 5.0 and “car” to the JSONDoubleValue 2.5 would be dumped as:

[
  "Pasadena",
  {
    "car": 2.5,
    "metro": 5
  }
]

Task 1.1. Implement dump for values. This version of the method does not have to be recursive.

Notes:

In order to get the actual value stored inside a JSONValue, you may want to look at the JSONValue interface in JSONValue.java.
For JSONDoubleValues, to ensure that a string such as 20000008 is dumped as such instead of as 2.0000008E7, you must use the BigDecimal class and its method toPlainString.
For JSONStringValues, these should output in quotations the string they contain, with proper JSON formatting. You may want to look at the methods in the JSONHelpers class, and see if there is one you can use.
The remaining JSONValue subtypes should return their values as strings.
You may ignore (or implement) the numIndents parameter of dump for JSONValue subtypes, depending on whether you wish to use it in later sections. We will not be testing this.

Task 1.2. Implement dump for lists and objects. These versions of the method must be recursive.

Hint: the repeat() function may be useful for indentation (for, say, if you wanted multiple indents in a row)!

Notes:

JSONLists should be formatted as

[
  JSON element,
  JSON element
]

repeating for each element.

The indentation is two spaces.
numIndents represents the correct number of indents for printing.
JSONObjects should be formatted as

{
  "key": JSON element,
  "key": JSON element
}

repeating for each key-value pair, with the keys appearing in alphabetical order.

The indentation is two spaces.
numIndents represents the correct number of indents for printing.

public static JSON parse(String inputStr)

Takes in a String representation of a JSON file, and returns the outermost JSON object. You can think of parse() as the “reverse operation” of dump() - instead of taking Java objects and turning it into JSON, we want to turn JSON into Java objects. If the input is not a valid JSON file, throws a JSONParsingException. The index you pass in when creating/throwing the exception does not matter (it will not be tested on).

If the input is valid, this method returns an object of type JSON, the outermost JSON instance in the input.

Unlike dump, there is one single parse method in the JSON class for you to implement.

You should recurse on each JSON element contained in the input string to decide what type is being read and how it should be decoded by your program. Your recursion should happen in order, similar to HTMLParser, i.e., using ParsableString you should never need to move the index backwards.

You should write helper functions to accomplish this - several headers have already been provided for you.

Hint: You should return an instance of a JSON class in each recursive step.

Task 2.1. Implement parse for values.

Note: Reminder that null is a special value taken on by the JSONValue subtype. To account for it, the String “null” should be parsed as the null value in Java.

Task 2.2. Implement parse for lists.

Task 2.3. Implement parse for objects.

If you’re stuck on some failing test cases, below are a couple of hints about potential edge cases for parse().

I’m using .indexOf() to find the end of the element I’m parsing, but it’s not grabbing the correct element?

Using .indexOf() to find the end of the current element will often return an incorrect element if objects of the same type are nested within the current object. For example, if you used .indexOf(]) to find the end of a JSONList, then the list

[
    "a", 
    [
        "cat",
        "dog"
    ],
    "b"
]

will cause .indexOf() to find the inner bracket before the outer bracket, missing the element b at the end of the list. Instead, if you recursively parse each element of the list (or Object) one at a time until you reach the end of the list (or Object) then such cases should be automatically handled if you implement your recursion (and error-ing in case of malformed input) carefully. Note that trying to find the end index of an item in a List (or key or value in an Object) using indexOf can run into the same problems described above and parsing each element recursively also addresses this.

The tests for parsing too many ]’s and }’s are failing, but my code finds an equal number of opening and closing braces?

The failure of this test often indicates that your parse() method is not checking the string is fully parsed before returning the JSON value. For example, parsing the JSONList

[
    "cat"
]]

should lead to a JSONParsingException to be thrown. However, if you have not checked that the string is fully parsed before returning from parse(), then your code might be returning the JSONList(“cat”) object instead of detecting that the string is not valid JSON.

Part 2: WebCrawler

This part will require you to analyze HTML tags. Please copy your HTMLManager and HTMLParser implementations from project03 into the appropriate files! If you did not complete the HTMLParser from project03, you will not be able to complete this project’s A tests. If you did not complete project03’s A tests, we highly recommend you to go back and finish them so you can attempt these ones.

What is a WebCrawler?

A web crawler is a program designed to browse the internet, and index webpages, traveling from page to page using the links on each page. Our web crawler will also be scraping these webpages for their textual content, which will be useful in next week’s project where you will construct a search engine.

How do links work in HTML? A link to another webpage is written in html as <a href=”link url”>text</a>, where the physical webpage link is stored after the href=. The tag element a indicates a link.

Implementation

public static JSONObject crawl(String link, int maxDepth)

Given a starting link, visits all webpages of distance at most maxDepth - 1 away. To do this, you should define a recursive method that, given a link, will parse that link’s HTML and will recursively visit links in the HTML, to create a nested object that is returned as described below.

This method returns a JSONObject which is constructed as follows. For each visited link:
- Stores it as a key in the JSONObject. The value of that key should be another JSONObject. This inner object should contain:
  - A key labeled “content”, with the value equal to the content of the link/page (see below for definition of content).
  - A key labeled “links”, with the value being another JSONObject. The key, value pairs for this innermost object should be as follows:
    - Each key is a link found on this webpage, and the corresponding value is the number of occurrences of this link. There is an example below of what such a JSONObject would look like that may be helpful!

Note that there are 3 main types of links you should look out for:

Absolute URLs - an entire link. These always start with a “https://”. Example: “https://wikipedia.org/wiki/Horse”
Relative URLs - a link that points to a file within a website. These always start with a “/”. Example: If you are on https://wikipedia.org/wiki/Horse, a relative url of “/wiki/Duck” would reroute you to https://wikipedia.org/wiki/Duck
Folder-Relative URLS - a link that points to a file within a website’s outermost folder. These can start with anything. Example: If you are on https://wikipedia.org/wiki/Horse, a folder-relative url of “Duck” would reroute you to https://wikipedia.org/wiki/Duck.

Each of these URL types should be converted to Absolute URL format before you put it in the JSONObject that is returned.

The value of “content” for a page should be constructed using the following rules:

It should contain the values of all CONTENT type tags concatenated together, with whitespaces in between.
Pre-existing whitespace in the CONTENT tags’ values should not be trimmed
CONTENT tags which are inside of <script> or <style> blocks should be ignored, and not included in the “content”.

Links should be handled according to the following rules:

Self-links should be ignored (the links in a page which refer to that page itself).
Only links stored inside tags of the form <a href="link"> should be handled.
We strongly recommend you use the provided HREF regular expression to detect the links inside of a tags.

Example:

We are given an input of Wikipedia.org/wiki/Chicken with maxDepth 2.

https://Wikipedia.org/wiki/Chicken has text “Gallus gallus domesticus” and links to both https://Wikipedia.org/wiki/Pet and https://Wikipedia.org/wiki/Hemoglobin. https://Wikipedia.org/wiki/Pet has text “companion animal” and links to https://Wikipedia.org/wiki/Chicken and https://Wikipedia.org/wiki/Koi. https://Wikipedia.org/wiki/Hemoglobin has text “blood” and a link to https://Wikipedia.org/wiki/Hyperbolic_functions.

If these are the only links we found on the webpages we searched, our output would be:

{
  "https://Wikipedia.org/wiki/Chicken":
  {
    "content": "Gallus gallus domesticus",
    "links":
    {
      "https://Wikipedia.org/wiki/Pet": 1,
      "https://Wikipedia.org/wiki/Hemoglobin": 1
    }
  },
  "https://Wikipedia.org/wiki/Pet": 
  {
    "content": "companion animal",
    "links":
    {
      "https://Wikipedia.org/wiki/Chicken": 1,
      "https://Wikipedia.org/wiki/Koi": 1
    }
  },
  "https://Wikipedia.org/wiki/Hemoglobin" : 
  {
    "content": "blood",
    "links":
    {
      "https://Wikipedia.org/wiki/Hyperbolic_functions": 1
    }
  }
}

Note that we don’t explore the links to the Wikipedia pages “Koi” or “Hyperbolic_functions” because they are greater than maxDepth - 1 = 1 pages away from our starting Wikipedia page of “Chicken”.

Helper Methods

Additionally, we have given you two more helper methods in WebCrawler:

public static String getPageText(String page): given the link for a webpage, this method returns the contents of the webpage in raw HTML. You may want to use this in conjunction with HTMLParser to get the HTMLTags for a link.
public static boolean ignoreHref(String s): given a potential link (the text following an “href=”), returns true if the contents are not related to a webpage (and should therefore be ignored by crawl).
public static boolean ignoreLink(String link): given a String formatted as a link, returns true if the contents point to an invalid webpage (and should therefore be ignored by crawl).

These methods will be necessary for your implementation of crawl: after obtaining a Queue of tags, you can process the Queue to find tags that correspond to links. For instance, if the link to webpage B appears in webpage A, then once you get a Queue of tags in webpage A, the tag <a href="link to B"> will be in that Queue. Note that the actual link to B is a substring of the String "<a href="link to B">", which you want to extract using the HREF regular expression.

Do not interrupt the first run of the A tests! If you do so, the largeData.zip file may not completely get unzipped. If you do accidentally interrupt the first run of the A tests, you can delete the first (incomplete) unzipped largeData folder, then rerun the A tests.

The first run of the A tests will likely take a while; this is perfectly normal and not a cause for concern.

Task 3. Implement WebCrawler.

Note: if you are failing some tests, you can check your findContent in HTMLParser. In HTMLParser, we expected findContent to return null if there is no tag in the current page; however, in WebCrawler, we would want findContent to return a content tag with the whole page (and set the page to the empty string).

Result

Congratulations! You’ve successfully created a database of links and webpages using your webcrawler and your own implementation of JSON. If you look at the tests, you will see that in the tests, we dump out the databases you’ve created (using your dump() method). Next week, you will see what we can do with these dumped-out databases.