For most of the remaining projects in CS 2 (including this one), you will be allowed to work in groups of two. You may discuss high-level ideas with students outside your group, but any viewing, sharing, copying, or dictating of code is forbidden. Dean’s Tutors may not read your code.

Goals and Outcomes

In this project, you will implement one data structure (called a trie) and two versions of an autocomplete algorithm (one that uses Java’s HashMap and one that uses your Trie). Your autocomplete algorithms will then be used to be the backend of a system that complete the names of movies in a search field.

By the end of this project, you will…

be able to run an efficient autocomplete algorithm on a variety of datasets
understand how to implement a trie data structure and some algorithms that correspond to them
have gained a better understanding of API design, as it relates to making projects usable with a variety of clients

Connection to Lecture

This project covers lecture08 (trees) and lecture09 (tries). You will likely find the code from these two lectures to be very useful in the later parts of this project.

Overview and Implementation Strategy

In the previous project, you wrote three implementations of the IDeque interface. This time, you will continue our mission of implementing all the data structures ourselves by writing a trie implementation of the IDictionary interface. In addition to building a trie data structure, you will write an autocomplete algorithm which can switch between using a backing java.util.HashMap and your very own TrieMap.

Just to be super clear: the only data structures in java.util that you are allowed to use in this project are java.util.HashMap and java.util.HashSet.

Deques Again

The first thing you should do for this project is copy over your deque implementations from project03. You may choose either partner’s implementation of each of the data structures, but, at a minimum, you must copy over someone’s ArrayDeque and someone’s LinkedDeque. You should place them in the edu.caltech.cs2.datastructures package.

If neither partner has both IDeque’s working, e-mail Adam immediately.

`HashMovieAutoCompleter` Requirements

First, we will implement a movie autocompleter algorithm backed by a HashMap. Our autocomplete implementation will be case-insensitive, and will suggest a list of movie names, given a set of words. Consider the movie database that includes the following movies:

The Avengers
The Avengers Age of Ultron
Captain America
Black Panther
Transformers Age of Extinction

Now consider the autocomplete query "age of". We want to match on any set of consecutive words in a movie title; so, we should suggest ["The Avengers Age of Ultron", "Transformers Age of Extinction"].

However, if the query is "of age", we should not suggest anything, because none of the movies have “of” and “age” in that order. Similarly, if our autocomplete query is "the", then we should suggest ["The Avengers", "The Avengers Age of Ultron"].

In our first AutoCompleter, HashMovieAutoCompleter, the backing data structure will be a java.util.HashMap which maps each normalized movie title to an IDeques of “suffixes” that can be formed from that title.

For example, the above database would have the following mappings:

"the avengers" \(\mapsto\) ["the avengers", "avengers"]
"the avengers age of ultron" \(\mapsto\) ["the avengers age of ultron", "avengers age of ultron", "age of ultron", "of ultron", "ultron"]
"captain america" \(\mapsto\) ["captain america", "america"]
"black panther" \(\mapsto\) ["black panther", "panther"]
"transformers age of extinction" \(\mapsto\) ["transformers age of extinction", "age of extinction", "of extinction", "extinction"]

public static void populateTitles()

Populates the autocompleter titles map with all suffixes for each movie title.

You will find the pre-populated field ID_MAP which maps movie titles to their IMDB ids useful. Do NOT attempt to open the data files directly; this has already been done for you in the super class AbstractMovieAutoCompleter.

public static IDeque<String> complete(String term)

Given a search term (i.e., words separated by spaces), returns an IDeque of movie titles that are possible completions of the search term. Make sure there are no duplicate values in the result.

The general algorithm here is to check if the search term is a prefix of any of the entries in the map. If so, it is a valid completion. Note that “this!” does not match “this” because symbols are significant.

Why am I looking for prefixes of suffixes?

Imagine you want to search for “The Avengers: Age of Ultron” in the Netflix search bar. You would expect “Age of Ultron” to match, even though it isn’t the start of the title. Each suffix represents the start of a new word in the title, so by using suffixes, we allow you to start searching from any word.

However, you also don’t want to have to type in the whole search term. “Age of” should also match. This is where the prefixes come in: they are the starts of a suffix that you could be searching for. Together, this is a search for prefixes of suffixes.

Why doesn’t “this!” match “this”?

In our algorithm, we are searching by words where punctuation is significant. So, “this!” has an extra exclamation point which does not match the term where there isn’t punctuation.

Task 0. Implement the populateTitles and complete methods in HashMovieAutoCompleter.java.

Running the AutoCompleter

If all your D tests pass, you should be able to run the “Movie Autocomplete” target of the project. A browser window should automatically open with the movie finder interface. The algorithm you just wrote backs the search for completions. We recommend trying a query like “the “ which we expect to have lots of results. You will likely notice that the interface is noticably slow! This is all because of the data structure we chose. In the next part of the project, we will write a new data structure, called a trie, which is more suited to this type of application.

`Trie` Requirements

A Trie is a dictionary which maps “strings” to some type. Unlike other trees that we’ve seen previously, each node only contains a single “letter”. For example, here is a trie that represents the words {adam, add, app, bad, bag, bags, beds, bee, cab}:

trie1

You should be familiar with using a HashMap and a TreeMap at this point. So, we’ll start with a comparison to those.

Comparing `ITrieMap` to `HashMap` and `TreeMap`

It helps to compare it with dictionaries you’ve already seen: HashMap and TreeMap. Each of these types of maps takes two generic parameters K (for the “key” type) and V (for the “value” type). It is important to understand that Tries are NOT a general purpose data structure. There is an extra restriction on the “key” type; namely, it must be made up of characters (it’s tempting to think of the Character type here, but really, we mean any alphabet–chars, alphabetic letters, bytes, etc.). Example of types that Tries are good for include: String, byte[], List<E>. Examples of types that Tries cannot be used for include int and Dictionary<E>.

In our implementation of Tries, we encode this restriction in the generic type parameters:

A: An “alphabet type”. For a String, it would be Character. For a byte[], it would be Byte.
K: A “key type”. We insist that all the “key types” extend Iterable<A> which encodes exactly the restriction that there is an underlying alphabet.
V: A “value type”. There are no special restrictions on this type.

Because the key type is iterable, there is a clean way to take apart keys (i.e., you can use a foreach loop), but we also need a way to put keys back together. To do this, we use Java’s anonymous functions.

For our purposes, what this means is that the constructor will look like this:

public ITrieMap(Function<IDeque<A>, K> collector) {
    ...
}

What is this “collector” for?

If you look at the collector function’s declaration, you will see it is listed as Function<IDeque<A>, K>. Essentially, what this means is that it has the functionality to take an IDeque of “letters” in your alphabet and return a key that is essentially the letters concatenated together. So if we have an IDeque called acc that contains ["h", "e", "l", "l", "o"], and we called this.collector.apply(acc), it would return the string "hello".

You will find the collector very useful in your keys() method.

And, eventually, when we create a trie to use for autocompleting, the instantiation will look like this:

private static ITrieMap<String, IDeque<String>, IDeque<String>> titles =
    new TrieMap<>((IDeque<String> s) -> s);

`TrieMap`

A TrieMap is an implementation of a trie where the “pointers” are made up of a HashMap. An array would work as well, but you should think about why that might not be a good idea if the alphabet size is large (like 5 or more). Consider the following ITrieMap:

trie2

We could manually construct this as a TrieMap as follows:

this.root = new TrieNode();
this.root.pointers.put('a', new TrieNode("hello"));
this.root.pointers.get('a').pointers.put('d', new TrieNode());
this.root.pointers.get('a').pointers.get('d').pointers.put('d', new TrieNode("1 + 1"));
this.root.pointers.get('a').pointers.put('p', new TrieNode());
this.root.pointers.get('a').pointers.get('p').pointers.put('p', new TrieNode("for a phone"));

Notice that the pointers variables in each of the nodes are just standard HashMaps!

You will implement all the standard IDictionary operations. For each of these, read the comments for IDictionary for the particulars. There are two methods (getCompletions and findPrefix) which are special for ITrieMaps and have additional information/restrictions:

isPrefix(K k) should return true iff k is a prefix of some key in the trie. For example, if “add” were a key in the trie, then:
isPrefix("”) = isPrefix("a”) = isPrefix("ad”) = isPrefix("add”) = true.
This method is arguably one of the major reasons to use a ITrieMap over another implementation of Dictionary. (We saw a similar trade-off between HashMap (faster) and TreeMap (ordered) a few weeks ago.) Unlike in a normal IDictionary, it is possible (and in fact, easy) to implement this method.
getCompletions(K prefix) should return an IDeque filled with the values of all descendants of the prefix. For example, in the example above:
getCompletions("") = getCompletions("a") = ["hello", "1 + 1", "for a phone"]

getCompletions, keys, and values are the hardest methods to write here. You will likely want to use an “accumulator” argument which “collects” the completions/keys/values as you recurse down the tree. The only place you will need to use the “collector” function is in keys.

Your implementations of get, put, isPrefix, and containsKey must have time complexity \(\Theta(d)\) where \(d\) is the number of letters in the key argument of these methods. These methods work on the entire key (the whole “string” of “letters”); make sure to only remove/add/find the exact key asked for. You will also have to implement other methods like keys and values, but the runtime of these methods may be worse. Notably, we are not implementing the remove method right now–it’s not needed for the application; so, we’ll implement it later in the A tests.

Task 1. Just like last time, the interface files will be very important. Carefully read the interfaces/IDictionary interface.

Task 2. Implement the put, size, and clear methods in TrieMap.

What is the size method counting?

The size function should return the number of key/value pairs, not the number of nodes in the trie (which might be much larger)!!!!

Task 3. Implement the get, containsKey, and isPrefix methods.

Task 4. Implement the values, containsValue, keys, and iterator methods. The iterator should go over the keys.

Task 5. Implement the getCompletions method.

`TrieMovieAutoCompleter` Requirements

Next, we’ll implement an autocompleter that uses our trie. The specification is identical to the D part, except the main dictionary will be a trie which maps suffixes to IDeques of titles instead of a HashMap that goes in the opposite direction. Furthermore, you will find the getCompletions method you wrote to be very useful when implementing complete().

Task 6. Implement the populateTitles and complete methods in TrieMovieAutoCompleter.java.

After you’ve implemented this, go to Main.java and swap out line 28. You should see a significant speed-up from the HashMap version!

`TrieMap` `remove()`

Finally, to wrap things up, go ahead and implement the remove method for your trie. remove(K k) should delete k from the trie as well as all of the nodes that become unnecessary. One implementation of remove (called lazy deletion) would be to find k in the map and set its value to null (since null is not a valid value in the map). You may not implement remove as lazy deletion. Instead, you must ensure that all leaves of your trie have values.

remove MUST be written recursively to recieve any credit.

Task 7. Implement the remove method.

Goals and Outcomes

Connection to Lecture

Overview and Implementation Strategy

Deques Again

HashMovieAutoCompleter Requirements

Running the AutoCompleter

Trie Requirements

Comparing ITrieMap to HashMap and TreeMap

TrieMap