CS 2 (Winter 2024) Project 04: Movie Finder

This project focuses on implementing the trie data structure, creating an extensible API, and designing performance improvements to tries to better fit a specific use-case.

Goals and Outcomes

In this project, you will implement one data structure (called a trie) and two versions of an autocomplete algorithm (one that uses Java’s HashMap and one that uses your Trie). Your autocomplete algorithms will then be used to be the backend of a system that complete the names of movies in a search field.

By the end of this project, you will…

Connection to Lecture

This project covers lecture08 (trees) and lecture09 (tries). You will likely find the code from these two lectures to be very useful in the later parts of this project.

Overview and Implementation Strategy

In the previous project, you wrote three implementations of the IDeque interface. This time, you will continue our mission of implementing all the data structures ourselves by writing a trie implementation of the IDictionary interface. In addition to building a trie data structure, you will write an autocomplete algorithm which can switch between using a backing java.util.HashMap and your very own TrieMap.

Deques Again

HashMovieAutoCompleter Requirements

First, we will implement a movie autocompleter algorithm backed by a HashMap. Our autocomplete implementation will be case-insensitive, and will suggest a list of movie names, given a set of words. Consider the movie database that includes the following movies:

Now consider the autocomplete query "age of". We want to match on any set of consecutive words in a movie title; so, we should suggest ["The Avengers Age of Ultron", "Transformers Age of Extinction"].

However, if the query is "of age", we should not suggest anything, because none of the movies have “of” and “age” in that order. Similarly, if our autocomplete query is "the", then we should suggest ["The Avengers", "The Avengers Age of Ultron"].

In our first AutoCompleter, HashMovieAutoCompleter, the backing data structure will be a java.util.HashMap which maps each normalized movie title to an IDeques of “suffixes” that can be formed from that title.

For example, the above database would have the following mappings:

public static void populateTitles()

Populates the autocompleter titles map with all suffixes for each movie title.


You will find the pre-populated field ID_MAP which maps movie titles to their IMDB ids useful. Do NOT attempt to open the data files directly; this has already been done for you in the super class AbstractMovieAutoCompleter.

public static IDeque<String> complete(String term)

Given a search term (i.e., words separated by spaces), returns an IDeque of movie titles that are possible completions of the search term. Make sure there are no duplicate values in the result.


The general algorithm here is to check if the search term is a prefix of any of the entries in the map. If so, it is a valid completion. Note that “this!” does not match “this” because symbols are significant.

Why am I looking for prefixes of suffixes?

Imagine you want to search for “The Avengers: Age of Ultron” in the Netflix search bar. You would expect “Age of Ultron” to match, even though it isn’t the start of the title. Each suffix represents the start of a new word in the title, so by using suffixes, we allow you to start searching from any word.

However, you also don’t want to have to type in the whole search term. “Age of” should also match. This is where the prefixes come in: they are the starts of a suffix that you could be searching for. Together, this is a search for prefixes of suffixes.

Why doesn’t “this!” match “this”?

In our algorithm, we are searching by words where punctuation is significant. So, “this!” has an extra exclamation point which does not match the term where there isn’t punctuation.

Running the AutoCompleter

If all your D tests pass, you should be able to run the “Movie Autocomplete” target of the project. A browser window should automatically open with the movie finder interface. The algorithm you just wrote backs the search for completions. We recommend trying a query like “the “ which we expect to have lots of results. You will likely notice that the interface is noticably slow! This is all because of the data structure we chose. In the next part of the project, we will write a new data structure, called a trie, which is more suited to this type of application.

Trie Requirements

A Trie is a dictionary which maps “strings” to some type. Unlike other trees that we’ve seen previously, each node only contains a single “letter”. For example, here is a trie that represents the words {adam, add, app, bad, bag, bags, beds, bee, cab}:

trie1

You should be familiar with using a HashMap and a TreeMap at this point. So, we’ll start with a comparison to those.

Comparing ITrieMap to HashMap and TreeMap

It helps to compare it with dictionaries you’ve already seen: HashMap and TreeMap. Each of these types of maps takes two generic parameters K (for the “key” type) and V (for the “value” type). It is important to understand that Tries are NOT a general purpose data structure. There is an extra restriction on the “key” type; namely, it must be made up of characters (it’s tempting to think of the Character type here, but really, we mean any alphabet–chars, alphabetic letters, bytes, etc.). Example of types that Tries are good for include: String, byte[], List<E>. Examples of types that Tries cannot be used for include int and Dictionary<E>.

In our implementation of Tries, we encode this restriction in the generic type parameters:

Because the key type is iterable, there is a clean way to take apart keys (i.e., you can use a foreach loop), but we also need a way to put keys back together. To do this, we use Java’s anonymous functions.

For our purposes, what this means is that the constructor will look like this:

public ITrieMap(Function<IDeque<A>, K> collector) {
    ...
}
What is this “collector” for?

If you look at the collector function’s declaration, you will see it is listed as Function<IDeque<A>, K>. Essentially, what this means is that it has the functionality to take an IDeque of “letters” in your alphabet and return a key that is essentially the letters concatenated together. So if we have an IDeque called acc that contains ["h", "e", "l", "l", "o"], and we called this.collector.apply(acc), it would return the string "hello".

You will find the collector very useful in your keys() method.

And, eventually, when we create a trie to use for autocompleting, the instantiation will look like this:

private static ITrieMap<String, IDeque<String>, IDeque<String>> titles =
    new TrieMap<>((IDeque<String> s) -> s);

TrieMap

A TrieMap is an implementation of a trie where the “pointers” are made up of a HashMap. An array would work as well, but you should think about why that might not be a good idea if the alphabet size is large (like 5 or more). Consider the following ITrieMap:

trie2

We could manually construct this as a TrieMap as follows:

this.root = new TrieNode();
this.root.pointers.put('a', new TrieNode("hello"));
this.root.pointers.get('a').pointers.put('d', new TrieNode());
this.root.pointers.get('a').pointers.get('d').pointers.put('d', new TrieNode("1 + 1"));
this.root.pointers.get('a').pointers.put('p', new TrieNode());
this.root.pointers.get('a').pointers.get('p').pointers.put('p', new TrieNode("for a phone"));

Notice that the pointers variables in each of the nodes are just standard HashMaps!

You will implement all the standard IDictionary operations. For each of these, read the comments for IDictionary for the particulars. There are two methods (getCompletions and findPrefix) which are special for ITrieMaps and have additional information/restrictions:

Your implementations of get, put, isPrefix, and containsKey must have time complexity \(\Theta(d)\) where \(d\) is the number of letters in the key argument of these methods. These methods work on the entire key (the whole “string” of “letters”); make sure to only remove/add/find the exact key asked for. You will also have to implement other methods like keys and values, but the runtime of these methods may be worse. Notably, we are not implementing the remove method right now–it’s not needed for the application; so, we’ll implement it later in the A tests.

What is the size method counting?

The size function should return the number of key/value pairs, not the number of nodes in the trie (which might be much larger)!!!!

TrieMovieAutoCompleter Requirements

Next, we’ll implement an autocompleter that uses our trie. The specification is identical to the D part, except the main dictionary will be a trie which maps suffixes to IDeques of titles instead of a HashMap that goes in the opposite direction. Furthermore, you will find the getCompletions method you wrote to be very useful when implementing complete().

After you’ve implemented this, go to Main.java and swap out line 28. You should see a significant speed-up from the HashMap version!

TrieMap remove()

Finally, to wrap things up, go ahead and implement the remove method for your trie. remove(K k) should delete k from the trie as well as all of the nodes that become unnecessary. One implementation of remove (called lazy deletion) would be to find k in the map and set its value to null (since null is not a valid value in the map). You may not implement remove as lazy deletion. Instead, you must ensure that all leaves of your trie have values.

remove MUST be written recursively to recieve any credit.