For most of the remaining projects in CS 2 (including this one), you will be allowed to work in groups of two. You may discuss high-level ideas with students outside your group, but any viewing, sharing, copying, or dictating of code is forbidden. Dean’s Tutors may not read your code.
Goals and Outcomes
In this project, you will implement one data structure (called a trie) and two versions of an autocomplete algorithm (one that uses Java’s HashMap
and one that uses
your Trie
). Your autocomplete algorithms will then be used to be the backend of a system that complete the names of movies in a search field.
By the end of this project, you will…
- be able to run an efficient autocomplete algorithm on a variety of datasets
- understand how to implement a trie data structure and some algorithms that correspond to them
- have gained a better understanding of API design, as it relates to making projects usable with a variety of clients
Connection to Lecture
This project covers lecture08 (trees) and lecture09 (tries). You will likely find the code from these two lectures to be very useful in the later parts of this project.
Overview and Implementation Strategy
In the previous project, you wrote three implementations of the IDeque
interface. This time, you will continue our mission of implementing all the data structures ourselves by
writing a trie implementation of the IDictionary
interface. In addition to building a trie data structure, you will write an autocomplete algorithm which can switch between
using a backing java.util.HashMap
and your very own TrieMap
.
Just to be super clear: the only data structures in java.util
that you are allowed to use in this project are java.util.HashMap
and java.util.HashSet
.
Deques Again
The first thing you should do for this project is copy over your deque implementations from project03. You may choose either partner’s implementation of each of the data structures, but,
at a minimum, you must copy over someone’s ArrayDeque
and someone’s LinkedDeque
. You should place them in the edu.caltech.cs2.datastructures
package.
If neither partner has both IDeque
’s working, e-mail Adam immediately.
HashMovieAutoCompleter
Requirements
First, we will implement a movie autocompleter algorithm backed by a HashMap
. Our autocomplete implementation will be case-insensitive, and will suggest a list of movie names,
given a set of words. Consider the movie database that includes the following movies:
- The Avengers
- The Avengers Age of Ultron
- Captain America
- Black Panther
- Transformers Age of Extinction
Now consider the autocomplete query "age of"
. We want to match on any set of consecutive words in a movie title; so, we should suggest
["The Avengers Age of Ultron", "Transformers Age of Extinction"]
.
However, if the query is "of age"
, we should not suggest anything, because none of the movies have “of” and “age” in that order.
Similarly, if our autocomplete query is "the"
, then we should suggest ["The Avengers", "The Avengers Age of Ultron"]
.
In our first AutoCompleter, HashMovieAutoCompleter
, the backing data structure will be a java.util.HashMap
which maps each normalized movie title
to an IDeque
s of “suffixes” that can be formed from that title.
For example, the above database would have the following mappings:
-
"the avengers"
\(\mapsto\)["the avengers", "avengers"]
-
"the avengers age of ultron"
\(\mapsto\)["the avengers age of ultron", "avengers age of ultron", "age of ultron", "of ultron", "ultron"]
-
"captain america"
\(\mapsto\)["captain america", "america"]
-
"black panther"
\(\mapsto\)["black panther", "panther"]
-
"transformers age of extinction"
\(\mapsto\)["transformers age of extinction", "age of extinction", "of extinction", "extinction"]
public static void populateTitles()
Populates the autocompleter titles
map with all suffixes for each movie title.
You will find the pre-populated field ID_MAP
which maps movie titles to their IMDB ids useful. Do NOT attempt to open the data files directly; this has already
been done for you in the super class AbstractMovieAutoCompleter
.
public static IDeque<String> complete(String term)
Given a search term (i.e., words separated by spaces), returns an IDeque
of movie titles that are possible completions of the search term
. Make sure there are
no duplicate values in the result.
The general algorithm here is to check if the search term
is a prefix of any of the entries in the map. If so, it is a valid completion.
Note that “this!” does not match “this” because symbols are significant.
Imagine you want to search for “The Avengers: Age of Ultron” in the Netflix search bar. You would expect “Age of Ultron” to match, even though it isn’t the start of the title. Each suffix represents the start of a new word in the title, so by using suffixes, we allow you to start searching from any word.
However, you also don’t want to have to type in the whole search term. “Age of” should also match. This is where the prefixes come in: they are the starts of a suffix that you could be searching for. Together, this is a search for prefixes of suffixes.
In our algorithm, we are searching by words where punctuation is significant. So, “this!” has an extra exclamation point which does not match the term where there isn’t punctuation.
Task 0.
Implement the populateTitles
and complete
methods in HashMovieAutoCompleter.java
.
Running the AutoCompleter
If all your D
tests pass, you should be able to run the “Movie Autocomplete” target of the project. A browser window should automatically open with the movie finder interface.
The algorithm you just wrote backs the search for completions. We recommend trying a query like “the “ which we expect to have lots of results. You will likely notice that the interface
is noticably slow! This is all because of the data structure we chose. In the next part of the project, we will write a new data structure, called a trie, which is more suited to this
type of application.
Trie
Requirements
A Trie
is a dictionary which maps “strings” to some type. Unlike other trees that we’ve seen previously, each node only contains a single “letter”.
For example, here is a trie that represents the words {adam, add, app, bad, bag, bags, beds, bee, cab}
:
You should be familiar with using a HashMap
and a TreeMap
at this point. So, we’ll start with a comparison to those.
Comparing ITrieMap
to HashMap
and TreeMap
It helps to compare it with dictionaries you’ve already seen: HashMap
and TreeMap
. Each of these types of maps takes
two generic parameters K
(for the “key” type) and V
(for the “value” type). It is important to understand that
Trie
s are NOT a general purpose data structure. There is an extra restriction on the “key” type; namely, it must be made
up of characters (it’s tempting to think of the Character
type here, but really, we mean any alphabet–char
s, alphabetic letters, bytes
, etc.).
Example of types that Trie
s are good for include: String
, byte[]
, List<E>
. Examples of types
that Trie
s cannot be used for include int
and Dictionary<E>
.
In our implementation of Trie
s, we encode this restriction in the generic type parameters:
-
A
: An “alphabet type”. For aString
, it would beCharacter
. For abyte[]
, it would beByte
. -
K
: A “key type”. We insist that all the “key types” extendIterable<A>
which encodes exactly the restriction that there is an underlying alphabet. -
V
: A “value type”. There are no special restrictions on this type.
Because the key type is iterable, there is a clean way to take apart keys (i.e., you can use a foreach loop), but we also need a way to put keys back together. To do this, we use Java’s anonymous functions.
For our purposes, what this means is that the constructor will look like this:
public ITrieMap(Function<IDeque<A>, K> collector) {
...
}
If you look at the collector function’s declaration, you will see it is listed as Function<IDeque<A>, K>
. Essentially, what this means is that it has the functionality
to take an IDeque
of “letters” in your alphabet and return a key that is essentially the letters concatenated together.
So if we have an IDeque
called acc
that contains ["h", "e", "l", "l", "o"]
, and
we called this.collector.apply(acc)
, it would return the string "hello"
.
You will find the collector very useful in your keys()
method.
And, eventually, when we create a trie to use for autocompleting, the instantiation will look like this:
private static ITrieMap<String, IDeque<String>, IDeque<String>> titles =
new TrieMap<>((IDeque<String> s) -> s);
TrieMap
A TrieMap
is an implementation of a trie where the “pointers” are made up of a HashMap
. An array would work as well, but you should think about why that might not be
a good idea if the alphabet size is large (like 5 or more). Consider the following ITrieMap
:
We could manually construct this as a TrieMap
as follows:
this.root = new TrieNode();
this.root.pointers.put('a', new TrieNode("hello"));
this.root.pointers.get('a').pointers.put('d', new TrieNode());
this.root.pointers.get('a').pointers.get('d').pointers.put('d', new TrieNode("1 + 1"));
this.root.pointers.get('a').pointers.put('p', new TrieNode());
this.root.pointers.get('a').pointers.get('p').pointers.put('p', new TrieNode("for a phone"));
Notice that the pointers
variables in each of the nodes are just standard HashMaps
!
You will implement all the standard IDictionary
operations. For each of these, read the comments for IDictionary
for the
particulars. There are two methods (getCompletions
and findPrefix
) which are special for ITrieMap
s and have additional information/restrictions:
-
isPrefix(K k)
should returntrue
iffk
is a prefix of some key in the trie. For example, if “add” were a key in the trie, then:
isPrefix("
”)
=isPrefix("
a”)
=isPrefix("
ad”)
=isPrefix("
add”)
=true
.
This method is arguably one of the major reasons to use aITrieMap
over another implementation ofDictionary
. (We saw a similar trade-off betweenHashMap
(faster) andTreeMap
(ordered) a few weeks ago.) Unlike in a normalIDictionary
, it is possible (and in fact, easy) to implement this method. -
getCompletions(K prefix)
should return anIDeque
filled with the values of all descendants of the prefix. For example, in the example above:
getCompletions("")
=getCompletions("a")
=["hello", "1 + 1", "for a phone"]
getCompletions
, keys
, and values
are the hardest methods to write here. You will likely want to use an “accumulator” argument
which “collects” the completions/keys/values as you recurse down the tree. The only place you will need to use the “collector” function is in keys
.
Your implementations of get
, put
, isPrefix
, and containsKey
must have time complexity \(\Theta(d)\) where \(d\) is the number
of letters in the key argument of these methods. These methods work on the entire key (the whole “string” of “letters”); make sure to only remove/add/find
the exact key asked for. You will also have to implement other methods like keys
and values
, but the runtime of these methods may be worse.
Notably, we are not implementing the remove
method right now–it’s not needed for the application; so, we’ll implement it later in the A
tests.
Task 1.
Just like last time, the interface files will be very important. Carefully read the interfaces/IDictionary
interface.
Task 2.
Implement the put
, size
, and clear
methods in TrieMap
.
size
method counting?
The size function should return the number of key/value pairs, not the number of nodes in the trie (which might be much larger)!!!!
Task 3.
Implement the get
, containsKey
, and isPrefix
methods.
Task 4.
Implement the values
, containsValue
, keys
, and iterator
methods. The iterator
should go over the keys
.
Task 5.
Implement the getCompletions
method.
TrieMovieAutoCompleter
Requirements
Next, we’ll implement an autocompleter that uses our trie. The specification is identical to the D
part, except the main dictionary will be a trie which maps suffixes to
IDeques
of titles instead of a HashMap
that goes in the opposite direction. Furthermore, you will find the getCompletions
method you wrote to be very useful
when implementing complete()
.
Task 6.
Implement the populateTitles
and complete
methods in TrieMovieAutoCompleter.java
.
After you’ve implemented this, go to Main.java
and swap out line 28. You should see a significant speed-up from the HashMap
version!
TrieMap
remove()
Finally, to wrap things up, go ahead and implement the remove
method for your trie. remove(K k)
should delete k
from the trie
as well as all of the nodes that become unnecessary. One implementation of remove
(called lazy deletion) would be to find
k
in the map and set its value to null
(since null
is not a valid value in the map). You may
not implement remove
as lazy deletion. Instead, you must ensure that all leaves of your trie have values.
remove
MUST be written recursively to recieve any credit.
Task 7.
Implement the remove
method.