All projects from this one forward will be partner projects. Yay! You may discuss high-level ideas with other groups, but any viewing, sharing, copying, or dictating of code is forbidden. Dean’s Tutors may not read your code.
Goals and Outcomes
In this project, you will implement the algorithms behind word suggestion, autocomplete, and spell check for your own chat client.
By the end of this project, you will…
- have implemented a heap
- have implemented a dynamic programming algorithm
- understand how simple data structures can be used in combination to accomplish fairly complex tasks
Overview
You will be using the data structures you have implemented so far along with a heap to drive word suggestion, spelling correction, and autocompletion. These algorithms are very similar to the ones smartphones use for these problems, and you will see that they do relatively well.
Connection to Lecture
This project covers lecture14 (heaps) and lecture16 (memoization). You will likely find the code from these lectures to be very useful in this project.
Always Forward, Never Backward
This project is the culmination of the sequence of projects we’ve done in the past few weeks. As such, it relies on these projects (mostly) working. In particular, we now absolutely
expect you to have two working IDeque
implementations (ArrayDeque
and LinkedDeque
). If your TrieMap
doesn’t work, autocomplete won’t work, but that’s okay.
However, ChainingHashTable
and MoveToFrontDictionary
will be necessary in all the parts this week. If you have not completed
ChainingHashTable
, you should do that first. Once you get it working, we will give you half credit for the B tests from the previous week.
Please contact Adam to make sure you get this credit.
Finally, you will need to port forward your completed NGram
class from the previous week as well.
Implementing an IPriorityQueue
: MinFourHeap
Your MinFourHeap
should be an implementation of the heap data structure we’ve discussed in class. It should be an array-based implementation
which starts at index 0. Unlike the implementation discussed in lecture, it should be a four-heap (not a two-heap). In other words, each node
should have four children, not two. All operations should have the efficiencies we’ve discussed in class.
All elements of your MinFourHeap
will be of type PQElement
which has a data
field and a priority
field. The priority
field is how
your MinFourHeap
should order elements, but the data
field is how you should test equality of elements.
We are asking you to implement decreaseKey
and increaseKey
which operate on the key but needs access to the index. This means one of your fields should be
an IDictionary
that maps keys to their current indices in the heap. You do not, however, have to implement an iterator for your MinFourHeap
.
Task 0.
Implement MinFourHeap
as described above using the array method from class.
TopKSort
Now that we have a working heap, we can implement TopKSort
. The naive way to sort the top \(k\) suggestions using a heap would be to enqueue all of the unsorted elements
into the heap, dequeue them all, and then take the top \(k\) elements from that result. But, this algorithm ignores the fact that we only need the top \(k\) results.
The runtime is therefore \(O(n\log n)\), where \(n\) is the number of elements to sort. With some very slight modifications, you should be able to get this down to
\(O(n\log k)\). As a hint, use a heap, but never put more than \(k\) elements into it. Implement this algorithm in sorts.TopKSort
.
public static void sort(PQElement[] array, int K)
Sorts the provided array by the priorities, places the top k
items in the first k
spots, and sets the remaining elements to null
.
Task 1.
Implement TopKSort
as described above in the sort
method.
NGramMap
Ultimately, our goal is to implement three features that you (probably) use every day on your phone: autocorrect, autocomplete, and word suggestion. Here’s a quick description of what these actually do:
- Autocorrect fixes misspellings in words. The interesting part is that it’s context sensitive which means it uses what you have previous typed after those words
rather than just randomly choosing a word. In this part, you will finish an implemention of
NGramMap
which backs the context, and in theA
part, you will implement the core algorithm behind this which is callededitDistance
. - Autocomplete suggests and finishes a word as you’re typing it. You’ve already implemented the fundamental data structure you need for this in project04: the trie.
- Word Suggestion provides several choices for a next word you might want to type (they show up above the keyboard on android and iOS). This algorithm is also context
sensitive (obviously) and will rely on your
NGramMap
.
Since two of these algorithms rely on NGramMap
, we’ll begin by describing what it does and what’s left for you to implement:
More concretely, NGramMap
maps NGram
s to IDictionary
s that map words to counts
of how many times that word occurs immediately after the NGram
. Consider the following “input file”:
Sample Input File
Not in a box. Not with a fox. Not in a house. Not with a mouse.
The key set of the outer map will contain all of the \(2\)-grams in the file. That is, it will be
{". not", "a box", "a fox", "a house", "a mouse", "box .", "fox .", "house .", "in a", "not in", "not with", "with a"}
Notice that the input is “standardized” by splitting words from symbols and converting everything to lowercase all input. Also, note that “mouse .” does not appear in the outer map; the reason for this is that there is nothing after it to include!
The “top level” maps to another dictionary whose keys are the possible words following that \(n\)-gram. So, for example, the keys of the dictionary that
“with a” maps to are {"mouse", "fox"}
, because “mouse” and “fox” are the only two words that follow the \(2\)-gram “with a” in the original text.
Finally, the values of the inner dictionary are a count of how many times that word followed that \(n\)-gram. So for example, we have:
-
"not in"={a=2}
, because “a” follows “not in” twice -
"with a"={mouse=1, fox=1}
, because “mouse” and “fox” each only appear once after “with a”
The entire output looks like:
". not"={in=1, with=2}, "a box"={.=1}, "a fox"={.=1}, "a house"{.=1}, "a mouse"={.=1}, "box ."={not=1}, "fox ."={not=1}, "house ."={not=1}, "in a"={box=1, house=1}, "not in"={a=2}, "not with"={a=2}, "with a"={fox=1}, "with a"={mouse=1}
The order of the entries does not matter (remember, dictionaries are not ordered), but the contents do.
NGramMap
supports using different “outer” and “inner” IDictionary
types in NGramMap
.
(The outer type is the map from NGram
s
to words; the inner type is the map from words to counts.) To make this easier, NGramMap
takes an outer map and an “initializer” in its
constructor representing these types. For example, to use outer = ChainingHashDictionary
and inner = MoveToFrontDictionary
,
we would write:
new NGramMap<>(
new ChainingHashDictionary<>(MoveToFrontDictionary::new),
MoveToFrontDictionary::new
)
One more important implementation detail is that instead of using type String
for the words, we use
type IterableString
. The reason for this should be clear: we’d like to use TrieMap
if possible!
Now that you know what NGramMap
is supposed to do, implement the following two
methods:
public static PQElement<String>[] getCountsAfter(NGram ngram)
Returns an array of PQElements
s representing words (in the data
) and
the number of times each word was seen after NGram
(in the priority
).
There is no guarantee on the ordering of the array.
public static String[] getWordsAfter(NGram ngram, int k)
Returns an array of the \(k\) most likely words to follow ngram.
You should use your TopKSort
somewhere in this method.
Task 2.
Implement getCountsAfter
and getWordsAfter
as described above.
Spelling Correction
Now that you’ve finished the NGramMap
, you have all of the necessary data structures to implement spell check. All that remains is implementing the spelling correction algorithm
itself. Here’s a high level overview of what the algorithm does:
Inputs: \(c_w\) - the ngram preceding the misspelled word, \(w\) - the misspelled word
- Initialize a string \(x\) to null. \(x\) will eventually hold the best possible correction.
- Use the word suggestor to generate words that could follow \(c_w\). For each suggestion \(s\):
- Compute the edit distance between \(s\) and \(w\).
- If the edit distance is both less than 2 and less than the edit distance between \(x\) and \(w\), then replace \(x\) with \(s\).
- If \(x\) is still null, we haven’t found a suitable correction. In this case:
- Generate a list of all valid words that are edit distance 2 away from \(w\)
- Select the word in this list that has the smallest edit distance from \(w\) to be \(x\)
- Return \(x\).
We realize this algorithm is non-trivial, but have already implemented a lot of this algorithm for you – the only portion you need to complete is called Edit Distance. We describe the edit distance problem and algorithm below.
Edit Distance
Let \(a\) and \(b\) be two strings of lengths \(m\) and \(n\) respectively. What is the shortest number of edits to transform \(a\) into \(b\)?
We define an “edit” as one of three operations:
- insertion: we insert a character into the string. (Ex: apple \(\rightarrow\) applde)
- deletion: we delete a character from the string. (Ex: apple \(\rightarrow\) appe)
- substitutions: we remove a character from the string and substitute it with a different one. (Ex: apple \(\rightarrow\) aprle)
Let \(d(i, j)\) be the edit distance between the first \(i\) characters of \(a\) and the first \(j\) characters of \(b\). Then, we are looking for \(d(m, n)\). Like we did in lecture, we can transform this problem into a recurrence based on various cases:
\[\begin{align*} d(i, 0) &= i \\ d(0, j) &= j \\ d(i, j) &= \begin{cases} d(i-1, j-1) & a_i = b_j \\ \min \begin{cases} d(i-1, j) + 1 \\ d(i, j-1) + 1 \\ d(i-1, j-1) + 1 \end{cases} & a_i \neq b_j \end{cases} \end{align*}\]In the first case, we are trying to compute the edit distance between the first \(i\) characters of \(a\) and an empty string; the fastest way to transform between these is to just delete all \(i\) characters. The second case is analogous.
In the last case, we want to transform the first \(i\) characters of \(a\) into the first \(j\) characters of \(b\).
- If the last letters of both are the same, then we just lob off the last two characters.
- If the last letters of the substrings are different, then the edit distance is equal to the minimum of an additional insertion, deletion, or substition performed after the recursive edit distance is computed.
N.B.: Naively implemented, this algorithm is exponential. Each recursive call not involving an empty string makes three recursive calls, so the run time would be an astoundingly horrible \(O(3^n)\). However, using memoization, we can vastly improve this runtime. This is because a lot of these subproblems are overlapping – there are many ways to generate the same substrings of \(a\) and \(b\), and our naive implementaton will recompute the edit distance each time. By storing these edit distances in a 2D array, you should be able to avoid this unnecessary redundancy.
public static int editDistance(String a, String b)
Finds the edit distance between the two strings. This should be done recursively and with memoization.
Task 3.
Implement editDistance
as described above.
This is the moment you didn’t know you were waiting for. At this point, you have completed word suggestion and spelling correction. Autocomplete is achieved using the
TrieMap
you implemented earlier. You can run Chat.java to see an (alpha version!!!) of BeaverChat
that uses all of these features.
OpenEnded Questions
Task 4. Answer all the OpenEnded questions below this task. These will be graded entirely on completion/effort, but they will really help us improve the course. Then, click the “Submit OpenEndeds” button below.