CS 2 (Winter 2021) Project 06: BeaverChat

This project combines the data structures we’ve written over the last few weeks with a heap and some new algorithms to implement spellcheck, autocomplete, and word suggestion.

Goals and Outcomes

In this project, you will implement the algorithms behind word suggestion, autocomplete, and spell check for your own chat client.

By the end of this project, you will…

Overview

You will be using the data structures you have implemented so far along with a heap to drive word suggestion, spelling correction, and autocompletion. These algorithms are very similar to the ones smartphones use for these problems, and you will see that they do relatively well.

Connection to Lecture

This project covers lecture14 (heaps) and lecture16 (memoization). You will likely find the code from these lectures to be very useful in this project.

Always Forward, Never Backward

This project is the culmination of the sequence of projects we’ve done in the past few weeks. As such, it relies on these projects (mostly) working. In particular, we now absolutely expect you to have two working IDeque implementations (ArrayDeque and LinkedDeque). If your TrieMap doesn’t work, autocomplete won’t work, but that’s okay.

However, ChainingHashTable and MoveToFrontDictionary will be necessary in all the parts this week. If you have not completed ChainingHashTable, you should do that first. Once you get it working, we will give you half credit for the B tests from the previous week. Please contact Adam to make sure you get this credit.

Finally, you will need to port forward your completed NGram class from the previous week as well.

Implementing an IPriorityQueue: MinFourHeap

Your MinFourHeap should be an implementation of the heap data structure we’ve discussed in class. It should be an array-based implementation which starts at index 0. Unlike the implementation discussed in lecture, it should be a four-heap (not a two-heap). In other words, each node should have four children, not two. All operations should have the efficiencies we’ve discussed in class.

All elements of your MinFourHeap will be of type PQElement which has a data field and a priority field. The priority field is how your MinFourHeap should order elements, but the data field is how you should test equality of elements.

We are asking you to implement decreaseKey and increaseKey which operate on the key but needs access to the index. This means one of your fields should be an IDictionary that maps keys to their current indices in the heap. You do not, however, have to implement an iterator for your MinFourHeap.

TopKSort

Now that we have a working heap, we can implement TopKSort. The naive way to sort the top \(k\) suggestions using a heap would be to enqueue all of the unsorted elements into the heap, dequeue them all, and then take the top \(k\) elements from that result. But, this algorithm ignores the fact that we only need the top \(k\) results. The runtime is therefore \(O(n\log n)\), where \(n\) is the number of elements to sort. With some very slight modifications, you should be able to get this down to \(O(n\log k)\). As a hint, use a heap, but never put more than \(k\) elements into it. Implement this algorithm in sorts.TopKSort.

public static void sort(PQElement[] array, int K)

Sorts the provided array by the priorities, places the top k items in the first k spots, and sets the remaining elements to null.

NGramMap

Ultimately, our goal is to implement three features that you (probably) use every day on your phone: autocorrect, autocomplete, and word suggestion. Here’s a quick description of what these actually do:

Since two of these algorithms rely on NGramMap, we’ll begin by describing what it does and what’s left for you to implement: More concretely, NGramMap maps NGrams to IDictionarys that map words to counts of how many times that word occurs immediately after the NGram. Consider the following “input file”:

Sample Input File

Not in a box. Not with a fox. Not in a house. Not with a mouse.

The key set of the outer map will contain all of the \(2\)-grams in the file. That is, it will be

{". not", "a box", "a fox", "a house", "a mouse", "box .", "fox .", "house .", "in a", "not in", "not with", "with a"}

Notice that the input is “standardized” by splitting words from symbols and converting everything to lowercase all input. Also, note that “mouse .” does not appear in the outer map; the reason for this is that there is nothing after it to include!

The “top level” maps to another dictionary whose keys are the possible words following that \(n\)-gram. So, for example, the keys of the dictionary that “with a” maps to are {"mouse", "fox"}, because “mouse” and “fox” are the only two words that follow the \(2\)-gram “with a” in the original text.

Finally, the values of the inner dictionary are a count of how many times that word followed that \(n\)-gram. So for example, we have:

The entire output looks like:

". not"={in=1, with=2}, "a box"={.=1}, "a fox"={.=1}, "a house"{.=1}, "a mouse"={.=1}, "box ."={not=1}, "fox ."={not=1}, "house ."={not=1}, "in a"={box=1, house=1}, "not in"={a=2}, "not with"={a=2}, "with a"={fox=1}, "with a"={mouse=1}

The order of the entries does not matter (remember, dictionaries are not ordered), but the contents do.

NGramMap supports using different “outer” and “inner” IDictionary types in NGramMap. (The outer type is the map from NGrams to words; the inner type is the map from words to counts.) To make this easier, NGramMap takes an outer map and an “initializer” in its constructor representing these types. For example, to use outer = ChainingHashDictionary and inner = MoveToFrontDictionary, we would write:

new NGramMap<>(
    new ChainingHashDictionary<>(MoveToFrontDictionary::new),
    MoveToFrontDictionary::new
)

One more important implementation detail is that instead of using type String for the words, we use type IterableString. The reason for this should be clear: we’d like to use TrieMap if possible!

Now that you know what NGramMap is supposed to do, implement the following two methods:

public static PQElement<String>[] getCountsAfter(NGram ngram)

Returns an array of PQElementss representing words (in the data) and the number of times each word was seen after NGram (in the priority). There is no guarantee on the ordering of the array.

public static String[] getWordsAfter(NGram ngram, int k)

Returns an array of the \(k\) most likely words to follow ngram.


You should use your TopKSort somewhere in this method.

Spelling Correction

Now that you’ve finished the NGramMap, you have all of the necessary data structures to implement spell check. All that remains is implementing the spelling correction algorithm itself. Here’s a high level overview of what the algorithm does:

Inputs: \(c_w\) - the ngram preceding the misspelled word, \(w\) - the misspelled word

We realize this algorithm is non-trivial, but have already implemented a lot of this algorithm for you – the only portion you need to complete is called Edit Distance. We describe the edit distance problem and algorithm below.

Edit Distance

Let \(a\) and \(b\) be two strings of lengths \(m\) and \(n\) respectively. What is the shortest number of edits to transform \(a\) into \(b\)?

We define an “edit” as one of three operations:

Let \(d(i, j)\) be the edit distance between the first \(i\) characters of \(a\) and the first \(j\) characters of \(b\). Then, we are looking for \(d(m, n)\). Like we did in lecture, we can transform this problem into a recurrence based on various cases:

\[\begin{align*} d(i, 0) &= i \\ d(0, j) &= j \\ d(i, j) &= \begin{cases} d(i-1, j-1) & a_i = b_j \\ \min \begin{cases} d(i-1, j) + 1 \\ d(i, j-1) + 1 \\ d(i-1, j-1) + 1 \end{cases} & a_i \neq b_j \end{cases} \end{align*}\]

In the first case, we are trying to compute the edit distance between the first \(i\) characters of \(a\) and an empty string; the fastest way to transform between these is to just delete all \(i\) characters. The second case is analogous.

In the last case, we want to transform the first \(i\) characters of \(a\) into the first \(j\) characters of \(b\).

N.B.: Naively implemented, this algorithm is exponential. Each recursive call not involving an empty string makes three recursive calls, so the run time would be an astoundingly horrible \(O(3^n)\). However, using memoization, we can vastly improve this runtime. This is because a lot of these subproblems are overlapping – there are many ways to generate the same substrings of \(a\) and \(b\), and our naive implementaton will recompute the edit distance each time. By storing these edit distances in a 2D array, you should be able to avoid this unnecessary redundancy.

public static int editDistance(String a, String b)

Finds the edit distance between the two strings. This should be done recursively and with memoization.

This is the moment you didn’t know you were waiting for. At this point, you have completed word suggestion and spelling correction. Autocomplete is achieved using the TrieMap you implemented earlier. You can run Chat.java to see an (alpha version!!!) of BeaverChat that uses all of these features.

OpenEnded Questions

Submit OpenEndeds For Credit