CS 2 (Winter 2024) Project 05: Markov Text Generation

This project focuses on implementing various data structures (hashtable, BST) and algorithms. These will be used to create a Markov Model.

Goals and Outcomes

In this project, you will build a Markov Model using various data structures.

By the end of this project, you will have…

Connection to Lecture

This project covers lecture08 (trees) and lecture12/lecture13 (hash tables). You will likely find the code from these lectures to be very useful in this project.

Overview and Implementation Strategy

This week, the test “grade-levels” are a little weird. Instead of having what would normally be C-B-A, we have B-A. The B tests are incredibly important to the learning outcomes of the course which is why they’re rated “B” instead of “C”.

In previous projects, you wrote several “list-like” data structures (the IDeques) and one IDictionary implementation (the trie). This time, you will continue our mission of implementing all the data structures ourselves by writing several more IDictionary implementations. This time, we will use the data structures to back Markov text generation, which “mimics” the style of a provided corpus.

Looking Back to Move Forward

NGram: A client of your data structures

Data structures store…data. Different data structures require the presence of different operations.

In Java, we implement these operations directly in the class that represents the key type. For this project, we will mostly use the NGram type. An NGram is a list of \(n\) words appearing in order in a text. Before implementing your data structures, it is imperative that equality, hashcode, and comparisons all work. Otherwise, you will run into weird errors with the tests that are not due to your data structures.

What multiplier should I use for Horner’s method?

As discussed in class, 31 and 37 are good multiplier to use for Horner’s method because they’re both prime.

How should I represent each String when computing my hash function?

Hint: in Java, Strings have a built-in hashCode() function that you can call!

I’ve done the above, but my hashCode method still fails!

Our tests require that you multiply the first item in the sequence by 31 (or 37) to the nth power, and so on down to 0. Using the reverse order (e.g., 31^0*a + … 31^n*z, as at the above link) is technically correct, but not currently supported.

MoveToFrontDictionary: Another Dictionary

In this part, you will implement MoveToFrontDictionary, a new type of Dictionary.

MoveToFrontDictionary is a type of linked list where new items are inserted at the front of the list, and an existing item gets moved to the front whenever it is referenced (e.g., when get is called). Although it has \(\mathcal{O}(n)\) worst-case time operations, it has a very good amortized analysis. We will not discuss this data structure in class. MoveToFrontDictionary should only rely on equality testing of keys.

ChainingHashDictionary: Another Another Dictionary

In this part, you will implement ChainingHashDictionary. Your hash table must use separate chaining–not probing. Furthermore, you must make the type of chain generic. In particular, you should be able to use any dictionary implementation as the type inside the buckets. To do this, you should take in a Supplier<IDictionary<K, V>> in the constructor and call chain.get() whenever you want to create a chain.

What is this Supplier thing?

As discussed in class, the buckets of a hash table must use some data structure to deal with collisions. The “Supplier” is a constructor which indicates which data structure to use (e.g., MoveToFrontDictionary, BSTDictionary, etc.).

Your ChainingHashDictionary should rehash as appropriate (use an appropriate load factor as discussed in class), and its capacity should always be a prime number. Your ChainingHashDictionary should be able to work with the provided corpora which means there shouldn’t be a hard cap on how much it can grow; though, it doesn’t have to use primes past 500000. Recall that all Hash Tables rely on a reasonable definition of hash function as well as equality testing.

Should I calculate primes in my constructor?

You should not calculate primes in your constructor! That would be far too slow. Instead, hard-code a list of primes in a constant at the top of your class!

At some point, you will want to test various types of chains in your ChainingHashDictionary. It is confusing to do this initially; so, we have provided some examples in the MarkovTextGenerator class.

BSTDictionary: Another Another Another Dictionary

In this part, you will implement BSTDictionary. put, get, and containsKey should have average \(\mathcal{O}(\lg(n))\) behavior, as discussed in class.

remove is pretty tricky. You should look at the slides we posted that describe how it should work.

remove MUST be written recursively to recieve any credit.

Recall that all binary search trees rely on a reasonable definition of comparison, in Java, this is the compareTo method.

I’m getting a StackOverflow Error?

You’re likely using too many stack frames, consider condensing your compareTo() and equals() into a single compareTo().

NGrams and Generating Text

An NGram is a list of \(n\) words appearing in order in a text. They are often used in textual analysis to see how frequent patterns are. We have written a library which uses one of your IDictionarys to generate text that sounds like the author of an original text. You can check it out by running MarkovTextGenerator. We recommend using the reddit.corpus corpus which can be downloaded here.