This will be the last solo project in CS 2. The reason this project is solo is because the material in this project is the most important part of CS 2. You may discuss high-level ideas with other students, but any viewing, sharing, copying, or dictating of code is forbidden. Dean’s Tutors may not read your code.
Goals and Outcomes
In this project, you will build a Markov Model using various data structures.
By the end of this project, you will…
- have implemented a hashtable and binary search tree
- have written a word suggester which predicts what the next word might be given some number of previous words
Connection to Lecture
This project covers lecture08 (trees) and lecture12/lecture13 (hash tables). You will likely find the code from these lectures to be very useful in this project.
Overview and Implementation Strategy
This week, the test “grade-levels” are a little weird. Instead of having what would normally be C-B-A, we have B-A-A+. You should think of the A+ level as “extra credit”. The B tests are incredibly important to the learning outcomes of the course which is why they’re rated “B” instead of “C”.
In previous projects, you wrote several “list-like” data structures (the IDeque
s) and one IDictionary
implementation (the trie).
This time, you will continue our mission of implementing all the data structures ourselves by
writing several more IDictionary
implementations. This time, we will use the data structures to back Markov text generation, which
“mimics” the style of a provided corpus.
Just to be super clear: you may not use ANY data structures in java.util
in this project. The whole point is to implement
your own versions of the fundamental data structures.
Looking Back to Move Forward
This time, you should copy forward at least one IDeque
implementation and your TrieMap
(if it passed the C tests or better in project04).
You should place these data structures in the edu.caltech.cs2.datastructures
package.
In the case where you do not have any working IDeque
implementations at this point, contact Adam immediately.
NGram
: A client of your data structures
Data structures store…data. Different data structures require the presence of different operations.
- A dictionary made out of a list requires equality testing on the key type.
- A dictionary made out of a hash table requires equality testing and hashcode calculation on the key type.
- A dictionary made out of a tree requires equality testing and comparison on the key type.
In Java, we implement these operations directly in the class that represents the key type. For this project, we will mostly use the NGram
type. An NGram
is a list of \(n\) words appearing in order in a text.
Before implementing your data structures, it is imperative that equality, hashcode, and comparisons all work. Otherwise, you will run into weird
errors with the tests that are not due to your data structures.
Task 0.
Implement the equals
method in NGram
. Make sure to compare the sizes of the NGram
s before bothering to loop through them.
If you look in IterableString
, you can see an example of a working equals
method. Note that, unlike the equals
method
in IterableString
, you can’t just delegate to the internal data structure (because your IQueue
implementations also don’t have an equals
method
implemented). You may not use toString
to implement equals
; we expect you to build it from scratch.
Your equals
function must not modify the elements in the data
field or their order.
Task 1.
Implement the hashCode
method in NGram
in the way discussed in class. DO NOT USE THE pow
FUNCTION! To make this efficient, you might want to use Horner’s Rule. Your hashCode
function must not modify the elements in the data
field or their order.
Task 2.
Implement the compareTo
method in NGram
. You may not use toString
to implement compareTo
; we expect you to build it from scratch.
Your compareTo
function must not modify the elements in the data
field or their order.
MoveToFrontDictionary
: Another Dictionary
In this part, you will implement MoveToFrontDictionary
, a new type of Dictionary
.
MoveToFrontDictionary
is a type of linked list where new items are inserted at the front of the list, and an existing item gets
moved to the front whenever it is referenced. Although it has \(\mathcal{O}(n)\) worst-case time operations, it has a very good amortized analysis.
We will not discuss this data structure in class. MoveToFrontDictionary
should only rely on equality testing of keys.
In the past, students have tried to use doubly-linked lists for this data structure. DO NOT DO THIS. It is unnecessary and makes debugging far harder than it needs to be.
Task 3.
Implement MoveToFrontDictionary
as described above.
ChainingHashDictionary
: Another Another Dictionary
In this part, you will implement ChainingHashDictionary
. Your hash table must use separate chaining–not probing. Furthermore, you must make the type of chain
generic. In particular, you should be able to use any dictionary implementation as the type inside the buckets. To do this, you should take in a Supplier<IDictionary<K, V>>
in the constructor and call chain.get()
whenever you want to create a chain.
Your ChainingHashDictionary
should rehash as appropriate (use an appropriate load factor as discussed in class), and its capacity should always be a prime number.
Your ChainingHashDictionary
should be able to work with the provided corpora which means there shouldn’t be a hard cap on how much it can grow; though, it doesn’t
have to use primes past 400000. Recall that all Hash Tables rely on a reasonable definition of hash function as well as equality testing.
At some point, you will want to test various types of chains in your ChainingHashDictionary
. It is confusing to do this initially; so, we have provided some examples
in the MarkovTextGenerator
class.
Task 4.
Implement ChainingHashDictionary
as described above.
BSTDictionary
: Another Another Another Dictionary
In this part, you will implement BSTDictionary
. put
, get
, and containsKey
should have average \(\mathcal{O}(\lg(n))\) behavior, as discussed in class.
remove
is pretty tricky. You should look at the slides we posted that describe how it should work.
remove
MUST be written recursively to recieve any credit.
Recall that all binary search trees rely on a reasonable definition of comparison, in Java, this is the compareTo
method.
Task 5.
Implement BSTDictionary
as described above.
AVLTreeDictionary
: Another Another Another Another Dictionary
Warning: This part is difficult, and it will count as “A+” tests (which will only add 3% to your final grade on the project).
In this part, you will implement the put
method of AVLTreeDictionary
. Most of the methods in AVLTreeDictionary
are inherited from BSTDictionary
, except
put
and remove
. We do not think remove
is instructive enough to ask you to write it though. Be careful to not duplicate code. Additionally, if your rotation
code is repetitive, you will not get credit for this part (even if the tests pass).
NGram
s and Generating Text
An NGram
is a list of \(n\) words appearing in order in a text. They are often used in textual analysis to see how frequent patterns
are. We have written a library which uses one of your IDictionary
s to generate text that sounds like the author of an original text.
You can check it out by running MarkovTextGenerator
. We recommend using the reddit.corpus
corpus which can be downloaded
here.
OpenEnded Questions
Task 6. Answer all the OpenEnded questions below this task. These will be graded entirely on completion/effort, but they will really help us improve the course. Then, click the “Submit OpenEndeds” button below.