Assignment 1: Hash Maps And Tweets (72 Points)
Chris Tralie
Table of Contents
Overview / Logistics
The purpose of this assignment is to implement one of the most important data structures in all of CS: a hash map, which is actually quite simply just an array of singly-linked lists. This will enable us to implement two abstract data types (ADTs): the Set
and the Map
(also known as a dictionary in python), which both have tons of applications. Before you proceed, review the notes at this link, which cover the background for hash tables.
Learning Objectives
- Implement a singly-linked list in python
- Implement hash tables using pythonic object oriented paradigms
- Use hash table to fulfill set and map ADTs, including exception handling
- Apply the dictionary API to natural language processing
Data / Starter Code
Click here to download the starter code for this assignment. Actually, the starter code is quite minimal; it mainly has a dataset you'll be using in the last task.
What To Submit
Submit your files wizard.py
and hash.py
to canvas, as well as any other .py files and notebooks you made. Also submit answers to the following questions on Canvas
- The name of your buddy, if you chose to work with one.
- Any other concerns that you have. For instance, if you have a bug that you were unable to solve but you made progress, write about it. The more you articulate the problem the more partial credit you will receive (fine to leave this blank)
Programming Tasks
To explain the data structures you'll be implementing, let's first look at python's built in versions of them. First, we'll examine a set, which is a unordered collection of objects without repetition
This prints
Notice how the numbers are not in the same order we put them in (since the set order is arbitrary), and that even though we added 2 twice, it only showed up once (no repetitions).
Actually, the set is quite flexible, and we can add anything to it, as long as it has a hash code. For example, we can add
and then we'll see
But then if we try to do
then we get
This is because a list can change, so it's impossible to generate a unique hash code for it, since one of the properties of hash codes is that they remain fixed, so they can only work with "immutable" objects.
Part 1: Hash Codes And Object Equality in Python (6 Points)
Let's now consider a more interesting class that's been provided with the assignment. I've created a Wizard
class in wizard.py
that mirrors the Harry Potter hashing class exercise. Let's run the following code from within some .py file or notebook in the same directory as wizard.py
:
but then this prints out
It seems like the set's not working, because it's not supposed to have more than one copy of the same object! But let's drill down and look at the hash codes. We can do this by calling the built-in method hash
in python:
When I run this, I get the following on my computer (but you will get something different every time you run this, which will become clear in a moment):
What on earth are these numbers, and how did python come up with them? Let's try one more thing. If we run the built-in id()
method, it will tell us the memory address of the object passed to it:
which yields the following
As it turns out, since we haven't told python how to make a hash code for a wizard, it ends up using the memory address // 16.
However, even so, there are some collisions here: the only unique hash codes are 8785000196838
and 8785000196229
. So why are we still seeing 5 unique objects in the set? Well, python doesn't actually know that they're equal; if we don't tell python how to compare objects, it will default to comparing them by their memory address. Since each of the 5 objects has a different memory address, they are considered unique.
Your Task
Write code in wizard.py
to override the default behavior of object equality and hashing for the Wizard
class:
-
Implement the
__hash__
method, which takes no parameters (other thanself
), and which returns the month*31 + day -
Implement the
__eq__
method, which takes one parameter (in addition toself
) which is object to which to compare it. ReturnTrue
if the name, month, day, and year are equal between this object and the other object, andFalse
otherwise.
If this works properly, the hash codes for all of the Snape wizards above should be 317, and when you run
It should print True
. Finally, you should only see one Snape wizard in the set if you add them in the loop.
Part 2: HashSet (24 Points)
Now that we've convinced ourselves that the built-in python set
class works properly, let's implement our own verison of a set from scratch using a hash table.
Your Task
Create a singly-linked list class and/or linked node class that wraps around a key
object, which is what will be stored in the set. Create a class HashSet
that aggregates and array of linked lists, and which has the following instance methods (that all have self
as a first parameter)
-
(4 Points)
__init__
: The construtor should take one parameter specifying the number of buckets to use, and you should initialize an array of empty linked lists with that number of buckets -
(4 Points)
__contains__
: A method that takes one parameter, thekey
, and which returnsTrue
if the key is in the linked list andFalse
otherwise. When you've finished implementing this method, then sayingShould print out
True
. Note the "syntactic sugar" with thein
keyword, which you can use to make your code look nicer. Under the hood, sayingx in y
in python is equivalent to callingy.__contains__(x)
-
(4 Points)
add
: Takes one parameter and adds that object to the set. You should compute the hash code of the object, find its appropriate bucket index, and add it if it doesn't already exist (we don't want duplicate objects). -
(4 Points)
remove(obj)
: Removeobj
from the hash table if it exists in the table, or throw aKeyError
if the object isn't in the table. You can do this with the following code: -
(4 Points)
__len__
: A method with no parameters (other thanself
) which returns how many elements are in the set. This should run in constant time; you should not have to loop through all of the linked lists to count. Instead, store a member variable that keeps track of the number of objects as they are added and deleted so you can simply return that member variable.Note also the syntactic sugar; if we say
len(s)
, this is equivalent to sayings.__len__()
-
(4 Points)
keys
: A method with no parameters that returns all of the elements in the set as a python list.
If you're not sure where to get started, have a look at the needle in a haystack notes, as well as some starter code for linked lists that we made in class.
Tips
The first thing you should do in the __contains__
, add
, and remove
methods is to compute the hash code for the object that's being passed in and %
it by the number of buckets to figure out which bucket you should start looking in. We will assume here that the objects passed along have implemented __hash__
, so that calling hash(obj)
will give the hash code. Python already has built-in hash codes for ints, strings, floats, etc.
If this is all working properly, then the following code:
should print something like this (note that the order is arbitrary):
Part 3: HashMap (17 Points)
Sometimes, we may not know everything about a wizard, but we do know partial information like their name. In this case, it's more appropriate to use a map ADT instead of a set ADT to look up more info from the partial info we have. In this ADT, they key is something hashable that we look up in the hash table, but it is glued to something called a value which is the associated thing we're trying to look up. Python calls its implementation of a map a dictionary (in Java it's a Map
, in C++ STL it's a map
, in PHP it's an called an "associative array", ...). Here's how we might set up some information in a dictionary:
after which we'll see the following output
Note that a python dictionary has a quite convenient syntax; it looks just like an array, but we index it with keys instead of 0-indexed integers. We will use built-in python methods to replicate this syntactic sugar in our implementation of HashMap
Your task
Create a class HashMap
which, like the HashSet
class, aggregates an array of linked lists. You'll just need to tweak the node class in your linked list to hold a value in addition to a key. For those who had CS 174 with me, it's similar to the last assignment I gave in that class. Below are the instance methods you'll need to implement:
-
(2 Points)
__init__
: The construtor should take one parameter specifying the number of buckets to use, and you should initialize an array of empty linked lists with that number of buckets -
(2 Points)
__contains__
: A method that takes one parameter, thekey
, and which returnsTrue
if the key is in the linked list andFalse
otherwise. This should be incredibly similar to the__contains__
method inHashSet
that you did before. -
(3 Points)
__setitem__
: A method that takes two parameters, thekey
and thevalue
that should be associated to that key.- If the key doesn't exist in the map, add the key/value pair.
- If the key already exists in the map, update its value to be the value that's being passed here
add
method inHashSet
that you did before. -
(3 Points)
__getitem__
: A method that takes one parameter, thekey
, and which returns the value associated to the key, or which raises aKeyError
if that key doesn't exist in the map. This should be fairly similar in structure to the__contains__
method -
(3 Points)
__delitem__
: A method that takes one parameter, thekey
, and which deletes the key/value pair associated to this key, or which raises aKeyError
if that key doesn't exist in the map. This should be fairly similar in structure to theremove
method in yourHashSet
class -
(2 Points)
__len__
: A method with no parameters (other thanself
) which returns how many elements are in the set. As in theHashSet
, this should run in constant time -
(2 Points)
keys
: A method with no parameters that returns all of the keys in the map as a python list.
Tips
If you replace the example above with m = HashMap(10)
instead of m = {}
, you should get exactly the same behavior. Just make sure your keys are getting evenly distributed throughout the buckets
Part 4: Rebalancing (10 Points)
In order for __contains__
, __getitem__
, __setitem__
, and __delitem__
to work in nearly constant time on average, we need to make sure the number of buckets is on the same order as the number of elements stored in the map. For example, if we're storing 100 key/value pairs, we should have around 100 buckets, and if we're storing a million key/value pairs, we should have around a million buckets.
This may seem wasteful since many buckets will be empty, but at worst, we're still have within a constant factor of the the storage of an array with N elements; we now have N pointers to buckets, plus the N elements and the overhead of the linked nodes. What we gain on average is that each bucket only has 1 element in it. Furthermore, if the hash codes are sufficiently staistically random, then most buckets will not have much more than 1 element in them.
In what follows, you will implement a scheme to ensure that the buckets stay balanced in this way, while not incuring much additional computation on average.
Your Task
Modify the constructor to take two parameters: the initial number of buckets and a "load factor." Then, modify the __setitem__
method so that if the number of elements in the map goes beyond load_factor * # buckets
, then you should double the number of buckets and re-add every key/value pair that was in there before to the new buckets. This is exactly like how a C++ vector, Java ArrayList, and python list []
work to keep list accesses and adding to the end of the list constant amortized time.
Tips
To keep your code organized, you should make an internal helper instance method to do the rebalancing, which you call from within __setitem__
at the appropriate time.
Make sure the rebalancing doesn't break your code. If everything is working properly and you start off with an initial capacity of 10
and a load factor of 0.75
, then you should see 160
buckets after running the following code.
Part 5: Tweet Wrangling (15 Points)
Now that you have an industrial strength hash map implementation, it's time to use it in an application. Provided with the starter code are all of Donald Trump's tweets up to 2021. You can load them as follows:
Each tweet is a dictionary. For instance, the first tweet, tweets[0]
, looks like this:
Your Task
In this exercise, your goal will be to use the HashMap
class you made to print out the top 200 words that Trump used across all his tweets. This will be a stress test for your implementation, as there are over 100k unique words that pop up throughout all of his tweets. To loop through all of the words, use the following code
If you've done this correctly, here are the first 120 you should see:
1 : the 2 : to 3 : and 4 : a 5 : of 6 : is 7 : in 8 : for 9 : i 10 : rt 11 : on 12 : you 13 : @realdonaldtrump 14 : will 15 : be 16 : are 17 : that 18 : great 19 : with 20 : we |
21 : our 22 : have 23 : it 24 : at 25 : this 26 : he 27 : they 28 : trump 29 : was 30 : my 31 : &, 32 : has 33 : not 34 : by 35 : all 36 : thank 37 : president 38 : just 39 : - 40 : your |
41 : as 42 : so 43 : from 44 : very 45 : who 46 : people 47 : his 48 : no 49 : but 50 : do 51 : what 52 : new 53 : would 54 : about 55 : if 56 : get 57 : an 58 : more 59 : out 60 : should |
61 : like 62 : now 63 : their 64 : big 65 : than 66 : can 67 : or 68 : never 69 : make 70 : been 71 : one 72 : up 73 : me 74 : when 75 : america 76 : many 77 : good 78 : only 79 : going 80 : how |
81 : time 82 : democrats 83 : want 84 : obama 85 : american 86 : donald 87 : there 88 : news 89 : country 90 : vote 91 : much 92 : over 93 : even 94 : why 95 : were 96 : & 97 : back 98 : must 99 : see 100 : us |
101 : fake 102 : am 103 : need 104 : being 105 : had 106 : @realdonaldtrump: 107 : u.s. 108 : love 109 : best 110 : last 111 : because 112 : think 113 : really 114 : she 115 : run 116 : doing 117 : go 118 : did 119 : after 120 : yo! |
Hints
The following algorithm will do this task efficiently:
-
Create a
HashMap
object where the key is a word and the associated value is the number of times that word appears across all tweets -
Once this is filled in, grab all of the keys into a list, and create a parallel list of the associated counts
-
Use
np.argsort
to find the indices of the words with the highest counts (we will talk more about sorting algorithms in the third unit of the course, but we'll take this as a given for now). For instance,Will yield
[2, 0, 1]
P.S.
Hopefully now it makes sense why they're called "hash tags"!