Assignment 1: Hash Maps And Tweets (72 Points)

Chris Tralie

Table of Contents

Overview / Logistics

The purpose of this assignment is to implement one of the most important data structures in all of CS: a hash map, which is actually quite simply just an array of singly-linked lists. This will enable us to implement two abstract data types (ADTs): the Set and the Map (also known as a dictionary in python), which both have tons of applications. Before you proceed, review the notes at this link, which cover the background for hash tables.

Learning Objectives

  • Implement a singly-linked list in python
  • Implement hash tables using pythonic object oriented paradigms
  • Use hash table to fulfill set and map ADTs, including exception handling
  • Apply the dictionary API to natural language processing

Data / Starter Code

Click here to download the starter code for this assignment. Actually, the starter code is quite minimal; it mainly has a dataset you'll be using in the last task.

What To Submit

Submit your files wizard.py and hash.py to canvas, as well as any other .py files and notebooks you made. Also submit answers to the following questions on Canvas

  1. The name of your buddy, if you chose to work with one.
  2. Any other concerns that you have. For instance, if you have a bug that you were unable to solve but you made progress, write about it. The more you articulate the problem the more partial credit you will receive (fine to leave this blank)

Programming Tasks

To explain the data structures you'll be implementing, let's first look at python's built in versions of them. First, we'll examine a set, which is a unordered collection of objects without repetition

This prints

Notice how the numbers are not in the same order we put them in (since the set order is arbitrary), and that even though we added 2 twice, it only showed up once (no repetitions).

Actually, the set is quite flexible, and we can add anything to it, as long as it has a hash code. For example, we can add

and then we'll see

But then if we try to do

then we get

This is because a list can change, so it's impossible to generate a unique hash code for it, since one of the properties of hash codes is that they remain fixed, so they can only work with "immutable" objects.

Part 1: Hash Codes And Object Equality in Python (6 Points)

Let's now consider a more interesting class that's been provided with the assignment. I've created a Wizard class in wizard.py that mirrors the Harry Potter hashing class exercise. Let's run the following code from within some .py file or notebook in the same directory as wizard.py:

but then this prints out

It seems like the set's not working, because it's not supposed to have more than one copy of the same object! But let's drill down and look at the hash codes. We can do this by calling the built-in method hash in python:

When I run this, I get the following on my computer (but you will get something different every time you run this, which will become clear in a moment):

What on earth are these numbers, and how did python come up with them? Let's try one more thing. If we run the built-in id() method, it will tell us the memory address of the object passed to it:

which yields the following

As it turns out, since we haven't told python how to make a hash code for a wizard, it ends up using the memory address // 16.

However, even so, there are some collisions here: the only unique hash codes are 8785000196838 and 8785000196229. So why are we still seeing 5 unique objects in the set? Well, python doesn't actually know that they're equal; if we don't tell python how to compare objects, it will default to comparing them by their memory address. Since each of the 5 objects has a different memory address, they are considered unique.

Your Task

Write code in wizard.py to override the default behavior of object equality and hashing for the Wizard class:

  1. Implement the __hash__ method, which takes no parameters (other than self), and which returns the month*31 + day
  2. Implement the __eq__ method, which takes one parameter (in addition to self) which is object to which to compare it. Return True if the name, month, day, and year are equal between this object and the other object, and False otherwise.

If this works properly, the hash codes for all of the Snape wizards above should be 317, and when you run

It should print True. Finally, you should only see one Snape wizard in the set if you add them in the loop.


Part 2: HashSet (24 Points)

Now that we've convinced ourselves that the built-in python set class works properly, let's implement our own verison of a set from scratch using a hash table.

Your Task

Create a singly-linked list class and/or linked node class that wraps around a key object, which is what will be stored in the set. Create a class HashSet that aggregates and array of linked lists, and which has the following instance methods (that all have self as a first parameter)

  • (4 Points) __init__: The construtor should take one parameter specifying the number of buckets to use, and you should initialize an array of empty linked lists with that number of buckets

  • (4 Points) __contains__: A method that takes one parameter, the key, and which returns True if the key is in the linked list and False otherwise. When you've finished implementing this method, then saying

    Should print out True. Note the "syntactic sugar" with the in keyword, which you can use to make your code look nicer. Under the hood, saying x in y in python is equivalent to calling y.__contains__(x)

  • (4 Points) add: Takes one parameter and adds that object to the set. You should compute the hash code of the object, find its appropriate bucket index, and add it if it doesn't already exist (we don't want duplicate objects).

  • (4 Points) remove(obj): Remove obj from the hash table if it exists in the table, or throw a KeyError if the object isn't in the table. You can do this with the following code:

  • (4 Points) __len__: A method with no parameters (other than self) which returns how many elements are in the set. This should run in constant time; you should not have to loop through all of the linked lists to count. Instead, store a member variable that keeps track of the number of objects as they are added and deleted so you can simply return that member variable.

    Note also the syntactic sugar; if we say len(s), this is equivalent to saying s.__len__()

  • (4 Points) keys: A method with no parameters that returns all of the elements in the set as a python list.

If you're not sure where to get started, have a look at the needle in a haystack notes, as well as some starter code for linked lists that we made in class.

Tips

The first thing you should do in the __contains__, add, and remove methods is to compute the hash code for the object that's being passed in and % it by the number of buckets to figure out which bucket you should start looking in. We will assume here that the objects passed along have implemented __hash__, so that calling hash(obj) will give the hash code. Python already has built-in hash codes for ints, strings, floats, etc.

If this is all working properly, then the following code:

should print something like this (note that the order is arbitrary):


Part 3: HashMap (17 Points)

Sometimes, we may not know everything about a wizard, but we do know partial information like their name. In this case, it's more appropriate to use a map ADT instead of a set ADT to look up more info from the partial info we have. In this ADT, they key is something hashable that we look up in the hash table, but it is glued to something called a value which is the associated thing we're trying to look up. Python calls its implementation of a map a dictionary (in Java it's a Map, in C++ STL it's a map, in PHP it's an called an "associative array", ...). Here's how we might set up some information in a dictionary:

after which we'll see the following output

Note that a python dictionary has a quite convenient syntax; it looks just like an array, but we index it with keys instead of 0-indexed integers. We will use built-in python methods to replicate this syntactic sugar in our implementation of HashMap

Your task

Create a class HashMap which, like the HashSet class, aggregates an array of linked lists. You'll just need to tweak the node class in your linked list to hold a value in addition to a key. For those who had CS 174 with me, it's similar to the last assignment I gave in that class. Below are the instance methods you'll need to implement:

  • (2 Points) __init__: The construtor should take one parameter specifying the number of buckets to use, and you should initialize an array of empty linked lists with that number of buckets

  • (2 Points) __contains__: A method that takes one parameter, the key, and which returns True if the key is in the linked list and False otherwise. This should be incredibly similar to the __contains__ method in HashSet that you did before.

  • (3 Points) __setitem__: A method that takes two parameters, the key and the value that should be associated to that key.

    • If the key doesn't exist in the map, add the key/value pair.
    • If the key already exists in the map, update its value to be the value that's being passed here
    This should be fairly similar to the add method in HashSet that you did before.

  • (3 Points) __getitem__: A method that takes one parameter, the key, and which returns the value associated to the key, or which raises a KeyError if that key doesn't exist in the map. This should be fairly similar in structure to the __contains__ method

  • (3 Points) __delitem__: A method that takes one parameter, the key, and which deletes the key/value pair associated to this key, or which raises a KeyError if that key doesn't exist in the map. This should be fairly similar in structure to the remove method in your HashSet class

  • (2 Points) __len__: A method with no parameters (other than self) which returns how many elements are in the set. As in the HashSet, this should run in constant time

  • (2 Points) keys: A method with no parameters that returns all of the keys in the map as a python list.

Tips

If you replace the example above with m = HashMap(10) instead of m = {}, you should get exactly the same behavior. Just make sure your keys are getting evenly distributed throughout the buckets


Part 4: Rebalancing (10 Points)

In order for __contains__, __getitem__, __setitem__, and __delitem__ to work in nearly constant time on average, we need to make sure the number of buckets is on the same order as the number of elements stored in the map. For example, if we're storing 100 key/value pairs, we should have around 100 buckets, and if we're storing a million key/value pairs, we should have around a million buckets.

This may seem wasteful since many buckets will be empty, but at worst, we're still have within a constant factor of the the storage of an array with N elements; we now have N pointers to buckets, plus the N elements and the overhead of the linked nodes. What we gain on average is that each bucket only has 1 element in it. Furthermore, if the hash codes are sufficiently staistically random, then most buckets will not have much more than 1 element in them.

In what follows, you will implement a scheme to ensure that the buckets stay balanced in this way, while not incuring much additional computation on average.

Your Task

Modify the constructor to take two parameters: the initial number of buckets and a "load factor." Then, modify the __setitem__ method so that if the number of elements in the map goes beyond load_factor * # buckets, then you should double the number of buckets and re-add every key/value pair that was in there before to the new buckets. This is exactly like how a C++ vector, Java ArrayList, and python list [] work to keep list accesses and adding to the end of the list constant amortized time.

Tips

To keep your code organized, you should make an internal helper instance method to do the rebalancing, which you call from within __setitem__ at the appropriate time.

Make sure the rebalancing doesn't break your code. If everything is working properly and you start off with an initial capacity of 10 and a load factor of 0.75, then you should see 160 buckets after running the following code.


Part 5: Tweet Wrangling (15 Points)

Now that you have an industrial strength hash map implementation, it's time to use it in an application. Provided with the starter code are all of Donald Trump's tweets up to 2021. You can load them as follows:

Each tweet is a dictionary. For instance, the first tweet, tweets[0], looks like this:

Your Task

In this exercise, your goal will be to use the HashMap class you made to print out the top 200 words that Trump used across all his tweets. This will be a stress test for your implementation, as there are over 100k unique words that pop up throughout all of his tweets. To loop through all of the words, use the following code

If you've done this correctly, here are the first 120 you should see:

1 : the
2 : to
3 : and
4 : a
5 : of
6 : is
7 : in
8 : for
9 : i
10 : rt
11 : on
12 : you
13 : @realdonaldtrump
14 : will
15 : be
16 : are
17 : that
18 : great
19 : with
20 : we
21 : our
22 : have
23 : it
24 : at
25 : this
26 : he
27 : they
28 : trump
29 : was
30 : my
31 : &,
32 : has
33 : not
34 : by
35 : all
36 : thank
37 : president
38 : just
39 : -
40 : your
41 : as
42 : so
43 : from
44 : very
45 : who
46 : people
47 : his
48 : no
49 : but
50 : do
51 : what
52 : new
53 : would
54 : about
55 : if
56 : get
57 : an
58 : more
59 : out
60 : should
61 : like
62 : now
63 : their
64 : big
65 : than
66 : can
67 : or
68 : never
69 : make
70 : been
71 : one
72 : up
73 : me
74 : when
75 : america
76 : many
77 : good
78 : only
79 : going
80 : how
81 : time
82 : democrats
83 : want
84 : obama
85 : american
86 : donald
87 : there
88 : news
89 : country
90 : vote
91 : much
92 : over
93 : even
94 : why
95 : were
96 : &
97 : back
98 : must
99 : see
100 : us
101 : fake
102 : am
103 : need
104 : being
105 : had
106 : @realdonaldtrump:
107 : u.s.
108 : love
109 : best
110 : last
111 : because
112 : think
113 : really
114 : she
115 : run
116 : doing
117 : go
118 : did
119 : after
120 : yo!

Hints

The following algorithm will do this task efficiently:

  1. Create a HashMap object where the key is a word and the associated value is the number of times that word appears across all tweets

  2. Once this is filled in, grab all of the keys into a list, and create a parallel list of the associated counts

  3. Use np.argsort to find the indices of the words with the highest counts (we will talk more about sorting algorithms in the third unit of the course, but we'll take this as a given for now). For instance,

    Will yield [2, 0, 1]

P.S.

Hopefully now it makes sense why they're called "hash tags"!