University of Arizona, Department of Computer Science

CSc 120: Writer Bot Hash Table

Background

This program is behaviorally identical to the Writer Bot program from Assignment 9, with these exceptions:

  1. The program will be altered to use a hash table ADT rather than a Python built-in dictionary, as specified in Programming Requirements.
  2. The input will include the table size for the hash table, as specified under Input Format.
  3. Some error checking is required as specified under Errors.

Restrictions

  1. For the long problem, your code should follow the style guidelines for the class. You must follow the updated guidelines for commenting classes.
  2. You may not use concepts and/or short-hand syntax not yet covered in class. The restrictions include the following:

Markov Chain Algorithm

As in Assignment 9.

Definitions

As in Assignment 9.

Expected Behavior

Write a program, in a file writer_bot_ht.py, that generates random text from a given source text. Your program should behave as follows:

  1. Use input() (without arguments) to read the name of the the source file sfile. Do not prompt the user for input. Do not hard-code the file name into your program.

  2. Use input() (without arguments) to read the size of the the hash table M. Do not prompt the user for input.

  3. Use input() (without arguments) to read in the prefix size n. Do not prompt the user for input.

  4. Use input() (without arguments) to read in the number of words to be generated for the random text. Do not prompt the user for input.

  5. Read sfile and build the Markov chain table of prefixes to suffixes according to the description above.

  6. Construct the randomly generated text according to the Markov chain algorithm. Construct a list to hold the words of the generated text.

  7. Print out the generated text list accoring to the Output format below.

Input Format

As in Assignment 9 and repeated here.

Each line of the input file is a sequence of characters separated by whitespace. The file may consists of any number of lines with any number of words on each line.

Output Format

As in Assignment 9 and repeated here.

Print out the list of generated text ten words per line. Any extra words will be printed on the last line. For example, if the generated text has only nine words, the output will consist of one line of nine words. If the text has 109 words, the output will consist of eleven lines of output, the first ten lines having ten words and the last line having nine.

Programming Requirements

  1. The example discussed above shows a table for prefixes of size two. Your program must work for a prefix of arbitrary size n.

  2. Instead of using a Python dictionary to build the table mapping prefixes to suffixes, implement a hash table ADT. Previously, tuples were used to represent the prefixes. In this assignment, you will use strings to represent a prefix. For example, instead of using the tuple
        ('Half, 'a')
        
    you will use the string
        'Half a'
        
    Since the prefixes will be the keys in the hash table, you will use the hash function specified in the Hash Function section to hash a string representing a prefix to an integer. You are required to use the given hash function. Implement the following class, Hashtable:
    class Hashtable
    An object of this class is a hash table that uses linear probing to handle collisions. It should contain (at least) the following attributes:
    • _pairs: the underlying Python list implementing the hash table.
    • _size: the size of the hash table.

    The class must implement (at least) the following methods:

    • __init__(self, size): initializes the object's attributes as follows: _pairs is set to a list of size size; _size is set to size.
    • put(self, key, value): hashes key and inserts the key/value pair in the hash table. Use linear probing with a decrement of 1 to resolve collisions.
    • get(self, key): looks up key in the hash table and if found, returns the corresonding value. If not found, it returns None.
    • __contains__(self, key): looks up key in the hash table and if found returns True and otherwise returns False.
    • __str__(self) returns a string representation of a Hashtable object.

    Note: Do not use append() or list concatenation in the put() method. (You will receive a deduction if you do!) The code that uses the hashtable ATD will modify a key's corresponding value when needed.

  3. As before, a prefix may have one or more suffixes. You must use a list to represent the possible suffixes. When a new suffix is encountered for an existing prefix, you must append the new suffix to the end of the list. This is important for matching the autograder output: the order in which suffixes are stored in the list will affect the choices made during text generation and will impact the output. For example, suppose that 'Half a' hashes to integer i and that 'a bee' hashes to integer j. The following shows the contents of the hash table at those indices for the Eric the Bee example:

        hash table at index i is ['Half a', [ 'bee' ]]
        ...
        hash table at index j is ['a bee', [ 'philosophically', 'be', 'Due' ]]
    
  4. In addition, during text generation, when a prefix has more than one suffix, the suffix will be randomly chosen from the list. You will use the Python random number generator as in Assignment 9 to do this. As in that assigment, in order for your output to match the tester and grading scripts, you must seed the random number generator. To do this, define the following constant at the top of your program:
       SEED = 8
    
  5. You must define the constant NONWORD, which must be a word that cannot exist in the original text. In Assignment 9, a space was used for this, however, we need spaces to delineate the words of a prefix. Consequently, define NONWORD as the single character @ as follows:
       NONWORD = '@' 
    
  6. As you can imagine, when generating the output for larger text, it is not useful to print out the random text one word at a time. During the text generation phase, create a list to hold the words of the generated text. When the text generation is complete, print the output as specified in the Output format section.

Hash Function

The hash function covered in lecture that hashes a string to an integer multiples the position of each character in the string by its ord value. That is straightforward but not robust enough for use in practice. A better approach is to compute a polynomial whose coefficients are the ord values of the individual characters in the string. Using Horner's rule to compute the polynomial, and using 31 for the value of x, gives us the following hash function:

    def _hash(self, key):
        p = 0
        for c in key:
            p = 31*p + ord(c)
        return p % self._size
  

Errors

The following are errors:

  1. The input value n for the prefix size is less than one.

    Program behavior: Use normal program logic to detect this (i.e., if statements). Give the following error message:

    Error message: "ERROR: specified prefix size is less than one"

  2. The input value for the size of the generated text is less than one.

    Program behavior: Use normal program logic to detect this (i.e., if statements). Give the following error message:

    Error message: "ERROR: specified size of the generated text is less than one"

Terminate the program after the errors above are reported. You may use Python's system library. First, include the following import statement at the top of your program:

import sys
After printing the appropriate error message, exit your program like so:
sys.exit(0)

Examples

Some examples of generating random text from different source texts are shown here.

Reference

The Markov chain algorithm is used to solve a variety of problems. Using it for random text generation has been described in many places, most notably in The Practice of Programming, by Kernighan and Pike, which can be found here.