## Problem Set 5, CS 301, Spring 2012

### Regular Problems

1. Using Java syntax, give a regular expression for valid US phone numbers. Remember such things as that there may or may not be an area code, and it may or may not have parentheses around it, and that there may or may not be a hyphen in a couple of places in the phone number. This problem does not have a perfect solution.
2. Give decision procedures for the following two problems:
1. Given two DFAs M1 and M2, does one accept the reverse of the other's language?
2. Given two DFAs M1 and M2, does the the language of M1 contain fewer strings than the language of M2? (Note: if both languages are infinite, then the answer to the question should be "No".)
3. Consider the following grammar G:

Show a parse tree produced by G for both of the following strings:
1. ababba
2. aababbab
4. Give a context-free grammar for each of the following languages:
1. Balanced Delimiters: Any string of properly balanced delimeters over the 3 pairs of delimeters: ), (, ], [, }, {.
5. For parts 2 and 4 of the immediately preceding problem, prove that that language is not regular. (It's easy for 4; it's very tricky but doable for 2. Don't worry if you don't get 2.)
6. If true, argue why; if false, give a counterexample: An infinite union of regular languages must be regular.

### Programming assignment using Java Regex

You are to write a short Java program that uses regular expressions to determine whether a string, representing the DNA of the short arm of Chromosome 4 of a human, indicates Huntington's Disease.

If the pattern CAG is repeated too many times, this indicates Huntington's Disease. See the Wikipedia article on Huntington's Disease.

Your program will take one command line argument that will give the name of a file that contains text consisting of letters from the 4-letter alphabet of DNA: C, G, A, and T, separated by arbitrary amounts of white space. You are to return the maximum number of consecutive repeats of the pattern CAG, plus arbitrary white space, that occurs anywhere in the DNA text. You will want to test your program mostly with some fairly short inputs, but be prepared for a string of length up to about 15,000,000. (Human Chromosome 4 has about 186,000,000 base pairs, but your program would take too long to run on 186,000,000 characters. Indeed, it will probably be noticeably slow even on a string of "only" 10,000,000 characters, although it should be instantaneous on a string of 1000 characters.)

Please name your file and class Huntingtons.java and submit it on Blackboard. Note that the folder will close at 10:50AM sharp on Friday.

If there are at most 180 repeats, then give the number. If there are more than 180 repeats, then simply return "> 180".

The meaning of the number of repeats of this pattern:

< 28: Normal

28–35: Intermediate level, no symptoms

36–39: At risk for Huntington's Disease.

40–180: Huntington's Disease.

> 180: Not a human being.

Some ProgrammingIssues for this problem:

-- Main.sloan - 2012-02-12

Topic attachments
I Attachment Action Size Date Who Comment
java Huntingtons.java manage 1.3 K 2012-02-17 - 16:12 UnknownUser Changes to file.

 Copyright 2013 The Board of Trustees of the University of Illinois.webmaster@cs.uic.edu WISESTHelping Women Faculty AdvanceFunded by NSF