Locating Patterns in Nucleotide Sequences

To submit a request, you need to give PatScan:

Nucleotide Sequence choice

PatScan currently uses a subset of Release 60 of the EMBL Nucleotide Sequence Database maintained by EBI.

The choices offered are:

Other choices (not maintained by EBI) are:

Pattern

There are many options that can be employed in constructing a pattern to scan for. We suggest you consider the following simple examples:

  1. p1=8...9 3...8 ~p1
    This pattern consists of three "pattern units" separated by spaces.

    Thus, this pattern would match

    	cgtaaccaa ggttaacc ttggttacg 
    

    Now for a short aside: PatScan will search only one strand, unless you ask for searches against the complementary strand, as well. With a pattern of the sort we just used, there is no need to search the opposite strand. However, it is normally the case that you will wish to search both the sequence and the opposite strand (i.e., the reverse complement of the sequence). You usually should ask for this option when you scan nucleotide sequences.

    Let us stop now and ask "What additional features would one need to really find the kinds of loop structures that characterize tRNAs, rRNAs, and so forth?" Two immediately come to mind:

  2. r1={au,ua,gc,cg,gu,ug,ga,ag}
    p1=2...3 0...4 p2=2...5 1...5 r1~p2 4...4 ~p1
    
    Let us first show you how to handle "non-standard rules for pairing in reverse complements". The example is shown as two lines. You may use as many lines as you like in forming a pattern, although you can only break a pattern at points where a space would be legal.

    The first "pattern unit" on the first line does not actually match anything. It defines a set of "pairing rules" in which standard pairings {au,ua,cc,cg} are allowed, as well as the nonstandard pairings {gu,ug,ga,ag}.

    The second line consists of six pattern units which may be interpreted as follows:

    	p1=2...3     match 2 or 3 characters (call it p1)
    	0...4        match 0 to 4 characters
    	p2=2...5     match 2 to 5 characters (call it p2)
    	1...5        match 1 to 5 characters
    	r1~p2        match the reverse complement of p2,
                         using the pairing rule r1 {au,ua,gc,cg,gu,ug,ga,ag} 
    	4...4        match 4 characters        
    	~p1          match the reverse complement of p1
                         allowing only G-C, C-G, A-T, and T-A pairs
    
    Thus, r1~p2 means "match the reverse complement of p2 using rule r1".

  3. p1=10...10 3...8 ~p1[1,2,1]
    
    Now let us consider the issue of tolerating mismatches and bulges.
    In this example, the third pattern unit must match the 10 characters of the reverse complement of p1, allowing:

    one "mismatch"
    a pairing other than G-C, C-G, A-T, or T-A,
    two "deletions"
    a deletion is a character that occurs in p1, but has been deleted from the string matched by ~p1, and
    one "insertion"
    an "insertion" is a character that occurs in the string matched by ~p1, but not for which no corresponding character occurs in p1.

    In this case, the pattern would match

    	ACGTACGTAC GGGGGGGG GCGTTACCT
    
    which is, you must admit, a fairly weak loop.

    It is common to allow mismatches, but you will find yourself using insertions and deletions much more rarely. You should note that allowing mismatches, insertions, and deletions does force the program to try many additional possible pairings, so it does slow things down a bit.

    NOTE: You cannot leave a space before the qualifier. Thus,

    	p1=10...10 3...8 ~p1 [1,2,1]
    
    is invalid.

  4. p1=6...6 3...8 p1
    
    Find exact 6 character repeat separated by 3 to 8 characters.

  5. p1=6...6 3..8 p1[1,0,0]
    
    Same as above, allowing one mismatch in the repeated string.

  6. p1=3...3 p1[1,0,0] p1[1,0,0] p1[1,0,0] 
    
    Match 12 characters of a 3-character sequence occuring 4 times with up to 1 mismatch in each of the the 2nd, 3rd and 4th sequences.

  7. p1=4...8 0...3 p2=6...8 p1 0...3 p2
    
    This would match things like:
    	ATCT G TCTTT ATCT TG TCTTT.
    

    Occasionally, one wishes to match a specific, known sequence. In such a case, you can just give the sequence (along with an optional statement of the allowable mismatches, insertions, and deletions).

  8. p1=6...8 GAGA ~p1
    
    Match a hairpin with GAGA as the loop.

  9. RRRRYYYY
    
    Match 4 purines followed by 4 pyrimidines.

  10. TATAA[1,0,0]
    
    Match TATAA, allowing 1 mismatch.
There are a number of other useful types of pattern units, but this should be enough to get you started. For more information see Rules for Forming Patterns.

Note that complex patterns could often match against a number of overlapping areas of a sequence: only the first would be reported (after a successful match, the matching algorithm picks up at the first character past the matched substring).

Note that searches may take some time since they are queued (sometimes for a few hours, but results can often be obtained in just a few minutes).

Results

Results should look something like:
	embl|M27249:[142,162]      :  aaaaaaga  aatca    tctttttt 
	embl|M35517:[343,363]      :  aaaaaaga  aatca    tctttttt 
	embl|V00101:[241,261]      :  aaaaaaga  aatca    tctttttt 
	embl|X07796:[343,363]      :  aaaaaaga  aatca    tctttttt 
	embl|X56679:[1562,1587]    :  aaaaaagac ccttaggg gtctttttt
	embl|M24537:[3334,3359]    :  aaaaaagcc cactagag ggctttttt
	embl|M98822:[15,40]        :  aaaaaagcc cactagag ggctttttt
	embl|D29985:[5641,5666]    :  aaaaaagcg cccttggg cgctttttt
	embl|X73124:[83254,83277]  :  aaaaaccc  tttttaaa gggttttt 
	embl|M12501:[611,636]      :  aaaaagact tggaaaca agtcttttt
	.
	.
	.
	embl|M77837:[623,644]      :  tttttaaa  ggtaca   tttaaaaa 
	embl|M16192:[1522,1542]    :  tttttata  ataat    tataaaaa 
	embl|L08822:[930,953]      :  tttttctg  tgctgaaa cagaaaaa 
	embl|L25604:[3323,3347]    :  ttttttgaa ataaaac  ttcaaaaaa
	embl|M97391:[3519,3543]    :  ttttttgaa gttttgt  ttcaaaaaa
	COMPLETED REQUEST
Each line gives an EMBL accession number, followed by the positions in the EMBL entry that were matched by the pattern. The returned hits are sorted on the first field matched by the pattern.

So, try out some patterns and see what you get. Note that we limit the maximum number of reported hits; you can override the maximum, but we suggest that you only do so once you know that you really want to see a truly large number of matches.