PatScan currently uses a subset of Release 60 of the EMBL Nucleotide Sequence Database maintained by EBI.
The choices offered are:
There are many options that can be employed in constructing a pattern to scan for. We suggest you consider the following simple examples:
p1=8...9 3...8 ~p1This pattern consists of three "pattern units" separated by spaces.
p1=8...9which means "match 8 to 9 characters and call them p1".
3...8which means "match 3 to 8 characters".
~p1which means "match the reverse complement of p1".
Thus, this pattern would match
cgtaaccaa ggttaacc ttggttacg
Now for a short aside: PatScan will search only one strand, unless you ask for searches against the complementary strand, as well. With a pattern of the sort we just used, there is no need to search the opposite strand. However, it is normally the case that you will wish to search both the sequence and the opposite strand (i.e., the reverse complement of the sequence). You usually should ask for this option when you scan nucleotide sequences.
Let us stop now and ask "What additional features would one need to really find the kinds of loop structures that characterize tRNAs, rRNAs, and so forth?" Two immediately come to mind:
r1={au,ua,gc,cg,gu,ug,ga,ag}
p1=2...3 0...4 p2=2...5 1...5 r1~p2 4...4 ~p1
Let us first show you how to handle "non-standard rules for
pairing in reverse complements". The example is shown as two lines.
You may use as many lines as you like in
forming a pattern, although you can only break a pattern at points
where a space would be legal.
The first "pattern unit" on the first line does not actually match
anything. It defines a set of "pairing rules" in which standard
pairings {au,ua,cc,cg} are allowed, as well as the nonstandard pairings
{gu,ug,ga,ag}.
The second line consists of six pattern units which may be interpreted as follows:
p1=2...3 match 2 or 3 characters (call it p1)
0...4 match 0 to 4 characters
p2=2...5 match 2 to 5 characters (call it p2)
1...5 match 1 to 5 characters
r1~p2 match the reverse complement of p2,
using the pairing rule r1 {au,ua,gc,cg,gu,ug,ga,ag}
4...4 match 4 characters
~p1 match the reverse complement of p1
allowing only G-C, C-G, A-T, and T-A pairs
Thus, r1~p2 means "match the reverse complement of p2 using rule r1".
p1=10...10 3...8 ~p1[1,2,1]Now let us consider the issue of tolerating mismatches and bulges.
In this case, the pattern would match
ACGTACGTAC GGGGGGGG GCGTTACCTwhich is, you must admit, a fairly weak loop.
It is common to allow mismatches, but you will find yourself using insertions and deletions much more rarely. You should note that allowing mismatches, insertions, and deletions does force the program to try many additional possible pairings, so it does slow things down a bit.
NOTE: You cannot leave a space before the qualifier. Thus,
p1=10...10 3...8 ~p1 [1,2,1]is invalid.
p1=6...6 3...8 p1Find exact 6 character repeat separated by 3 to 8 characters.
p1=6...6 3..8 p1[1,0,0]Same as above, allowing one mismatch in the repeated string.
p1=3...3 p1[1,0,0] p1[1,0,0] p1[1,0,0]Match 12 characters of a 3-character sequence occuring 4 times with up to 1 mismatch in each of the the 2nd, 3rd and 4th sequences.
p1=4...8 0...3 p2=6...8 p1 0...3 p2This would match things like:
ATCT G TCTTT ATCT TG TCTTT.
Occasionally, one wishes to match a specific, known sequence. In such a case, you can just give the sequence (along with an optional statement of the allowable mismatches, insertions, and deletions).
p1=6...8 GAGA ~p1Match a hairpin with GAGA as the loop.
RRRRYYYYMatch 4 purines followed by 4 pyrimidines.
TATAA[1,0,0]Match TATAA, allowing 1 mismatch.
Note that complex patterns could often match against a number of overlapping areas of a sequence: only the first would be reported (after a successful match, the matching algorithm picks up at the first character past the matched substring).
Note that searches may take some time since they are queued
(sometimes for a few hours, but results can often be obtained in just a few minutes).
embl|M27249:[142,162] : aaaaaaga aatca tctttttt embl|M35517:[343,363] : aaaaaaga aatca tctttttt embl|V00101:[241,261] : aaaaaaga aatca tctttttt embl|X07796:[343,363] : aaaaaaga aatca tctttttt embl|X56679:[1562,1587] : aaaaaagac ccttaggg gtctttttt embl|M24537:[3334,3359] : aaaaaagcc cactagag ggctttttt embl|M98822:[15,40] : aaaaaagcc cactagag ggctttttt embl|D29985:[5641,5666] : aaaaaagcg cccttggg cgctttttt embl|X73124:[83254,83277] : aaaaaccc tttttaaa gggttttt embl|M12501:[611,636] : aaaaagact tggaaaca agtcttttt . . . embl|M77837:[623,644] : tttttaaa ggtaca tttaaaaa embl|M16192:[1522,1542] : tttttata ataat tataaaaa embl|L08822:[930,953] : tttttctg tgctgaaa cagaaaaa embl|L25604:[3323,3347] : ttttttgaa ataaaac ttcaaaaaa embl|M97391:[3519,3543] : ttttttgaa gttttgt ttcaaaaaa COMPLETED REQUESTEach line gives an EMBL accession number, followed by the positions in the EMBL entry that were matched by the pattern. The returned hits are sorted on the first field matched by the pattern.
So, try out some patterns and see what you get. Note that we limit the maximum number of reported hits; you can override the maximum, but we suggest that you only do so once you know that you really want to see a truly large number of matches.