Welcome to tacg Version 2.34
Document Version 2.34 - October 30, 1997
Thanks to several helpful souls, several bugs (including one significant
bug for DEC Alphas) have been pointed out and fixed and some suggested
interface changes have been made. The latest release (2.34) addresses
these. Please see Changes since 2.28 for the
bug reports, fixes.
The Web interface that uses most of Ver 2's abilities is ready!
Try it out here!
This documentation is subject to change - compare the version # above to
your local copy, if you got one with your source code or binary
distribution. Text in italics is to be believed only with many grains of
salt.
This page provides information about tacg Ver
2.x, describes some of the philosophy behind its design, provides
links to more documentation, and where it can be got (both ANSI C source
code and precompiled binaries for some popular systems).
Table of Contents
Briefly:
- Transparent handling of degenerate sequence input
with adaptive switching in default mode.
- Pattern-matching with errors, from command-line or
input file.
- Proximity matching, with upstream/downstream,
less-than/greater-than distance constraints.
- Data output by 'bins' in 3 formats to allow external
plotting or analysis of results.
- ORF analysis
- More concise output.
- More standardization of command usage.
- Renaming some flags for more logical association.
- A real man page in 'man/nroff' format (included in the distribution pkg)
and in HTML-ized form.
- Making the program quiet by default so that it no longer spews any
stderr in normal usage to make it play better with external
editors/pagers, especially NEdit.
- Significant code cleanup and more dynamically allocated internal data
structures, especially in handling patterns.
- A moderated listserv for users of tacg to ask questions, request
bug-fixes, suggest improvements, and to hear the answers to others'
questions. This is likely to be very low traffic - mostly, it's a
channel so that I have a way to broadcast bugfixes and notices of
improvements to all tacg users. Sign up HERE by typing
ONLY subscribe tacg firstname lastname in the BODY of the
message, where "firstname" should be YOUR 1st name and "lastname"
should be YOUR lastname. Signatures may confuse the listserver, so
try not to include them. Each listserv messages contains
instructions about how to modify your subscription or remove yourself
from the list entirely, so it's not a one-way trip to listserv hell.
- More error checking.
- LOTS of bug fixes. (How did this thing ever run before?)
- Added more bugs, some known.
tacg is a character-based, command line tool
for unix-like operating systems for pattern-matching nucleic acids. It
was designed for restriction enzyme analysis of DNA, but has been extended
to other types of matching. It now handles degenerate sequence input in a
variety of matching approaches, as well as patterns with errors.
It was designed to be a command-line tool which reads DNA sequence on stdin
and spits data to stdout, reporting errors and diagnostics to stderr, sort
of like a 'grep' for DNA. Like grep, it's capabilities have grown so that
now the author has to keep calling up the help page to figure out which
flags (now ~30) mean what.
tacg itself is NOT a GUI application in any sense. However, it's existance
as a strictly command-line tool lends itself well to Webification and I've
made it available as a cgi program.. Also, by
it's strict command line nature, it can parasitize the interface of a
number of GUI programs that allow passing selections made within their
windows to external programs. This is particularly effective with a free,
Xwindow editor called NEdit. With minimal reconfiguration, NEdit can
become a fairly good biosequence editor and analysis tool a la Strider for
Xwindows. See the entry for NEdit below.
This is also one of the first examples of a new kind of software payment:
auto-citation-ware. If you use it, please allow it to spit back about 100
bytes of info that tells me how it is being used, although you can
disallow it from the command line or by recompilation
(see "The Odd one: UDP Reporting" below).
A real example, quickly.. say the Genbank entry for the lambda genome.
tail +1411 lamcg.gb | tacg -w75 -n6 -F2 -l -c -g10 -L -t3 -O145,25 | gzip - > lambda.map.gz
Translation: chop off the top 1411 lines of 'lamcg.gb' (the comments)
and pipe the following sequence to tacg, returning
the analysis:
- 90 characters wide (-w75)
- on all 6+ cutters (-n6)
- giving me the sorted fragment sizes (-F2) of those Enz's that match
- a ladder map (-l)
- sort tables by number of hits and thence alphabetically (-c)
- gel simulation with a 10 bp cutoff (-g10)
- linear restriction map (-L)
- 1 letter, 3 frame translation (-t3)
- the ORF table for frames 1,4,5 at a minimum ORF size of 25 AAs (-O145,25)
- and pipe the output thru 'gzip' to a file called lambda.map.gz.
Here's the resulting output (114KB text, gzipped - you'll need
a gzip-compatible app to view it, such as Stuffit Expander, Stuffit For
Windows, Winzip, or even ... gzip).
The lambda genome used to generate this map is also included in the tacg distribution
directory (called lamcg.gb.gz)
tacg first reads ASCII characters from stdin, determines whether there are
IUPAC characters in the stream and, depending on the -D flag option, EITHER
strips out anything not an 'a', 'c', 't', 'u', or 'g', OR allows all IUPAC
characters (but strips all others, including numbers, spaces, tabs,
linefeeds, etc), and places that input sequence in a buffer.
What it does then depends on what other flags are set. The 1st writ
function and probably the most popular is Restriction Enzyme Analysis.
Flags indicating this function cause tacg to read in a series of named
patterns (either from the command line, or more typically, from a
GCG-formatted REBASE file) and match them against the sequence already read
in. These matches are noted, then some basic math is done with them and
the output is sent to stdout, in the form of tables of cut sites, fragments
generated, various maps and simulations.
The number of recognition patterns was originally hard-coded, but is now
dynamically allocated so 1000s of patterns can be searched for at once
(I've generated ~7000 without problems), as long as they are in the simple
format required. I am working on a simple perl script to fracture
TRANSFAC into a number of REBASE-formatted files, by source.
A similar function matches (possibly degenerate) input with patterns that
contain errors (as well as simple degeneracies). This means that if the
pattern is gartc with 1 error, it could match nartc,
gnrtc, gantc, garnc, or gartn. Of course, the
1st and last sequences decay to artc and gart, as a leading
or lagging 'n' makes no difference in the pattern. NB: See the note on how subsequence patterns are
currently handled.
A related function filters the data from the above searches by parameter
that you give it, restricting the number of reports by
proximity.
A new function allows Open Reading Frames to be
extracted. This function is not very sophisticated, but does present the
data in a way that allows it to be analyzed by other programs (each ORF is
written in a FASTA format so that you can use the resulting file for other
types of searching). I suspect this function will be updated quite
frequently as users request more features.
Enzyme selection can be filtered by a variety of criteria; these criteria
affect what is printed as output (if only 3 enzymes match your selection
criteria, only those 3 will show up in most analyses. They can be chosen
by 'magnitude' of recognition site (4, 5, 6 cutters, etc), by overhang
generated, by minimum/maximum times they cut, etc.
The entire feature set is described in more detail below, but it includes:
- Enzyme cut-frequency summary tables,
- Tables of site data (sorted or not)
- Tables of fragment data (sorted or not; ordered or not)
- Linear map with names stacked efficiently with/without
translations in 1,3, or 6 frames, from 1 of 8 Codon Usage tables
- Pseudo-graphical ladder maps a la GCG
- Summary map of infrequent cutters
- Gel simulations (using a simple log fit).
- Tables of ORFs for any frames, with offsets in bases, AAs, and MolWt. [new]
- Entering patterns from the commandline [new]
- Searching for patterns with errors [new]
- Proximity matching - reports only sites that match distance criteria
(up/downstream, less/more than a distance, within a range of distances) [new]
- Other match data output formats
tacg V2 is available via FTP in ANSI C
source code and
precompiled binaries for several
architectures. Instructions are included with each package. Note that
the precompiled binaries do not include the source code; be sure to pick up
both pkgs if you want both. If I can figure out how the REDHAT Package
Manager works, I'll make up an RPM for the Linux distribution.
tacg is a command line program (see Simplicity,
below) but as such, it lends itself well to Webification as a cgi program.
I made a Web
interface to tacg Ver 1.7x (still active) that allows you use most of
its abilities.
The Web interface that uses most of Ver 2's abilities
is ready!
Try it out here!
Another page that makes tacg's pattern-searching abilities easier to
use will be up soon as well.
The only substantial crippling is that it currently has a cutoff at
200,000 bases (up from 100K in Ver 1.7x) as the output for that size of
analysis starts to get very large for some analyses). If you routinely
use more sequence than this, it's probably worth your time to install it
locally.
In my experience, if the analyses provided are sufficient, interacting with
tacgi/tacg via it's web interface (on an 100MHz R4000 SGI) over 2 LAN hops
is both faster and easier than using GCG's native Xwindows interface (on a
200MHz R4400 SGI). There may be some subjective feeling involved here,
but objectively, it's faster by 10X (on a machine that's 1/2 as fast :) )
Another example of how tacg has been incorporated into a Web interface is
provided by the Biologist's
Workbench at NCSA, for which it provides restriction enzyme
capabilities.
The Biologist's Workbench has also been licensed to MDL for
commercial users, tacg included.
If you think that the web interface might be useful to you or your
institution and would like to set it up locally, you're more than welcome.
It requires Tom Boutell's cgic
library (for which it won
Best Technical Application - Yowza! ,
the associated tacgi.c code (originally by Fred Criscuolo ) and the HTML code
for the entry
form. Just mail me for all the
details, or if you're adventurous, you can try it yourself. The code is in
the same FTP tree as tacg.
tacgi2, the CGI that supports tacg2 is now operating. After I tweak
it, and polish the docs a bit more, it will also be made available via
FTP. If you're interested in getting it, you might consider subscribing
to the above-mentioned tacg listserv, where it (and associated
bugs/fixes/patches, etc will be announced).
Yup. You have your choice of a REAL manpage (included with both the source code and the binary
pkgs), or a
PREFORMATTED man page or a
Webified man page.
In answer to: "Why does the world need another restriction mapping
program, especially one as crude as this?" or Why didn't you just use (or
modify XXX?", I enumerate:
- Well, in the 1st place, crude is not always bad :)... 'grep' could be
classified as crude and it's one of the most useful and used programs in
the world. It's free, simple, flexible, chainable, and very fast; it provides
the desired info in a useable form with a minimum of mucking about.
- There are a variety of programs available for Windows and (especially) the
Mac for doing these kinds of analysis, but there's little along the same
lines for unix, especially Linux, which is an extremely attractive OS for
this kind of work.
- Many available programs, even the commercial ones, start to come undone
at sequence lengths quite a bit smaller than those now being published.
Dynamic memory allocation relieves this restriction, so I decided to write
one that incorporated this feature.
- Even some of my favorite programs would refuse to analyze degenerate DNA
sequence, making the generation of useable plasmid maps from the usual bits of
known sequence, polylinker, and hand-drawn maps impossible or extremely
unpleasant.
- While I have had access to GCG for quite some time, I have been
increasingly aggravated at it's speed, output, cost, and the fact that it
doesn't run on Linux. While tacg's code is sufficiently arcane that
precious few biologists will attempt to add functionality to it, they may
suggest additional functions that I would certainly consider implementing.
- There is also the challenge of creating a program that, like grep has
significant utility in a relatively small footprint.
- There was also the fun of it - this has been a hobby of mine and it's
also rewarding to return to the Internet community something of (hopeful)
value as a way of saying 'thank you' for all the useful stuff I've taken
advantage of.
However, if you want alternatives, here are a few:
- There's the stalwart (if ugly)
GCG, but it's
$ware and does not (yet?) run on Linux. It also remains essentially a
collection of unfriendly command-line programs (hey, nice glass house..)
sheltered under a common but still not too intuitive GUI. The GUI is
strangely warped - it appears to be very similar to a HTML forms-based GUI,
but with neither the benefit of HTML, nor many of the advantages of a
'native' GUI. The controls are surprisingly crudely laid out (much like the
tacg Web form), which is odd for a 'native' app; I would have expected a
much more compact and intuitive interface.
- Webcutter
2 is a Web-only cgi (no command-line version, no source code) that Max Heiman put together at
about the same time that I was writing tacgi. I've tried it a few times
and while it may be useful, I found it too slow and the output a bit
info-sparse. It does have some features that tacg lacks (coloration of
selected Restriction Enzymes (REs), Sites that could be introduced by Silent
Mutagenesis, grabbing sequences direct from Genbank to restrict), but for
my own use they weren't that compelling. Nevertheless, if anyone thinks
that a feature of WebCutter would make tacg considerably more usable, I'd
be interested in hearing about it.
- Shankar Subramanian's
Biologist's Workbench is an excellent example of how a Web Interface
can make coherent a great number of the free software tools of the net.
Now that you mention it, tacg is one of its included tools :).
The only exception I take to Web interfaces
(including my own) is the response time/uncertainty of the connection and
the inability to retain state easily.
But it is a great way to take advantage of programs that you'd never be
able to take the time to get working and working reasonably easily and a
well thought-out interface can do a lot to reduce the problem of retained
state.
- Steve Smith's 'GDE' has been
compiled for
Linux, but requires a fair amount of dedication to put and hold it all
together; last time I looked it was available only an OpenLook interface,
which is an increasing disadvantage. Steve is now at GCG (I think - the
new versions of GCG seem to show the GDE influence.) so there probably
won't be a huge amount of work going into it.
- Brian Fristensky's
estimable suite of programs BIRCH
(Blatant Plug: of which tacg is also a part) is pretty good and his
comments and suggestions have resulted in a a number of the improvements
seen here, especially the Degenerate Sequence matching.
- MSI has just released an app called Gene Explorer, which includes
quite a bit of functionality as a loss-leader to attract researchers to
their higher-end products. Haven't had enough experience with it to
comment further.
- James Tisdall's 'DNA Workbench' is
designed along the same idea as tacg, but
requires perl and is therfore interpreted.
Despite perl's many syntactic and other advantages, I wanted something even
simpler (a standalone app) and therefore easier to install, faster, and
more complete.
- Don Gilbert's
multiplatform, GUI ' SeqPup' is the
closest to an ideal setup in terms of design and interface (and I hope to
make this code compatible with it), but it also requires a certain amount
of dedication to install it and the current Restriction Enzyme module is
quite slow - unusable for sequences longer than a few thousand bases.
Anyone interested in interface design for bioinformatics is well-advised to
check out and follow Don's contributions. His have consistently
led the field in terms of usability, Internet use, cohesiveness, and
inventiveness.
- On some systems, you can run
Executor, Macintosh Application Environment, SoftWindows, and Virtual
PC and thus have access to the Mac and Windows programs, but that's
also somewhat costly, slow, and complex.
tacg was designed along the same lines as other small unix
utilities - a tool that does a small set of things, does them reasonably well,
and can be chained to other utilities or used in conjuction with them to
extract the information you need without too much fuss. It especially
benefits from Don Gilbert's readseq utility as
it has no sequence conversion ability of it's own, although last year James Knight announced the
availability of the SEQIO package
designed to add this capability to programs; I thought I'd
incorporate it into Ver 2, but I haven't done it yet :(. But it's next on
the list, especially as tacg has no database-parsing abilities of its own
and that is where tacg seems to be headed.
Alternatively, you can use the unix utilities
head or tail to chop off the headers of
most sequence formats and feed the rest to
tacg.
(It also had to be simple because of my lack of fluency in C - something
that will be obvious to anyone looking thru the code, but the side
advantage is that the code IS reasonably simple.)
To that end, it REQUIRES only 1 file - the executable (tacg, which can be renamed anything). Most people
will want to use the restriction enzyme database file (rebase.data;
included with the distributions, which is simply the GCG-formatted version of
Rich Robert's REBASE), but it is not strictly needed - you can enter
patterns from the command line in Ver 2. The codon usage file
(codon.prefs) that used to be required has been eliminated as that data is
now compiled into the program. The code is still in place, however and it
would be trivial to reverse the process.
The 'rebase.data' file will be found in any of 3 places automatically: the
current directory ($PWD), the home directory ($HOME), or the tacg lib directory ($TACGLIB). The program searches
these paths and gives up only if the file can't be found in any of these
places. 'rebase.data' is ASCII text and can be edited and modified by the
user, if required. I'd suggest placing 'rebase.data' in the TACGLIB directory
and keeping modified versions of these files, if required, in your home
directory which is searched first.
The output of this program uses only alphanumeric characters so that all of
its output can be viewed on a vanilla vt100-like terminal, although you can
do more useful things if you're using an X display or have imported the
output into a Word Processor (some of the output can best be viewed using
very small fonts or in multiple columns on a page).
The different output sections have been prefixed with "==" so that you can
easily jump to the different sections if you're viewing the output with an
editor or pager such as less. In the WWW version, the
different sections have been labeled with HTML tags so that clicking the
Table of Contents Entry will drop you to the corresponding results section.
For the Web interface, if you're using a Mac, printing to a Laserwriter-type
device, you might try the '105' width, with the 'Small' font, then use Page
Setup to select Landscape output, 2-up (2 pages fitted to one page).
If you're using MS Windows, there are probably similar output
selections that you can use. You can probably discern my love for Windows.
If you're using a unix box, save the output in whatever
width you want and pass it thru a text-to-postscript filter to
generate nicely formatted output. Some that I've used are:
- lptops - included on SGI's; probably available somewhere for other platforms.
- nenscript
- (Free implementation of Adobe's enscript)
- genscript aka GNU enscript - Free,
most features, currently supported, no good reason to use any other...
Use it thusly:
genscript -1GZ -r -fCourier7 -p- [filename] | lpr -P[PSprinter]
Most Xterminals allow different font selections and window re-sizing, so
you can probably view the results adequately that way as well.
The program is written in vanilla ANSI C. After being directed to the
correct libraries, it compiles with few complaints on:
- Silicon Graphics/IRIX 5.3, 6.2 (R3000, R4X00, R5000, R10K) (cc)
- Sun SPARC/SunOS (4.1.3) (gcc)
- Sun SPARC/Solaris (gcc, cc)
- DEC Alpha/DEC Unix aka OSF/1 (V3.0/347) (cc, gcc)
- Convex C3480/ConvexOS (cc, gcc)
- HP-UX/Exemplar OS (single CPU of an SPP2000) (cc, gcc)
- NeXT boxes (Black)/NeXTSTEP (3.2) (except the UDP code) (gcc)
- Intel PCs running Linux (Slackware, RedHat, Caldera) (gcc)
- DEC Alpha Linux (gcc)
- MkLinux on PowerMacs (gcc)
- Probably anything with gcc.
On those systems where both the native commercial compiler (cc) and gcc have
been available, gcc has consistently outperformed cc. YMMV.
I suspect that it will require some work to run on VMS, but I don't know.
Others will be made available as I find systems on which to compile them,
or as others contribute binaries.
tacg executes almost entirely integer operations, with the surprising
result that it runs faster on a PPro than on many RISC workstations of
similar or even higher clock speed.
The program uses a hashtable-lookup of the restriction enzyme recognition
sites, so that only about half of the sequence is checked any further than
the initial hash. Depending on the degeneracy of the input sequence, the
kind of output you request and the i/o of the machine, the program
processes:
Speed* | Hardware | OS | Compiler, flags |
450 | Sparc Ultra1/??MHz | Sun OS v5.5 | gcc -O2 |
313 | SPP/HP-PA | SPP-UX 5.1/HP-UX 10.1 | gcc -O2 |
165 | early DEC Alpha | OSF/1 | gcc -O2 |
140 | R4000/100 Indigo2 | IRIX 5.3 | cc -O2 -mips2 |
290 | R4400/200 Indigo2 | IRIX 5.3 | cc -O2 -mips2 |
454 | R10000/175 Indigo2 | IRIX 6.2 | cc -O2 -mips2 |
330 | DECAlphaStation/233 | RH Linux 2.0.29 | gcc -O2 |
68 | i486/66 | RH Linux 2.0.18 | gcc -O2 -m486 |
515 | Intel PPro/200 | RHLinux 2.0.27 | gcc -O2 -486 |
300 | PMac8500/PPC604/120 | Mklinux 2.0.27 | gcc -O2 |
* in Kb/sec, crudely adjusted for cpu usage via the `time` shell command,
using 215 enzymes, digesting 4.638 MB, for only the summary listing (-s).
Requesting the full analysis (-n4 -sSlc -F3 -g100) takes about twice this
time. The full Linear Map doesn't take much more CPU time, but
considerably more wall time due to the i/o (output is 10x input).
NB: This is about the slowest performance you can expect. Because of the size
of the input sequence, the program has to go thru several rounds of memory
allocation. Shorter sequences, more typical of cloning projects, complete
essentially immediately even on slow machines.
Here's a table of how fast the same analysis completes on various
sequence lengths (for a PPro/200/96M/PCI-SCSI and a 486/66/32M/VESA-SCSI):
Sequence
Length | Seconds(PPro) | Bases/s(PPro) | Seconds(486) | Bases/s(486) |
100 | .04 | 2.5k | .28 | 360 |
1,000 | .05 | 20k | .3 | 3.3k |
10,000 | .07 | 143k | .4 | 24.4k |
100,000 | .2 | 500k | 1.4 | 71.4k |
1,000,000 | 1.6 | 617k | 11.3 | 88.5k |
4,630,000 | 9.1 | 509k | 68 | 68k |
10,000,000 | 22.7 | 492k | 261 | 38k |
From earlier, crude analyses it appears to be about 7-15 times faster than
the equivalent routines in the GCG 'map' program, but I haven't checked
recently.
tacg uses dynamic memory allocation for most data structures so that while
there are a few hard-coded limitations (mostly in output format), it easily
handles sequences into the millions of bases. You may have to specify a
very wide output with the '-w' flag to allow the tic labels to fit, but the
program will work. As a practical rule of thumb, the program needs about 2
MB of free memory for itself, and then, depending on what results you've
asked for, as much as 3 times the input sequence length to hold all the
intermediate results before they are barfed to stdout. For the E Coli
genome (4.7Mbases), tacg memory usage tops out at about 13 MB, so on a 16
MB machine, you'll probably be doing some swapping in this case. On a 32
MB Linux PC, I had no swapping on the above 4.7 Mbase analysis, but quite
a bit with 10MB (about 40MB total required, according to 'top'). YMMV.
Inspired by Christian Marck's elegant DNA Strider, I tried for a similiar
output format, changing a few things I didn't like, adding a few things I
wished he had. I'm open to suggestions for enhancements, interface changes,
etc.
Kvetch or it won't change.
tacg will take any ASCII data as input, ignore anything
that isn't a,c,g, u, or t (or IUPAC degeneracies - see below) and subject that sequence
to analysis. Because of this, numbering, spacing, line width, etc, of the
input file should not be a problem.
IUPAC Degeneracies are those characters which denote uncertainty in the
sequence; most are familiar with y (pyrimidines - c or t)
and r (purines - a or g), but there are several
others that can be useful. The entire set is:
Base | Name | Bases Represented | Complementary Base |
A | Adenine | A | T |
T | Thymidine | T | A |
U | Uridine(RNA only) | U | A |
G | Guanidine | G | C |
C | Cytidine | C | G |
Y | pYrimidine | C T | R |
R | puRine | A G | Y |
S | Strong(3Hbonds) | G C | W |
W | Weak(2Hbonds) | A T | S |
K | Keto | T/U G | M |
M | aMino | A C | K |
B | not A | C G T | V |
D | not C | A G T | H |
H | not G | A C T | D |
V | not T/U | A C G | B |
N | Unknown | A C G T | N |
However, it will NOT accept (at least reliably) binary file
formats such as those from DNA Strider, MacVector, NBI, GeneWorks, etc. Most
commercial programs however have an ASCII export function that should work.
For the WWW interface, anything that you can cut and paste into the window and
looks like sequence ought to be fine.
In the Web form, there is a cutoff of 200,000 characters (not bases) as
input; currently this includes extraneous characters such as headers,
numbers, spaces, etc, so if you think you are being unfairly denied
service, check how many bytes your file actually is. The command-line
version is limited only by your machine's RAM and your patience.
tacg V2 has been modified to handle degenerate input sequence as well as
the previous gatc-only sequences. tacg V2 matches sequence using the fast,
hexamer hashtable look-ahead algorithm that DNA Strider uses (my
implementation) until it hits a degeneracy in the leading hexamer
window. At this point it switches to a more extensive (but slower)
matching approach until the sequence degeneracy is cleared whereupon it
switches back to the faster approach. This allows tacg to cut sequence
that is only slightly degenerate (such as many genomic sequences)
essentially as fast as nondegenerate sequence.
Although tacg behaves as above in the default mode, it can be forced
into several other modes if desired. Check the -D man page entry for more
detail on the different modes.
Restriction Enzymes can be selected either by:
Explicit pick from all the active entries from the
latest REBASE.
I'm not particularly happy with the way this is currently implemented. I've also tried picks
from very long lists, but they are also cumbersome and not particularly easy
to use. While the current implementation gives you the choice all enzymes in
a logical layout it's too slow on many machines and interrupts the flow of the
form. I've also tried it with hyperlinks to the recognition sequence and
suppliers (an HTMLized version of the 'rebase.data' file). Suggestions are
welcome.
(In the command-line version, enzymes can be picked explicitly via the -r
flag (-r NameA,NameB,NameC... where NameX is the case INsensitive name of the
Enzyme or pattern you wish to use in the GCG-formatted REBASE file,
either 'rebase.data' or another specified by the -R flag (-R Alt.rebase.file).)
or
Selected recursively by characteristic, based on:
Magnitude of recognition sequence
The 'magnitude' of the recognition sequence depends
on the number of defined bases that make up the site. Degenerate bases
can also contribute:
acgt each count '1' magnitude point
yrwsmk each count '1/2' magnitude point
bdhu each count '1/4' magnitude point
n doesn't count at all
(tgca=4, tgyrca=5, tgcnnngca=6, etc)
Overhang of resulting ends (5',
3', blunt)
This refers to the stretch of unpaired bases at the border of the cut
created by offset breaks in the phosphate backbone. Enzymes can leave:
5' overhangs - ie. BamHI(G'GATC_C)
v BamHI
5'...tagG GATC_Ccga...3' -> 5'...tagG GATCCcga...3'
3'...atcC CTAG Ggct...5' -> 3'...atcCCTAG Ggct...5'
^
3' overhangs - ie. BbeI(G_GCGC'C)
v BbeI
5'...tagG_GCGC'Ccga...3' -> 5'...tagGGCGC Ccga...3'
3'...atcC CGCG Ggct...5' -> 3'...atcC CGCGGgct...5'
^
Blunt (no overhang) ... You get the idea...
Minimum, Maximum times they cut the
sequence
This is pretty self explanatory, but..
'Minimum' indicates the
minimum # of times an enzyme HAS TO cut
before it's
included in the output
(the default is 1).
'Maximum' indicates the maximum # of times an enzyme
CAN
cut before it's excluded from the output
(no default maximum).
Alternative REBASE database files to
use
This option allows you to specify an alternative REBASE file to use in the
matching, perhaps edited to contain only cheap enzymes or otherwise
customized for your lab/site. Ver2 has been modified to use dynamic memory
for
the allocation of the REBASE entries, so there is no compiled-in limit to
the number that can be read in. It starts at 300 entries, which is enough
for most Restriction Enzyme databases, such as REBASE, but if you format
TRANSFAC or TFD into REBASE form, it will easily process them (about 3000
entries). This RE struct is also used to hold the degenerate patterns
created from a pattern prototype with errors and I've used it to generate
up to ~ 6000 entries.
The Web version allows you to choose among:
- The entire REBASE, using the default commenting scheme to provide a
reasonable list from which to choose.
- Only Restriction Enzymes (REs) from NEB.
- Only REs from Promega
- Only REs from Stratagene
- Only REs from Amersham
- Only CHEAP REs (>= 40 Units/$)
- The Complete Transcription Factor Database - A compact version
of the TFD, including almost all of the ~2000 TFs but lacking on-line
referencing, like Dan Prestridge's SignalScan does
( email him for info).
In the Web version, the "Select by Characteristics" filtering also
applies to the Transcription Factors (TFs), although it was meant for
Restriction Enzymes (REs). You can use it to filter too-frequent TFs from
the analysis.
If you want to use tacg for TF analysis, you will have
much more flexibility running it from the command line, as you can request
only those TFs in which you're interested, as well as making up your own
file of TFs.
Suggestions welcome...
tacg' Linear Map Mode:
- produces linear restriction maps, with enzyme names nested compactly,
readably.
- is routinely tested to more 4.6 Mbases, has been tested to 14 Mbases
- handles linear/circular topologies, subsequences.
- shows EXACT cutting position (not just the start of the recognition
sequence - minor quibble with Strider and other programs)
- can show same-page translation
- in 1/3/6 frames
- in 1 or 3 letter codes
- using one of several Codon Usage Tables
- can scale output (reasonably intelligently) in widths up to ~230
characters.
- shows numeric indices for both the whole sequence (4635, below) and the
subsequence (181, below) if specified.
Bsu36I
Sau3AI
DpnI
AlwI NlaIV
BstYI Tsp4CI
BbvI BamHI AlwI MseI MaeIII SfaNI
\ \ \ \ \ \ \ \ \\
181 gagaaacaacaatggatcctaaggttagaacattgttaaaagttactgttgaagatgctt 240
4635 ctctttgttgttacctaggattccaatcttgtaacaattttcaatgacaacttctacgaa 4694
^ * ^ * ^ * ^ * ^ * ^ *
E K Q Q W I L R L E H C Z K L L L K M L
tacg can search for Open Reading Frames (ORFs) in any of the 6 frames
in one pass with a user-specified minimum ORF size. It is not a
sophisticated search algorithm; the rules are:
- It assumes that the sequence starts in a ORF - this may cause
ORFs to be reported without starting M's and BUG!! may lead to
short ORFs being reported in short input sequences. (If it has not hit a
STOP before it gets to the end of the input sequence, it will report the
cruft that it assumed was an aleady started ORF. Yes, I'll fix it.
- If not already in an ORF, a Met codon starts an ORF and a STOP
codon ends it.
- If the ORF is smaller than the specified size when it hits a STOP, it's
discarded.
- Internal Met's are not noted in any other way, as they are in some
other apps such as Lasergene's Protean (good graphic design there),
although this is an easy addition.
- If there is an unidentified AA (because of degeneracies in the
DNA) the AA is translated as an 'X' and given the averaged MolWt (118.89 Da),
but the ORF is allowed to continue (see output below).
Very little analysis is done (so far). The beginning and ending Offsets in
both base pairs and Amino Acids are given so that it can be found on a
linear map (offsets for frames in the opposite strand are calculated so
that they refer to the current top strand). The size in both Amino Acids
and KDa is calculated. Perhaps most usefully, the results are writ to
stdout in FASTA format, with the stats as the ID line and the AAs as an
unbroken line so that it can be examined by a more sophisticated pattern
matcher such as grep, agrep, perl, etc (see Using
tacg with Other Programs
Programming Note: The data structure that holds the ORF information is
dynamically allocated so that all the ORF info is available until the end
of the program. As well this allows very large sequences to be scanned in
one sweep - my aging 486/66/32M will process the E coli genome, searching
all the ORFs for a degenerate 10 AA motif with 'agrep' in under 5 minutes.
Here is some sample output:
% ./tacg -O156,60 < dpp.mel.seq
== ORF Analysis for Frame 1: 7 ORF(s) > 60 AAs
F# ORF# Begin(bp/AAs) End(bp/AAs) #AAs MWt(KDa)
> 1 1 2581 / 860 2922 / 974 114 13155.40
MWTVLWHAENFMTSFADHQASGAEAPNKKHSPRLAHNRPRFCLLMASLERFSGCNTRNFVQRQHNSPKDSLNFDQLSDWVAAAPQHHRNWLAEKLELHPLLNGPIRNPQDFTDG
> 1 2 3091 / 1030 3282 / 1094 64 7107.97
MLNKNDKTQPRIVICNLCTCTLNHLVAPVLLLDAPPAPILFMPLSVAHCCCCCHTHTHIRTTQK
> 1 3 5929 / 1976 6216 / 2072 96 9692.80
MEVDAGSGMLGAGQVGYGLGWVLAPLRFPGHADAADGARGATTPADKAVRVQLPTPAREASAQATKLAKMQRDQGMSGGVLARMGLRLGLCGINCA
> 1 4 7522 / 2507 7704 / 2568 61 7430.89
MYIYLFFACVQYQTLMFNKSGKNLYCAYKNHQNHKGYALPYEPTYLTRSQQTLHFESIHWV
> 1 5 7870 / 2623 8160 / 2720 97 10690.71
MIYVWDVGECVSVCWRRLGGWGAERADRAGRSDPAESSSKTTTAAAIMIHSRNVGNGCLTILFFFSIFGSRRRRSCAGYEIHTTQILTHTATVAFTT
> 1 6 11071 / 3690 11277 / 3759 69 7770.53
MWNKLARASQRFSRSFFSSLRATISAALFRGQRMMGTLPRVTNRAKSESTMESIRASISSSRVLARLMA
> 1 7 12406 / 4135 12768 / 4256 121 14055.17
MILKEEHPHQSIETAANAARQAQVRWRMAHLKALSRTRTPAHGNCCGRVVSKNHFFKHSRAFLWFLLCNLVMNADAFAHSQLLINVQNQVRNRKKKWFKTLKSQQRKTQQRQTATLAALPI
== ORF Analysis for Frame 5: 5 ORF(s) > 60 AAs
F# ORF# Begin(bp/AAs) End(bp/AAs) #AAs MWt(KDa)
> 5 1 11241 / 3747 10885 / 3627 120 13410.25
MEALMDSIVDSLFALFVTLGNVPIIRCPRNSAAEMVARKLEKKLRENLWDARANLFHMDATQAGGGVFSFQRPVLLLLDRNMDLATPLHHTWSYQALVHDVLDLGLNLVYVEDETASAGK
> 5 2 8007 / 2669 7708 / 2568 101 10973.69
MAAAVVVLLDDSAGSDRPARSARSAPQPPNLRQHTDTHSPTSHTYIIILTTLNKAEQIPKFEEPDSNSQSGSRTIPQNEVPARQRRRSSDTASALALGWDL
> 5 3 2304 / 768 2113 / 703 65 7230.86
MLAQHQIEIKVYPPFRSIPCYVCFLPGQNNSALFWALGATQKVMIKLCGDPVEGTRLRGYPSTSS
> 5 4 1734 / 578 1492 / 496 82 9039.19
MPISWHKKMRQPCPHSPAITEGTGLCIIFNEDICQQKCSGVLFMAFKAKPIFITTATAGQRKTTRDAVVVVVVLIIDTSYAN
> 5 5 762 / 254 583 / 193 61 6683.20
MLCPSILPFTLKHKFMLSVHFFQCTTKLTPCGAESATRRRSAKGKGAHGQGTGCATDWLTR
== ORF Analysis for Frame 6: 5 ORF(s) > 60 AAs
F# ORF# Begin(bp/AAs) End(bp/AAs) #AAs MWt(KDa)
> 6 1 11181 / 3727 10997 / 3664 63 7224.82
MCPSSAVRGTVPLKWWHASWRRSCARTFGMRVPICSTWMPRKPAAECSAFRDLCSCCWTGTWI
> 6 2 10764 / 3588 9728 / 3241 347 39136.07
MTHKGSPFPTVAEAIQEELESYRNSEEEIKRLKTSMGIEGESDIAFSLVNDTTARLTNAVNSLPQLMEKKRLIDMHTKIATAILNFIKARRLDSFFEIEEKVMSKQTLDRPLLDLLRDGEFGQAEDKLRLYI
IYFICAQQLPESEQERLKEALQAAGCDLTALAYVQRWKGIMNRSPSISQATQYEGGGTKTVSMFTKLVSQGSSFVMEGVKNLVVKRHVSFSFQIQIQVLYQNCLQNLPVTKITEQVMECRSNAETDDYLYLD
PKLLKGGEVLPKNRAPFQDAVVFMVGGGNYIEYQNLVDFIKQKQTSNVQRRIIYGASTLTNARQFLKELSALGGEIQSPTATS
> 6 3 6774 / 2258 6530 / 2175 83 9280.03
MQGNCNRLAGEWTQKGQPRMREDAALPARNPTFLGLHYPGLGRFFVRPTAPNGHQQTAANSFIKHKQIPWTCSALTPHAAFEH
> 6 4 5901 / 1967 5687 / 1894 73 8163.18
MVLFLRVVCFSPVDNKVQPTCGYRTDRFRVLVALVLGASRWLTLRANDAKQQIKGAIRKIAARDVLLNCSSKA
> 6 5 1647 / 549 1310 / 435 114 13523.17
MKTSVSKNVRAFCLWHLRQNRFLLPQRQQGKGKPHATLSSLSSSSSLTHRMQIKSVSQPDHKVAAPDRRICRFQCWERGDPWRFRSRIPDTDTETRLEVKRTRPIRRDLWHESS
==-------------------------- End of Analysis --------------------------==
Tabular Output
All these output formats, as well as the Linear Map and Summaries are affected
by the width parameter (-w); if you can manage the wider output, it's
more efficient and easier to view.
This is no longer printed by default; you have to explicitly request all
output. This was to make tacg more amenable to inclusion into other
programs as a shell command - see
Using tacg with Other Programs.
This output includes info on EVERY active (uncommented) enzyme/pattern in
the rebase filethat has passed all the filtering flags (-n, -o, -m/M).
Restriction Enzymes that DO NOT CUT in this sequence:
BbeI EheI FseI KasI NarI NheI NotI
PacI PaeR7I SalI SfiI SpeI SwaI XhoI
Total Number of Cuts per Restriction Enzyme:
AatII 5 BsiYI 130 EcoNI 5 MluI 7 SalI 0
AccI 5 BsmI 30 EcoO109I 2 MmeI 8 SapI 7
AflII 2 BsmAI 26 EcoRI 3 MnlI 184 SauI 1
AflIII 13 Bsp120I 1 EcoRII 49 MscI 17 Sau96I 61
AgeI 12 Bsp1286I 26 EcoRV 14 MseI 106 ScaI 4
AluI 89 BspEI 22 EheI 0 MspI 278 ScrFI 145
[remainder omitted]
(for enzymes that pass the filtering flags)
All of the following output formats (except the Infrequent Cutters Map) can
have the REs listed by order of their names in the database or
sorted by # of
cuts and thence alphabetically, like Strider, using the '-c' flag.
In the next example output, the first example is sorted by 'order in database',
the second is sorted by '# cuts, then alphabetically'.
sorted by 'order in database'
== Cut Sites by Restriction Enzyme
AatII G_ACGT'C - 2 Cut(s)
1118 12143
AccI GT'mk_AC - 16 Cut(s)
632 3008 3073 3079 5393 5816 6881 7974 8815 9978 11962
14136 16857 18087 20067 20222
AceIII CAGCTCnnnnnnn'nnnn_ - 8 Cut(s)
2127 5417 11681 11750 15151 17252 17584 21795
AciI C'CG_C - 48 Cut(s)
567 850 1512 2455 2931 3300 3339 3490 4842 5468 5600
5907 6174 6318 7533 8123 8807 9086 9745 10482 10805 11626
11669 11896 12050 12214 12404 12417 12675 13337 13646 14217 14881
15949 16013 16668 17060 17089 17101 17132 18035 18140 18820 18896
19781 21146 21536 21958
AflII C'TTAA_G - 5 Cut(s)
1292 4185 11172 18334 20663
[remainder omitted]
== Cut Sites by Restriction Enzyme
sorted by '# cuts, then alphabetically'
ApaI G_GGCC'C - 1 Cut(s)
21979
ApaLI G'TGCA_C - 1 Cut(s)
10676
Bpu10I CC'TnA_GC - 1 Cut(s)
844
Bpu1102I GC'TnA_GC - 1 Cut(s)
13369
NarI GG'CG_CC - 1 Cut(s)
16014
NgoAIV G'CCGG_C - 1 Cut(s)
15606
[remainder omitted]
(unsorted by fragment size, sorted or both; 'vertical' ordering can also be
sorted as in the above example.
** SORTED Fragment Sizes by Restriction Enzyme **
AatII G_ACGT'C - 5 Fragment(s)
1849 3449 3731 4289 5110 14062
AccI GT'mk_AC - 5 Fragment(s)
639 1187 2192 3574 11828 13070
AflII C'TTAA_G - 2 Fragment(s)
6078 6541 19871
AflIII A'CryG_T - 13 Fragment(s)
35 170 459 493 956 1268 1712 1913 2360 2419 4091
4920 5733 5961
[remainder omitted]
- overhangs indicated: 5'(\), 3'(/) blunt(|) cutters
- if more than 1 cut maps to a space, the number of cuts is shown
numerically (see AluI, below)
- total # of cuts indicated at end of line).
- width-sensitive, so you can increase accuracy by choosing a wider
output. See also the
complete output of the lambda analysis.
Ladder Map of Restriction Enzyme Cut Sites:
900 1800 2700 3600 4500 5400 6300 7200 8100 9000
: : : : : : : : : :
AccI ----------------\-----------------\-------------------- 2
AciI -----------------\-----------\--\---\\------\--\\------ 8
AflII ---------\-------------\---------\--\----\---\--------- 6
AhdI ---------------------------/--------------------------- 1
: : : : : : : : : :
AluI --|---|||--23||||2||-|2--2---|2|232||-|-|22--2-|3|-||-- 50
AlwI ------2---\-----------\--2-------------\2-------------- 9
Alw26I ----------------------------------------------------\-- 1
AlwNI -----------------------/--------------/--/------------- 3
: : : : : : : : : :
(a la Strider) of enzymes that cut less than X (a small number of times) -
(In Ver 2 this has been changed to be length-sensitive; X = 2 until the
sequence increases to beyond 10,000, then X increases proportional to the
log of the increased size). This output option is not called explicitly,
but is appended to the Ladder map (-l).
It should be obvious, but each label is of the form:
Name@Cut_Site
== Summary of Enzymes that cut ** 2 ** times or less:
XcmI@9286
XcmI@2466 UbaEI@9054
Sse8647I@1801 SexAI@9288 UbaDI@13309
SmaI@2882 PstI@10119 PstI@15573
Psp5II@1801 Pfl1108I@10898 SgrAI@16556
NruI@2248 HgiEII@10152 PflMI@18025
NcoI@2848 SexAI@8621 EcoNI@12615 NruI@17146
BstXI@2962 EcoNI@10850 NgoAIV@15606 PmlI@20299
BstEII@4076 EagI@12054 NarI@16014 Pfl1108I@20364
BstEII@1183 Bsu36I@10980 HgiEII@18039 MluI@22230
BstDSI@2848 NcoI@7724 Bpu1102I@13369 PacI@18544 MluI@22201
Bpu10I@844 PshAI@5399 ApaLI@10676 EagI@16150 BstXI@20204
AatII@1118 BstDSI@7724 AatII@12143 Bsu36I@16688 ApaI@21979
||| | | | | | ||| | || | | | || || | | | ||
-------------------------------------------------------------------
: : : : : : : : : : :
2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 22000
- This option shows approximately how different digests would look if run
on an agarose gel. Turn the page 90°, blur your eyes slightly, and
it'll look like a massive agarose gel... really. Useful for selecting
restriction enzymes for best separation of fragments...
- It uses a straight log10() approximation, but the addition of
math to simulate different percentages of agarose/polyacryamide would be
trivial. If anyone has a favorite approximation, I'd be happy to include
it.
- It uses the same representation as the ladder
map, with single fragments represented as '|', multiple fragments that
cannot be resolved as a digit showing how many map to that space.
- the flag value sets the LOW CUTOFF - fragments of this size or lower
are excluded from the simulation (perhaps they are not seen with
ethidium staining) The value must be an integer multiple of 10 (10,
100, 1000, etc).
Below is a section of gel output generated with the width set to '-w75';
the sequence length was 22380 bp (so the gel is automatically extended to
30 Kb).
From examining the UDP data so
far, this is the option that is used incorrectly most of the time. To
obtain the results below (a 10 bp cutoff), it should be used thusly:
tacg -g10 [other flags] < input.file
If you wanted a 1000 bp cutoff, change it to:
tacg -g1000 [other flags] < input.file
== Pseudo-Gel Map of Digestions:
10 100 1000 10000
. . . . . . .... . . . . . .... . . . . . .... . .
AatII | || 2
AccI | | | | | | |2| 222 | 16
AceIII | | | 2 || | | 8
AciI || || || | | | |22 || 33||4 |227||3| |2 48
. . . . . . .... . . . . . .... . . . . . .... . .
AflII | | | | 2 5
AflIII | | | | | | 2 2 | | 11
AhdI 3 | | | | | 7
AluI | 3 ||3 3|222 |||2223 3334|2352|2|42||| 2|2 74
. . . . . . .... . . . . . .... . . . . . .... . .
AlwI 5 | || | ||| ||||2|| | 3| || 26
Alw26I | | 2 | | | |2|| 3 32||3| |4|2 | | 35
AlwNI | | | 2 | | | 7
ApaI | 1
[rest deleted]
The pattern matching used in tacg has been extended to a more general
approach for nucleic acids so that it is applicable for searching for
degenerate patterns with a specified number of errors (in possibly degenerate
sequences).
The original tacg allowed searching for nucleic acid patterns with the
standard IUPAC degeneracies, but V2 allows you to extend this capability
by allowing searching with errors.
Quite often when you are searching for binding sites for
transcription patterns, you wish to search for not only degenerate sites
but degenerate sites with 'a bit of slop'. You can use tacg to do this
in 2 ways, using the '-p' flag:
Say you had defined a site (called NPR) on a promoter by footprinting as
'gtcagggcgaat' and you
wanted to search another implicated region for this sequence but you
wanted to allow 2 errors to be allowed, so that you could find a sequence
that would, for example match as follows:
...gtcagggcgaat...
x x
...gtcggggcgatt...
An appropriate flag/option sequence to obtain such a match would be:
tacg -pNPR,gtcagggcgaat,2 -Sl -w75 <input.file |less
Translation: search input.file for the NPR sequence
gtcagggcgaat, allowing at most 2 errors
(-pNPR,gtcagggcgaat,2), returning the Site data and a ladder map
(-Sl) formatted 75 characters wide (-w75) and pipe the
results directly to the pager program 'less' (|less)
oryou can specify that the sequence is always
supposed to be searched 'with slop' via the addition of a single character
in the
standard GCG-formatted REBASE file entry:
Here is the standard GCG REBASE format line:
NPR 6 gtcagggcgaat 0 !NPR is searched with 0 errors
Here is my hack to the standard GCG format line:
NPR 6 gtcagggcgaat 0 2 !NPR is always searched with 2 errors
The algorithm to do this is reasonably brute force: an 'n'
scan of the sequence is done, replacing every base of the sequence with an
'n'. If there is more than 1 error allowed, then each successive
generation is 'n'-scanned. At each iteration, each generated pattern is
compared with the previous ones so that only novel patterns are actually
stored. This approach dramatically decreases the number of patterns
searched for, but can consume substantial CPU resources in the validation
for long patterns with many errors allowed. There's a tweak to the
algorithm that will also substantially decrease this time, but I haven't
implemented it yet. I'll see if anyone uses it 1st.
This is another option designed specifically for Transcription Factor (TF)
analyses. (see the man page section for details) In many cases, you are interested in a particular juxtaposition of
2 or more TFs; this option allows you to specify up to 10 sets of
relationships between TFs and extract only those which match your
criteria. The output has been organized so that you can see both the raw data (All) and the
filtered data (Matched) in the same output stanza (below):
== Proximity Matches:
lef(1E)/IHF(0E) * 1 matches * found from search parameters: lef,200,IHF
57309/57116
== Ladder Map of Enzyme Sites:
# hits
10000 20000 30000 40000 50000 60000 70000 80000 90000100000
: : : : : : : : : :
lef |--|-|-|463222222||||25|2|243---|324|222222--32223|-||3-3|-- 98 (All)
lef ----------------------------------|------------------------- 1 (Matched)
IHF ----------------------------------|------------------------- 1 (Matched)
IHF ----------------------------|-----|------------------------- 2 (All)
: : : : : : : : : :
This option (-Gbin_width,X|Y|L) was
included to allow the results of the matching to be
exported to other programs for further analysis. Especially where there
is a high number of matches, it may be of more use to plot or analyze these
data using an external program, whether it be a spreadsheet, plotting app,
or custom-writ program. I don't presume to know what you might want to do
with the data and so to make tacg play nicely with other apps in the best
unix tradition, I've tried to address some of the simpler data output formats that
might be of most use. Output is possible in 2 rectangular formats
(X and Y; one is the diagonal flip of the other) as well as
in a long form (L)that may be better suited for simple, custom-writ
programs that just want to read x,y pairs. The very slick gnuplot can read the multiple columns of
the Y form with only a very small amount of editing or filtering of the
output. An example of gnuplot's processing of some tacg-generated data is shown
as well. See also the gnuplot stanza in the section below.
tacg is obviously not the only molbio program you need to do your work.
There are many analyses tacg just cannot do; to that end, it at least
regurgitates your data in a re-digestible form so that you can masticate it
further elsewhere. If you can program a bit, for quick and dirty programs
you probably can't do better than perl
(but if you can program a bit, you already know that). If you can't
program, importing your data into a spreadsheet and manipulating it cell by
cell is certainly viable with the sophistication of modern spreadsheets.
In terms of working with your data in the context of tacg, there are lots
of other programs that make input easier, and output more
readable/browsable. Here are some that I use on a regualr basis:
- NEdit
A well-designed, actively supported, free GUI editor for Xwindows on
unix and VMS. Besides its own features, NEdit allows you to seamlessly pipe
selected text to external programs (...like tacg) and display the output
in a NEdit window, and even add menu items of external programs
(...like tacg), effectively making the NEdit/tacg combination a decent
GUI biosequence analysis tool.
I've included a few of my own shell commands and other NEdit settings in
the .nedit file accompanying the distribution. Either copy your .nedit
to a safe place and start NEdit with mine, or edit the tasty bits of
mine into yours. If you're just browsing, here's my
.nedit.
While the NEdit/tacg combo doesn't allow quite the same flexibility as
Strider, it does allow you to use an exceptional editor to modify or
annotate your sequence in a very natural way and also allows you to
submit parts of that sequence for some of the usual analyses. On slow
machines, (~486/66) there is a perceptible pause while the data is
passed to tacg and tacg's output is redirected back into nedit, but the
window does appear; at least you don't have to go searching for it as
you usually do in GCG.
Coming soon especially for this approach will be the output of some of
tacg's internal subroutines such as sequence manipulation (reversal,
complement, reverse complement).
-
joe I can't stand vi or emacs so when I have to text-edit thru a term
window, I use joe. Now a standard part of linux, it compiles on just
about everything and while it isn't as powerful as emacs, it's intuitive
and doesn't force you to be bimodal as does vi. In order to use it with
tacg, you can (as with vi and emacs, use it's shell
command to pass-thru whatever you've selected to an external command:
Example:
- ^KB to mark the beginning of a block
- ^KK to mark the end of the block
- ^K/ to pipe the selected, hilited text to whatever program you
wish, such as tacg -n4 -sSl (to get summary, Sites and
the ladder map imported into joe) so that you can move around and
edit the results from within joe.
- less
My favorite pager - slick design, fast, memory
miser. Like joe, it can pass selected text (but by lines only) to an
external app with the '|from mark X to here' command.
Example:
- go to the start of the selected text
- hit 'm', then a key to mark the current position (say 'a')
- move to the end of the text
- hit '|', then the same indicator key as above ('a'), then the command
you wish to pipe it to (|a tacg -n4 -sSl)
- readseq
The standard for biological sequence format conversion, from that
biocomputing code wiz Don Gilbert. You can quibble with the interface a
bit, but it's a lifesaver. If only GCG could convert sequence like
readseq...
- gnuplot
gnuplot is a another of those many unix utilities that seems crude and
clunky until you play with it a while. And after a bit, things fall
into place and you wonder how you got along without it. Using tacg's
-G output with Y format, it is easy to plot the
distribution of A/T rich regions (using a pattern entry called
'at-rich') in the E coli genome, along with the distribution of an IHF
site (pattern called 'ihf') allowing 1 error. While they should both be
further reduced with a sliding window, you can see that it is reasonably
easy to visualize primary data from tacg.
Example:
- agrep
A really useful tool that works very much like the other members of the
'grep' family but with a few notable differences. It's VERY fast, and
like tacg, you can use it to search for patterns with errors. I was
originally going to use it as the basis for tacg's searching with
errors, but by the time I started to understand the bit operations that
it uses, I had already implemented a workable scheme using the code I
had already written, which was more applicable to the kind of output I
wanted. I might switch later - it's a very cool algorithm (underlies
Jim Knight's grepseq as well as the Glimpse and Harvest search tools).
The technical papers describing it (in the same FTP tree) are a model of
readability.
- Genscript (Gnu enscript)
This is an ASCII to postscript converter that allows you to scale and
rotate text on the printed page. For many tacg analyses, it's convenient
to have very wide output and genscript allows you to squeeze the text you
want in the space you have.
- Alpha
Not a unix tool at all, but a hair-raisingly, spine-tinglingly good
shareware text editor for the Mac from Pete Keleher. Often compared
with the commercial BBedit. The reason I use it is because it works the
way I work - almost everything I'd like in an editor is there and if it
isn't, Alpha is mostly written in tcl so that it can be extended to do
whatever you want. If I was going to try a native Mac port of tacg, I'd
try it 1st as an add-in to Alpha, similar to the shell commands added to
NEdit, described above.
Well, to make a short story shorter, you can't. tacg V2 has no built-in
functions to scan databases, although it is certainly robust enough to do
so. In order for tacg to deal with databases, you have to supply
the scripts to feed sequences into tacg, one by one. tacg has a minimal
separation line to delineate between analyses, but nothing more than that.
Two things would be very handy for database handling: the
ability to deal transparently with gzipped/compressed data and the ability
to handle different database formats in a reasonable manner. The 1st is
available via the zlib
library; the second, via the SEQIO library.
These (along with the inevitable bug patches) are the next goal for tacg.
From the docs, they both appear to be comprehensible; we'll see....
tacg was also designed to track it's own use
and spread - something in which I'm also interested. To that end, the binaries
have been compiled with code (udping.c) that spits a small amount of data back to me
at each usage. To wit:
- the IP number of the hosting machine
- the cpu type
- Operating System/Version #
- what flags were used in calling the program
- the number of bases processed.
It does NOT return host or domain names, user names, or actual sequence.
The exact data (minus IP number which is inherent in the UDP protocol)
that is returned is shown on stderr (usually the screen) at each use. ie.
Returned: [hw=IP22 os=IRIX osver=5.3]
[TACG Version 1.5F]
/usr/local/bin/tacg -n5 -o5 -F2 -slL -g100 <100000
bp>
If your machine is not connected to a network, is behind a firewall, or there is
a problem in getting the data to my machine, there is no
penalty, as it's UDP output (unverified).
This allows me to find out what OSs it's been ported to, what the popular
flags are and which ones are used little if at all. I understand that
some people may be queasy about this function, especially if they haven't seen
the source code (which is freely available), so this function
can be disabled from the command line (-q) and can easily be eliminated
when recompiling (see tacg.h in the source code). If you do this, I'd
appreciate an email letting me know what you think of it otherwise.
I'd also be interested in hearing suggested refinements and critiques of
this process.
Major
- Thanks to Claude Aflalo, a bug was fixed that stopped the correct
placement of tics very close to the origin in the Summary Map (and
stopped the program).
- The routine for reading REBASE files now checks for both Carriage
Returns and Line Feeds to enable uploads from Macs and PCs.
- Thanks to Catherine Letondal, numerous bugs were fixed, including bad
casts that crashed the DEC Unix versions whenever log10() was called.
- In the -O flag, a stupid mistake prevented frames 4, 5, and 6 from
being translated correctly (thanks to Kirk Fry for pointing this out).
Fixed.
Minor
- Ashok Aiyar suggested that since tacg is pretty terse, it emit a
message when no other output has been requested. Incorporated.
- Catherine Letondal (CL) pointed out a number of glitches in the man page,
HTML documentation, and wording of a number of sections. Most of these
have been incorporated.
- CL also caught an 'off-by-one' error in analyses where the sequence was
defined as circular.
- While fixing the -O flag bug, I realized that the offsets listed in
the output were less than straightforward, so I fixed them as well, adding
a Beginning offset and Ending offset. Frames 4, 5, 6 have their offsets
manipulated so that they refer to the current top strand instead of the
previous bottom strand (so that index numbers are easier to find on the
linear map.
Major
- Added Degenerate sequence analyses, with 5 modes
of searching, including adaptive switching of the search algorithm, based
on sequence input. The 'fast' searching of nondegenerate sequence will
digest/search about 500kbases/sec on a 200MHz PPro. The 'slow' search type
is about 10 times slower. The default method runs using the 'fast' method
unless it hits degeneracies in a small window leading the current position.
In that case, the degeneracies are processed with the 'slow' method, and
then once the window is cleared of degeneracies, tacg speeds back up on
non-degenerate sequence. Thus, other things being equal, search speed is
inversely proportional to number of degeneracies, although long runs of
'n's will not slow it down.
- Added Searching with Errors. Using the
'-p' flag, you can denote both the pattern and number of errors you want to
allow. This also works with the -P (Proximity) flag (see below) and can be
applied to any existing pattern by appending a single term to its REBASE
entry. For each error allowed, this involves making an 'n' scan across the
sequence length. For each level of error, each sequence thus generated is
checked against the previously validated sequences to check for
redundancies and is only added to the list if it is novel. This
backchecking results in some time being lost in the startup, but it
enormously reduces the number of sequences to be checked and thus increases
the speed at which the input sequence is scanned by a similar amount.
- Added a method to extract 'match data by bins'
(of bases) so that much of the internal data is available for external
analysis or plotting - one form of output is of the type preferred by
gnuplot. Another is useable by most spreadsheets, such as NeXS, Applix,
or even Excel BP .
- Added -p flag, allowing patterns to be entered from the command
line. The patterns so entered are logged (with the date) to 'tacg.pattern'
in the current directory in the same form as the GCG-formatted REBASE that
is the default for tacg.
- Added -O flag, allowing simple Open Reading
Frame analysis. Allows any of the six frames to be translated; can specify
the minimum ORF length to be allowed. Output is currently only DNA and
Amino Acids offsets and MolWt, but adding more analyses should be
relatively easy.
- Added -P flag to allow Proximity matching.
This also works with the variable error rate described above. This allows
multiple relationships to be defined and only the sites that pass the
filter are reported. These relationships can be composed by terms that
restrict it by 'upstream vs downstream', 'less than' or 'greater than' a
distance, 'within a range of distances', or logical combinations of these
terms.
- Fixed (or substantially fixed) the previous 30bp minimum sequence
requirement. It's now down to 4bp, allowing analysis of polylinkers and
primers. If you need analysis on fewer than 4bp, this is probably the
wrong program for you.
- Major internal re-write of major data structures to streamline and
compact code. Substantial memory savings for Enzyme struct 'RE' which is
also now dynamically allocated, so there are no longer any hard limits on
how many patterns can be used. Several major logical bugs were
rooted out and addressed. Probably an equal number were created..
-
Minor
- Assorted Interface tweaks
- The number of Summary Cuts is now calculated based on length of
sequence, rather than being static.
-
Major
- Addition of the '-c' flag which sorts tabular data 1st by # of cuts and
then alphabetically. Makes output run from smallest number of cuts to largest
so that you can group the REs for those cuts in you're interested.
- Addition of the '-r' option that allows the explicit selection of REs
by name (up to 15 without recompilation, but you can #define any number
you want). This was primarily for the Web version but has been
incorporated into the commandline version as well.
- Addition of the '-H' option that helps out the Web version by
generating some crude HTML tags for formatting and building the Table
of Contents of the Results. As of now, it's not very useful outside of
the Web version, but maybe it'll evolve to generate standalone HTML output
- Fixed a bad bug in the sequence input routine (GetSeq.c) that would crash
the Webified program if it was taking sequence input from a Mac or PC
that did not terminate lines in a Unix-style.
- Documentation is now mostly HTML, so it can be read via Web Browser for
easier movement.
- Addition of Web interface via Tom Boutell's cgic library and and Fred
Criscuolo's help. This has driven many of the above changes and helped
in uncovering (and generating) many bugs.
Minor
- Included the '==' prefix to the major output stanzas so that you
can immediately jump to them with a searching pager such as 'less' or 'more'.
- Tidied and spiffed the HTML page and re-ordered the flags in the
help section and man page to be ~alphabetical.
- Many smaller bug fixes and cosmetic changes.
- The fast gets faster. tacg on a PPro/200 is faster than on a similarly
clocked SGI Indigo2/R4400, and even faster than an 233MHz Alpha (?!).
- Fixed some nonfatal, but bothersome messages from SetFlags().
- For Web interface, made up Vendor-specific versions of the REBASE
file. As well, Dana Macelis of NEB helped with the generation of a
price field so I could offer a 'Cheap' file.
Major
- the command line handling has been completely re-written, using the
getopt() function, so the flags are considerably less sensitive to
spacing and order. "-n 6" or "-n6" are now equally acceptable
and flags without variables can be grouped "-Ls" (but not "-Lsn6" -
that must be entered "-Ls -n6"
- All output is to stdout, instead of to named files in the cases of the
ladder and gel map.
- All output is dependent on the width setting (-w#), rather than having
different width settings for different options
- the linear map is now produced only on request (-L), reducing the
amount of output under normal circumstances.
- the enzymes displayed on the map are filtered to represent only those
selected by the various flags; previously all enzymes were displayed. This
was the result of one vocal request; I'd be interested in hearing
supporting or opposing views.
Minor
Several bugs and much seemingly unconnected weirdness disappeared
when I used getopt().
Major
- tacg, unless recompiled without it, spits
back about 100 bytes of information about it's use (the result of `uname`,
the program version, the command line flags, and the sequence length) to
enable me to track how and how often it is being used. If you are
uncomfortable with this trait, you may disable it from the command line
('-q', see above).
- noted above; tacg has no sequence format
conversion abilities, it will happily eat and analyze the acgt's from the
header (or your thesis) if fed them.
- both main() and functions were written as single pass code; little effort
was made to free() memory grabbed in the analytical stages. If you
plan on using any of this code for re-entrant programs, you'll have
to be *very* wary of this. As if... High priority to fix..(still).
- translation in 6 frames assumes circular sequence regardless of '-f'
flag, so that the last amino acids in frames 5 and 6 in the 1st output
block are obviously incorrect if you are assuming linear sequence. This is
still incorrect as of V2, but is also reasonably obvious.
- Related to above, if you want translation of circular DNA, the
translation routines do not yet support translation through the
endpoints, so if there is an ORF in progress when it hits the end of the
sequence, it will not continue the translation across the boundary (nor
in the reverse direction).
This occurred to me just prior to release of V2, and it will be one of
the 1st things addressed. This bugfix may also fix the bug mentioned
directly above.
- if you are running tacg on a machine that
was set up on a TCP/IP network and then disconnected it from the network
(took it home or pulled the ethernet card), the UDP reporting code will
cause the program to hang at the end. If this is the case, either use the
quiet flag (-q), recompile the code with the REPORT #define set to be
quiet, or hit [Cntrl+C] to terminate it.
Minor
- the code is still laden with debugging statements, many of which are
still operative if you specify the -V (Verbose) flag. Previous to Ver
2, tacg was quite chatty while in progress, but is now silent by
default, unless a major error occurs or you request the output. This
allows tacg to be used by other programs (Nedit, for example) which trap
stderr and open windows to display it.
- in the linear map, although the enzyme names stack efficiently, they
sometimes appear to stack inappropriately. This is a result of the
order in which they were encountered; a site that has a long offset
to the cut site, but encountered in the opposite strand, can map
before but still be higher in the labels.
- NOT a bug but a slight difference in accuracy. tacg finds all the sites
that Strider does and a few more. Strider is apparently confused by
some degenerate sites that overlap with themselves. I've only seen it
with an enzyme called GdiII. There is also a surprising bug in Strider 1.2
that causes long, non-palindromic patterns to be found correctly in the
top strand, but ignored in the bottom strand. I found this because a
(corrected) bug produced the same error in tacg in a few cases.
- relatedly, because of the algorithm used to find sites, and the offset
recognition sequence of some enzymes, tacg
will find cut sites for the same enzyme in opposite strands that map to
the same nucleotide, resulting in fragments of zero length. After some
thought, I've decided to consider this a feature, not a bug ;) better to
know all the possible sites than not.
- the tic markers in the last block of linear map do not stop with the
sequence. V low priority.
- when searching for patterns with errors,
some patterns generated are shorter than the parent pattern. Because of
the way this has been implmented, a hit will often be a subsequence of
another hit (aagctg is subsequence of gaagctg, but will be
noted as a separate hit. As in the case above, I've decided to call
this a feature, not a bug (so far). Just be careful...
- some flags are ignored; if you decline a linear map, translation
format requests will be ignored because that is the only way for
them to be written at this point. As the protein analysis code appears,
this may change.
- in the process of writing this, I've come to realize that there are
many cases where the code is fragile, wrong-headed, inefficient, and/or
crude beyond imagining. In many cases, I'm planning to fix this; in
others, it doesn't impact the speed (hashing), I'll probably leave it alone.
I am extremely interested in hearing about new bugs, suggestions for new
features, and overall comments on the program.
- Incorporation of a sequence parser for the input, either from
'readseq' or 'SEQIO', so that it does not need a separate program to do so,
although because of the organization of many sequence formats, many files
can be simply be 'tail'ed to remove the header and the rest of the sequence
is digested smoothly, filtering out numbers, spaces, newlines, etc. Crude,
but effective. But perhaps not - I like the simplicity of it this way -
arguments?
- An earlier version of this code has already been converted to a GUI
with tcl/tk. Depending on feedback and time constraints, I'd like to
convert the latest version to a GUI, probably using FORMS or other
C-based GUI toolkit (the Vibrant toolkit?),
and in fact some of the things I'm planning can really only be done using a
full GUI program. We'll see... tcl/tk is questionable. For a programmer,
it's very nice and flexible, but it's not really standard on all unix
machines yet and the setup for mere mortals is horrendous. If tacg goes to a GUI, it'll probably be as a
standalone application, but don't hold your breath.
- Even without a GUI, I plan on converting some of the output to
graphics, so that more information can be put on one page. Circular maps
are difficult to implement using only alphabetic characters. Color can be
used to great effect in some presentations, although I'm a firm believer in
Tufte's maxim: Maximum info, Minimum ink. On hold to see how the Web
interface can be used.
- Assuming significant usage, analysis of returned data to put together
a picture of how the program spreads and how it's being used.
Cheers
Harry
COPYRIGHT NOTICE
Copyright (c) 1996,1997 by Harry Mangalam at Univ. of California, Irvine
Permission to use, copy, modify, and distribute this software and its
documentation is hereby granted, subject to the following restrictions
and understandings:
- Any copy of this software or any copy of software derived
from it must include this copyright notice in full.
- All materials or software developed as a consequence of the
use of this software or software derived from it must duly
acknowledge such use, in accordance with the usual standards of
acknowledging credit in academic research.
- The software may be used by anyone for any purpose, except
that its redistribution for profit or inclusion in other software
sold for profit requires the express, written permission of the
author. Note that the written permission of the author is not
required for any use of the software that may result in profit,
and the author waives any rights to any part of profit gained by
any use other than its redistribution for profit.
In plain English, the only thing that requires the express,
written permission is if you want to sell software that
includes or uses this software. Any use other than selling
software that uses this software is fine, and any profit made
from that use is yours to keep.
- This software is provided AS IS with no warranties of any
kind. The author shall have no liability with respect to the
infringement of copyrights, trade secrets or any patents by
this software or any part thereof. In no event will the
author be liable for any lost revenue or profits or other
special, indirect and consequential damages.