Welcome to tacg Version 2.34

Document Version 2.34 - October 30, 1997
Thanks to several helpful souls, several bugs (including one significant bug for DEC Alphas) have been pointed out and fixed and some suggested interface changes have been made. The latest release (2.34) addresses these. Please see Changes since 2.28 for the bug reports, fixes.

The Web interface that uses most of Ver 2's abilities is ready! Try it out here!

This documentation is subject to change - compare the version # above to your local copy, if you got one with your source code or binary distribution. Text in italics is to be believed only with many grains of salt.

This page provides information about tacg Ver 2.x, describes some of the philosophy behind its design, provides links to more documentation, and where it can be got (both ANSI C source code and precompiled binaries for some popular systems).


Table of Contents


What's new with Version 2?

Briefly:


What is it?

tacg is a character-based, command line tool for unix-like operating systems for pattern-matching nucleic acids. It was designed for restriction enzyme analysis of DNA, but has been extended to other types of matching. It now handles degenerate sequence input in a variety of matching approaches, as well as patterns with errors.
It was designed to be a command-line tool which reads DNA sequence on stdin and spits data to stdout, reporting errors and diagnostics to stderr, sort of like a 'grep' for DNA. Like grep, it's capabilities have grown so that now the author has to keep calling up the help page to figure out which flags (now ~30) mean what.

tacg itself is NOT a GUI application in any sense. However, it's existance as a strictly command-line tool lends itself well to Webification and I've made it available as a cgi program.. Also, by it's strict command line nature, it can parasitize the interface of a number of GUI programs that allow passing selections made within their windows to external programs. This is particularly effective with a free, Xwindow editor called NEdit. With minimal reconfiguration, NEdit can become a fairly good biosequence editor and analysis tool a la Strider for Xwindows. See the entry for NEdit below.

This is also one of the first examples of a new kind of software payment: auto-citation-ware. If you use it, please allow it to spit back about 100 bytes of info that tells me how it is being used, although you can disallow it from the command line or by recompilation (see "The Odd one: UDP Reporting" below).

A real example, quickly.. say the Genbank entry for the lambda genome.

tail +1411 lamcg.gb | tacg -w75 -n6 -F2 -l -c -g10 -L -t3 -O145,25 | gzip - > lambda.map.gz

Translation: chop off the top 1411 lines of 'lamcg.gb' (the comments) and pipe the following sequence to tacg, returning the analysis:

Here's the resulting output (114KB text, gzipped - you'll need a gzip-compatible app to view it, such as Stuffit Expander, Stuffit For Windows, Winzip, or even ... gzip).

The lambda genome used to generate this map is also included in the tacg distribution directory (called lamcg.gb.gz)


What does it do?

tacg first reads ASCII characters from stdin, determines whether there are IUPAC characters in the stream and, depending on the -D flag option, EITHER strips out anything not an 'a', 'c', 't', 'u', or 'g', OR allows all IUPAC characters (but strips all others, including numbers, spaces, tabs, linefeeds, etc), and places that input sequence in a buffer.

What it does then depends on what other flags are set. The 1st writ function and probably the most popular is Restriction Enzyme Analysis. Flags indicating this function cause tacg to read in a series of named patterns (either from the command line, or more typically, from a GCG-formatted REBASE file) and match them against the sequence already read in. These matches are noted, then some basic math is done with them and the output is sent to stdout, in the form of tables of cut sites, fragments generated, various maps and simulations.

The number of recognition patterns was originally hard-coded, but is now dynamically allocated so 1000s of patterns can be searched for at once (I've generated ~7000 without problems), as long as they are in the simple format required. I am working on a simple perl script to fracture TRANSFAC into a number of REBASE-formatted files, by source.

A similar function matches (possibly degenerate) input with patterns that contain errors (as well as simple degeneracies). This means that if the pattern is gartc with 1 error, it could match nartc, gnrtc, gantc, garnc, or gartn. Of course, the 1st and last sequences decay to artc and gart, as a leading or lagging 'n' makes no difference in the pattern. NB: See the note on how subsequence patterns are currently handled.

A related function filters the data from the above searches by parameter that you give it, restricting the number of reports by proximity.

A new function allows Open Reading Frames to be extracted. This function is not very sophisticated, but does present the data in a way that allows it to be analyzed by other programs (each ORF is written in a FASTA format so that you can use the resulting file for other types of searching). I suspect this function will be updated quite frequently as users request more features.

Enzyme selection can be filtered by a variety of criteria; these criteria affect what is printed as output (if only 3 enzymes match your selection criteria, only those 3 will show up in most analyses. They can be chosen by 'magnitude' of recognition site (4, 5, 6 cutters, etc), by overhang generated, by minimum/maximum times they cut, etc.

The entire feature set is described in more detail below, but it includes:


Where can I get it?

tacg V2 is available via FTP in ANSI C source code and precompiled binaries for several architectures. Instructions are included with each package. Note that the precompiled binaries do not include the source code; be sure to pick up both pkgs if you want both. If I can figure out how the REDHAT Package Manager works, I'll make up an RPM for the Linux distribution.

Is there a GUI or Web Interface?

tacg is a command line program (see Simplicity, below) but as such, it lends itself well to Webification as a cgi program. I made a Web interface to tacg Ver 1.7x (still active) that allows you use most of its abilities.
The Web interface that uses most of Ver 2's abilities is ready! Try it out here!
Another page that makes tacg's pattern-searching abilities easier to use will be up soon as well.

The only substantial crippling is that it currently has a cutoff at 200,000 bases (up from 100K in Ver 1.7x) as the output for that size of analysis starts to get very large for some analyses). If you routinely use more sequence than this, it's probably worth your time to install it locally.

In my experience, if the analyses provided are sufficient, interacting with tacgi/tacg via it's web interface (on an 100MHz R4000 SGI) over 2 LAN hops is both faster and easier than using GCG's native Xwindows interface (on a 200MHz R4400 SGI). There may be some subjective feeling involved here, but objectively, it's faster by 10X (on a machine that's 1/2 as fast :) )

Another example of how tacg has been incorporated into a Web interface is provided by the Biologist's Workbench at NCSA, for which it provides restriction enzyme capabilities.
The Biologist's Workbench has also been licensed to MDL for commercial users, tacg included.

If you think that the web interface might be useful to you or your institution and would like to set it up locally, you're more than welcome. It requires Tom Boutell's cgic library (for which it won Best Technical Application - Yowza! , the associated tacgi.c code (originally by Fred Criscuolo ) and the HTML code for the entry form. Just mail me for all the details, or if you're adventurous, you can try it yourself. The code is in the same FTP tree as tacg.
tacgi2, the CGI that supports tacg2 is now operating. After I tweak it, and polish the docs a bit more, it will also be made available via FTP. If you're interested in getting it, you might consider subscribing to the above-mentioned tacg listserv, where it (and associated bugs/fixes/patches, etc will be announced).


Is there a man page I can look at?

Yup. You have your choice of a REAL manpage (included with both the source code and the binary pkgs), or a PREFORMATTED man page or a Webified man page.

Design Criteria

Simplicity

In answer to: "Why does the world need another restriction mapping program, especially one as crude as this?" or Why didn't you just use (or modify XXX?", I enumerate:
  1. Well, in the 1st place, crude is not always bad :)... 'grep' could be classified as crude and it's one of the most useful and used programs in the world. It's free, simple, flexible, chainable, and very fast; it provides the desired info in a useable form with a minimum of mucking about.
  2. There are a variety of programs available for Windows and (especially) the Mac for doing these kinds of analysis, but there's little along the same lines for unix, especially Linux, which is an extremely attractive OS for this kind of work.
  3. Many available programs, even the commercial ones, start to come undone at sequence lengths quite a bit smaller than those now being published. Dynamic memory allocation relieves this restriction, so I decided to write one that incorporated this feature.
  4. Even some of my favorite programs would refuse to analyze degenerate DNA sequence, making the generation of useable plasmid maps from the usual bits of known sequence, polylinker, and hand-drawn maps impossible or extremely unpleasant.
  5. While I have had access to GCG for quite some time, I have been increasingly aggravated at it's speed, output, cost, and the fact that it doesn't run on Linux. While tacg's code is sufficiently arcane that precious few biologists will attempt to add functionality to it, they may suggest additional functions that I would certainly consider implementing.
  6. There is also the challenge of creating a program that, like grep has significant utility in a relatively small footprint.
  7. There was also the fun of it - this has been a hobby of mine and it's also rewarding to return to the Internet community something of (hopeful) value as a way of saying 'thank you' for all the useful stuff I've taken advantage of.

However, if you want alternatives, here are a few:

  1. There's the stalwart (if ugly) GCG, but it's $ware and does not (yet?) run on Linux. It also remains essentially a collection of unfriendly command-line programs (hey, nice glass house..) sheltered under a common but still not too intuitive GUI. The GUI is strangely warped - it appears to be very similar to a HTML forms-based GUI, but with neither the benefit of HTML, nor many of the advantages of a 'native' GUI. The controls are surprisingly crudely laid out (much like the tacg Web form), which is odd for a 'native' app; I would have expected a much more compact and intuitive interface.
  2. Webcutter 2 is a Web-only cgi (no command-line version, no source code) that Max Heiman put together at about the same time that I was writing tacgi. I've tried it a few times and while it may be useful, I found it too slow and the output a bit info-sparse. It does have some features that tacg lacks (coloration of selected Restriction Enzymes (REs), Sites that could be introduced by Silent Mutagenesis, grabbing sequences direct from Genbank to restrict), but for my own use they weren't that compelling. Nevertheless, if anyone thinks that a feature of WebCutter would make tacg considerably more usable, I'd be interested in hearing about it.
  3. Shankar Subramanian's Biologist's Workbench is an excellent example of how a Web Interface can make coherent a great number of the free software tools of the net. Now that you mention it, tacg is one of its included tools :).
    The only exception I take to Web interfaces (including my own) is the response time/uncertainty of the connection and the inability to retain state easily. But it is a great way to take advantage of programs that you'd never be able to take the time to get working and working reasonably easily and a well thought-out interface can do a lot to reduce the problem of retained state.
  4. Steve Smith's 'GDE' has been compiled for Linux, but requires a fair amount of dedication to put and hold it all together; last time I looked it was available only an OpenLook interface, which is an increasing disadvantage. Steve is now at GCG (I think - the new versions of GCG seem to show the GDE influence.) so there probably won't be a huge amount of work going into it.
  5. Brian Fristensky's estimable suite of programs BIRCH (Blatant Plug: of which tacg is also a part) is pretty good and his comments and suggestions have resulted in a a number of the improvements seen here, especially the Degenerate Sequence matching.
  6. MSI has just released an app called Gene Explorer, which includes quite a bit of functionality as a loss-leader to attract researchers to their higher-end products. Haven't had enough experience with it to comment further.
  7. James Tisdall's 'DNA Workbench' is designed along the same idea as tacg, but requires perl and is therfore interpreted. Despite perl's many syntactic and other advantages, I wanted something even simpler (a standalone app) and therefore easier to install, faster, and more complete.
  8. Don Gilbert's multiplatform, GUI ' SeqPup' is the closest to an ideal setup in terms of design and interface (and I hope to make this code compatible with it), but it also requires a certain amount of dedication to install it and the current Restriction Enzyme module is quite slow - unusable for sequences longer than a few thousand bases.
    Anyone interested in interface design for bioinformatics is well-advised to check out and follow Don's contributions. His have consistently led the field in terms of usability, Internet use, cohesiveness, and inventiveness.
  9. On some systems, you can run Executor, Macintosh Application Environment, SoftWindows, and Virtual PC and thus have access to the Mac and Windows programs, but that's also somewhat costly, slow, and complex.
tacg was designed along the same lines as other small unix utilities - a tool that does a small set of things, does them reasonably well, and can be chained to other utilities or used in conjuction with them to extract the information you need without too much fuss. It especially benefits from Don Gilbert's readseq utility as it has no sequence conversion ability of it's own, although last year James Knight announced the availability of the SEQIO package designed to add this capability to programs; I thought I'd incorporate it into Ver 2, but I haven't done it yet :(. But it's next on the list, especially as tacg has no database-parsing abilities of its own and that is where tacg seems to be headed.
Alternatively, you can use the unix utilities head or tail to chop off the headers of most sequence formats and feed the rest to tacg.

(It also had to be simple because of my lack of fluency in C - something that will be obvious to anyone looking thru the code, but the side advantage is that the code IS reasonably simple.)

To that end, it REQUIRES only 1 file - the executable (tacg, which can be renamed anything). Most people will want to use the restriction enzyme database file (rebase.data; included with the distributions, which is simply the GCG-formatted version of Rich Robert's REBASE), but it is not strictly needed - you can enter patterns from the command line in Ver 2. The codon usage file (codon.prefs) that used to be required has been eliminated as that data is now compiled into the program. The code is still in place, however and it would be trivial to reverse the process.

The 'rebase.data' file will be found in any of 3 places automatically: the current directory ($PWD), the home directory ($HOME), or the tacg lib directory ($TACGLIB). The program searches these paths and gives up only if the file can't be found in any of these places. 'rebase.data' is ASCII text and can be edited and modified by the user, if required. I'd suggest placing 'rebase.data' in the TACGLIB directory and keeping modified versions of these files, if required, in your home directory which is searched first.

Program Output

The output of this program uses only alphanumeric characters so that all of its output can be viewed on a vanilla vt100-like terminal, although you can do more useful things if you're using an X display or have imported the output into a Word Processor (some of the output can best be viewed using very small fonts or in multiple columns on a page).

The different output sections have been prefixed with "==" so that you can easily jump to the different sections if you're viewing the output with an editor or pager such as less. In the WWW version, the different sections have been labeled with HTML tags so that clicking the Table of Contents Entry will drop you to the corresponding results section.

For the Web interface, if you're using a Mac, printing to a Laserwriter-type device, you might try the '105' width, with the 'Small' font, then use Page Setup to select Landscape output, 2-up (2 pages fitted to one page).

If you're using MS Windows, there are probably similar output selections that you can use. You can probably discern my love for Windows.

If you're using a unix box, save the output in whatever width you want and pass it thru a text-to-postscript filter to generate nicely formatted output. Some that I've used are:

Use it thusly:
genscript -1GZ -r -fCourier7 -p- [filename] | lpr -P[PSprinter]

Most Xterminals allow different font selections and window re-sizing, so you can probably view the results adequately that way as well.


High Portability

The program is written in vanilla ANSI C. After being directed to the correct libraries, it compiles with few complaints on: On those systems where both the native commercial compiler (cc) and gcc have been available, gcc has consistently outperformed cc. YMMV.
I suspect that it will require some work to run on VMS, but I don't know. Others will be made available as I find systems on which to compile them, or as others contribute binaries.

tacg executes almost entirely integer operations, with the surprising result that it runs faster on a PPro than on many RISC workstations of similar or even higher clock speed.


Speed and Capacity

The program uses a hashtable-lookup of the restriction enzyme recognition sites, so that only about half of the sequence is checked any further than the initial hash. Depending on the degeneracy of the input sequence, the kind of output you request and the i/o of the machine, the program processes:
Speed*HardwareOSCompiler, flags
450Sparc Ultra1/??MHz Sun OS v5.5 gcc -O2
313SPP/HP-PASPP-UX 5.1/HP-UX 10.1 gcc -O2
165early DEC AlphaOSF/1gcc -O2
140R4000/100 Indigo2IRIX 5.3cc -O2 -mips2
290R4400/200 Indigo2 IRIX 5.3cc -O2 -mips2
454R10000/175 Indigo2 IRIX 6.2cc -O2 -mips2
330DECAlphaStation/233RH Linux 2.0.29gcc -O2
68 i486/66RH Linux 2.0.18gcc -O2 -m486
515Intel PPro/200RHLinux 2.0.27gcc -O2 -486
300PMac8500/PPC604/120Mklinux 2.0.27gcc -O2

* in Kb/sec, crudely adjusted for cpu usage via the `time` shell command, using 215 enzymes, digesting 4.638 MB, for only the summary listing (-s). Requesting the full analysis (-n4 -sSlc -F3 -g100) takes about twice this time. The full Linear Map doesn't take much more CPU time, but considerably more wall time due to the i/o (output is 10x input).

NB: This is about the slowest performance you can expect. Because of the size of the input sequence, the program has to go thru several rounds of memory allocation. Shorter sequences, more typical of cloning projects, complete essentially immediately even on slow machines.

Here's a table of how fast the same analysis completes on various sequence lengths (for a PPro/200/96M/PCI-SCSI and a 486/66/32M/VESA-SCSI):
Sequence LengthSeconds(PPro)Bases/s(PPro)Seconds(486)Bases/s(486)
100.042.5k .28 360
1,000.0520k .3 3.3k
10,000.07143k .4 24.4k
100,000.2500k 1.4 71.4k
1,000,0001.6617k 11.3 88.5k
4,630,0009.1509k 68 68k
10,000,00022.7492k 261 38k


From earlier, crude analyses it appears to be about 7-15 times faster than the equivalent routines in the GCG 'map' program, but I haven't checked recently.

Memory Usage

tacg uses dynamic memory allocation for most data structures so that while there are a few hard-coded limitations (mostly in output format), it easily handles sequences into the millions of bases. You may have to specify a very wide output with the '-w' flag to allow the tic labels to fit, but the program will work. As a practical rule of thumb, the program needs about 2 MB of free memory for itself, and then, depending on what results you've asked for, as much as 3 times the input sequence length to hold all the intermediate results before they are barfed to stdout. For the E Coli genome (4.7Mbases), tacg memory usage tops out at about 13 MB, so on a 16 MB machine, you'll probably be doing some swapping in this case. On a 32 MB Linux PC, I had no swapping on the above 4.7 Mbase analysis, but quite a bit with 10MB (about 40MB total required, according to 'top'). YMMV.

Feature Set

Inspired by Christian Marck's elegant DNA Strider, I tried for a similiar output format, changing a few things I didn't like, adding a few things I wished he had. I'm open to suggestions for enhancements, interface changes, etc.
Kvetch or it won't change.

Sequence Input

tacg will take any ASCII data as input, ignore anything that isn't a,c,g, u, or t (or IUPAC degeneracies - see below) and subject that sequence to analysis. Because of this, numbering, spacing, line width, etc, of the input file should not be a problem.

IUPAC Degeneracies

IUPAC Degeneracies are those characters which denote uncertainty in the sequence; most are familiar with y (pyrimidines - c or t) and r (purines - a or g), but there are several others that can be useful. The entire set is:
BaseNameBases RepresentedComplementary Base
AAdenineAT
TThymidineTA
UUridine(RNA only)U A
GGuanidineGC
CCytidineCG
YpYrimidineC TR
RpuRineA GY
SStrong(3Hbonds)G CW
WWeak(2Hbonds)A TS
KKetoT/U GM
MaMinoA CK
Bnot AC G TV
Dnot CA G TH
Hnot GA C TD
Vnot T/UA C GB
NUnknownA C G TN

However, it will NOT accept (at least reliably) binary file formats such as those from DNA Strider, MacVector, NBI, GeneWorks, etc. Most commercial programs however have an ASCII export function that should work. For the WWW interface, anything that you can cut and paste into the window and looks like sequence ought to be fine.

In the Web form, there is a cutoff of 200,000 characters (not bases) as input; currently this includes extraneous characters such as headers, numbers, spaces, etc, so if you think you are being unfairly denied service, check how many bytes your file actually is. The command-line version is limited only by your machine's RAM and your patience.


Sequence Handling [New]

tacg V2 has been modified to handle degenerate input sequence as well as the previous gatc-only sequences. tacg V2 matches sequence using the fast, hexamer hashtable look-ahead algorithm that DNA Strider uses (my implementation) until it hits a degeneracy in the leading hexamer window. At this point it switches to a more extensive (but slower) matching approach until the sequence degeneracy is cleared whereupon it switches back to the faster approach. This allows tacg to cut sequence that is only slightly degenerate (such as many genomic sequences) essentially as fast as nondegenerate sequence.

Although tacg behaves as above in the default mode, it can be forced into several other modes if desired. Check the -D man page entry for more detail on the different modes.

Enzyme Selection

Restriction Enzymes can be selected either by:

Explicit pick from all the active entries from the latest REBASE.

I'm not particularly happy with the way this is currently implemented. I've also tried picks from very long lists, but they are also cumbersome and not particularly easy to use. While the current implementation gives you the choice all enzymes in a logical layout it's too slow on many machines and interrupts the flow of the form. I've also tried it with hyperlinks to the recognition sequence and suppliers (an HTMLized version of the 'rebase.data' file). Suggestions are welcome.
(In the command-line version, enzymes can be picked explicitly via the -r flag (-r NameA,NameB,NameC... where NameX is the case INsensitive name of the Enzyme or pattern you wish to use in the GCG-formatted REBASE file, either 'rebase.data' or another specified by the -R flag (-R Alt.rebase.file).)

or

Selected recursively by characteristic, based on:

Magnitude of recognition sequence
The 'magnitude' of the recognition sequence depends on the number of defined bases that make up the site. Degenerate bases can also contribute:

acgt    each count   '1'  magnitude point 
yrwsmk  each count  '1/2' magnitude point 
bdhu    each count  '1/4' magnitude point 
n       doesn't count at all

(tgca=4, tgyrca=5, tgcnnngca=6, etc)
Overhang of resulting ends (5', 3', blunt)
This refers to the stretch of unpaired bases at the border of the cut created by offset breaks in the phosphate backbone. Enzymes can leave:
5' overhangs - ie. BamHI(G'GATC_C) 
          v                   BamHI
 5'...tagG GATC_Ccga...3'      ->      5'...tagG       GATCCcga...3'
 3'...atcC CTAG Ggct...5'      ->      3'...atcCCTAG       Ggct...5'
               ^
 
 3' overhangs - ie. BbeI(G_GCGC'C)     
               v              BbeI
 5'...tagG_GCGC'Ccga...3'      ->      5'...tagGGCGC       Ccga...3'
 3'...atcC CGCG Ggct...5'      ->      3'...atcC       CGCGGgct...5'
          ^  
              
 Blunt (no overhang) ... You get the idea...

Minimum, Maximum times they cut the sequence
This is pretty self explanatory, but..
'Minimum' indicates the minimum # of times an enzyme HAS TO cut before it's included in the output (the default is 1).
'Maximum' indicates the maximum # of times an enzyme CAN cut before it's excluded from the output (no default maximum).

Alternative REBASE database files to use
This option allows you to specify an alternative REBASE file to use in the matching, perhaps edited to contain only cheap enzymes or otherwise customized for your lab/site. Ver2 has been modified to use dynamic memory for the allocation of the REBASE entries, so there is no compiled-in limit to the number that can be read in. It starts at 300 entries, which is enough for most Restriction Enzyme databases, such as REBASE, but if you format TRANSFAC or TFD into REBASE form, it will easily process them (about 3000 entries). This RE struct is also used to hold the degenerate patterns created from a pattern prototype with errors and I've used it to generate up to ~ 6000 entries.

The Web version allows you to choose among:


Linear Restriction Map

tacg' Linear Map Mode:

                            Bsu36I
                       Sau3AI
                         DpnI
                   AlwI  NlaIV
                       BstYI                            Tsp4CI
              BbvI     BamHI   AlwI          MseI MaeIII SfaNI
              \    \   \ \  \  \             \    \     \\
    181   gagaaacaacaatggatcctaaggttagaacattgttaaaagttactgttgaagatgctt    240
   4635   ctctttgttgttacctaggattccaatcttgtaacaattttcaatgacaacttctacgaa   4694
              ^    *    ^    *    ^    *    ^    *    ^    *    ^    *
          E  K  Q  Q  W  I  L  R  L  E  H  C  Z  K  L  L  L  K  M  L

Simple ORF AnalysisNEW

tacg can search for Open Reading Frames (ORFs) in any of the 6 frames in one pass with a user-specified minimum ORF size. It is not a sophisticated search algorithm; the rules are: Very little analysis is done (so far). The beginning and ending Offsets in both base pairs and Amino Acids are given so that it can be found on a linear map (offsets for frames in the opposite strand are calculated so that they refer to the current top strand). The size in both Amino Acids and KDa is calculated. Perhaps most usefully, the results are writ to stdout in FASTA format, with the stats as the ID line and the AAs as an unbroken line so that it can be examined by a more sophisticated pattern matcher such as grep, agrep, perl, etc (see Using tacg with Other Programs

Programming Note: The data structure that holds the ORF information is dynamically allocated so that all the ORF info is available until the end of the program. As well this allows very large sequences to be scanned in one sweep - my aging 486/66/32M will process the E coli genome, searching all the ORFs for a degenerate 10 AA motif with 'agrep' in under 5 minutes. Here is some sample output:


% ./tacg -O156,60  < dpp.mel.seq

== ORF Analysis for Frame 1:   7 ORF(s) > 60 AAs
 F#  ORF#   Begin(bp/AAs)     End(bp/AAs)   #AAs    MWt(KDa)
> 1     1    2581 /   860    2922 /   974    114    13155.40
MWTVLWHAENFMTSFADHQASGAEAPNKKHSPRLAHNRPRFCLLMASLERFSGCNTRNFVQRQHNSPKDSLNFDQLSDWVAAAPQHHRNWLAEKLELHPLLNGPIRNPQDFTDG
> 1     2    3091 /  1030    3282 /  1094     64     7107.97
MLNKNDKTQPRIVICNLCTCTLNHLVAPVLLLDAPPAPILFMPLSVAHCCCCCHTHTHIRTTQK
> 1     3    5929 /  1976    6216 /  2072     96     9692.80
MEVDAGSGMLGAGQVGYGLGWVLAPLRFPGHADAADGARGATTPADKAVRVQLPTPAREASAQATKLAKMQRDQGMSGGVLARMGLRLGLCGINCA
> 1     4    7522 /  2507    7704 /  2568     61     7430.89
MYIYLFFACVQYQTLMFNKSGKNLYCAYKNHQNHKGYALPYEPTYLTRSQQTLHFESIHWV
> 1     5    7870 /  2623    8160 /  2720     97    10690.71
MIYVWDVGECVSVCWRRLGGWGAERADRAGRSDPAESSSKTTTAAAIMIHSRNVGNGCLTILFFFSIFGSRRRRSCAGYEIHTTQILTHTATVAFTT
> 1     6   11071 /  3690   11277 /  3759     69     7770.53
MWNKLARASQRFSRSFFSSLRATISAALFRGQRMMGTLPRVTNRAKSESTMESIRASISSSRVLARLMA
> 1     7   12406 /  4135   12768 /  4256    121    14055.17
MILKEEHPHQSIETAANAARQAQVRWRMAHLKALSRTRTPAHGNCCGRVVSKNHFFKHSRAFLWFLLCNLVMNADAFAHSQLLINVQNQVRNRKKKWFKTLKSQQRKTQQRQTATLAALPI

== ORF Analysis for Frame 5:   5 ORF(s) > 60 AAs
 F#  ORF#   Begin(bp/AAs)     End(bp/AAs)   #AAs    MWt(KDa)
> 5     1   11241 /  3747   10885 /  3627    120    13410.25
MEALMDSIVDSLFALFVTLGNVPIIRCPRNSAAEMVARKLEKKLRENLWDARANLFHMDATQAGGGVFSFQRPVLLLLDRNMDLATPLHHTWSYQALVHDVLDLGLNLVYVEDETASAGK
> 5     2    8007 /  2669    7708 /  2568    101    10973.69
MAAAVVVLLDDSAGSDRPARSARSAPQPPNLRQHTDTHSPTSHTYIIILTTLNKAEQIPKFEEPDSNSQSGSRTIPQNEVPARQRRRSSDTASALALGWDL
> 5     3    2304 /   768    2113 /   703     65     7230.86
MLAQHQIEIKVYPPFRSIPCYVCFLPGQNNSALFWALGATQKVMIKLCGDPVEGTRLRGYPSTSS
> 5     4    1734 /   578    1492 /   496     82     9039.19
MPISWHKKMRQPCPHSPAITEGTGLCIIFNEDICQQKCSGVLFMAFKAKPIFITTATAGQRKTTRDAVVVVVVLIIDTSYAN
> 5     5     762 /   254     583 /   193     61     6683.20
MLCPSILPFTLKHKFMLSVHFFQCTTKLTPCGAESATRRRSAKGKGAHGQGTGCATDWLTR

== ORF Analysis for Frame 6:   5 ORF(s) > 60 AAs
 F#  ORF#   Begin(bp/AAs)     End(bp/AAs)   #AAs    MWt(KDa)
> 6     1   11181 /  3727   10997 /  3664     63     7224.82
MCPSSAVRGTVPLKWWHASWRRSCARTFGMRVPICSTWMPRKPAAECSAFRDLCSCCWTGTWI
> 6     2   10764 /  3588    9728 /  3241    347    39136.07
MTHKGSPFPTVAEAIQEELESYRNSEEEIKRLKTSMGIEGESDIAFSLVNDTTARLTNAVNSLPQLMEKKRLIDMHTKIATAILNFIKARRLDSFFEIEEKVMSKQTLDRPLLDLLRDGEFGQAEDKLRLYI
IYFICAQQLPESEQERLKEALQAAGCDLTALAYVQRWKGIMNRSPSISQATQYEGGGTKTVSMFTKLVSQGSSFVMEGVKNLVVKRHVSFSFQIQIQVLYQNCLQNLPVTKITEQVMECRSNAETDDYLYLD
PKLLKGGEVLPKNRAPFQDAVVFMVGGGNYIEYQNLVDFIKQKQTSNVQRRIIYGASTLTNARQFLKELSALGGEIQSPTATS
> 6     3    6774 /  2258    6530 /  2175     83     9280.03
MQGNCNRLAGEWTQKGQPRMREDAALPARNPTFLGLHYPGLGRFFVRPTAPNGHQQTAANSFIKHKQIPWTCSALTPHAAFEH
> 6     4    5901 /  1967    5687 /  1894     73     8163.18
MVLFLRVVCFSPVDNKVQPTCGYRTDRFRVLVALVLGASRWLTLRANDAKQQIKGAIRKIAARDVLLNCSSKA
> 6     5    1647 /   549    1310 /   435    114    13523.17
MKTSVSKNVRAFCLWHLRQNRFLLPQRQQGKGKPHATLSSLSSSSSLTHRMQIKSVSQPDHKVAAPDRRICRFQCWERGDPWRFRSRIPDTDTETRLEVKRTRPIRRDLWHESS

==--------------------------  End of Analysis  --------------------------==

Tabular Output

All these output formats, as well as the Linear Map and Summaries are affected by the width parameter (-w); if you can manage the wider output, it's more efficient and easier to view.

Summary of All Cuts

This is no longer printed by default; you have to explicitly request all output. This was to make tacg more amenable to inclusion into other programs as a shell command - see Using tacg with Other Programs. This output includes info on EVERY active (uncommented) enzyme/pattern in the rebase filethat has passed all the filtering flags (-n, -o, -m/M).
Restriction Enzymes that DO NOT CUT in this sequence:

      BbeI      EheI      FseI      KasI      NarI      NheI      NotI
      PacI    PaeR7I      SalI      SfiI      SpeI      SwaI      XhoI

 Total Number of Cuts per Restriction Enzyme:

     AatII    5     BsiYI  130     EcoNI    5      MluI    7      SalI    0
      AccI    5      BsmI   30  EcoO109I    2      MmeI    8      SapI    7
     AflII    2     BsmAI   26     EcoRI    3      MnlI  184      SauI    1
    AflIII   13   Bsp120I    1    EcoRII   49      MscI   17    Sau96I   61
      AgeI   12  Bsp1286I   26     EcoRV   14      MseI  106      ScaI    4
      AluI   89     BspEI   22      EheI    0      MspI  278     ScrFI  145
                               [remainder omitted]

Tables of Cutting Sites

(for enzymes that pass the filtering flags)
All of the following output formats (except the Infrequent Cutters Map) can have the REs listed by order of their names in the database or sorted by # of cuts and thence alphabetically, like Strider, using the '-c' flag.
In the next example output, the first example is sorted by 'order in database', the second is sorted by '# cuts, then alphabetically'.
sorted by 'order in database'
  ==  Cut Sites by Restriction Enzyme 

AatII       G_ACGT'C - 2 Cut(s)
   1118  12143

AccI        GT'mk_AC - 16 Cut(s)
    632   3008   3073   3079   5393   5816   6881   7974   8815   9978  11962
  14136  16857  18087  20067  20222

AceIII      CAGCTCnnnnnnn'nnnn_ - 8 Cut(s)
   2127   5417  11681  11750  15151  17252  17584  21795

AciI        C'CG_C - 48 Cut(s)
    567    850   1512   2455   2931   3300   3339   3490   4842   5468   5600
   5907   6174   6318   7533   8123   8807   9086   9745  10482  10805  11626
  11669  11896  12050  12214  12404  12417  12675  13337  13646  14217  14881
  15949  16013  16668  17060  17089  17101  17132  18035  18140  18820  18896
  19781  21146  21536  21958

AflII       C'TTAA_G - 5 Cut(s)
   1292   4185  11172  18334  20663

                               [remainder omitted]
== Cut Sites by Restriction Enzyme
sorted by '# cuts, then alphabetically'
ApaI        G_GGCC'C - 1 Cut(s)
  21979

ApaLI       G'TGCA_C - 1 Cut(s)
  10676

Bpu10I      CC'TnA_GC - 1 Cut(s)
    844

Bpu1102I    GC'TnA_GC - 1 Cut(s)
  13369

NarI        GG'CG_CC - 1 Cut(s)
  16014

NgoAIV      G'CCGG_C - 1 Cut(s)
  15606
                               [remainder omitted]

Tables of Fragment Sizes

(unsorted by fragment size, sorted or both; 'vertical' ordering can also be sorted as in the above example.
  **  SORTED Fragment Sizes by Restriction Enzyme **

AatII       G_ACGT'C - 5 Fragment(s)
   1849   3449   3731   4289   5110  14062

AccI        GT'mk_AC - 5 Fragment(s)
    639   1187   2192   3574  11828  13070

AflII       C'TTAA_G - 2 Fragment(s)
   6078   6541  19871

AflIII      A'CryG_T - 13 Fragment(s)
     35    170    459    493    956   1268   1712   1913   2360   2419   4091
   4920   5733   5961
                               [remainder omitted]

Ladder Map


  Ladder Map of Restriction Enzyme Cut Sites:

            900 1800 2700 3600 4500 5400 6300 7200 8100 9000
              :    :    :    :    :    :    :    :    :    :
     AccI ----------------\-----------------\--------------------         2
     AciI -----------------\-----------\--\---\\------\--\\------         8
    AflII ---------\-------------\---------\--\----\---\---------         6
     AhdI ---------------------------/---------------------------         1
              :    :    :    :    :    :    :    :    :    :
     AluI --|---|||--23||||2||-|2--2---|2|232||-|-|22--2-|3|-||--        50
     AlwI ------2---\-----------\--2-------------\2--------------         9
   Alw26I ----------------------------------------------------\--         1
    AlwNI -----------------------/--------------/--/-------------         3
              :    :    :    :    :    :    :    :    :    :

Infrequent Cutters Map

(a la Strider) of enzymes that cut less than X (a small number of times) - (In Ver 2 this has been changed to be length-sensitive; X = 2 until the sequence increases to beyond 10,000, then X increases proportional to the log of the increased size). This output option is not called explicitly, but is appended to the Ladder map (-l).
It should be obvious, but each label is of the form:
Name@Cut_Site
                                                                                     
 == Summary of Enzymes that cut ** 2 ** times or less:
                           XcmI@9286
      XcmI@2466           UbaEI@9054
    Sse8647I@1801          SexAI@9288  UbaDI@13309
        SmaI@2882            PstI@10119       PstI@15573
    Psp5II@1801                 Pfl1108I@10898   SgrAI@16556
      NruI@2248              HgiEII@10152            PflMI@18025
        NcoI@2848        SexAI@8621  EcoNI@12615  NruI@17146
        BstXI@2962              EcoNI@10850   NgoAIV@15606  PmlI@20299
           BstEII@4076             EagI@12054  NarI@16014   Pfl1108I@20364
   BstEII@1183                  Bsu36I@10980         HgiEII@18039 MluI@22230
        BstDSI@2848   NcoI@7724        Bpu1102I@13369  PacI@18544 MluI@22201
  Bpu10I@844   PshAI@5399      ApaLI@10676     EagI@16150   BstXI@20204
  AatII@1118          BstDSI@7724  AatII@12143   Bsu36I@16688    ApaI@21979
  ||| | |  |   |      |  ||| | ||  | | |      || ||  | |    |    ||
-------------------------------------------------------------------
     :     :     :     :     :     :     :     :     :     :     :
  2000  4000  6000  8000 10000 12000 14000 16000 18000 20000 22000

Pseudo Gel Map

Below is a section of gel output generated with the width set to '-w75'; the sequence length was 22380 bp (so the gel is automatically extended to 30 Kb).

From examining the UDP data so far, this is the option that is used incorrectly most of the time. To obtain the results below (a 10 bp cutoff), it should be used thusly:
tacg -g10 [other flags] < input.file
If you wanted a 1000 bp cutoff, change it to:
tacg -g1000 [other flags] < input.file



 == Pseudo-Gel Map of Digestions:

      10                   100                  1000                 10000
       .      .  .  . . . ....      .  .  . . . ....      .  .  . . . ....      .  .
 AatII                                               |                    ||          2
  AccI |                 |        |        |   |  | |2|   222 |                      16
AceIII                    |              |     |           2   || |   |               8
  AciI   ||      || ||   | |  | |22 || 33||4 |227||3| |2                             48
       .      .  .  . . . ....      .  .  . . . ....      .  .  . . . ....      .  .
 AflII                                                |  |  | |        2              5
AflIII           |               |    |  |          |   | 2 2     |   |              11
  AhdI      3                                  |      |     |         |     |         7
  AluI |     3 ||3  3|222 |||2223 3334|2352|2|42|||  2|2                             74
       .      .  .  . . . ....      .  .  . . . ....      .  .  . . . ....      .  .
  AlwI   5                     | ||   |  |||  ||||2|| |  3|   ||                     26
Alw26I |           |  2      | |  | |2|| 3 32||3|   |4|2 |    |                      35
 AlwNI                      |             |        |    2  |          |  |            7
  ApaI                                     |                                          1
                         [rest deleted]

Pattern Matching NEW

The pattern matching used in tacg has been extended to a more general approach for nucleic acids so that it is applicable for searching for degenerate patterns with a specified number of errors (in possibly degenerate sequences).
Pattern Matching with Errors NEW
The original tacg allowed searching for nucleic acid patterns with the standard IUPAC degeneracies, but V2 allows you to extend this capability by allowing searching with errors.

Quite often when you are searching for binding sites for transcription patterns, you wish to search for not only degenerate sites but degenerate sites with 'a bit of slop'. You can use tacg to do this in 2 ways, using the '-p' flag:
Say you had defined a site (called NPR) on a promoter by footprinting as 'gtcagggcgaat' and you wanted to search another implicated region for this sequence but you wanted to allow 2 errors to be allowed, so that you could find a sequence that would, for example match as follows:

               ...gtcagggcgaat...
                     x      x 
               ...gtcggggcgatt...
An appropriate flag/option sequence to obtain such a match would be:

tacg -pNPR,gtcagggcgaat,2 -Sl -w75 <input.file |less

Translation: search input.file for the NPR sequence gtcagggcgaat, allowing at most 2 errors (-pNPR,gtcagggcgaat,2), returning the Site data and a ladder map (-Sl) formatted 75 characters wide (-w75) and pipe the results directly to the pager program 'less' (|less)

oryou can specify that the sequence is always supposed to be searched 'with slop' via the addition of a single character in the standard GCG-formatted REBASE file entry:

Here is the standard GCG REBASE format line:

NPR  6   gtcagggcgaat   0  !NPR is searched with 0 errors
Here is my hack to the standard GCG format line:
NPR  6   gtcagggcgaat   0  2 !NPR is always searched with 2 errors
The algorithm to do this is reasonably brute force: an 'n' scan of the sequence is done, replacing every base of the sequence with an 'n'. If there is more than 1 error allowed, then each successive generation is 'n'-scanned. At each iteration, each generated pattern is compared with the previous ones so that only novel patterns are actually stored. This approach dramatically decreases the number of patterns searched for, but can consume substantial CPU resources in the validation for long patterns with many errors allowed. There's a tweak to the algorithm that will also substantially decrease this time, but I haven't implemented it yet. I'll see if anyone uses it 1st.
Proximity Matching NEW
This is another option designed specifically for Transcription Factor (TF) analyses. (see the man page section for details) In many cases, you are interested in a particular juxtaposition of 2 or more TFs; this option allows you to specify up to 10 sets of relationships between TFs and extract only those which match your criteria. The output has been organized so that you can see both the raw data (All) and the filtered data (Matched) in the same output stanza (below):
== Proximity Matches:

lef(1E)/IHF(0E) * 1 matches * found from search parameters: lef,200,IHF
   57309/57116     


 == Ladder Map of Enzyme Sites:  
                                                                          # hits

           10000 20000 30000 40000 50000 60000 70000 80000 90000100000
               :     :     :     :     :     :     :     :     :     : 

       lef |--|-|-|463222222||||25|2|243---|324|222222--32223|-||3-3|--   98 (All)
       lef ----------------------------------|-------------------------    1 (Matched)
       IHF ----------------------------------|-------------------------    1 (Matched)
       IHF ----------------------------|-----|-------------------------    2 (All)

               :     :     :     :     :     :     :     :     :     :
Pattern Match Data Output by 'Bins' NEW
This option (-Gbin_width,X|Y|L) was included to allow the results of the matching to be exported to other programs for further analysis. Especially where there is a high number of matches, it may be of more use to plot or analyze these data using an external program, whether it be a spreadsheet, plotting app, or custom-writ program. I don't presume to know what you might want to do with the data and so to make tacg play nicely with other apps in the best unix tradition, I've tried to address some of the simpler data output formats that might be of most use. Output is possible in 2 rectangular formats (X and Y; one is the diagonal flip of the other) as well as in a long form (L)that may be better suited for simple, custom-writ programs that just want to read x,y pairs. The very slick gnuplot can read the multiple columns of the Y form with only a very small amount of editing or filtering of the output. An example of gnuplot's processing of some tacg-generated data is shown as well. See also the gnuplot stanza in the section below.

Using tacg with Other Programs

tacg is obviously not the only molbio program you need to do your work. There are many analyses tacg just cannot do; to that end, it at least regurgitates your data in a re-digestible form so that you can masticate it further elsewhere. If you can program a bit, for quick and dirty programs you probably can't do better than perl (but if you can program a bit, you already know that). If you can't program, importing your data into a spreadsheet and manipulating it cell by cell is certainly viable with the sophistication of modern spreadsheets.

In terms of working with your data in the context of tacg, there are lots of other programs that make input easier, and output more readable/browsable. Here are some that I use on a regualr basis:

Using tacg to Scan Databases

Well, to make a short story shorter, you can't. tacg V2 has no built-in functions to scan databases, although it is certainly robust enough to do so. In order for tacg to deal with databases, you have to supply the scripts to feed sequences into tacg, one by one. tacg has a minimal separation line to delineate between analyses, but nothing more than that.

Two things would be very handy for database handling: the ability to deal transparently with gzipped/compressed data and the ability to handle different database formats in a reasonable manner. The 1st is available via the zlib library; the second, via the SEQIO library. These (along with the inevitable bug patches) are the next goal for tacg. From the docs, they both appear to be comprehensible; we'll see....

The Odd one: UDP Reporting

tacg was also designed to track it's own use and spread - something in which I'm also interested. To that end, the binaries have been compiled with code (udping.c) that spits a small amount of data back to me at each usage. To wit: It does NOT return host or domain names, user names, or actual sequence. The exact data (minus IP number which is inherent in the UDP protocol) that is returned is shown on stderr (usually the screen) at each use. ie.

Returned: [hw=IP22 os=IRIX osver=5.3] [TACG Version 1.5F] /usr/local/bin/tacg -n5 -o5 -F2 -slL -g100 <100000 bp>

If your machine is not connected to a network, is behind a firewall, or there is a problem in getting the data to my machine, there is no penalty, as it's UDP output (unverified). This allows me to find out what OSs it's been ported to, what the popular flags are and which ones are used little if at all. I understand that some people may be queasy about this function, especially if they haven't seen the source code (which is freely available), so this function can be disabled from the command line (-q) and can easily be eliminated when recompiling (see tacg.h in the source code). If you do this, I'd appreciate an email letting me know what you think of it otherwise. I'd also be interested in hearing suggested refinements and critiques of this process.


Changes and Fixes since Ver 2.28

Major Minor

Changes and Fixes since Ver 1.71

Major Minor

Changes and Fixes since Ver 1.5

  • Major

    Minor


    Changes and Fixes since Ver 1.34

    Major

    Minor Several bugs and much seemingly unconnected weirdness disappeared when I used getopt().


    Bugs and Oddities

    Major Minor I am extremely interested in hearing about new bugs, suggestions for new features, and overall comments on the program.

    To Do List

    Cheers Harry

    ...And The Small Print Taketh Away...

    COPYRIGHT NOTICE

    Copyright (c) 1996,1997 by Harry Mangalam at Univ. of California, Irvine

    Permission to use, copy, modify, and distribute this software and its documentation is hereby granted, subject to the following restrictions and understandings:

    1. Any copy of this software or any copy of software derived from it must include this copyright notice in full.
    2. All materials or software developed as a consequence of the use of this software or software derived from it must duly acknowledge such use, in accordance with the usual standards of acknowledging credit in academic research.
    3. The software may be used by anyone for any purpose, except that its redistribution for profit or inclusion in other software sold for profit requires the express, written permission of the author. Note that the written permission of the author is not required for any use of the software that may result in profit, and the author waives any rights to any part of profit gained by any use other than its redistribution for profit.
      In plain English, the only thing that requires the express, written permission is if you want to sell software that includes or uses this software. Any use other than selling software that uses this software is fine, and any profit made from that use is yours to keep.
    4. This software is provided AS IS with no warranties of any kind. The author shall have no liability with respect to the infringement of copyrights, trade secrets or any patents by this software or any part thereof. In no event will the author be liable for any lost revenue or profits or other special, indirect and consequential damages.