HyperCopy - About
NNTP Downloading
Copyright (C) by Wen-Jhy Sheen since Jun 1996(BIG5 Version Is Here)
HyperCopy's NNTP downloading is designed to achieve two goals. First, to download and decode miscellaneous files inside USENET newsgroups efficiently and automatically. Second, to backup new posts in USENET newsgroups. This version of NNTP downloading has been tested for many popular newsgroups having thousands of files inside.
The main features and downloading strategy that HyperCopy utilizes is as follows:
- There are two modes for NNTP downloading. One is auto-decoding. The other is saving only, no decoding. They are specified in the command-line options via -DEC+ (default) and -DEC- respectively.
- No matter in auto-decoding or saving mode, hcopy will scan the newsgroup first, analyze, and download post bodies. The scanning process is comprised of two steps. The first is to read the message ID (MSGID) of each post. Each MSGID is unique in the whole USENET. The cost in this step is relatively low. The second step is to download the post headers (not including the post body) when necessary.
To avoid duplicate and useless scan, hcopy will begin from the last (newest) post to the first of the newsgroup. Hcopy will remember MSGIDs it ever seen in 30 days recently.
During the scan,
- If the MSGID of a post has been seen by hcopy, that post is assumed to be scanned by hcopy before and is ignored. Such a post is said to be old, otherwise, it's new.
- If hcopy encounters more than 20 old posts, it will assume the following posts are all old and stops scanning. This is a reasonable guess.
- If the option -MIN is specified in the command-line, hcopy will ignore a post having not enough rows. Also hcopy will remember the MSGID of such post to not to scan it again.
- If a post has passed the above tests, it is a new post having enough number of rows. Hcopy will then download its headers for further analyzing.
- If the option -X or -Y options is specified in the command-line, hcopy will try to parse the file name from the post headers. If any, and the file name doesn't satisfy the filtering rules given by -X or -Y options, hcopy will ignore that post and remember its MSGID.
After the scan, hcopy will sort the post headers according to their Subject and From fields to determine the downloading order of post bodies. The reason for sorting is to collect different fragments of same file possibly been encoded into different posts with discontinuous post number. After sorting, different fragment posts of same file may be decoded correctly.
Then, hcopy will download the post bodies and decode (or save) them. After the body of a post is downloaded successfully, hcopy will remember its MSGID.
- If -DEC+ is enabled (default), hcopy will analyze the headers and body of each post to determine its filename and encoding method, and decode it. The supported encoding schemes includes UU and Base64. If a file is UU-encoded into different posts, hcopy will try to decode them into a file as possible.
Hcopy will try to find out the original file name of a file. If failed, a name "hhhhhhhh.tmp" is used instead, where hhhhhhhh is a random hexadecimal number. If a file name is found, it doesn't satisfy the filtering rules given by -X or -Y options, hcopy will ignore this file.
When the bodies of all new posts are downloaded and decode , hcopy will generate a sorted (by filename) file description list which records the file name, length, and subject of each file. If the decoding of a file is probably incorrect, there will be a '?' character attached after the file length.
If there exists an old description list inside the target directory, hcopy will merge the new one with the old one and sort it by filename.
The option -FL=<file> can be used to specify the file name of the description list. If not specified, the default one is "~files.lst".
- When -DEC- is specified, hcopy will save all new posts into a file, retaining their original format. A row with only a "." character will be inserted between two posts. The file name of the stored file can be specified in local_path in the command-line. If the given one is an existing directory, or is ended in '/' or '\' character, the default file name is "group_name.msg" .