DOCUMENTATION FOR MAP

******************************************

CONTENTS

I. PREREQUISITES

II. INSTALLATION

III. RUNNING MAP

IV. INPUT DATA FILES

V. OUTPUT FILES

VI. COMMAND LINE OPTIONS




I. PREREQUISITES

 1. 1 GB RAM or more
	 
 2. 32 bit or 64 bit GNU/Linux 
	
 3. GCC 4.0 + with Standard C++ Library 
	
 4. GNU make (optional) 
	

II. INSTALLATION

 1. Download MAP source code from http://mech.ctb.pku.edu.cn/MAP/
	 
 2. Extract the archive, e.g. tar -xzf MAP.tar.gz 
	
 3. Switch to the directory you extracted MAP to, e.g. cd MAP 
	
 4. Generate the target file, type make 
	
 5. MAP is sucessfully installed! Run MAP assembler: ./assemble [OPTION] [VALUE] ... 
	
 6. You may wish to remove the source code however: make remove
 
 
	
III. RUNNING MAP

 Once MAP is successfilly installed, you can ran MAP directly using command line:
 
 >assemble [OPTION] [VALUE] ...
 
 For example:
 
 >assemble -p amd.d -s amd007.fasta -q amd007.fasta.qual -m amd007.fasta.mate -o amd007
 
 
 
IV. INPUT DATA FILES
 
 MAP accepts two formats of the input sequence files: one is the FASTA format, the other is
 
 the FRG format which is specially designed as input for the Celera Assembler.
 
 The input sequence files are specified followed the option -p, and mutiple sequence files 
 
 can be specified, separated by comma. Once a input sequence file is specified, MAP will identify
 
 the format from the filename: the file with the name having the suffix ".frg" will be recoganized
 
 as the FRG format file, others will be identified as the FASTA format.  
 
 Instructions of the FRG format can be seen from 
 
 http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=FRG_Files
 
 IF FRG format files are input, all input information needed by MAP can be read by MAP from the FRG 
 
 files including sequences, base quality score and mate pair information if there is any. 
 
 
 MAP also accepts the FASTA format sequences file. If a FASTA file is specified,
 
 you may optionally specify a corresponding file of data quality information following the option -q.
 
 The quality of the sequence base is used in the overlap calculation and the consensus stage by MAP. 
 
 If multiple sequence files in FASTA format are specified, multiple quality files can be specified, also 
 
 separated by comma. Attention! The quality file corresponding to a sequence file must consist of the name
 
 of the FASTA file, with ".qual" appended. 
 
 The format of the .qual file is similar to that of the corresponding FASTA sequence file. For each read 
 
  there should appear a header line identical to that in the FASTA file. This is followed by one or more 
 
 lines giving the qualities for each base. Quality values should be integers between 0 and 99
 
 (inclusive), and should be separated by spaces. The total number of quality values for each read 
 
 must match the number of bases for that read in the FASTA file. The quality score should be Phred 
 
 quality score, which follows the transformation q = -10 log_10(p), where the p is the probability 
 
 of an error in the base call, and the q is the quality score of that base.
 
 You may also specify a const quality score via the command -d to assign that values to all the bases 
 
 from the FASTA files that has not assigned a corresponding quality files. By default, this value is assigned 23,
 
 which indecates a mean sequencing error about 0.005.
 
 
 Mate pair information should be included in the file with the name consist of the name of the FASTA file, with 
 
 ".mate" appended. Mate pair files are specified with the option -m. Again, multiple mate files can be separated 
 
 comma. Mate pair files should consist of lines of mate pair information. Each line consists of four strings. 
 
 First two are ids of reads, and last two of which are distribution parameters (should be integers) 
 
 (the mean value and the standard variation) of the mate pair length or insert length. The id of each read 
 
 should be the first string following the ">" and ends at the first blank space in the header line of the read
 
 in the FASTA files. Although we strongly commend you to provide the mate pair files to MAP so that MAP can give 
 
 full play of its merits, it is also okey to run MAP without mate pair information. Thus, missing mate pair files
 
 of partial or all the sequence files is permited for MAP.
 
 

V. OUTPUT FILES 

 There are four output files MAP generates. The output prefix can be specified via the option -o (By default, MAP 
 
 will specify the output prefix "assembly" ). The first output file with the suffix ".contigs" gives the contig consensus
 
 sequences in FASTA format. The second output file with the suffix ".contiginfo" gives the read maps of each contig.
 
 Each contig begins with the header identical with the ".contigs" file, and is followed by several lines with each line
 
 presenting the map infomation (id, strand, the starting position, the ending position) of one read in that contig.
 
 The third output file with the suffix ".singlets" gives the reads that are not assembled in contigs. The last file with 
 
 the suffix ".stat" gives some statistics of the final assembly.
 
 
 
VI. COMMAND LINE OPTIONS
 
 -s		Sequence file(s) in FASTA format or FRG format, seperated by comma
 -q		Quality file(s), seperated by comma
 -m		Matepair file(s), seperated by comma
 -o [string]	Output prefix (by default "assembly")
 -k [integer]	Kmer Length (by default 17)
 			Before calculating overlaps, MAP selects pairs of reads sharing kmers as the potential pairs of reads that have overlaps.
 -n [integer]   the number of kmer archives to write into the temp files at once to release the memory (by default 10000000)
                        This parameter is used in the process of reading kmers of all the reads and recording the position information of each kmer
                         of the read. Large number of kmers depending on the number of reads requires large capacity of the machine memory, thus would
                         be kept in temporary files to reduce the demand of the memory at the cost of longer runtime.
 -l [integer]	Minimal overlap length (by default 30)
 -d [integer]	Quality score (by default 23)
 -e [float]     Allowed maximal overlap error rate (by default 0.05)
 			MAP use this value in the overlap calculation. A precise mean error rate provided will increase the accuracy of the 
 			identification of the overlaps between reads. Generally, a higher error rate would increase the false positive overlap rate,
 			while a lowere error rate would decrease the sensitivity of the correct overlaps. 
 -t [integer]   Maximal thread number (by default 1)
 
