Micro-Inversion Detector (MID)

MID is a tool to detect microinversions (MIs) by mapping initially unmapped short reads back onto reference genome sequence. The input file is unmapped BAM file, and the output files contain detailed alignments of each unmapped read with MIs (output_i) and a list of unique MIs (o_inv).

Download MID source code: MID.tar.gz (updated by 12/25/2015 )

MANUAL

PREREQUISITES

64 bit GNU/Linux
GCC 4.0 with Standard C++ Library
Python 2.7

USAGE

Download MID source code(MID.tar.gz) from http://cqb.pku.edu.cn/ZhuLab/MID
Install bowtie from http://sourceforge.net/projects/bowtie-bio/files/bowtie
or download here Bowtie
Bowtie should be in the systems environment variable $ PATH
Get UCSC hg19.fa and pre-built bowtie index of UCSC hg19 from
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/
http://bowtie-bio.sourceforge.net/manual.shtml
Or download here hg19.fa Bowtieindex
Extract index by
$ tar -xzvf index.tar.gz
Download cython, pysam and Biopython module of python from
https://pypi.python.org/pypi/Cython/
http://code.google.com/p/pysam/downloads/list
http://biopython.org/DIST
Or download here cython pysam-0.7 Biopython
Install the modules by
$ tar -xzvf Cython-0.17.4.tar.gz
$ cd Cython-0.17.4
$ python setup.py install
$ tar -xzvf pysam-0.7.tar.gz
$ cd pysam-0.7
$ python setup.py install
$ tar -xzvf biopython-1.66.tar.gz
$ cd biopython-1.66
$ python setup.py install
Extract MID.tar.gz by $ tar -xzvf MID.tar.gz
Run the program by command line
$ python MID.py -a unmapped -r hg19.fa -i index -v erranchor -p parallel -s anchor -k kmer -m matchnum -e errkmer -g mergenum -c cutsize
[Option]
-a/--unmapped unmapped BAM file of 1000 Genomes Project sample (e.g., HG01880.unmapped.ILLUMINA.bwa.ACB.low_coverage.20120522.bam)
-r/--reference reference sequence (e.g., hg19.fa)
-i/--index bowtie index (e.g., hg19)
-v/--erranchor error number in the anchors (default: 1)
-p/--parallel number of alignment threads (default: 1)
-s/--anchor length of anchors (default: 18)
-k/--kmer length of kmers (default: 14)
-m/--matchnum number of matching serial (default: 5)
-e/--errkmer error number in each kmer (default: 2)
-g/--mergenum deviation for merging two subsequences (default: 3)
-c/--cutsize length of cutting size (default: 0)
If you want to compile the files by yourself, please remove the previous executable files and compile the files after extracting MID.tar.gz in step(5) by
$ make clean
$ make
Remove all the files of MID by $ make remove

EXAMPLE

For HG01880 from 1000 Genomes Project, the command line would be:
$ python MID.py -a HG01880.unmapped.ILLUMINA.bwa.ACB.low_coverage.20120522.bam -r hg19.fa -i hg19 -v 1 -p 1 -s 18 -k 14 -m 5 -e 2 -g 3 -c 0
The input file (unmapped BAM file) and output files (output_i, o_inv) are available.
Extract input file by $ tar -xzvf input.tar.gz
Extract output files by $ tar -xzvf output.tar.gz

MAF FORMAT

The format of each read in output file "output_i" is

example

The first line is the name of short read, the second line starting with “s” is the reference sequence of the read, and the third and fourth line are alignments on both forward and reverse strand. For the “s” lines, the first column “s” stands for the alignment lines, the second column stands for the chromosome and specie of the reference sequence or the name of the read respectively, the third column stands for the starting point of the following sequence, the fourth column stands for the length of the aligned sequence, the fifth column describes the strand to which the following sequence is aligned (“+” stands for the forward strand, while “-” stands for the reverse strand), the sixth column stands for the size of the entire source sequence, and the last column stands for the aligned sequence.