Micro-Inversion Detector (MID)

MID is a tool to detect microinversions (MIs) by mapping initially unmapped short reads back onto reference genome sequence. The input file is unmapped BAM file, and the output files contain detailed alignments of each unmapped read with MIs (output_i) and a list of unique MIs (o_inv).


  • Download MID source code:MID.tar.gz (updated by 12/25/2015 )

MANUAL

PREREQUISITES

  • 64 bit GNU/Linux
  • GCC 4.0 with Standard C++ Library
  • Python 2.7

USAGE

  1. Download MID source code(MID.tar.gz) from http://cqb.pku.edu.cn/ZhuLab/MID
  2. Install bowtie from http://sourceforge.net/projects/bowtie-bio/files/bowtie
    or download here Bowtie
    Bowtie should be in the systems environment variable $ PATH
  3. Get UCSC hg19.fa and pre-built bowtie index of UCSC hg19 from
    http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/
    http://bowtie-bio.sourceforge.net/manual.shtml
    Or download here hg19.fa Bowtieindex
    Extract index by
    $ tar -xzvf index.tar.gz
  4. Download cython, pysam and Biopython module of python from
    https://pypi.python.org/pypi/Cython/
    http://code.google.com/p/pysam/downloads/list
    http://biopython.org/DIST
    Or download here cythonpysam-0.7 Biopython
    Install the modules by
    $ tar -xzvf Cython-0.17.4.tar.gz
    $ cd Cython-0.17.4
    $ python setup.py install
    $ tar -xzvf pysam-0.7.tar.gz
    $ cd pysam-0.7
    $ python setup.py install
    $ tar -xzvf biopython-1.66.tar.gz
    $ cd biopython-1.66
    $ python setup.py install
  5. Extract MID.tar.gz by $ tar -xzvf MID.tar.gz
    Run the program by command line
    $ python MID.py -a unmapped -r hg19.fa -i index -v erranchor -p parallel -s anchor -k kmer -m matchnum -e errkmer -g mergenum -c cutsize
    [Option]
    -a/--unmapped unmapped BAM file of 1000 Genomes Project sample (e.g., HG01880.unmapped.ILLUMINA.bwa.ACB.low_coverage.20120522.bam)
    -r/--reference reference sequence (e.g., hg19.fa)
    -i/--index bowtie index (e.g., hg19)
    -v/--erranchor error number in the anchors (default: 1)
    -p/--parallel number of alignment threads (default: 1)
    -s/--anchor length of anchors (default: 18)
    -k/--kmer length of kmers (default: 14)
    -m/--matchnum number of matching serial (default: 5)
    -e/--errkmer error number in each kmer (default: 2)
    -g/--mergenum deviation for merging two subsequences (default: 3)
    -c/--cutsize length of cutting size (default: 0)
  6. If you want to compile the files by yourself, please remove the previous executable files and compile the files after extracting MID.tar.gz in step(5) by
    $ make clean
    $ make
    Remove all the files of MID by $ make remove

EXAMPLE

  • For HG01880 from 1000 Genomes Project, the command line would be:
    $ python MID.py -a HG01880.unmapped.ILLUMINA.bwa.ACB.low_coverage.20120522.bam -r hg19.fa -i hg19 -v 1 -p 1 -s 18 -k 14 -m 5 -e 2 -g 3 -c 0
  • The input file(unmapped BAM file) and output files (output_i, o_inv) are available.
    Extract input file by $ tar -xzvf input.tar.gz
    Extract output files by $ tar -xzvf output.tar.gz

MAF FORMAT

The format of each read in output file "output_i" is

The first line is the name of short read, the second line starting with “s” is the reference sequence of the read, and the third and fourth line are alignments on both forward and reverse strand. For the “s” lines, the first column “s” stands for the alignment lines, the second column stands for the chromosome and specie of the reference sequence or the name of the read respectively, the third column stands for the starting point of the following sequence, the fourth column stands for the length of the aligned sequence, the fifth column describes the strand to which the following sequence is aligned (“+” stands for the forward strand, while “-” stands for the reverse strand), the sixth column stands for the size of the entire source sequence, and the last column stands for the aligned sequence.