Micro-Inversion Detector (MID)

MID is a tool to detect microinversions (MIs) by mapping initially unmapped short reads back onto reference genome sequence. The input file is unmapped BAM file, and the output files contain detailed alignments of each unmapped read with MIs (output_i) and a list of unique MIs (o_inv).


  • Download MID source code: MID.tar.gz (updated by 12/25/2015 )


    MANUAL

    PREREQUISITES

    • 64 bit GNU/Linux

    • GCC 4.0 with Standard C++ Library

    • Python 2.7

    USAGE

    1. Download MID source code(MID.tar.gz) from http://cqb.pku.edu.cn/ZhuLab/MID

    2. Install bowtie from http://sourceforge.net/projects/bowtie-bio/files/bowtie
      or download here Bowtie
      Bowtie should be in the systems environment variable $ PATH

    3. Get UCSC hg19.fa and pre-built bowtie index of UCSC hg19 from
      http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/
      http://bowtie-bio.sourceforge.net/manual.shtml
      Or download here hg19.fa Bowtieindex
      Extract index by
      $ tar -xzvf index.tar.gz

    4. Download cython, pysam and Biopython module of python from
      https://pypi.python.org/pypi/Cython/
      http://code.google.com/p/pysam/downloads/list
      http://biopython.org/DIST
      Or download here cython pysam-0.7 Biopython
      Install the modules by
      $ tar -xzvf Cython-0.17.4.tar.gz
      $ cd Cython-0.17.4
      $ python setup.py install
      $ tar -xzvf pysam-0.7.tar.gz
      $ cd pysam-0.7
      $ python setup.py install
      $ tar -xzvf biopython-1.66.tar.gz
      $ cd biopython-1.66
      $ python setup.py install

    5. Extract MID.tar.gz by $ tar -xzvf MID.tar.gz
      Run the program by command line
      $ python MID.py -a unmapped -r hg19.fa -i index -v erranchor -p parallel -s anchor -k kmer -m matchnum -e errkmer -g mergenum -c cutsize
      [Option]
      -a/--unmapped unmapped BAM file of 1000 Genomes Project sample (e.g., HG01880.unmapped.ILLUMINA.bwa.ACB.low_coverage.20120522.bam)
      -r/--reference reference sequence (e.g., hg19.fa)
      -i/--index bowtie index (e.g., hg19)
      -v/--erranchor error number in the anchors (default: 1)
      -p/--parallel number of alignment threads (default: 1)
      -s/--anchor length of anchors (default: 18)
      -k/--kmer length of kmers (default: 14)
      -m/--matchnum number of matching serial (default: 5)
      -e/--errkmer error number in each kmer (default: 2)
      -g/--mergenum deviation for merging two subsequences (default: 3)
      -c/--cutsize length of cutting size (default: 0)

    6. If you want to compile the files by yourself, please remove the previous executable files and compile the files after extracting MID.tar.gz in step(5) by
      $ make clean
      $ make
      Remove all the files of MID by $ make remove

    EXAMPLE

    • For HG01880 from 1000 Genomes Project, the command line would be:
      $ python MID.py -a HG01880.unmapped.ILLUMINA.bwa.ACB.low_coverage.20120522.bam -r hg19.fa -i hg19 -v 1 -p 1 -s 18 -k 14 -m 5 -e 2 -g 3 -c 0

    • The input file (unmapped BAM file) and output files (output_i, o_inv) are available.
      Extract input file by $ tar -xzvf input.tar.gz
      Extract output files by $ tar -xzvf output.tar.gz

    MAF FORMAT

    The format of each read in output file "output_i" is

    example

    The first line is the name of short read, the second line starting with “s” is the reference sequence of the read, and the third and fourth line are alignments on both forward and reverse strand. For the “s” lines, the first column “s” stands for the alignment lines, the second column stands for the chromosome and specie of the reference sequence or the name of the read respectively, the third column stands for the starting point of the following sequence, the fourth column stands for the length of the aligned sequence, the fifth column describes the strand to which the following sequence is aligned (“+” stands for the forward strand, while “-” stands for the reverse strand), the sixth column stands for the size of the entire source sequence, and the last column stands for the aligned sequence.