Welcome to SeqSpider!
1. Introduction
2. Requirments
3. Installation
4. Usage
5. Download
6. FAQs
1. Introduction
Deep sequencing has spurred genome-wide mapping of transcription factor binding sites, histone modifications and DNA methylations. However, due to experimental variations, there are still no computational tools that can universally accept and seamlessly integrate different types (discrete/real/profile) of deep sequencing data from different sources, to predict regulatory networks. Here, we have developed a Bayesian network inference algorithm 'SeqSpider' to address this bottleneck. SeqSpider is the first Bayesian network algorithm that enables learning from tag distributions, a unique feature of ChIP-Seq and bisulfite sequencing data, and combined with a profile clustering method for noise removal, enables ab initio identification of interactions from multiple sources of heterogeneous data. SeqSpider correctly predicted the interactions between DNA methylation, histone modifications, gene expression, transcription factors and chromatin modification complexes as well as their underlying motif interactions using datasets of two human embryonic stem cell lines from three laboratories. Furthermore, the inferred network model predicts an intriguing enhancer-promoter interaction mechanism, where H3K4me3 serves as a signal relaying hub for information propagation among different epigenetic modification and regulatory domains. Details please refer to the paper: Liu, Y., N. Qiao, et al. (2013). " A novel Bayesian network inference algorithm for integrative analysis of heterogeneous deep sequencing data." Cell Res.
Anyone can use the source codes, documents or the excutable file of SeqSpider free of charge for non-commercial use. For commercial use, please contact the author.
2. Requirments
Linux x64 http://www.ubuntu.com/
Perl http://www.perl.org/
3. Installation
nstallation of SeqSpider only requires unpacking the files in the "seqspider.zip" file on any Linux platform and adding the directory to the PATH.
4. Usage
(1) SeqSpider.pl
The program "SeqSpider.pl" build Bayesian network of TFs/Histone modifications/DNA methylations from the raw deep sequencing reads (usually represented in the BED format
files) and a REFFLAT file (indicating the position of the TSS sites of the genes, which could be downloaded from UCSC genome browser) as input. The following example uses the
REFFLAT file "data/hg18/refFlat.txt" and several BED format files under the "data" directory as input, SuperKmeans is used to group the TSS profiles into 1000 clusters, the
output file is "test_matrix.txt".
Usage: perl SeqSpider.pl --help
--refSeq list of target elements in refseq format,
refFlat.txt in hg18 is included in the package (default).
If your bed files are based on other genome or other assembly
of human, you should download corresponding file from UCSC.
--methyFiles REQUIRED, use comma "," to delimit
multiple inputs, wildcard "*" supported,
eg: -methy "data1/*.bed,data2/*.bed"
--shift shift size towards 3' end of short reads (default:0)
--debug to debug the pipeline
--output FILENAME ( "TSS_matrix.txt")
--reg_factor reg_factor
--percent percent of each group (default 0.9)
--SKcluster do super K mean cluster, with n groups
--SKtryN SK Trial_num, default 20
--Help Show this message
Example usage:
perl SeqSpider.pl --refSeq data/hg18/refFlat.txt --methyFiles "data/*.bed" --output test_matrix.txt --SKcluster 1000
(2) exeABCD.pl
The program "exeABCD.pl" build Bayesion network from a matrix file, the matrix file is tab separated, each row represents a gene, each column a node (regulator), if a node is
represented by a vector (such as a regulator is divided into 10 bins at TSS region), the header of the columns should be the same, as shown in the example file
"data/human_ESC_regulaters.tsv".
Usage: perl exeABCD.pl --help
--input FILENAME ( "*_matrix.txt")
--reg_factor reg_factor (default 3)
--is_wise is_wise for nips (default 2)
--percent percent of each group (default 0.9) or exact gene numbers if >2
--if_normalized if do column-wize normalization (default 1)
--pseudo_count used in normalization (default 1, for chip-seq data, 1 is better.)
--SKcluster do super K mean cluster, with n groups
--SKmaxIter max iteration for SKmeans,default 400
--SKtryN SK Trial_num, default 20
--mix set data types for each node. (default: estimated from the column names of the input file, such as "10 10 8 1", means there are 29 columns in the input data. )The first two nodes are 10-dimention vetor data with same column names; the third node is a 8-dimentional vector data; and the last node is a continous data. In addition,you can manualy add more discritized data, as a form of "10 10 8 1 0 0",total 31 columns);
--help
Example usage:
perl exeABCD.pl --input data/test_matrix.txt -SKcluster 1000
5. updates
version 1.01
1. fix file "data/test_matrix.txt";
2. change file "exeABCD.pl" to a link to avoid bad library reference;
3. change the usage of "exeABCD.pl" to "perl exeABCD.pl --input data/test_matrix.txt -SKcluster 1000 ";
6. Download
SeqSpider could be downloaded from this link: Download
7. FAQs
Wait for your questions. Please feel free to contact us if anything unclear.
1) How to reproduce the hESC BN descript in the paper?
Run the command "sh test.sh"
2) How many nodes are supported by SeqSpider?
Currently, SeqSpider supports inferring BNs with less than 100 nodes.
3) How to cite SeqSpider?
SeqSpider users please cite the following paper: Liu, Y., N. Qiao, et al. (2013). " A novel Bayesian network inference algorithm for integrative analysis of heterogeneous deep sequencing data." Cell Res.
4) For those *.bed files located at (/data/*.bed), what are the values in fourth and fifth column used for?
The fourth column in the testing data is not used. The fifth column means raw counts in the certain region, and will be used as signals in TSS sites. Please check the script
"scan_bed_for_TSS.pl" in tools directory for more information. Normalization has been considered in the later learning process.
5) What's the output of SeqSpider?
SeqSpider output 2 files: one "SIF" file named "*_overlap.sif" contain the edge information, could be opened in text editor or EXCEL, or imported into
Cytoscape for network visulization; one ROC file named "*Roc_D.txt" contain the ROC curve to evaluate the network stability, could be opened in
Excel.
6) What's "reg"/"noreg" mean in "SIF" file?
"reg " label the edge direction, "noreg " label the edge have no direction.
7) How can I get the dashed edges in Figure S20?
The method for recovering feedback edges (dashed edges in Figure 20) is presented in Suppl. text page 9. As the program is not friendly for users, we currently haven't wrap
the program in the Perl scripts.