(mtxassemble.pl: last updated on February 1, 2016)
(README: last updated on February 25, 2016)

####################################################
TESTED:
Linux
MacOS X
Windows 7 (A Perl interpeter, such as Strawberry Perl (www.strawberryperl.com) must already be installed)
#####################################################


####################################################
Example Usages:

Help:
perl mtxassemble.pl -help

Nonoverlapping domain predictions
perl mtxassemble.pl -dir test -fasta rgs.fa -hmm Pfam-A.hmm -overlap 0 -blE 1.0 -domE 1.0

Overlapping domain predictions with 5% threshold
perl mtxassemble.pl -dir test -fasta rgs.fa -hmm Pfam-A.hmm -overlap 5 -blE 1.0 -domE 1.0

Overlapping domain predictions with no threshold
perl mtxassemble.pl -dir test -fasta rgs.fa -hmm Pfam-A.hmm -overlap 0 -blE 1.0 -domE 1.0
####################################################


####################################################
PREPARATION: (DO THIS BEFORE YOU RUN THE PIPELINE)

Working directory MUST contain the following five scripts (included in the package):
	reformat_fasta.pl
	filter_hmmer.pl
	domainseqs.pl
	filter_blast.pl
	make_matrices.pl

Create a directory and copy the input FASTA sequence file in this directory. ALL output files will be saved in this directory, too. This directory ('test' in the examples above) is specified with the -dir option.

"reformat_fasta.pl" can be skipped (comment out line 73 in mtxassemble.pl) if the input FASTA sequence file is already in a format where each sequence is given in a single line (in addition to the header line). In this case, the input file needs to be named as "protein_reformatted.fa".
	
hmmscan from HMMER3 version 3.1 or newer (tested with version 3.1) must be installed on the system. HMMER3 is available from http://hmmer.org. hmmscan is used to search the domain profile database such as Pfam. The Pfam profile database (e.g. Pfam-A.hmm) can be obtained from http://pfam.xfam.org. As an example, the Pfam database file (release 27.0), Pfam-A.hmm, is included in this package. After downloading and unzipping the profile database file, the file needs to be prepared by hmmpress before it can be searched by hmmscan. See HMMER User's Guide for the details. 

BLAST programs (makeblastdb and blastp) must be installed on the system. BLAST program suite can be downloaded from http://blast.ncbi.nlm.nih.gov/

For both hmmscan and BLAST programs, if the executables are not included in the executable search path ($PATH environment variable), the paths to HMMER and BLAST executables can be specified as $HMMDIR and $BLASTDIR in mtxassemble.pl See lines 18-19.

####################################################


####################################################
INPUTS and OPTIONS:
		-dir  		the directory where the input FASTA file is located and where the output files are to be directed
	    	-fasta		the FASTA file containing the protein sequences
		-hmm		the pHMM to be searched by hmmscan (should already be prepared by hmmpress)
		-overlap	the overlap threshold for domain predictions (percentage)
		-blE		the BLAST E-value threshold
		-domE		the HMMER E-value threshold


####################################################
OUTPUT FILES: all the files mentioned below will be written to the output directory chosen by the "-dir" option

SUBDIRECTORY - "Preprocess"
1. protein_reformatted.fa
--------------------------
Reformatted FASTA file where each protein takes up two lines, a header line and a sequence line

2. hmmer_out.txt
--------------------------
Output (full) of HMMER3's hmmscan program

3. hmmer_out_domain.txt
--------------------------
Output (domain table) of HMMER3's hmmscan program

4. hmmer_out_domain_filtered.txt
--------------------------
Filtered output of the hmmscan program, filtered according to the desired HMMER E-value threshold (filters the i-Evalue)

5. domain_seqs.fa
--------------------------
FASTA file containing the entire set of domain sequences found on the proteins. Domain ID is added to the protein ID in each sequence header

6. blast_report.tab
--------------------------
BLAST report with domain sequence file as query and protein sequence file as database

7. blast_report.tabf
--------------------------
Filtered BLAST report containing only top hit for each domain sequence to each protein


SUBDIRECTORY - "Matrices"
1. protein_index.txt
--------------------------
Protein index file
Example: each line gives information for one protein
P1: 	 sp|Q9CQV8|1433B_MOUSE
protein 1 (P1) corresponds to the protein with identifier sp|Q9CQV8|1433B_MOUSE

2. no_domain_info.txt
--------------------------
Includes protein identifiers for the proteins that had no domain information found by HMMER (these must be eliminated from the set before solving game theory optimization problems)

3. domains_present.txt
--------------------------
Lists the identifiers of the all of the domains found on the proteins

4. #_mtx.txt
--------------------------
Similarity table for protein P# 
Example: each line starts with a domain identifier followed by the E-value when this domain is searched against each protein (the ordering of the proteins is the same as given in the protein index file)
PF00615.2	1	0.02	2870	0.10
entry of 1 is always assigned to the reference column and entry 2780 means absence of the domain

####################################################
