ACC/Mean transformation ======================= Input: Sequences transformed with 5PC-scores or any other values. Each entry data need to be listed in a single line (no line break). Each line starts with the entry ID followed by the given number of values. For 5 values, for example: ID aa1var1 aa1var2 aa1var3 aa1var4 aa1var5 aa2var1 aa2var2 aa2var3 ... Output types: Auto/crosscovariance - calculated as (1/n)Sigma[X(i,t) - X^(i)][X(j,t+k) - X^(j)], where n is the amino acid length, i or j is the variables (i, j = 1, ..., m), t is the amino acid position (t = 1, ..., n-k), k is the lag size (k = 0, ..., k_max, and k_max is the maximum lag size), and X^(i) or X^(j) is the mean of the variable X(i) or X(j), which is calculated as (1/n)Sigma[X(i, t)], where t = 1, ..., n. This is based on the implementation in R/S-plus. Note that k starts from 0. Standardized auto/crosscovariance (auto/crosscorrelation) - calculated as C(k)/C(0), where C(k) is the auto/crosscovariance at the lag size of k as shown above, and C(0) is calculated as sqrt[Sigma[Var(i)*Var(j)]. Wold's ACC - calculated following the original method by Wold et al. (1993, Anal. Chim. Acta 277:239), which is {1/(n - k)}SigmaX(i,t)X(j,t+k), where n is the amino acid length, i or j is the variables (i, j = 1, ..., m), t is the amino acid position (t = 1, ..., n-k), k is the lag size (k = 1, ..., k_max, and k_max is the maximum lag size). Note that k starts from 1. Mean - calculated as (1/n)Sigma(PCi), where n is the amino acid length and i = 1, ..., m (for m variables, 5 for PC5). Output file format: [PLS format] this format can be used as the input file for the PLS analysis using R and the provided model file, PLS-ACC.Rdata. The first column includes the entry ID followed by ACC values, acc(i, j, k), where i, j = 1, ..., m for m scores, and the lag size of k. The order of acc values with m = 2, k = 3 is as follows: acc(1,1,1), acc(1,1,2), acc(1,1,3), acc(1,2,1), acc(1,2,2), ..., acc(2,2,1), acc(2,2,2), acc(2,2,3) [SVM_light format] this can be used as the input file for SVM-light. The format is shown below. Note that the first letter '0' is not the entry name, but it is used to identify the type of each sequence (1 for a positive sample, -1 for a negative sample, or 0 for an unknown sample). In this output, all sequences are assigned with '0'. The following example is for acc values with m = 2, k = 3: 0 1:acc(1,1,1) 2:acc(1,2,1) 3:acc(2,1,1) 4:acc(2,2,1) 5:acc(1,1,2) 6:acc(1,2,2) 7:acc(2,1,2) 8:acc(2,2,2) 9:acc(1,1,3) 10:acc(1,2,3) 11:acc(2,1,3) 12:acc(2,2,3) [TAB-delimited flat table] this is a simple table format with the sequence ID in the first column. The order of acc values is the same as for the SVM-light format. Standardized variables: this option is to obtain standardized variables without performing ACC transformation. The standardization is performed as [X(i,j) - X^(i)]/rms(i), where i is the i-th variable, j is an amino acid position, X^(i) is the mean of the variable X(i), and rms(i) is calculated as rms(i) = sqrt{Sigma[X(i,j)-X^(i)]^2/(n-1)} [TAB-delimited 3-way table] each amino acid is converted to corresponding five scores. The format is shown below. The first column is the sequence ID, 2nd column is the aa position, followed by the standardized five scores. At1g11000.1 1 0.8672771 0.22258 -1.562412 -0.4718294 -0.1297437 At1g11000.1 2 -0.2388994 1.011137 -0.9719389 -2.042766 1.008086 At1g11000.1 3 0.05577879 1.442711 -1.010034 -0.1761236 -2.597505 The output uses N lines for an entry with N amino acids. This format is convenient for using 'sapply' R/S function (e.g., to obtain the column mean).