Strope, P. K. and Moriyama, E. N. (2007)
"Simple Alignment-free Methods for Protein Classification: A Case Study from G-Protein Coupled Receptors."
Genomics 89: 602-612
— Datasets used in the study —
Each dataset is available in three formats:
the list of accession numbers, sequences in FASTA format, and sequences in SwissProt format.
[Datasets used for within- & between-class tests]
See Tables 2 for the dataset description
- Class A training dataset: 200 entries
(AC#,
FASTA,
SP)
- Class A test dataset: 200 entries
(AC#,
FASTA,
SP)
- Non-Class A training dataset: 81 entries
(AC#,
FASTA,
SP)
- Non-Class A test dataset: 81 entries
(AC#,
FASTA,
SP)
- Non-GPCR training dataset: 210 entries
(AC#,
FASTA)
- Non-GPCR test dataset: 210 entries
(AC#,
FASTA)
(For Non-GPCR datasets, current ID's as long as the original ones are listed.
See the note below.)
[Datasets used for Class A analysis]
See Tables 4 for the dataset description
- AR1 dataset: 63 entries
(AC#,
FASTA,
SP)
- AR2 dataset: 63 entries
(AC#,
FASTA,
SP)
- PE1 dataset: 69 entries
(AC#,
FASTA,
SP)
- PE2 dataset: 70 entries
(AC#,
FASTA,
SP)
- OL1 dataset: 154 entries
(AC#,
FASTA,
SP)
- OL2 dataset: 155 entries
(AC#,
FASTA,
SP)
- N1 dataset: 158 entries
(AC#,
FASTA,
SP)
- N2 dataset: 158 entries
(AC#,
FASTA,
SP)
NOTE: The sequence data used in this study were originally obtained in 2004 from GPCRDB (for positives)
and SwissProt (for negatives).
Some sequences may have been changed in these databases since then. For the most recent version, see each database:
-
GPCRDB: Information system for G protein-coupled receptors (GPCRs)
-
Swiss-Prot: Protein knowledgebase