Digital processing of a nucleotide sequence requires it to be mapped to a numerical sequence in which the choice of nucleotide to numeric mapping affects how well its biological properties can be preserved and reflected from nucleotide domain to numerical domain. Digital spectral analysis of nucleotide sequences unfolds a period3 power spectral value which is more prominent in an exon sequence as compared to that of an intron sequence. The success of a period3 based exon and intron classification depends on the choice of a threshold value. The main purposes of this article are to introduce novel codes for 1sequence numerical representations for spectral analysis and compare them to existing codes to determine appropriate representation, and to introduce novel thresholding methods for more accurate period3 based exon and intron classification of an unknown sequence. The main findings of this study are summarized as follows: Among sixteen 1sequence numerical representations, the KQuaternary Code I offers an attractive performance. A windowed 1sequence numerical representation (with window length of 9, 15, and 24 bases) offers a possible speed gain over nonwindowed 4sequence Voss representation which increases as sequence length increases. A winner threshold value (chosen from the best among two defined threshold values and one other threshold value) offers a top precision for classifying an unknown sequence of specified fixed lengths. An interpolated winner threshold value applicable to an unknown and arbitrary length sequence can be estimated from the winner threshold values of fixed length sequences with a comparable performance. In general, precision increases as sequence length increases. The study contributes an effective spectral analysis of nucleotide sequences to better reveal embedded properties, and has potential applications in improved genome annotation.
Keywords
DNA sequencenumerical representationnucleotide to numeric mappingexon and intron sequencescoding and noncoding sequencesthreshold valuethresholdingexon and intron classificationperiod3spectral analysisdiscrete Fourier transformgene detectiongenome annotation
1. Introduction
It is known that the coding regions of a nucleotide sequence exhibit a period3 spectral property because of the presence of codon structure [1], and this property can be used to identify regions of interest. A nucleotide sequence is a discrete sequence consisting of nucleotides C, G, A, and T in which digital spectral analysis can be used to reveal its hidden periodicities, spectral features, and genome structure. Digital spectral analysis is usually carried out by the discrete Fourier transform (DFT) which is a digital signal processing (DSP) technique. Genomic signal processing (GSP) involves the processing and analysis of a digital signal in the form of a numerical sequence mapped from a nucleotide sequence [1, 2]. A general description on the biological aspects of this article can be found in [1, 2]. To perform digital spectral analysis, each nucleotide of a nucleotide sequence has to be converted to a numerical value through a mapping. Such a mapping is called numerical representation and its choice affects how well the biological properties of a nucleotide sequence can be revealed in numerical domain. A nucleotide sequence can be numerically represented in the form of Rsequence, a survey on mappings with R ≥ 1 is given in [3, 4]. 1sequence numerical representation (with R = 1) is the most compact form of mapping in which one nucleotide is mapped to one fixed numerical value to form a single sequence. 1sequence numerical representations and their relative performances are studied in this article for unknown exon and intron sequence classification with an improved accuracy.
In this article, nine 1sequence numerical representations (Codes 19) [5–16] are identified though a literature search, which can be grouped under positiveintegervalue, realvalue, and complexvalue numerical representations. In addition, seven 1sequence complexvalue numerical representations (Codes 1016) are derived. In this article, all these sixteen numerical representations and the Voss representation [17] are compared based on the genomes of twelve organisms (including the human).
Genome annotation involves a process of identifying the locations of the coding regions (and associated genes) in a genome. Coding regions exhibit the period3 property in the spectral domain which is less apparent in sequences other than exon sequences, and can therefore be used to detect exon sequences and to distinguish exon regions from intron regions in genomic sequences. The performance depends on an appropriate choice of a period3 threshold value to classify between an exon sequence and an intron sequence. In [17, 18], the cumulative distributions of coding and noncoding sequences are used to determine a period3 threshold value. In this article, two threshold value determination methods are defined. Given a set of exon and intron sequences of a fixed length, all the three thresholding values are used to determine the winner threshold value applicable to any nucleotide sequence of the same length. Also, given a number of exon and intron sequence sets of different lengths, the winner threshold values can be interpolated to yield an interpolated winner threshold value applicable to an unknown nucleotide sequence of an arbitrary length. Some commonly used symbols (or abbreviations) are explained in Table 1.
Table 1
List of symbols
Symbol
Description
Symbol
Description
C
Cytosine
G
Guanine
A
Adenine
T
Thymine
DNA
Deoxyribonucleic acid
OG
Organism
DSP
Digital signal processing
GSP
Genomic signal processing
DFT
Discrete Fourier transform
FFT
Fast Fourier transform
SL
Sequence length (bases)
WL
Window length (bases)
WRS
Window rightshift (bases)
NW
Number of windows
CDP3
Cumulative distribution period3
CCDP3
Complementary CDP3
PSV
Power spectral value
P_{3}
Period3 power spectral value
T_{
m
}
Mid threshold value
T_{
p
}
Proportional threshold value
T_{
c
}
Cumulative distribution threshold value
T_{
4
}
Fixed threshold value (= 4.0)
T_{
w
}
Winner threshold value
T_{
i
}
Interpolated winner threshold value
meanP_{
3e
}
Mean of the period3 values obtained from specified exon sequences
sdP_{
3e
}
Standard deviation of the period3 values obtained from specified exon sequences
meanP_{
3i
}
Mean of the period3 values obtained from specified intron sequences
sdP_{
3i
}
Standard deviation of the period3 values obtained from specified intron sequences
Note: T_{
m
}(P_{3}), T_{
p
}(P_{3}), T_{
c
}(P_{3}), T_{
w
}(P_{3}), T_{
i
}(P_{3}) are threshold values computed from P_{3}.
2. Methods and results
2.1. Numerical representation
There are a number of numerical representations for nucleotide sequences. In this article, we shall focus on direct and simple numerical representations which satisfy the following requirements: (a) Single sequence mapping for a nucleotide sequence; (b) Fixed value mapping for each nucleotide; and (c) Accessible to digital signal processing analysis. Sixteen 1sequence numerical representations (Codes 116) which satisfy these three requirements are defined in Table 2.
Table 2
List of sixteen numerical representation codes
Name
Code
C
G
A
T
1
Integer Number
1
3
2
0
2
Single Galois Indicator
1
3
0
2
3
Paired Nucleotide Atomic Number
42
62
62
42
4
Atomic Number
58
78
70
66
5
Molecular Mass
110
150
134
125
6
EIIP
0.1340
0.0806
0.1260
0.1335
7
Paired Numeric
1
1
1
1
8
Real Number
0.5
0.5
1.5
1.5
9
Complex Number
1j
1+j
1+j
1j
10
KTwinPair Code
1
1
j
j
11
KBipolarPair Code I
1
1
j
j
12
KBipolarPair Code II
1
1
j
j
13
KQuaternary Code I
1
j
1
j
14
KQuaternary Code II
1
j
j
1
15
KQuaternary Code III
j
1
1
j
16
KQuaternary Code IV
j
1
j
1
It is known that exons are rich in nucleotides C and G and introns are rich in nucleotides A and T [8, 19]. For ease of viewing, CG and AT are paired when defining each of the sixteen numerical representations. The first group of numerical representations consists of five positiveintegervalue numerical representations (as Codes 15 in Table 2) which shall be described as follows: The Integer Number representation [5, 6] is obtained by mapping numerals {1, 3, 2, 0} respectively to the four nucleotides as C = 1, G = 3, A = 2, and T = 0. The Single Galois Indicator representation [7, 8] maps the CGAT nucleotides to a Galois field of four, GF(4), which is formed by assigning numerical values to the nucleotides as C = 1, G = 3, A = 0, and T = 2 in a nucleotide sequence. This representation suggests that C < G and A < T. In the Paired Nucleotide Atomic Number representation [9], the paired nucleotides are assigned with atomic numbers as A, G = 62 and C, T = 42 respectively. In the Atomic Number representation [9], a numerical sequence is formed by assigning atomic numbers to each nucleotide as C = 58, G = 78, A = 70, and T = 66 in a nucleotide sequence. The Molecular Mass representation [10] of a nucleotide sequence is formed by mapping the four nucleotides to their molecular masses as C = 110, G = 150, A = 134, and T = 125, respectively.
The second group of numerical representations consists of three realvalue numerical representations (as Codes 68 in Table 2) which can be described as: The ElectronIon Interaction Pseudopotential (EIIP) represents the distribution of the free electrons energies along a nucleotide sequence. In the EIIP representation, a single EIIP indicator sequence [11, 12] is formed by substituting the EIIP of the nucleotides as C = 0.1340, G = 0.0806, A = 0.1260, and T = 0.1335 in a nucleotide sequence. In the Paired Numeric representation [6, 8], nucleotides (CG, AT) are to be paired in a complementary manner and values of 1 and +1 are to be used to denote, respectively, CG and AT nucleotide pairs. In the Real Number representation [6, 8, 13], the nucleotide mappings are C = 0.5, G = 0.5, A = 1.5, and T = 1.5, which bears complementary property. The third group consists of one complexvalue numerical representation (as Code 9 in Table 2) called the Complex Number representation [2, 5, 6, 8, 14–16], it reflects the complementary nature of CG and AT pairs by mapping nucleotides as C = 1j, G = 1+j, A = 1+j, and T = 1j.
In addition to the above nine 1sequence numerical representations, seven 1sequence numerical representations (listed as Codes 1016 in Table 2) are derived in which each nucleotide of a sequence is mapped to either a single realvalue element (± 1) or a single imaginaryvalue element (± j). Each of the Codes 1016, namely, the KTwinPair Code, the KBipolarPair Codes I and II, and the KQuaternary Codes IIV has its equivalent numerical representations. We define a numerical representation R 2 to be an equivalent numerical representation of R 1 if R 2 gives rise to the same power spectrum as that of R 1. An equivalent numerical representation can be obtained by multiplying the numerical represented value of each of the four bases [C, G, A, T] of a nucleotide sequence by the same constant which includes any of 1, and ± j. In particular, a complementary numerical representation obtained by inverting the signs of all the four bases [C, G, A, T] in a numerical representation is an equivalent numerical representation. Other forms of equivalent numerical representation exist, for example, the KTwinPair Code [C, G, A, T] = [1, 1, j, j] has an equivalent form [C, G, A, T] = [j, j, 1, 1]. The KQuaternary Code III (Code 15) [C, G, A, T] = [j, 1, 1, j] is identical to the pentanary code in [20] (with [20] specifies that x[n] = 0 for an unknown nucleotide at position n) which was derived for the threedimensional DNA walk for graphical representation of nucleotide sequences. The KBipolarPair Code II (Code 12) [C, G, A, T] = [1, 1, j, j] has an equivalent numerical representation of [C,G,A,T] = [j, j, 1, 1] adopted in [21]. The KQuaternary Code IV (Code 16) [C, G, A, T] = [j, 1, j, 1] has a complementary numerical representation of [C, G, A, T] = [j, 1, j, 1] mentioned in [22]. In [23], complex numerical representations with simultaneous nonzero real part and nonzero imaginary part were adopted which are different from those of the Codes 1016 in which either a real value or an imaginary value is used for each nucleotide to numeric mapping. It should be noted that the numerical representations in [21–23] were formulated for sequence alignment but not for period3 power spectral analysis.
The Voss representation (i.e., the Code 17) is a commonly used numerical representation for period3 spectral classification of exon and intron sequences [1, 2, 17, 24, 25]. The Voss representation is a 4sequence representation (R = 4) in which each of the four nucleotides of a nucleotide sequence is represented by a separate numerical sequence (as Csequence, Gsequence, Asequence, and Tsequence) such that the n^{th} position of the Csequence is coded by 1 if the n^{th} nucleotide of the sequence is C, otherwise it is coded by 0. In a similar manner, the coding procedure just described applies to the remaining three numerical sequences. A threshold value can be determined from the cumulative distributions of the signaltonoise ratio of the peak at f = 1/3 for each set of exon and intron sequences [17]. Using the Voss representation, two known threshold values (T_{
c
}, T_{4}) are adopted for comparison in this article. The T_{
c
} thresholding is determined from the exon and intron cumulative distribution functions [17] and the T_{4} thresholding is set to a fixed value four [17]. The Codes 116 are to be compared to the Voss representation (Code 17) in Section 2.4.
2.2. Spectral analysis
The purpose of the spectral analysis of a numerically represented nucleotide sequence is to compute its period3 spectral component located at a frequency equal to 2π/3 in the DFT spectrum. Given a numerical sequence, x[l] for l = 1 to L, its finitelength DFT sequence, X[k] for k = 1 to L, and its inverse DFT, x[l] for l = 1 to L, are defined by
In this article, a rectangular windowing of length N bases is used and each consecutive window is rightshifted by three bases with an overlap window length of N6 bases between two adjacent windows. The rectangular windowing is adopted to avoid distortion on the cumulative DFT spectrum. We define the mean cumulative DFT spectrum, X_{
T
} [k], of all the windowed sequences, X_{
m
} [k] for m = 1 to M, as
The period3 power spectral value (or period3 value) P_{3} can be obtained from the normalized power spectrum of a numerically represented sequence at k = N/3+1 as
${P}_{3}={S}_{n}\left[\frac{N}{3}+1\right]$
(7)
In all computer simulations, for a window length of WL equals to a sequence length of SL, one complete DFT analysis is adopted. For WL less than SL, each consecutive window is rightshifted by three bases with an overlap window length of "WL6" bases between two adjacent windows. The period3 values P_{3} computed by Equation 7 for 4583 exon sequences and that for 4583 intron sequences (both of sequence length 1250 bases, window length 24 bases, and window rightshift 3 bases) obtained by the KQuaternary Code I are plotted in Figures 1 and 2. From these figures, it can be observed that the period3 values of exon sequences are higher than those of the intron sequences. This is an important property which will be used to classify exon sequences from intron sequences.
The log_{10} power spectrum S_{
n
} [k] for 1 ≤ k ≤ N of an exon sequence and that of an intron sequence (both of sequence length 1250 bases, window length 24 bases, and window rightshift 3 bases) obtained by the KQuaternary Code I are plotted in Figures 3 and 4. As shown in Figure 3 (or Figure 4), the exon (or intron) period3 value at 2π/3 defined by Equation 7 has in fact been able to reveal a prominent power spectral value; also, the exon period3 value is higher than the intron period3 value. The DC power spectral value is usually the highest. Besides period3 and DC power spectral value, power spectral values of other periodicities are clearly shown in Figures 3 and 4.
2.3. Winner threshold value
In this section, the statistics of the period3 values computed from a training set of exon sequences and intron sequences are used to define the threshold values of two novel thresholding methods to classify an untrained sequence to be either an exon sequence or an intron sequence. As explained in Table 1, meanP_{
3e
} and sdP_{
3e
} represent, respectively, the mean and standard deviation of the period3 values obtained from the exon sequences of a training set; and meanP_{
3i
} and sdP_{
3i
} represent, respectively, the mean and standard deviation of the period3 values obtained from the intron sequences of the same training set. We define two threshold values called the mid threshold value, T_{
m
} , and the proportional threshold value, T_{
p
} , as
In both definitions, the exon cluster is centred at meanP_{
3e
} and the intron cluster is centred at meanP_{
3i
} . In Equation 8, the mid threshold value is determined by the midpoint between the exon cluster and the intron cluster, whereas in Equation 9, the proportional threshold value is determined in proportion to the standard deviations of the two clusters.
Besides the above two methods for determining a threshold value, the crossover point of the cumulative distribution of all the exon period3 values, F(P_{3e}), and the complementary cumulative distribution of all the intron period3 values, F_{
c
} (P_{3i}), of a set of exon and intron training sequences can be used to determine a threshold value. We define such a cumulative distribution threshold value, T_{
c
} , as
Figure 5 shows a plot of F(P_{3e}) and F_{
c
} (P_{3i}) versus corresponding exon/intron period3 power spectral value. Each of the three threshold values T_{
m
}, T_{
p
} , and T_{
c
} can be used to classify an unknown nucleotide sequence. For illustration, let T_{
t
} be any of T_{
m
}, T_{
p
} , and T_{
c
} , if an unknown nucleotide sequence has a computed period3 power spectral value greater than or equal T_{
t
} , the unknown nucleotide sequence is classified as an exon sequence; otherwise it is classified as an intron sequence.
The performance is measured objectively in terms of the exon classification (%) which is the percentage of correct classification of exon sequences, the intron classification (%) which is the percentage of correct classification of intron sequences, and the precision (%). These three terms are defined in Equations 1113 as
We define a winner threshold value, T_{
w
} , as the threshold value chosen among T_{
m
}, T_{
p
} , and T_{
c
} that yields the top classification (or the highest precision).
2.4. Classification and speed performances of codes 117
In this article, all the exon and intron sequences of the human and eleven model organisms were downloaded from the UCSC Genomes [26–29] as inputs to be processed by Matlab programs.
2.4.1. Classification performance of short sequences
The short sequence group I listed in Table 3 downloaded from the UCSC Human genome [26–29] consists of the thirteen short sequence sets ranging from 50 bases to 650 bases (at intervals of 50 bases) as [50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650]. For each of the thirteen short sequence sets, the training sequence numbers are 1 to 10005 and the testing sequence numbers are 10006 to 13340 with a train/test ratio of three for both exon and intron sequences within a set. Each of the thirteen sequence sets is trained and tested using a combination of sixteen numerical representations, four window lengths, and four threshold values (T_{
m
}, T_{
p
}, T_{
c
} , and T_{
4
} ). Figures 6(ab) and 7 (ac) plots the top classifications obtained using each of the thirteen untrained subsets of 3335 exon sequences and 3335 intron sequences of the short sequence group I. Figure 6 (a, top) displays a total of 13 × 4 precision values obtained using the thirteen sequence sets in thirteen combined columns with each combined column consists of four subcolumns of precision values. Each of the four subcolumn precision values shown in Figure 6 (a, top) corresponds to a top classification with its code index, window length index, threshold index, and threshold value displayed, respectively, in Figures 6 (b, bottom) and Figure 7 (ac). In each of the five subplots in Figures 6 (ab) and 7 (ac), the first subcolumn corresponds to the top classification obtained by the top performer among the Codes 1 to 16, the second subcolumn corresponds to the top classification obtained by the Code 10, the third subcolumn corresponds to the top classification obtained by the Code 13, and the fourth subcolumn corresponds to the top classification obtained by the Code 17. The Codes 10 and 13 are found to be among the top performers and therefore they are displayed in two separate subcolumns. As the Codes 116 are to be compared to the Code 17 (described in Section 2.1), therefore the Code 17 is also displayed as one subcolumn.
Table 3
UCSC Human genome consisting of 2 short sequence groups.
The classification performance shown in Figures 6(ab) and 7 (ac) and summarized in Table 4 indicate that for the thirteen short sequence sets: (a) The top performer for 8 of these thirteen sequence sets is the Code 13, followed by the Code 17 as the top performer for the remaining 5 sequence sets (with the Code 13 ranks 2^{nd} in 4 cases and the Code 10 ranks 2^{nd} in 1 case, each of these 5 cases is also shown in Table 4 on the row following that of the top performer). (b) The window length of 9 bases yields 7 top classifications; the window length of 15 bases yields 2 top classifications; and for the window length equals to the sequence length, 4 top classifications are obtained. (c) The T_{
c
} thresholding yields 8 top classifications; followed by the T_{
p
} thresholding which yields 5 top classifications. In general, top classifications can often be achieved by the Code 13, using WL = 9 bases and T_{
p
} thresholding for SL = 50 to 450 bases; and using WL = 9 bases and T_{
c
} thresholding for SL = 500 to 650 bases.
Table 4
Code (Figure 6b, bottom), WL (bases) (Figure 7a, top), threshold method (Figure 7b, middle), and precision (%) (Figure 6a, top) of top classifications of 13 short sequence sets
SL
Code
WL
Threshold
Precision
50
17
9
T_{
c
}
67.5412
50
13
9
T_{
p
}

100
13
9
T_{
p
}
73.2234
150
17
150
T_{
c
}
76.8066
150
10
150
T_{
m
}

200
17
200
T_{
c
}
77.8411
200
13
9
T_{
c
}

250
13
9
T_{
p
}
79.1004
300
17
300
T_{
c
}
83.8231
300
13
300
T_{
p
}

350
13
9
T_{
p
}
84.1829
400
13
9
T_{
p
}
84.2579
450
13
15
T_{
p
}
84.3628
500
13
9
T_{
c
}
86.2519
550
17
550
T_{
c
}
89.1604
550
13
9
T_{
c
}

600
13
9
T_{
c
}
89.7151
650
13
15
T_{
c
}
91.2744
2.4.2. Classification performance of long sequences
The long sequence group I listed in Table 5 downloaded from the UCSC Human genome [26–29] consists of the thirteen long sequence sets ranging from 650 bases to 1250 bases (at intervals of 50 bases) as [650, 700, 750, 800, 850, 900, 950, 1000, 1050, 1100, 1150, 1200, 1250]. For each of the thirteen long sequence sets, the training sequence numbers are 1 to 3437 and the testing sequence numbers are 3438 to 4583 with a train/test ratio of three for both exon and intron sequences within a set. Each of the thirteen sequence sets is trained and tested using a combination of sixteen numerical representations, four window lengths, and four threshold values (T_{
m
}, T_{
p
}, T_{
c
} , and T_{
4
} ). The top classification results of Codes 116 are compared to those obtained by the Code 17 [17] (as described in Section 2.1). Figures 8(ab) and 9 (ac) plots the top classifications obtained using each of the thirteen untrained subsets consisting of 1146 exon sequences and 1146 intron sequences of the long sequence group I. The notations and explanations of Figures 6(ab) and 7 (ac) regarding the short sequence set I in Table 3 apply to Figures 8(ab) and 9 (ac) which display the corresponding results of the long sequence set I in Table 5.
Table 5
UCSC Human genome consisting of 2 long sequence groups
The classification performance shown in Figures 8(ab) and 9 (ac) and summarized in Table 6 indicates that for the thirteen long sequence sets: (a) The top performer for the thirteen sequence sets is the Code 13, except for SL = 1100 bases in which the Code 1 (with WL = 24 bases and T_{
4
} thresholding) is marginally higher than that of the Code 13 (with WL = 9 bases and T_{
p
} thresholding as shown in Table 6 on the row following that of the top performer). (b) The window length of 9 bases yields 5 top classifications; the window length of 15 bases yields 4 top classifications; and for window length equals to 24 bases, 3 top classifications are obtained. (c) The T_{
p
} thresholding yields 6 top classifications; followed by the T_{
4
} thresholding which yields 4 top classifications; and the T_{
c
} thresholding which yields 3 top classifications. In general, top classifications are achievable often using a combination of the Code 13, WL = 9 bases, and T_{
p
} thresholding.
Table 6
Code (Figure 8b, bottom), WL (bases) (Figure 9a, top), threshold method (Figure 9b, middle), and precision (%) (Figure 8a, top) of top classifications of 13 long sequence sets
SL
Code
WL
Threshold
Precision
650
13
15
T_{
c
}
90.1396
700
13
9
T_{
c
}
90.4014
750
13
24
T_{
4
}
91.2304
800
13
15
T_{
4
}
92.8447
850
13
15
T_{
4
}
93.2810
900
13
9
T_{
p
}
92.4084
950
13
9
T_{
p
}
92.4520
1000
13
24
T_{
p
}
91.7103
1050
13
15
T_{
p
}
92.1902
1100
1
24
T_{
4
}
93.5428
1100
13
9
T_{
p
}

1150
13
9
T_{
p
}
94.6771
1200
13
1200
T_{
c
}
94.9825
1250
13
9
T_{
p
}
94.9825
2.4.3. Speed performance
In this article, the DFTbased spectral analysis of a numerical sequence is computed using the FFT. In general, the computational complexity of a 4sequence numerical representation (such as the Voss representation [17]) is always four times that of a 1sequence numerical representation if either windowing or nonwindowing is used in the comparison. The FFT computational complexity for a numerical sequence of length L is L × log_{2}(L) and for a windowed sequence of WL = N is N × log_{2}(N). As each consecutive window is rightshifted by three bases, the number of windows NW = 1 + fix((LN)/3) where fix(X) rounds the elements of X to the nearest integer. For a nonwindowed 4sequence numerical representation of SL = L, the computational complexity is 4L × log_{2}(L) whereas for a windowed 1sequence numerical representation of SL = L, the computational complexity is NW × N × log_{2}(N). The ratio of 4L × log_{2}(L) over NW × N × log_{2}(N) gives a speed improvement ratio. A plot of the speed improvement ratio against SL for WL = [9,15,24] bases is shown in Figure 10. As seen from Figure 10, it can be observed that a windowed 1sequence representation with WL = [9,15,24] bases offers a possible speed advantage over a nonwindowed 4sequence representation and the speed improvements increases as SL increases. Also shown in Figure 10, for WL = SL, N = L, the speed improvement ratio is four.
2.5. Interpolated winner threshold value
With the use of the winner threshold value described in Section 2.3, a top precision for an unknown nucleotide sequence of length equals to any of the thirteen long or short sequence sets can be obtained. However, if the length of an unknown nucleotide sequence is neither any of the thirteen short sequence sets nor any of the thirteen long sequence set, an interpolated winner threshold value can be determined as described in this section. For illustration, let us consider the KQuaternary Code I but the interpolated winner threshold methodology also applies to other numerical representations. For a set of exon and intron training sequences, the winner threshold value, T_{
w
} (P_{3}), is first chosen among the three threshold values, T_{
m
} (P_{3}), T_{
p
} (P_{3}), and T_{
c
} (P_{3}) that yields a top precision during testing. For each window length, the winner threshold value, T_{
w
} (P_{3}), versus sequence length of the set of thirteen short sequences with SL = 50 to 650 bases is plotted in Figure 11 and similarly in Figure 12 for the set of thirteen long sequences with SL = 650 to 1250 bases. From Figures 11 and 12, given an unknown nucleotide sequence of arbitrary length (within the length of trained sequences), an interpolated winner threshold value, T_{
i
} (P_{3}), at any of the four specified window lengths can be obtained. Using an interpolated winner threshold value, T_{
i
} (P_{3}), if the unknown nucleotide sequence has a period3 power spectral value greater than or equal T_{
i
} (P_{3}), the unknown nucleotide sequence is classified as an exon sequence; otherwise it is classified as an intron sequence.
2.5.1. Short sequences
For each WL, an interpolated threshold value for an arbitrary sequence length can be estimated using cubic spline interpolation from the corresponding discrete winner threshold values shown in Figure 11. For each of the four WL, the interpolated threshold values and the corresponding precisions obtained on testing each of the twelve untrained sequence sets (under the short sequence group II of Table 3) of sequence lengths [90, 120, 180, 210, 270, 330, 390, 420, 480, 510, 570, 630] bases, are displayed in Figures 13 and 14 using thin red bars. For comparison, the winner threshold values and the corresponding precisions of the thirteen short sequence sets for each of the four WL are displayed in Figures 13 and 14 using thick blue bars. From Figures 13 and 14, it can be observed that the interpolated winner threshold values and their precisions perform well and follow the general trends.
2.5.2. Long sequences
In a similar manner to the short sequence sets, for each WL, an interpolated winner threshold value for an arbitrary sequence length can be determined using cubic spline interpolation from the corresponding discrete winner threshold values of the thirteen long sequence sets shown in Figure 12. For each of the four WL, twelve untrained sequence sets (under the long sequence group II of Table 5) of sequence lengths [690, 720, 780, 810, 870, 930, 990, 1020, 1080, 1110, 1170, 1230] bases are tested, the interpolated winner threshold values and the corresponding precisions obtained are displayed in Figures 15 and 16 using thin red bars. For comparison, the winner threshold values and the corresponding precisions for each of the four WL of the thirteen long sequence sets are displayed in Figures 15 and 16 using thick blue bars. From Figures 15 and 16, it can be observed that the interpolated winner threshold values and their precisions work well and follow the general trends.
2.6. Classification performances of 12 organisms
To demonstrate the efficiency of the numerical representations and thresholding methods when applied to other model organisms, the UCSC genomes [26–29] of twelve organisms (including the human) listed in Table 7 are tested. Each of the genomes of the twelve organisms is trained and tested with identical SL of 300 bases, four window lengths, and four threshold values, the results are compared based on the best of the Codes 116 (dark blue), the Code 10 (light blue), the Code 13 (yellow), and the Code 17 (red) [17] (as described in Section 2.1) as shown in Figures 1718. For the twelve organisms, Figures 17(ab) plot precision (a, top) and the corresponding code index (b, bottom); and Figure 18(ac) plot window length index (a, top), threshold index (b, middle), and threshold value (c, bottom) of the top classifications obtained. The results summarized in Table 8 indicate that top classifications are obtained by the Code 13 in seven organisms; followed by the Code 17 with four top classifications (with the Code 13 ranks 2^{nd} in 2 cases and the Code 10 ranks 2^{nd} in 1 case, each of these 3 cases is shown in Table 8 on the row following that of the top performer); and the Code 6 with one top classification. The results also indicate that the Code 13 with (a) WL = 24 bases and T_{
p
} thresholding is the top choice for organisms 1, 3, 4; (b) WL = 24 bases and T_{
4
} thresholding is the top choice for organism 7; (c) WL = 15 bases and T_{
c
} thresholding is the top choice for organisms 11 and 12; and (d) WL = 300 bases and T_{
m
} thresholding is the top choice for organism 2.
Table 7
UCSC genome of 12 organisms
OG
Clade
Genome
Assembly
Track
Table
Type
Number
1
Mammal
Human
Feb. 2009 (GRCh37/hg19)
UCSC Genes
knownGene
Exon
Intron
35685
532489
2
Mammal
Mouse
July 2007 (NCBI37/mm9)
UCSC Genes
knownGene
Exon
Intron
27452
43041
3
Mammal
Pig
Nov. 2009 (SGSC Sscrofa9.2/susScr2)
Ensembl Genes
ensGene
Exon
Intron
11499
113946
4
Vertebrate
Chicken
May 2006 (WUGSC 2.1/galGal3)
Ensembl Genes
ensGene
Exon
Intron
12045
57770
5
Vertebrate
Zebrafish
Jul. 2010 (Zv9/danRer7)
RefSeq Genes
refGene
Exon
Intron
7838
78776
6
Deuterostome
C. intestinalis
Mar. 2005 (JGI 2.1/ci2)
Ensembl Genes
ensGene
Exon
Intron
7360
86885
7
Deuterostome
S. purpuratus
Sep. 2006 (Baylor 2.1/strPur2)
Other RefSeq
xenoRefGene
Exon
Intron
4888
193451
8
Insect
D. melanogaster
Apr. 2006 (BDGP R5/dm3)
RefSeq Genes
refGene
Exon
Intron
37925
35403
9
Insect
A. mellifera
Jan. 2005 (Baylor 2.0/apiMel2)
Ensembl Genes
ensGene
Exon
Intron
21827
47605
10
Nematode
C. elegans
May 2008 (WS190/ce6)
RefSeq Genes
refGene
Exon
Intron
25360
37359
11
Nematode
C. japonica
Mar. 2008 (WUGSC 3.0.2/caeJap1)
Other RefSeq
xenoRefGene
Exon
Intron
7978
30958
12
Other
Sea hare
Sept. 2008 (Broad 2.0/aplCal1)
Other RefSeq
xenoRefGene
Exon
Intron
6914
431792
Group: Genes and Gene Prediction Tracks. SL = 300 bases.
Table 8
Code index (Figure 17b, bottom), WL (bases) (Figure 18a, top), threshold method (Figure 18b, middle), and precision (%) (Figure 17a, top) of top classifications of 12 organisms (SL = 300 bases)
OG
Code
WL
Threshold
Precision
1
13
24
T_{
p
}
80.8281
2
13
300
T_{
m
}
82.9433
3
13
24
T_{
p
}
83.6184
4
13
24
T_{
p
}
80.1530
5
17
300
T_{
c
}
83.3933
5
13
300
T_{
p
}

6
6
300
T_{
p
}
82.6283
7
13
24
T_{
4
}
94.9145
8
17
300
T_{
c
}
93.2943
8
13
9
T_{
p
}

9
17
300
T_{
c
}
88.7039
9
10
15
T_{
c
}

10
17
300
T_{
c
}
87.3087
11
13
15
T_{
c
}
82.3132
12
13
15
T_{
c
}
91.8992
3. Discussion
In this article, the ability of the KQuaternary Code I (the Code 13) through the use of the discrete Fourier transform to capture the periodicities of an exon sequence or an intron sequence across the entire spectrum at a resolution defined by the window length is shown in Figures 3 and 4. Such a spectral tracking ability is shared among all of the sixteen numerical representations. As seen from Figures 3 and 4, there are three prominent peaks located at frequency equal to 0, 2π/3, and 4π/3. The peak at 2π/3 corresponds to the period3 power spectral value, and the peak at 0 corresponds to the power spectral value at DC which usually exhibits the highest value within a spectrum. The power spectral values at other frequencies are lower and different but all serve to reflect their actual power spectral properties across the spectrum. For the Codes 916, their numerical represented sequences x[n] are complex; therefore, DFT spectrum shows nonsymmetrical peaks at frequency equal to 2π/3 and 4π/3. However, for the Codes 18 in which their numerical represented sequences x[n] are real and therefore symmetrical peaks at frequency equal to 2π/3 and 4π/3 are obtained from DFT analysis.
The interpolated winner threshold values, T_{
i
} (P_{3}), and their corresponding precisions shown in Figures 13 and 14 for short sequences, and in Figures 15 and 16 for long sequences of the human genome indicate that T_{
i
} (P_{3}) obtained using cubic spline interpolation from either Figure 11 or Figure 12 can yield a similar precision as compared to that of the winner threshold value, T_{
w
} (P_{3}), of its nearby SL. It should be noted that each of T_{
m
} (P_{3}), T_{
p
} (P_{3}), and T_{
c
} (P_{3}) required to determine T_{
w
} (P_{3}) has to be computed directly from the training portion of each short/long sequence set whereas T_{
i
} (P_{3}) requires minimal computation but can achieve a comparable precision.
Given an unknown human nucleotide sequence of an arbitrary length L, if L is equal to the length of any of the thirteen short sequence sets or any of the thirteen long sequence sets, the choice of a suitable set of code, WL, and threshold can be obtained as a tablelookup from either Table 4 or 6. If L is not equal to the length of any of these thirteen short or long sequence sets, the closest SL, its code, and WL can also be obtained from either Table 4 or 6. For the latter case, once the code and WL are determined, its threshold value can then be determined using the interpolated winner threshold described in Section 2.5. Besides the human genome, the methodologies described in this article can be applied to the genome of other model organisms as verified by the results shown in Figures 17(ab) and 18 (ac) and Tables 7 and 8.
4. Conclusions
In this article, two methods for determining threshold values have been defined, and together with the cumulative distribution threshold value, determine the winner threshold value for classifying an unknown nucleotide sequence of a fixed length. An interpolated winner threshold value has also been introduced to classify an unknown nucleotide sequence of an arbitrary length with a comparable performance to that obtained by the winner threshold value of its nearby SL (in classifying an unknown nucleotide sequence of a fixed length). In general, precision increases as sequence length increases, and classification performance depends on a suitable choice of numerical representation and window length. Sixteen 1sequence numerical representations have been presented and compared to classify untrained exon and intron sequences in the spectral domain, in which the KQuaternary Code I yields attractive performance. When comparing each of the sixteen windowed 1sequence numerical representations using WL = [9,15,24] bases to a nonwindowed 4sequence numerical representation (such as the Voss representation), the speed improvement ratio increases as SL increases which favours long nucleotide sequence analysis. The results obtained indicate the methodologies introduced in this article for exon and intron sequence classification are applicable to the genomes of the human and other model organisms. Overall, the study has developed novel methodologies in numerical representation for improved nucleotide to numeric mapping, spectral analysis for effective period3 spectral value computation, and thresholding for more accurate classification of unknown exon and intron sequences of fixed and arbitrary length.
Declarations
Authors’ Affiliations
(1)
Department of Electrical and Computer Engineering, University of Windsor
(2)
Faculty of Medicine, University of Ottawa
(3)
School of Medicine, Queen's University
References
Vaidyanathan PP: Genomics and proteomics: A signal processor's tour. IEEE Circ. Syst. Mag. 4th Q 2004, 629.
Anastassiou D: Genomic signal processing. IEEE Signal Process. Mag. 2001, 18: 820.View Article
Kwan HK, Arniker SB: Numerical representation of DNA sequences. 307310.
Arniker SB, Kwan HK: Graphical representation of DNA sequences. 311314.
Cristea PD: in BIOS’02: Genetic Signal Representation and Analysis , SPIE International Conference on Biomedical Optics Symposium, Molecular Analysis and Informatics, San Jose, CA, USA, 2124 January 2002. Proceedings of Society of PhotoOptical Instrumentation Engineers (SPIE) Conference, vol. 4623 (SPIE, 2002). 7784.
Akhtar M, Epps J, Ambikairajah E: On DNA numerical representations for period3 based exon prediction. 4 pages in Proceedings of IEEE Workshop on Genomic Signal Processing and Statistics (GENSIPS), Tuusula, Finland, 1012 June 2007
Rosen GL: Signal Processing for Biologicallyinspired Gradient Source Localization and DNA Sequence Analysis. PhD dissertation, Georgia Institute of Technology, Atlanta; 2006.
Akhtar M, Epps J, Ambikairajah E: Signal processing in sequence analysis: Advances in eukaryotic gene prediction. IEEE J Sel Top Signal Process 2008, 2: 310321.View Article
Holden T, Subramaniam R, Sullivan R, Cheng E, Sneider C, Tremberger G Jr, Flamholz A, Leiberman DH, Cheung TD: ATCG nucleotide fluctuation of Deinococcus radiodurans radiation genes. In Instruments, Methods, and Missions for Astrobiology X. Proceedings of Society of PhotoOptical 2 Instrumentation Engineers (SPIE) Conference, vol. 6694 (SPIE, 12 September 2007) Edited by: B Hoover, GV Levin, AY Rozanov, and PCW Davies. 669417166941710.
Stanley HE, Buldyrev SV, Goldberger AL, Goldberger ZD, Havlin S, Ossadnik SM, Peng CK, Simmons M: Statistical mechanics in biology: how ubiquitous are longrange correlations? Physica A 1994, 205: 214253. 10.1016/03784371(94)905029View Article
Cosic I: Macromolecular Bioactivity: is it resonant interaction between macromolecules? Theory and applications. IEEE Trans Biomed Eng 1994, 41: 11011114. 10.1109/10.335859View Article
Chakravarthy N, Spanias A, Lasemidis LD, Tsakalis K: Autoregressive modeling and feature analysis of DNA sequences. EURASIP Journal of Genomic Signal Processing 2004, 1: 1328.View Article
Cristea PD: Conversion of nucleotides sequences into genomic signals. J Cell Mol Med 2002, 6: 279303. 10.1111/j.15824934.2002.tb00196.xView Article
Cristea PD: Representation and analysis of DNA sequences. In Genomic Signal Processing and Statistics, EURASIP Book Series in Signal Processing and Communications, volume 2 (Hindawi Publishing Corporation, 2005) Edited by: ER Dougherty, I Shmulevich, J Chen, ZJ Wang. 1565.
Brodzik AK, Peters O: Symbolbalanced Quaternionic periodicity transform for latent pattern detection in DNA sequences. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 5, Philadelphia, USA, March 2005 373376.
Tiwari S, Ramachandran S, Bhattacharya A, Bhattacharya S, Ramaswamy R: Prediction of probable genes by Fourier analysis of genomic sequences. Bioinformatics (CABIOS) 1997, 13(3):263270. 10.1093/bioinformatics/13.3.263View Article
Datta S, Asif A: A fast DFT based gene prediction algorithm for identification of protein coding regions. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 5, Philadelphia, USA, March 2005 653656.
Kwan BYM, Kwan JYY, Kwan HK, Atwal R, Shen OT: Wavelet analysis of the genome of the model plant Arabidopsis thaliana. In 4 pages in Proceedings of TENCON, Hong Kong, China, 2006. ;
Berger JA, Mitra SK, Carli M, Neri A: New approaches to genome sequence analysis based on digital signal processing. 4 pages in Proceedings of IEEE Workshop on Genomic Signal Processing and Statistics (GENSIPS), Raleigh, North Carolina, October 2002
Cheever EA, Searls DB, Karunaratne W, Overton GC: Using signal processing techniques for DNA sequence comparison. 173174.
Rajasekaran S, Nick H, Pardalos PM, Sahni S, Shaw G: Efficient algorithms for local alignment search. J Comb Optim 2001, 5: 117124. 10.1023/A:1009893719470MathSciNetView Article
Avenio GD, Grigioni M, Orefici G, Creti R: SWIFT (sequencewide investigation with Fourier transform): a software tool for identifying proteins of a given class from the unannotated genome sequence. Bioinformatics 2005, 21(13):29432949. 10.1093/bioinformatics/bti468View Article
Yin CC, Yau SST: Prediction of protein coding regions by the 3base periodicity analysis of a DNA sequence. J Theor Biol 2007, 247: 687694. 10.1016/j.jtbi.2007.03.038MathSciNetView Article
Tuqan J, Rushdi A: A DSP approach for finding the codon bias in DNA sequences. IEEE J Sel Topics Signal Process 2008, 2(3):343356.View Article
Karolchik D, Hinrichs AS, Furey TS, Roskin KM, Sugnet CW, Haussler D, Kent WJ: The UCSC Table Browser data retrieval tool. Nucleic Acids Res 2004, (32, Database issue):D493496. 10.1093/nar/gkh103
Goecks J, Nekrutenko A, Taylor J, The Galaxy Team: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biology 2010., 11(8): Article R86 10.1186/gb2010118r86
Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J: Galaxy: a webbased genome analysis tool for experimentalists. Curr Protoc Mol Biol 2010, Chapter 19(Unit 19.10):121.
Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A: Galaxy: a platform for interactive largescale genome analysis. Genome Res 2005, 15(10):14511455. 10.1101/gr.4086505View Article
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.