James Long
jlong@arsc.edu
Arctic Region Supercomputing Center
Cray's Cray Bioinformatics Library (CBL) is a low level set of library routines written in Fortran and Cray assembly language (most are callable from C) that implement some common nucleotide/protein sequence manipulations typical in a bioinformatics setting. The original CBL was coded and optimized on a Cray SV1 vector machine, and the National Cancer Institute used it to produce a comprehensive map of short tandem repeats [1]. The CBL will continue to grow as additional biological computational primitives are identified and implemented.
The routines in CBL are designed to facilitate performance by operating on compressed data whenever possible. In the case of nucleotide data, for example, it is sufficient to represent each of the four nucleotides with only two bits, and thus a 64-bit word can contain a sequence of 32 nucleotides instead of the normal 8. A CBL search routine can then take advantage of the compression by comparing whole words, one containing all or part of a compressed query, and the other containing part of a compressed database. Since query and database are effectively 1/4 in size, significant performance is realized. In addition to 2-bit compression, CBL supports 4 bit and 5 bit levels for larger alphabets.
The Portable CBL v1.0, consisting entirely of integer operations written in C, implements the computational primitives in a generic fashion with little regard to specific hardware. It does, however, try to use loops that will vectorize, not only so that it can be compared to the proprietary CBL on the Cray SV1ex, but also to run effectively on other platforms like the SX6 and X1. In addition, vectorizable routines run faster on non-vector hardware in many cases. Issues surrounding the implementation of the portable version are addressed in this paper.
Two factors dominate portability for the CBL. The first factor to consider is machine register length, either 32- or 64-bit. The second factor to consider is whether a machine is big-endian or little-endian. The portable CBL is coded to run on any combination of these two factors:
1) 32-bit, big-endian: most unix workstations 2) 32-bit, little-endian: Intel IA-32 3) 64-bit, big-endian: most supercomputers 4) 64-bit, little-endian: Intel IA-641
Both factors are discussed below.
pattern output array of packed words containing the repeated pattern and the number of times it was repeated, in this format: ------------------------------------------ |00| pattern (32 bits) | count (30 bits) | ------------------------------------------
On a 32-bit box, changing the return type of "pattern" from
long *pattern
to
long long *pattern
preserves the same 64-bit format for the data as on a 64-bit big-endian machine. See the next section for the little-endian format.
Big-endian:
|<--------word0-------->|<--------word1-------->|<--etc-->| byte0 byte1 byte2 byte3 byte4 byte5 byte6 byte7 a c g t a null
Little-endian:
|<--etc-->|<--------word1-------->|<--------word0-------->| byte7 byte6 byte5 byte4 byte3 byte2 byte1 byte0 null a t g c a
Memory on both machines contain the same nucleotide string acgta. Big-endian machines read left-to-right, like English, while little-endian machines read right-to-left, like Hebrew2. Let's see what the following code prints in each case:
main() { char *c = "acgta"; unsigned int i; i = ((unsigned int *) c)[0] >> 8; printf("%s\n",(char *) &i); }
First, this code casts the character pointer to an unsigned int pointer so that c[0] now contains 4 bytes, i.e. the first four letters of the string. It then shifts the rightmost letter off (8 bits), and prints the resulting string by treating the address of the unsigned int as a character pointer.
On a big-endian machine this code prints absolutely nothing. It's easy to see why in the picture, the t in byte3 gets shifted off to the right, and a null gets shifted in from the left3. The null is the first byte of the new string, and therefore nothing prints.
On a little-endian machine, however, the string cgt is printed. Looking at the little-endian picture above, it is the a in byte0 that gets shifted off this time. When a null gets shifted in from the left, we are left with the string cgt.
When acgta is compressed, it becomes apparent why byte-order is a significant factor affecting portability [4]. The 2-bit representation for each of the four nucleotide letters is taken directly from their ASCII code in a case insensitive manner:
A 0100 0001 a 0110 0001 C 0100 0011 c 0110 0011 G 0100 0111 g 0110 0111 T 0101 0100 t 0111 0100 ^^ || these two columns contain the desired 2-bit case-insensitive code
After stripping out the unnecessary bits, it's clear that the result should be shifted to the left in the big-endian case, and to the right in the little-endian case:
Big-endian
|<--------------word0-------------->|<--------------word1-------------->| |byte0---|byte1---|byte2---|byte3---|byte4---|byte5---|byte6---|byte7---| 01100001 01100011 01100111 01110100 01100001 00000000 <------8-bit ascii a c g t a null 00011110 00000000 00000000 00000000 <------ compressed 2-bit string a c g t a null padded zeros4
Little-endian
|<--------------word1-------------->|<--------------word0-------------->| |byte7---|byte6---|byte5---|byte4---|byte3---|byte2---|byte1---|byte0---| 8-bit ascii -----> 00000000 01100001 01110100 01100111 01100011 01100001 null a t g c a compressed 2-bit string ----------> 00000000 00000000 00000000 10110100 padded zeros null a t g c a
The format of the compressed information returned via the pattern output array discussed in the section on register length was shown for the big-endian case:
------------------------------------------ |00| pattern (32 bits) | count (30 bits) | ------------------------------------------
The little-endian format becomes:
------------------------------------------ | count (30 bits) | pattern (32 bits) |00| ------------------------------------------
In conclusion, word- and byte-order for little endian is the mirror image of big-endian, but bit order is not. The 2-bit, 4-bit, and 5-bit representations for individual compressed letters from an alphabet read from left-to-right for both endians. This is true in general for any n-bit type within a word, i.e. a 32-bit int in big-endian has the same bit order as in little-endian.
Testing philosophy is in the style of Extreme Programming (XP), where the test is written before the production code. While this author does not subscribe to all tenets of XP (i.e. two programmers sitting together at one machine), designing the test code first has proved useful. For CBL, a test code is comprised of a simple skeleton into which the production code is inserted so that its output may be compared to the output of a simpler-but-slower version. Outputs are tested across word boundaries and at edges. Writing the test before the code helps to delineate code requirements up-front, potentially reducing the number of changes that might need to be made later. Additionally, writing the slower version first helps to clarify the issues for the production version.
The test suite included with the Portable CBL distribution consists of four source files that exercise seven of the routines. An installation configure script is edited to set parameters for the target platform, which when run creates two header files and a makefile. Typing "make" builds the library, and "make test" builds and runs the four tests.
A typical test code contains an outer loop that, on each iteration, creates a random database string of increasing length and compresses it. The compressed string never gets more than a few words long, but all possible parameter combinations and word boundary conditions are tested by the slow version of the library on the uncompressed data, whose outputs are then compared against the compressed data of the production version. The DEBUG_INFO flag in the configure file can be set for debugging info if any test fails, while flag BMARK turns on the benchmarking portion of each test, producing the data in the Section 4 charts below.
The Portable CBL distribution incorporates a benchmark suite inside the test
suite. The benchmark codes are implemented by wrapping two loops around each CBL
routine. The outer loop begins with a database length of 256 bytes before compression, doubling
it on each iteration to a final length of 33,554,432. The inner loop repeats
the call a sufficient number of times to accumulate timing data while
exercising the various parameters whenever possible, and prints out the
accumulated user and system time for each iteration of the outer loop. The
total time for all iterations of the outer loop is summed on various platforms
in the following charts for each routine, where the proprietary (native) CBL
v1.2 on the ARSC SV1ex appears in the first column, followed by a direct
comparison with the Portable CBL v1.0 timings in SSP mode on the same platform.
The remaining columns contain Portable CBL timings in the following
order5:
500 MHz SX6
800 MHz X1, compiled with "-h command" to ensure SSP mode, VLEN = 128
800 MHz X1, MSP mode, VLEN = 128
800 MHz X1, MSP mode, VLEN = 512
1.3 GHz IBM P4, 32 MB shared L3, 64-bit mode
1.4 GHz AMD, 256 KB cache, intel 7.1 compiler
1.4 GHz AMD, 256 KB cache, gcc 3.2.2 compiler
1.4 GHz AMD, 256 KB cache, Portland Group Workstation 4.1 compiler
Various compiler optimizations were tried, and the charts below reflect the best time obtained for that routine.
Fig.1: Translate Nucleotides to Amino Acids
cb_amino_translate_ascii employs a 64-element static unsigned long array as a lookup table to translate groups of 3 nucleotides that are compressed in 2-bit mode into amino acids in 8-bit ASCII. The algorithm consists solely of masks, shifts, and lookups for all three reading frames of the database. While this simple strategy works well on non-vector hardware, the performance on vector hardware is mediocre.
Fig.2: Compress/uncompress Nucleotides or Amino Acids
cb_compress/uncompress consists largely of mask and shift operations, performing well on all platforms.
Fig.3: Copy a Contiguous Sequence of Memory Bits
cb_copy_bits performs only a few register operations before moving data back to memory, essentially making this routine a memory bandwidth measure for a platform.
Fig.4: Count A, C, T, G, and N Characters in a String
cb_countn_ascii performs well, except on the SV1ex.
Fig.5: Find Short Tandem Repeats in a Nucleotide String
cb_repeatn algorithm needs additional work for vector platforms, but has excellent performance on low-end hardware.
Fig.6: Reverse Complement Compressed Nucleotide Data
cb_revcompl starts at the end of the database, shifting bits into a new word before a bit reversal within the word followed by a bitwise complement. Needs tuning on low-end hardware.
Fig.7: Gap-Free Nucleotide Search Allowing Mismatches
cb_searchn screens candidates by counting mismatches for only a fraction of each candidate database string, hoping to reject many without having to count all mismatches. Surviving candidates are saved until there are enough to process a VECTLEN number of them, set at 128 for non-vector hardware. Needs bigger cache on low-end hardware.
Fig.8: Driving IBM P4 Out-of-Cache
IBM P4 processors have a shared 32 MB L3 cache, large enough to contain all of the databases in the benchmark suite. Compare actual runtimes with ideal runtimes computed by successively doubling the uncompressed 32 MB time for each database.
Table 1: Integer Benchmarks for figures above (seconds) 500 800 MHz X1 1.3 1.4 GHz MHz SSP MSP MSP GHz icc gcc pgcc CBL Function CRAY ARSC SX6 128 128 512 IBM AMD AMD AMD ============ ==== ==== === === === === === === === === cb_amino_translate_ascii: 39 108 77 158 94 88 44 63 87 91 cb_compress/uncompress: 31 54 55 58 46 42 48 64 70 78 cb_copy_bits: 15 26 1 5 1 3 12 45 45 50 cb_countn_ascii: 18 84 11 24 7 4 29 59 62 63 cb_repeatn: 41 143 83 138 139 124 33 48 49 52 cb_revcompl: 19 81 15 34 25 19 46 148 138 156 cb_searchn: 20 84 28 58 49 41 64 129 167 166 IBM P4: 23,49,99,198,396 26,54,110,222,442 8,18,43,90,178 14,29,61,123,246 17,35,69,138,277 24,50,105,220,448 32,65,131,262,524
The primary objective of the Portable CBL is to run on as many platforms as possible, and so it was written with little regard to proprietary architectural considerations, save the caveat that main work loops were made vectorizable if possible. This caveat allows direct comparison to the native CBL running on the ARSC SV1ex, and allows the routines to run in reasonable times on the SX6 and X1. Additionally, vectorizable loops aid performance on non-vector hardware by keeping pipelines loaded in many cases.
Portable CBL runtimes on the ARSC SV1ex are about 2x to 4x of those for the native version. The native version was used as a standard by which to compare the algorithms used in the portable version, and in most cases the vectorized algorithms are the ones used on non-vector hardware. In those cases where the vectorized version is not used, a serial version of the same algorithm is employed because it is found to run faster, usually due to locality considerations. Preprocessing directives control code choice.
The Cray SX6 runtimes for the Portable CBL are all within 2x of the times for the native version on the SV1ex, taking about twice as much time for cb_amino_translate_ascii, cb_compress/uncompress, and cb_repeatn. Times for cb_countn_ascii and cb_revcompl, however, are slightly faster, while cb_copy_bits is 14x faster, essentially a measure of the superior memory bandwidth of the SX6. The time of about one second for this routine was matched only by the X1 in MSP mode.
Comparison of the portable times on the SX6 to the portable times on the SV1ex reveals superior performance on the SX6. Big improvements include the cb_searchn routine, arguably one of the most important in the library, running 3x faster, cb_revcompl running 6x faster, and cb_countn_ascii running 8x faster.
On the Cray X1, three CBL functions benefit significantly from MSP mode. cb_amino_translate_ascii has a run time almost half that of the X1 in SSP mode, but still twice that of the native version on the SV1ex. cb_copy_bits runs significantly faster, with a nice exhibition of high memory bandwidth similar to the SX6. cb_countn_ascii shows two good improvements, once when going from SSP to MSP with VECTLEN=128, and again while remaining in MSP mode and going from VECTLEN=128 to VECTLEN=512. All but one routine experienced some gain by increasing the chunk size for the MSP. For portable timing comparisons, the X1 times are largely comparable to those of the SX6, with the SX6 performing better overall.
For non-vector machines, the vectorizable loops facilitate loaded pipelines with very good results on high-end hardware, and acceptable results on low-end hardware. On the 1.3 GHz IBM power4 with 32 MB shared L3 cache, where the database always fits into L3 for this benchmark, the portable CBL runtimes are roughly comparable in most cases with the native version on the SV1ex, the most notable exception being the searchn routine, where the portable version is 3 times slower. As mentioned earlier, the search routine is important, and Cray's effort here is outstanding. Driving the P4 out of cache with databases up to 512 MB reveals a 40% deviation from the ideal case for the cb_copy_bits routine, consistent with this routine's role as a measure of memory bandwidth. Other routines did not fare as badly, the next worst being cb_revcompl with about a 15% degradation. Note that only columns labeled 8 & 9 in Figure 8 are out-of-cache for benchmarks on compressed data.
Under linux on 1.4 GHz AMD hardware w/256 KB cache, runtimes are 1.5x to 8x of those for the native version on the SV1ex. The Intel v7.1 compiler (icc) produces the best results, remarkably so for the important searchn routine, while gcc 3.2.2 surprisingly outperforms Portland Group's Workstation 4.1 (pgcc). Intel and AMD have always had good integer performance, and it remains to be seen what numbers might come from a faster processor like a 2 GHz Xeon MP with a 2 MB cache. The very good numbers for low-end came on cb_repeat, running in about the same time as the native version, and was in the ball park for many of the other routines. Performance dropped considerably, however, on cb_revcompl and the important cb_searchn. Overall, the performance of the Portable CBL on low-end hardware is quite good, and the price/performance ratio is attractive.
Given that the Portable CBL does perform acceptably well on low-end hardware, it is hoped that the library will provide the framework via open source for others to not only optimize the code for specific hardware, but to extend the library with additional primitives and promote its use as a standard.
1 IA-64 can access either endian data [2].
2
Here is how awkward it would look if we mentally tried
to make the little-endian words look like English:
|<--------word0-------->|<--------word1-------->|<--etc-->| byte3 byte2 byte1 byte0 byte7 byte6 byte5 byte4 t g c a null a
3 signed words allow the possibility of 1's being shifted in.
4
Note that an a is 00, composed of the same zero bits used to
pad
the word. Thus when decompressing, one must know the
number
of letters that were compressed.
5
Thanks to Shawn Houston of ARSC for use of his linux box,
and Ed Kornkven of ARSC for running the X1 numbers.
Special thanks to Bill Long of Cray, Inc. for hints.
cb_amino_translate_ascii - translate nucleotides to amino acids cb_compress - compresses nucleotide or amino acid ASCII data cb_copy_bits - copy contiguous sequence of memory bits cb_countn_ascii - counts A, C, T, G, and N characters in a string cb_fasta_convert - restructure the memory image of a FASTA file cb_free - frees memory allocated with cb_malloc in Cray version - simply calls free() in portable version cb_irand - generates an array of random bits cb_malloc - allocate block aligned memory region in Cray version - simply calls malloc() in portable version cb_read_fasta - loads data from a FASTA file into memory arrays cb_repeatn - find short tandem repeats in a nucleotide string cb_revcompl - reverse complements compressed nucleotide data cb_searchn - gap-free nucleotide search allowing mismatches cb_uncompress - uncompress nucleotide or amino acid data to ASCII cb_version - returns the version number of libcbl
The native v1.0 CBL contains the following routines in fortran only: cb_isort - unsigned integer radix sort cb_sort - multi-pass sort routine for compressed data cb_swa_fw - compute Smith-Waterman cell scores with ASCII input and full word output Smith-Waterman is planned for v1.1 of the portable CBL. Routines to deal with proprietary Cray SSD hardware (in both C and fortran): cb_copy_from_ssd cb_copy_to_ssd cb_largest_ssdid cb_ssd_errno cb_ssd_free cb_ssd_initThe native CBL is now at version 1.2, with the following additions/changes:
cb_isort - Fortran-only restriction is removed cb_isort1 - new - cb_isort without the index array cb_cghistn - new - histograms of cg density in a string cb_swa_fw - The Fortran-only restriction is removed. Also, the "full word" refers to the cell size in the scoring matrix, not the output, which is a pair of aligned character strings. cb_swn_fw - new - same as cb_swa_fw, except with 2-bit nucleotide input cb_swn4_fw - new - same as cb_swa_fw, except with 4-bit nucleotide input cb_block_zero - new - blocks out a region of memory using special block transfer SV1ex hardware cb_nmer - new - creates up to 64-bit-length short sequences from each starting point in the input string.