Recently, I have been working with an ARMv5 device that was released in 2008: the NXP LPC3240 which is based on the ARM926EJ-S SoC. The specific device I'm using has a VFPv2 unit, however since the VFP was optional on the ARM926EJ-S, most distros/images are built with no floating-point support. From the standpoint of binary distributions, this makes the most sense: if you want to supply a binary to run on the most number of devices, build for the lowest common denominator. But when building your own distro/images from source using OpenEmbedded, you have the flexibility to tweak the parameters of your build to suit the specifics of your hardware.
Nowadays, a user has 3 choices when it comes to VFP on the ARM926EJ-S:
- soft: floating-point emulated in software (no hardware floating-point)
- softfp: enable hardware floating-point but have floating-point parameters passed in integer registers (i.e. use the soft calling conventions)
- hard: enable floating-point and have floating-point parameters passed in floating-point registers (i.e. use FPU-specific calling conventions)
By default in OpenEmbedded, specifying tune-arm926ejs.inc sets the DEFAULTTUNE to "armv5te" which disables VFP. By tweaking DEFAULTTUNE in your machine.conf file (or local.conf) you can try out all the options. Personally, when setting DEFAULTTUNE, I also like to tweak TUNE_CCARGS.
To try out the different options, set the following parameters:
- soft:
DEFAULTTUNE = "armv5te"
TUNE_CCARGS = "-mcpu=arm926ej-s -marm" - softfp:
DEFAULTTUNE = "armv5te-vfp"
TUNE_CCARGS = "-mcpu=arm926ej-s -mfpu=vfp -mfloat-abi=softfp -marm" - hard:
DEFAULTTUNE = "armv5tehf-vfp"
TUNE_CCARGS = "-mcpu=arm926ej-s -mfpu=vfp -mfloat-abi=hard -marm"
STD BENCHMARK DISCLAIMER: when it comes to benchmarks it's always important to remember that they are synthetic. That is: they are programs created to measure the performance of some artificial work-load of their choosing. If you want to know how the performance of your program will change under different settings, the only real way to determine that is to build and test your specific program under the different settings. It's also worth pointing out that during the era when benchmark programs were a really hot topic (late 90's-ish?) many vendors would tailor their hardware towards the popular benchmark programs of the time, skewing the results dramatically. In other words, a specific piece of hardware would be tuned to run a specific benchmark really well, but "real" workloads wouldn't see much improvement. Therefore YMMV.
For this experiment I created three images; each one built using one of the three floating-point tunings given above but all containing the same contents and the same versions of all the contents. I then loaded each of the images on my hardware in turn, so I could run the benchmark programs to generate performance data.
As of the time these images were built (Oct 11, 2019), the HEAD revision of openembedded-core was 59938780e7e776d87146002ea939b185f8704408 and the head revision of meta-openembedded/meta-oe was fd1a0c9210b162ccb147e933984c755d32899efc. At that time, the compiler being used was gcc-9.2, and the versions of various components are: glibc:2.30, bash:5.0, dhrystone:2.1, linpack:1.0, nbench:2.2.3, sysbench:0.4.12, and whetstone:1.2.
First Impressions
One of the first interesting things to note is the size of the various binaries:soft | softfp | hard | ||||||
whetstone | 33,172 | 20,236 | 20,444 | |||||
dhrystone | 13,752 | 9,660 | 9,660 | |||||
sysbench | 81,268 | 77,176 | 77,176 | |||||
linpack | 13,744 | 9,652 | 9,652 | |||||
nbench | 47,308 | 43,216 | 43,216 |
$ arm-oe-linux-gnueabi-objdump -d whetstoneWhile the softfp and hard programs are filled with VFP instructions (e.g. vldr, vmul.f64, vsub.f64, etc.) the soft program contains calls to various __aeabi_* functions and __adddf3. These functions come from libgcc, a library written by the gcc people to help shore up things that are missing from standard C libraries (such as software emulation of floating-point, see here for more info). Interestingly, the code of these functions is linked into the executable itself (and not as a shared library). As you can imagine, emulating floating-point operations in software is going to take a lot of code!
If you have floating-point hardware, taking advantage of it will shrink the size of your executables (if they use floating-point math).
Whetstone
whetstone is a benchmark program whose primary purpose is to measure floating-point performance. In each image I ran the whetstone program 5 times, timing each run with time, and had it run 1,000,000 loops:# time whetstone 1000000The averages of each test are as follows. Higher MIPS is better, lower time is better:
soft | softfp | hard | |||
MIPS | duration [s] | MIPS | duration [s] | MIPS | duration [s] |
100.16 | 998.4 | 1872.84 | 53.4 | 1872.84 | 53.4 |
Dhrystone
dhrystone is a benchmark used to evaluate integer performance. In each image I ran the whetstone program 5 times, timing each run, and performing 1,000,000 iterations per run:# time echo 1000000 | dhryThe averages are as follows. Higher dhry/sec is better, lower time is better:
soft | softfp | hard | |||
dhry/sec | duration [ms] | dhry/sec | duration [ms] | dhry/sec | duration [ms] |
432527.22 | 2.3 | 431037.7 | 2.3 | 429554.58 | 2.3 |
Sysbench (cpu)
sysbench is a benchmark which includes a bunch of sub-benchmarks, one of which is the "cpu" test. On each image I ran the cpu test 5 times, capping the run-time to 300[s]. The benchmark appears to perform prime factorization, measuring something called "events", and recording run time per event.# time sysbench --max-time=300 --test=cpu run
soft | softfp | hard | |||
events | duration/event [ms] | events | duration/event [ms] | events | duration/event [ms] |
1157.2 | 259.29 | 2951.6 | 101.638 | 2951 | 101.662 |
As a final test, on each image I ran the cpu test just once without a time limitation, to see how much time it would otherwise take.
# time sysbench --test=cpu run
soft | softfp | hard | |||
events | test duration | events | test duration | events | test duration |
10000 | 43m0.50s | 10000 | 16m56.499s | 10000 | 16m56.777 |
Linpack
linpack is a benchmark testing a computer's ability to perform numerical linear algebra. The program takes one required parameter: the size of the array to use. If you pass "200", it will calculate a 200x200 array. As it runs, it determines how many repetitions to perform, it bases the repetitions on its performance. For each repetition it records how much time it took. When it's done a set of repetitions, it calculates a KFLOPS count, then starts over with a different repetition count.For each image I ran the program once with "200" and once with "500". With no hardware floating point support calculating a 200x200 array it starts with 1 repetition, then tries 2, then 4, 8, etc. With hardware floating-point on a 200x200 array it starts with 8 repetitions, then 16, 32, etc. On a 200x200 array the repetition counts common to all images are 8, 16, and 32. On a 500x500 array the repetition counts common to all images are 1 and 2.
The program never terminates; it keeps increasing the repetition count and going until explicitly killed.
# echo 200 | linpack
soft | softfp | hard | ||||||
reps | time/rep | KFLOPS | reps | time/rep | KFLOPS | reps | time/rep | KFLOPS |
8 | 4.3 | 2718.669 | 8 | 0.64 | 18553.356 | 8 | 0.62 | 19223.389 |
16 | 8.6 | 2718.614 | 16 | 1.29 | 18552.917 | 16 | 1.25 | 19214.278 |
32 | 17.2 | 2718.792 | 32 | 2.58 | 18552.361 | 32 | 2.49 | 19212.128 |
# echo 500 | linpack
soft | softfp | hard | ||||||
reps | time/rep | KFLOPS | reps | time/rep | KFLOPS | reps | time/rep | KFLOPS |
1 | 8.1 | 2674.928 | 1 | 1.38 | 15876.865 | 1 | 1.38 | 15883.324 |
2 | 16.17 | 2674.871 | 2 | 2.74 | 15878.365 | 2 | 2.74 | 15882.516 |
nbench
nbench (aka BYTEmark) runs a bunch of sub-tests (including: numerical sort, string sort, bitfield, fp emulation, fourier, assignment, IDEA, huffman, neural net, and LU decomposition) then generates both an integer index and a floating-point index. These indices are relative to what were considered capable machines of the time (mid-1990's).This benchmark was run twice on each image, the averaged results are:
soft | softfp | hard | ||||||
integer idx | fp idx | integer idx | fp idx | integer idx | fp idx | |||
1.054 | 0.1 | 1.1095 | 0.961 | 1.109 | 0.979 |
Conclusions
Since software floating-point emulation gets added statically to C programs, using hardware floating point makes binaries smaller in programs that perform floating-point calculations. Enabling floating-point in such programs also improves the performance of floating-point operations noticeably. Interestingly, it appears as though integer performance is ever so slightly impacted in the hard case relative to softfp. Therefore it would seem to be that if your entire work-load is floating-point, then go with hard, otherwise if there is both floating-point and considerable integer calculations, softfp might be best.As always, test your own application to know which mode is best in your scenario.
No comments:
New comments are not allowed.