21 Oct 2019

OE Floating-Point Options for ARMv5 (ARM926EJ-S)

One of the (many) things I enjoy about OpenEmbedded is how easy it is to try out different configurations. Want to switch from sysvinit to systemd? Change the config, re-build, and there's your new image to test. Want to switch from busybox to coreutils? Change the config, re-build, and there's your new image.

Recently, I have been working with an ARMv5 device that was released in 2008: the NXP LPC3240 which is based on the ARM926EJ-S SoC. The specific device I'm using has a VFPv2 unit, however since the VFP was optional on the ARM926EJ-S, most distros/images are built with no floating-point support. From the standpoint of binary distributions, this makes the most sense: if you want to supply a binary to run on the most number of devices, build for the lowest common denominator. But when building your own distro/images from source using OpenEmbedded, you have the flexibility to tweak the parameters of your build to suit the specifics of your hardware.

Nowadays, a user has 3 choices when it comes to VFP on the ARM926EJ-S:
  1. soft: floating-point emulated in software (no hardware floating-point)
  2. softfp: enable hardware floating-point but have floating-point parameters passed in integer registers (i.e. use the soft calling conventions)
  3. hard: enable floating-point and have floating-point parameters passed in floating-point registers (i.e. use FPU-specific calling conventions)
The naming of option 2 (softfp) is unfortunate. To me, saying "soft floating-point" implies the floating-point is being emulated in software. However, its name was meant to contrast its calling convention with that of hard floating-point, not to imply the floating-point is being emulated in software.

By default in OpenEmbedded, specifying tune-arm926ejs.inc sets the DEFAULTTUNE to "armv5te" which disables VFP. By tweaking DEFAULTTUNE in your machine.conf file (or local.conf) you can try out all the options. Personally, when setting DEFAULTTUNE, I also like to tweak TUNE_CCARGS.

To try out the different options, set the following parameters:
  1. soft:
    DEFAULTTUNE = "armv5te"
    TUNE_CCARGS = "-mcpu=arm926ej-s -marm"
  2. softfp:
    DEFAULTTUNE = "armv5te-vfp"
    TUNE_CCARGS = "-mcpu=arm926ej-s -mfpu=vfp -mfloat-abi=softfp -marm"
  3. hard:
    DEFAULTTUNE = "armv5tehf-vfp"
    TUNE_CCARGS = "-mcpu=arm926ej-s -mfpu=vfp -mfloat-abi=hard -marm"
The meta-openembedded/meta-oe layer provides a number of recipes for benchmark applications. Interesting performance benchmark programs include: whetstone, dhrystone, linpack, nbench, and the "cpu" test of sysbench.

STD BENCHMARK DISCLAIMER: when it comes to benchmarks it's always important to remember that they are synthetic. That is: they are programs created to measure the performance of some artificial work-load of their choosing. If you want to know how the performance of your program will change under different settings, the only real way to determine that is to build and test your specific program under the different settings. It's also worth pointing out that during the era when benchmark programs were a really hot topic (late 90's-ish?) many vendors would tailor their hardware towards the popular benchmark programs of the time, skewing the results dramatically. In other words, a specific piece of hardware would be tuned to run a specific benchmark really well, but "real" workloads wouldn't see much improvement. Therefore YMMV.

For this experiment I created three images; each one built using one of the three floating-point tunings given above but all containing the same contents and the same versions of all the contents. I then loaded each of the images on my hardware in turn, so I could run the benchmark programs to generate performance data.

As of the time these images were built (Oct 11, 2019), the HEAD revision of openembedded-core was 59938780e7e776d87146002ea939b185f8704408 and the head revision of meta-openembedded/meta-oe was fd1a0c9210b162ccb147e933984c755d32899efc. At that time, the compiler being used was gcc-9.2, and the versions of various components are: glibc:2.30, bash:5.0, dhrystone:2.1, linpack:1.0, nbench:2.2.3, sysbench:0.4.12, and whetstone:1.2.

First Impressions

One of the first interesting things to note is the size of the various binaries:


soft softfp hard




whetstone 33,172 20,236 20,444




dhrystone 13,752 9,660 9,660




sysbench 81,268 77,176 77,176




linpack 13,744 9,652 9,652




nbench 47,308 43,216 43,216

    Looking at the disassembly of each of these binaries, it's not hard to see why this is. Disassembling the binaries is as simple as:
    $ arm-oe-linux-gnueabi-objdump -d whetstone
    While the softfp and hard programs are filled with VFP instructions (e.g. vldr, vmul.f64, vsub.f64, etc.) the soft program contains calls to various __aeabi_* functions and __adddf3. These functions come from libgcc, a library written by the gcc people to help shore up things that are missing from standard C libraries (such as software emulation of floating-point, see here for more info). Interestingly, the code of these functions is linked into the executable itself (and not as a shared library). As you can imagine, emulating floating-point operations in software is going to take a lot of code!

    If you have floating-point hardware, taking advantage of it will shrink the size of your executables (if they use floating-point math).

    Whetstone

    whetstone is a benchmark program whose primary purpose is to measure floating-point performance. In each image I ran the whetstone program 5 times, timing each run with time, and had it run 1,000,000 loops:
    # time whetstone 1000000
    The averages of each test are as follows. Higher MIPS is better, lower time is better:

    soft softfp hard
    MIPS duration [s] MIPS duration [s] MIPS duration [s]
    100.16 998.4 1872.84 53.4 1872.84 53.4

    Dhrystone

    dhrystone is a benchmark used to evaluate integer performance. In each image I ran the whetstone program 5 times, timing each run, and performing 1,000,000 iterations per run:
    # time echo 1000000 | dhry
    The averages are as follows. Higher dhry/sec is better, lower time is better:

    soft softfp hard
    dhry/sec duration [ms] dhry/sec duration [ms] dhry/sec duration [ms]
    432527.22 2.3 431037.7 2.3 429554.58 2.3

    Sysbench (cpu)

    sysbench is a benchmark which includes a bunch of sub-benchmarks, one of which is the "cpu" test. On each image I ran the cpu test 5 times, capping the run-time to 300[s]. The benchmark appears to perform prime factorization, measuring something called "events", and recording run time per event.
    # time sysbench --max-time=300 --test=cpu run
    soft softfp hard
    events duration/event [ms] events duration/event [ms] events duration/event [ms]
    1157.2 259.29 2951.6 101.638 2951 101.662


    As a final test, on each image I ran the cpu test just once without a time limitation, to see how much time it would otherwise take.
    # time sysbench --test=cpu run
    soft softfp hard
    events test duration events test duration events test duration
    10000 43m0.50s 10000 16m56.499s 10000 16m56.777

    Linpack

    linpack is a benchmark testing a computer's ability to perform numerical linear algebra. The program takes one required parameter: the size of the array to use. If you pass "200", it will calculate a 200x200 array. As it runs, it determines how many repetitions to perform, it bases the repetitions on its performance. For each repetition it records how much time it took. When it's done a set of repetitions, it calculates a KFLOPS count, then starts over with a different repetition count.

    For each image I ran the program once with "200" and once with "500". With no hardware floating point support calculating a 200x200 array it starts with 1 repetition, then tries 2, then 4, 8, etc. With hardware floating-point on a 200x200 array it starts with 8 repetitions, then 16, 32, etc. On a 200x200 array the repetition counts common to all images are 8, 16, and 32. On a 500x500 array the repetition counts common to all images are 1 and 2.

    The program never terminates; it keeps increasing the repetition count and going until explicitly killed.
    # echo 200 | linpack
    soft softfp hard
    reps time/rep KFLOPS reps time/rep KFLOPS reps time/rep KFLOPS
    8 4.3 2718.669 8 0.64 18553.356 8 0.62 19223.389
    16 8.6 2718.614 16 1.29 18552.917 16 1.25 19214.278
    32 17.2 2718.792 32 2.58 18552.361 32 2.49 19212.128
    # echo 500 | linpack
    soft softfp hard
    reps time/rep KFLOPS reps time/rep KFLOPS reps time/rep KFLOPS
    1 8.1 2674.928 1 1.38 15876.865 1 1.38 15883.324
    2 16.17 2674.871 2 2.74 15878.365 2 2.74 15882.516

    nbench

    nbench (aka BYTEmark) runs a bunch of sub-tests (including: numerical sort, string sort, bitfield, fp emulation, fourier, assignment, IDEA, huffman, neural net, and LU decomposition) then generates both an integer index and a floating-point index. These indices are relative to what were considered capable machines of the time (mid-1990's).

    This benchmark was run twice on each image, the averaged results are:

    soft softfp hard


    integer idx fp idx integer idx fp idx integer idx fp idx


    1.054 0.1 1.1095 0.961 1.109 0.979

    Conclusions

    Since software floating-point emulation gets added statically to C programs, using hardware floating point makes binaries smaller in programs that perform floating-point calculations. Enabling floating-point in such programs also improves the performance of floating-point operations noticeably. Interestingly, it appears as though integer performance is ever so slightly impacted in the hard case relative to softfp. Therefore it would seem to be that if your entire work-load is floating-point, then go with hard, otherwise if there is both floating-point and considerable integer calculations, softfp might be best.

    As always, test your own application to know which mode is best in your scenario.

    No comments: