DoodleTs: October 2019

One of the (many) things I enjoy about OpenEmbedded is how easy it is to try out different configurations. Want to switch from sysvinit to systemd? Change the config, re-build, and there's your new image to test. Want to switch from busybox to coreutils? Change the config, re-build, and there's your new image.

Recently, I have been working with an ARMv5 device that was released in 2008: the NXP LPC3240 which is based on the ARM926EJ-S SoC. The specific device I'm using has a VFPv2 unit, however since the VFP was optional on the ARM926EJ-S, most distros/images are built with no floating-point support. From the standpoint of binary distributions, this makes the most sense: if you want to supply a binary to run on the most number of devices, build for the lowest common denominator. But when building your own distro/images from source using OpenEmbedded, you have the flexibility to tweak the parameters of your build to suit the specifics of your hardware.

Nowadays, a user has 3 choices when it comes to VFP on the ARM926EJ-S:

soft: floating-point emulated in software (no hardware floating-point)
softfp: enable hardware floating-point but have floating-point parameters passed in integer registers (i.e. use the soft calling conventions)
hard: enable floating-point and have floating-point parameters passed in floating-point registers (i.e. use FPU-specific calling conventions)

The naming of option 2 (softfp) is unfortunate. To me, saying "soft floating-point" implies the floating-point is being emulated in software. However, its name was meant to contrast its calling convention with that of hard floating-point, not to imply the floating-point is being emulated in software.

By default in OpenEmbedded, specifying tune-arm926ejs.inc sets the DEFAULTTUNE to "armv5te" which disables VFP. By tweaking DEFAULTTUNE in your machine.conf file (or local.conf) you can try out all the options. Personally, when setting DEFAULTTUNE, I also like to tweak TUNE_CCARGS.

To try out the different options, set the following parameters:

soft:
DEFAULTTUNE = "armv5te"
TUNE_CCARGS = "-mcpu=arm926ej-s -marm"
softfp:
DEFAULTTUNE = "armv5te-vfp"
TUNE_CCARGS = "-mcpu=arm926ej-s -mfpu=vfp -mfloat-abi=softfp -marm"
hard:
DEFAULTTUNE = "armv5tehf-vfp"
TUNE_CCARGS = "-mcpu=arm926ej-s -mfpu=vfp -mfloat-abi=hard -marm"

The meta-openembedded/meta-oe layer provides a number of recipes for benchmark applications. Interesting performance benchmark programs include: whetstone, dhrystone, linpack, nbench, and the "cpu" test of sysbench.

STD BENCHMARK DISCLAIMER: when it comes to benchmarks it's always important to remember that they are synthetic. That is: they are programs created to measure the performance of some artificial work-load of their choosing. If you want to know how the performance of your program will change under different settings, the only real way to determine that is to build and test your specific program under the different settings. It's also worth pointing out that during the era when benchmark programs were a really hot topic (late 90's-ish?) many vendors would tailor their hardware towards the popular benchmark programs of the time, skewing the results dramatically. In other words, a specific piece of hardware would be tuned to run a specific benchmark really well, but "real" workloads wouldn't see much improvement. Therefore YMMV.

For this experiment I created three images; each one built using one of the three floating-point tunings given above but all containing the same contents and the same versions of all the contents. I then loaded each of the images on my hardware in turn, so I could run the benchmark programs to generate performance data.

As of the time these images were built (Oct 11, 2019), the HEAD revision of openembedded-core was 59938780e7e776d87146002ea939b185f8704408 and the head revision of meta-openembedded/meta-oe was fd1a0c9210b162ccb147e933984c755d32899efc. At that time, the compiler being used was gcc-9.2, and the versions of various components are: glibc:2.30, bash:5.0, dhrystone:2.1, linpack:1.0, nbench:2.2.3, sysbench:0.4.12, and whetstone:1.2.

First Impressions

One of the first interesting things to note is the size of the various binaries:

	soft	softfp	hard
whetstone	33,172	20,236	20,444
dhrystone	13,752	9,660	9,660
sysbench	81,268	77,176	77,176
linpack	13,744	9,652	9,652
nbench	47,308	43,216	43,216

Looking at the disassembly of each of these binaries, it's not hard to see why this is. Disassembling the binaries is as simple as:

$ arm-oe-linux-gnueabi-objdump -d whetstone

While the softfp and hard programs are filled with VFP instructions (e.g. vldr, vmul.f64, vsub.f64, etc.) the soft program contains calls to various __aeabi_* functions and __adddf3. These functions come from libgcc, a library written by the gcc people to help shore up things that are missing from standard C libraries (such as software emulation of floating-point, see here for more info). Interestingly, the code of these functions is linked into the executable itself (and not as a shared library). As you can imagine, emulating floating-point operations in software is going to take a lot of code!

If you have floating-point hardware, taking advantage of it will shrink the size of your executables (if they use floating-point math).

Whetstone

whetstone is a benchmark program whose primary purpose is to measure floating-point performance. In each image I ran the whetstone program 5 times, timing each run with time, and had it run 1,000,000 loops:

# time whetstone 1000000

The averages of each test are as follows. Higher MIPS is better, lower time is better:

soft		softfp		hard
MIPS	duration [s]	MIPS	duration [s]	MIPS	duration [s]
100.16	998.4	1872.84	53.4	1872.84	53.4

Dhrystone

dhrystone is a benchmark used to evaluate integer performance. In each image I ran the whetstone program 5 times, timing each run, and performing 1,000,000 iterations per run:

# time echo 1000000 | dhry

The averages are as follows. Higher dhry/sec is better, lower time is better:

soft		softfp		hard
dhry/sec	duration [ms]	dhry/sec	duration [ms]	dhry/sec	duration [ms]
432527.22	2.3	431037.7	2.3	429554.58	2.3

Sysbench (cpu)

sysbench is a benchmark which includes a bunch of sub-benchmarks, one of which is the "cpu" test. On each image I ran the cpu test 5 times, capping the run-time to 300[s]. The benchmark appears to perform prime factorization, measuring something called "events", and recording run time per event.

# time sysbench --max-time=300 --test=cpu run

soft		softfp		hard
events	duration/event [ms]	events	duration/event [ms]	events	duration/event [ms]
1157.2	259.29	2951.6	101.638	2951	101.662

As a final test, on each image I ran the cpu test just once without a time limitation, to see how much time it would otherwise take.

# time sysbench --test=cpu run

soft		softfp		hard
events	test duration	events	test duration	events	test duration
10000	43m0.50s	10000	16m56.499s	10000	16m56.777

Linpack

linpack is a benchmark testing a computer's ability to perform numerical linear algebra. The program takes one required parameter: the size of the array to use. If you pass "200", it will calculate a 200x200 array. As it runs, it determines how many repetitions to perform, it bases the repetitions on its performance. For each repetition it records how much time it took. When it's done a set of repetitions, it calculates a KFLOPS count, then starts over with a different repetition count.

For each image I ran the program once with "200" and once with "500". With no hardware floating point support calculating a 200x200 array it starts with 1 repetition, then tries 2, then 4, 8, etc. With hardware floating-point on a 200x200 array it starts with 8 repetitions, then 16, 32, etc. On a 200x200 array the repetition counts common to all images are 8, 16, and 32. On a 500x500 array the repetition counts common to all images are 1 and 2.

The program never terminates; it keeps increasing the repetition count and going until explicitly killed.

# echo 200 | linpack

soft			softfp			hard
reps	time/rep	KFLOPS	reps	time/rep	KFLOPS	reps	time/rep	KFLOPS
8	4.3	2718.669	8	0.64	18553.356	8	0.62	19223.389
16	8.6	2718.614	16	1.29	18552.917	16	1.25	19214.278
32	17.2	2718.792	32	2.58	18552.361	32	2.49	19212.128

# echo 500 | linpack

soft			softfp			hard
reps	time/rep	KFLOPS	reps	time/rep	KFLOPS	reps	time/rep	KFLOPS
1	8.1	2674.928	1	1.38	15876.865	1	1.38	15883.324
2	16.17	2674.871	2	2.74	15878.365	2	2.74	15882.516

nbench

nbench (aka BYTEmark) runs a bunch of sub-tests (including: numerical sort, string sort, bitfield, fp emulation, fourier, assignment, IDEA, huffman, neural net, and LU decomposition) then generates both an integer index and a floating-point index. These indices are relative to what were considered capable machines of the time (mid-1990's).

This benchmark was run twice on each image, the averaged results are:

soft		softfp		hard
integer idx	fp idx	integer idx	fp idx	integer idx	fp idx
1.054	0.1	1.1095	0.961	1.109	0.979

Conclusions

Since software floating-point emulation gets added statically to C programs, using hardware floating point makes binaries smaller in programs that perform floating-point calculations. Enabling floating-point in such programs also improves the performance of floating-point operations noticeably. Interestingly, it appears as though integer performance is ever so slightly impacted in the hard case relative to softfp. Therefore it would seem to be that if your entire work-load is floating-point, then go with hard, otherwise if there is both floating-point and considerable integer calculations, softfp might be best.

As always, test your own application to know which mode is best in your scenario.

DoodleTs

21 Oct 2019

OE Floating-Point Options for ARMv5 (ARM926EJ-S)