DoodleTs: arm926ej-s

Showing posts with label arm926ej-s. Show all posts

28 Apr 2020

The LPC32xx Project - introduction

For the last while I've been working on an extremely exciting project at work; certainly one of the best jobs to come my way. I've been lucky in my career to have worked on some really exciting things, and this project is certainly one of them.

We have a customer who has a range of hardware from some very old stuff to some very new stuff. They want the same versions of bootloader, kernel, and userspace running on all of them. Additionally they want, for each device, A/B updates, each of which is to run in its own container, and all built with OpenEmbedded/Yocto.

I'm not at all worried about finding support for the newish stuff. It's actually the oldest hardware that needs the most attention. For example, look at the mailing lists for the Linux kernel, U-Boot, or anything graphics-related and you'll find hundreds of patches from many developers every day all working on hardware so cutting edge, some of it isn't even available for purchase yet. But many fewer people are making sure the old stuff is still working. At best a developer will make sure their changes won't cause support for an older device to stop compiling, but sometimes that's not enough.

Earlier in my career I had an opportunity to do board bring-up. The company I was working for had just created their own custom board. It was based around a variant of the AMCC PowerPC 440 SoC. Our board, thankfully, was very closely modelled on the reference board for the specific SoC we were using, but with two key differences:

we were using a brand of SDRAM that differed from the one on the reference board (and therefore the timing parameters needed tweaking), and
whereas the reference board was running at a middle-of-the-road clocking, we wanted to run our board at the highest clock rate possible

Although upstream U-Boot already had support for the reference board, it was my job to update it to work on our board by completing these two tasks. Before starting this job, I hadn't even heard the phrase "board bring-up", but I was hooked! If I could have, I would have plotted a career from that time on that would have included a lot more board bring-up activities! But you take what you can get, and there are some other exciting things to do other than board bring-up.

What I enjoy most about board bring-up is it provides an opportunity to get deep down into the details of how an SoC works. We're all aware of SDRAM, and flash, and DMA, and all of the dozens of other pieces that fit together to make an SoC. But it's not every day one gets the opportunity to examine these things at the register level. Getting these two tasks working was very exciting for me; definitely a highlight.

Here I am, years later, and I find myself doing board bring-up again. However this time it's not with cutting-edge hardware, but rather with really old hardware: the NXP LPC32xx SoC. I realize it's strange for me to claim to be doing "board bring-up" for an SoC that clearly already has support in U-Boot and the Linux kernel. However the fact is the support for this device was added decades ago and (especially in the case of U-Boot) has bitrotted quite badly in the interim.

U-Boot is a very current and exciting project! It's not unusual for the daily patch count to run well into the hundreds. There are always new boards and SoCs in need of support, plus there are ongoing projects to improve the underlying structure of the code, its build system, and its test/ci infrastructure. With a code base moving this fast, an older device can quickly fall out of step with the rest of the project.

I've been busily working on this project for a while and enjoying every minute. I'll be blogging more about it!

21 Oct 2019

OE Floating-Point Options for ARMv5 (ARM926EJ-S)

One of the (many) things I enjoy about OpenEmbedded is how easy it is to try out different configurations. Want to switch from sysvinit to systemd? Change the config, re-build, and there's your new image to test. Want to switch from busybox to coreutils? Change the config, re-build, and there's your new image.

Recently, I have been working with an ARMv5 device that was released in 2008: the NXP LPC3240 which is based on the ARM926EJ-S SoC. The specific device I'm using has a VFPv2 unit, however since the VFP was optional on the ARM926EJ-S, most distros/images are built with no floating-point support. From the standpoint of binary distributions, this makes the most sense: if you want to supply a binary to run on the most number of devices, build for the lowest common denominator. But when building your own distro/images from source using OpenEmbedded, you have the flexibility to tweak the parameters of your build to suit the specifics of your hardware.

Nowadays, a user has 3 choices when it comes to VFP on the ARM926EJ-S:

soft: floating-point emulated in software (no hardware floating-point)
softfp: enable hardware floating-point but have floating-point parameters passed in integer registers (i.e. use the soft calling conventions)
hard: enable floating-point and have floating-point parameters passed in floating-point registers (i.e. use FPU-specific calling conventions)

The naming of option 2 (softfp) is unfortunate. To me, saying "soft floating-point" implies the floating-point is being emulated in software. However, its name was meant to contrast its calling convention with that of hard floating-point, not to imply the floating-point is being emulated in software.

By default in OpenEmbedded, specifying tune-arm926ejs.inc sets the DEFAULTTUNE to "armv5te" which disables VFP. By tweaking DEFAULTTUNE in your machine.conf file (or local.conf) you can try out all the options. Personally, when setting DEFAULTTUNE, I also like to tweak TUNE_CCARGS.

To try out the different options, set the following parameters:

soft:
DEFAULTTUNE = "armv5te"
TUNE_CCARGS = "-mcpu=arm926ej-s -marm"
softfp:
DEFAULTTUNE = "armv5te-vfp"
TUNE_CCARGS = "-mcpu=arm926ej-s -mfpu=vfp -mfloat-abi=softfp -marm"
hard:
DEFAULTTUNE = "armv5tehf-vfp"
TUNE_CCARGS = "-mcpu=arm926ej-s -mfpu=vfp -mfloat-abi=hard -marm"

The meta-openembedded/meta-oe layer provides a number of recipes for benchmark applications. Interesting performance benchmark programs include: whetstone, dhrystone, linpack, nbench, and the "cpu" test of sysbench.

STD BENCHMARK DISCLAIMER: when it comes to benchmarks it's always important to remember that they are synthetic. That is: they are programs created to measure the performance of some artificial work-load of their choosing. If you want to know how the performance of your program will change under different settings, the only real way to determine that is to build and test your specific program under the different settings. It's also worth pointing out that during the era when benchmark programs were a really hot topic (late 90's-ish?) many vendors would tailor their hardware towards the popular benchmark programs of the time, skewing the results dramatically. In other words, a specific piece of hardware would be tuned to run a specific benchmark really well, but "real" workloads wouldn't see much improvement. Therefore YMMV.

For this experiment I created three images; each one built using one of the three floating-point tunings given above but all containing the same contents and the same versions of all the contents. I then loaded each of the images on my hardware in turn, so I could run the benchmark programs to generate performance data.

As of the time these images were built (Oct 11, 2019), the HEAD revision of openembedded-core was 59938780e7e776d87146002ea939b185f8704408 and the head revision of meta-openembedded/meta-oe was fd1a0c9210b162ccb147e933984c755d32899efc. At that time, the compiler being used was gcc-9.2, and the versions of various components are: glibc:2.30, bash:5.0, dhrystone:2.1, linpack:1.0, nbench:2.2.3, sysbench:0.4.12, and whetstone:1.2.

First Impressions

One of the first interesting things to note is the size of the various binaries:

	soft	softfp	hard
whetstone	33,172	20,236	20,444
dhrystone	13,752	9,660	9,660
sysbench	81,268	77,176	77,176
linpack	13,744	9,652	9,652
nbench	47,308	43,216	43,216

Looking at the disassembly of each of these binaries, it's not hard to see why this is. Disassembling the binaries is as simple as:

$ arm-oe-linux-gnueabi-objdump -d whetstone

While the softfp and hard programs are filled with VFP instructions (e.g. vldr, vmul.f64, vsub.f64, etc.) the soft program contains calls to various __aeabi_* functions and __adddf3. These functions come from libgcc, a library written by the gcc people to help shore up things that are missing from standard C libraries (such as software emulation of floating-point, see here for more info). Interestingly, the code of these functions is linked into the executable itself (and not as a shared library). As you can imagine, emulating floating-point operations in software is going to take a lot of code!

If you have floating-point hardware, taking advantage of it will shrink the size of your executables (if they use floating-point math).

Whetstone

whetstone is a benchmark program whose primary purpose is to measure floating-point performance. In each image I ran the whetstone program 5 times, timing each run with time, and had it run 1,000,000 loops:

# time whetstone 1000000

The averages of each test are as follows. Higher MIPS is better, lower time is better:

soft		softfp		hard
MIPS	duration [s]	MIPS	duration [s]	MIPS	duration [s]
100.16	998.4	1872.84	53.4	1872.84	53.4

Dhrystone

dhrystone is a benchmark used to evaluate integer performance. In each image I ran the whetstone program 5 times, timing each run, and performing 1,000,000 iterations per run:

# time echo 1000000 | dhry

The averages are as follows. Higher dhry/sec is better, lower time is better:

soft		softfp		hard
dhry/sec	duration [ms]	dhry/sec	duration [ms]	dhry/sec	duration [ms]
432527.22	2.3	431037.7	2.3	429554.58	2.3

Sysbench (cpu)

sysbench is a benchmark which includes a bunch of sub-benchmarks, one of which is the "cpu" test. On each image I ran the cpu test 5 times, capping the run-time to 300[s]. The benchmark appears to perform prime factorization, measuring something called "events", and recording run time per event.

# time sysbench --max-time=300 --test=cpu run

soft		softfp		hard
events	duration/event [ms]	events	duration/event [ms]	events	duration/event [ms]
1157.2	259.29	2951.6	101.638	2951	101.662

As a final test, on each image I ran the cpu test just once without a time limitation, to see how much time it would otherwise take.

# time sysbench --test=cpu run

soft		softfp		hard
events	test duration	events	test duration	events	test duration
10000	43m0.50s	10000	16m56.499s	10000	16m56.777

Linpack

linpack is a benchmark testing a computer's ability to perform numerical linear algebra. The program takes one required parameter: the size of the array to use. If you pass "200", it will calculate a 200x200 array. As it runs, it determines how many repetitions to perform, it bases the repetitions on its performance. For each repetition it records how much time it took. When it's done a set of repetitions, it calculates a KFLOPS count, then starts over with a different repetition count.

For each image I ran the program once with "200" and once with "500". With no hardware floating point support calculating a 200x200 array it starts with 1 repetition, then tries 2, then 4, 8, etc. With hardware floating-point on a 200x200 array it starts with 8 repetitions, then 16, 32, etc. On a 200x200 array the repetition counts common to all images are 8, 16, and 32. On a 500x500 array the repetition counts common to all images are 1 and 2.

The program never terminates; it keeps increasing the repetition count and going until explicitly killed.

# echo 200 | linpack

soft			softfp			hard
reps	time/rep	KFLOPS	reps	time/rep	KFLOPS	reps	time/rep	KFLOPS
8	4.3	2718.669	8	0.64	18553.356	8	0.62	19223.389
16	8.6	2718.614	16	1.29	18552.917	16	1.25	19214.278
32	17.2	2718.792	32	2.58	18552.361	32	2.49	19212.128

# echo 500 | linpack

soft			softfp			hard
reps	time/rep	KFLOPS	reps	time/rep	KFLOPS	reps	time/rep	KFLOPS
1	8.1	2674.928	1	1.38	15876.865	1	1.38	15883.324
2	16.17	2674.871	2	2.74	15878.365	2	2.74	15882.516

nbench

nbench (aka BYTEmark) runs a bunch of sub-tests (including: numerical sort, string sort, bitfield, fp emulation, fourier, assignment, IDEA, huffman, neural net, and LU decomposition) then generates both an integer index and a floating-point index. These indices are relative to what were considered capable machines of the time (mid-1990's).

This benchmark was run twice on each image, the averaged results are:

soft		softfp		hard
integer idx	fp idx	integer idx	fp idx	integer idx	fp idx
1.054	0.1	1.1095	0.961	1.109	0.979

Conclusions

Since software floating-point emulation gets added statically to C programs, using hardware floating point makes binaries smaller in programs that perform floating-point calculations. Enabling floating-point in such programs also improves the performance of floating-point operations noticeably. Interestingly, it appears as though integer performance is ever so slightly impacted in the hard case relative to softfp. Therefore it would seem to be that if your entire work-load is floating-point, then go with hard, otherwise if there is both floating-point and considerable integer calculations, softfp might be best.

As always, test your own application to know which mode is best in your scenario.

16 Sept 2019

Board Bring-Up: An Introduction to gdb, JTAG, and OpenOCD

Lately I've been working on getting a recent-ish U-Boot (2018.07) and Linux kernel (5.0.19) running on an SoC that was released back in 2008: the NXP LPC3240 which is based on the ARM926EJ-S processor (NOTE: the ARM926EJ-S was released in 2001).

Although I had U-Boot working well enough to load and boot Linux, the moment the Linux kernel started printing its boot progress, my console was filled with the sort of garbage that tells an embedded developer they've got the wrong baud rate. Double-checking, and even triple-checking, of the baud rate values, however, showed that every place where it was configured, it had been correctly set to 115200 8N1.

Having a working console is the basis from which the rest of a software developer's board bring-up activities take place! If you can compile a kernel, load it on the board, and get it to print anything, legibly, to the console, then you're already in a really good position. But if there's no working connection via the console, it means more low-level work is needed.

Going down the hierarchy (from easier to harder), if the console isn't working, then you'll need to see if JTAG is a possibility. If a JTAG isn't available, then you'll need to look for an LED to blink. Blinking an LED to debug one's work during board bring-up isn't uncommon, but it can be a lot more painful. With nothing but (perhaps) a single LED, it can be hard (though strictly not impossible) to communicate something as simple as: "the value at 0x4000 4064 is 0x0008 097e, and I've reached <this> point in the code". Thankfully for me, this particular board has a working JTAG, and there is support for this SoC in OpenOCD.

JTAG is a very large specification and has a lot of use-cases. For my purposes, JTAG consists of:

extra logic that is added to a chip which implements a very specific state machine
a bunch of extra pins (at least 4, but some designs add more) with which to interface to this internal state machine from outside the chip
a set of commands (in the state machine) that can be executed by toggling bits on the external pin interface

These commands let you do things such as push individual bits into, or get individual bits out of, a particular device's JTAG scan chain. Depending on how the scan chain is implemented, this could translate to activities such as the ability to read/write registers, and read/write arbitrary values in the memory space (which includes things like peripherals, configuration, etc).

Most development hosts don't have random GPIO lines available for interfacing, therefore a dongle of some sort is needed to go between the desktop machine and the target board's multi-wire JTAG interface. In days past, these dongles would be connected to the development host via serial or parallel interfaces; nowadays they're mostly USB.

Armed with a JTAG dongle, in theory it would be possible to start interacting with the target board directly via JTAG commands. However, this could be very tedious as the JTAG commands are very primitive (i.e. having to follow the state machine precisely, and work 1 bit at a time). One of the more common arrangements is to use gdb, which permits the user to perform higher-level actions (i.e. set a breakpoint, read a given 32-bit memory address, list the register contents, etc) and let the software deal with the details. Note, however, gdb itself does not know how to "speak" JTAG nor does it know how to interact with a JTAG dongle. gdb does, however, speak its own command language called the remote gdbserver protocol. It is OpenOCD which acts as the interpreter between the remote gdbserver protocol on the one hand (e.g. over a network port), and JTAG commands for the target on the other (e.g. over USB to the dongle) marshalling all the data back and forth between the two.

With the target board powered off, plug the JTAG dongle's pins into the board's JTAG connector; connect the development host to the JTAG dongle via USB.

Power on the target board.

Run OpenOCD on the development host. In my specific case the command I invoke is:

$ openocd -f interface/ftdi/olimex-arm-usb-ocd-h.cfg -f board/phytec_lpc3250.cfg

It is important to note that openocd runs as a daemon, and as such, once invoked, does not terminate until explicitly killed. In particular, this command is run in its own terminal, and simply left running until my debugging session is done. All other work that will be done, needs to be performed in other terminals. Perhaps you're thinking: "I'll just run it in the background using an ampersand". That would work, however: as it runs and interacts with gdb and the board, openocd will print out useful information to the terminal. Therefore giving it its own terminal and letting it run independently while keeping it visible is often quite useful. It's always someplace visible on my desktop while debugging.

OpenOCD needs to know what dongle I'm using (it supports a number of JTAG dongles) and it needs to know the board or SoC to which it is connecting (it has support for many SoCs and boards). Implicit in the choice of dongle is the communication protocol (here USB) and dongle characteristics (properties, product ID, etc). By specifying a target board or SoC, you're letting OpenOCD know things such as how to initialize its connection, the register size, what speed to use, details about how the device needs to be reset, and so on.

More recently, some development boards come with built-in debug circuitry, including a USB connector, already designed into the target board itself. In these cases the JTAG dongle isn't needed. One simply needs to connect the target board directly to the development host via a single USB cable, and start up OpenOCD (and gdb) giving only one piece of information: the board's name. All other details are implied.

Running on a GNU/Linux system, gdb works best with ELF executables. gdb can be coerced into working with raw binaries, but when presented with an ELF file, it is provided with a lot more of the data it needs to do its job. But neither the Linux kernel nor U-Boot are ELF binaries. As part of their default build processes, however, both the Linux kernel and U-Boot build systems generate ELF output in addition to the parts that are actually run. A U-Boot build will produce, for example, u-boot.bin, which is the actual U-Boot binary that is stored wherever the bootloader needs to be placed. But in addition to this, a file called u-boot is produced which is its ELF counterpart. Similarly for the Linux kernel, the kernel itself might be found in arch/arm/Image, but its ELF counterpart is vmlinux.

If you want to debug a Linux kernel via JTAG using gdb, simply invoke:

$ arm-oe-linux-gnueabi-gdb vmlinux

Since the target is an ARM board and my host is an x86 board, I need to invoke the cross-gdb program, not the native one (otherwise it won't be able to make sense of the binary instructions). Since I do so much of my work using OpenEmbedded as a basis, when working independently on U-Boot and the kernel, I simply have OpenEmbedded generate an SDK targeting this particular board, and use it for all my work. When invoking this cross-debugger, I simply provide it with the path to, and the name of, the ELF file containing the kernel I have compiled.

By default openocd listens on port 6666 for tcl connections, port 4444 for telnet connections, and port 3333 for gdb connections. In order to create the link between gdb and openocd, once gdb is up and running you'll need to link them together by issuing the target remote or target extended-remote command:

(gdb) target extended-remote :3333

Of course if you've told openocd to listen to a different port, you'll need to make the necessary adjustments to the connection.

Congratulations! You're now debugging the Linux kernel on your target board via JTAG using gdb! No serial console required!

In my particular case, although I knew the Linux kernel was doing something, I wasn't sure what exactly was going on since the baud rate via my serial console was messed up. Using this setup I was able to dump the kernel's ring buffer, allowing me to see exactly what the kernel was doing and providing me with valuable debugging information of its boot:

(gdb) x /2000bs __log_buf