9 Jun 2020

NAND Flash Basics

Introduction

NAND flash is a type of non-volatile data storage technology, and is often used on embedded devices much the same way a hard-drive would be used on a desktop machine.

NAND flash is built on cells. The original flash devices stored one bit of information per cell, and were called single-level cell (SLC) technology. Later two bits were stored per cell, so this became (unfortunately named) multi-level cell (MLC) technology. When the technology came along to store three bits per cell, it was named triple-level cell (TLC) technology. As you can see: "single" means one, "multiple" means two, and "triple" means three. According to wikipedia there now exist (or are in development) quad- devices with 4 bits per cell (QLC) and penta- (PLC) devices with 5.

Unlike traditional hard-drives (i.e. built on spinning platters), the individual cells used to store bits in NAND flash can not be twiddled indefinitely. New or newly-erased flash shows up as all bits set to 1. The act of writing data to a freshly erased NAND device is simply the process of changing the necessary bits from 1s to 0s. The moment the data you want to write needs to flip a 0 back to a 1, an erase cycle is needed. If, coincidentally, every time you needed to write data to a NAND all that was required was to flip bits from 1s to 0s, you would never need to perform an erase cycle. Flipping bits from 1s to 0s comes "for free", but erasing bits from 0s to 1s comes at a cost. Each time a bit is switched from a 0 to a 1 (i.e. erased) the oxide layer of the cell is degraded. Over time it will stop being possible to erase a cell.
 
Every device comes with a claim of being able to support a given minimum number of program/erase (P/E) cycles, but the maximum number is never known. In general SLC NAND supports a higher number of P/E cycles than MLC, and MLC supports more P/E cycles than TLC. In general the drop-off in P/E endurance between the different technologies is quite significant; on the order of magnitudes. You'll have to consult the datasheets for specific devices for the actual performance of any specific device, but in general SLC devices are good for ~50,000 to 100,000 P/E cycles, MLC ~1,000 to 10,000, and TLC under ~1,000 P/E cycles.

Internally, cells are combined to form bytes, bytes are combined to form pages, pages are gathered into blocks, and blocks are combined to form planes.

From outside a NAND device one can only access the data inside a NAND device one page at a time. If you want to read one byte of data, the entire page on which that byte is stored must be retrieved, then the specific byte of interest can be accessed. The same is true for writing; you can only write data to a NAND device one page at a time. More recent devices will often break a single page up into fixed, equal-sized sub-pages which can help when fetching or writing data. Erasing, however, needs to be performed a whole block at a time. Fancy devices can allow pages in separate planes to be read or written simultaneously. In general pages are for reading/writing, blocks for erasing, and operations can occur simultaneously to separate planes. You'll need to consult the datasheet of any specific device for each device's specific behaviour.

The sizes of pages, blocks, and planes on a NAND device varies by device. A typical device will have page sizes in the 2048-bytes/page range, will split a page into 4 sub-pages, will combine around 64 pages to form a block, and use roughly 1024 blocks for each plane.

Although storing more bits per cell improves storage density, there are trade-offs such as: slower access times, and reduced life expectancy of the device. So while TLC is newer than MLC and has some advantages over it, TLC doesn't displace MLC; MLC is still available. The same holds true for SLC; even though it is "older" technology, it is faster and has an order of magnitude or better P/E cycles over "newer" technology. Therefore SLC is still used quite extensively when the circumstances call for improved reliability or speed over size or cost. As you would expect SLC flash is the most expensive, with the price dropping with each increase in bits per cell.

ECC and OOB


In order to combat the strange situation of a device that will (by design) fail over time, in practice ECC checks are often calculated for the data that is stored on the device. In fact if you read the fine print in the datasheet regarding a device's P/E endurance you'll often find that these endurance claims are based on the expectation that ECC is being used. But if you're going to calculate an ECC for a chunk of data, where will you store this value? There's not much point to generating an ECC for a chunk of data if it isn't stored with the data so it can be checked later on. As such, each NAND flash device comes with extra storage space added to the device in order to store these ECC calculations (or any other book-keeping data you'd like to track). So if you buy (for example) a 512MB-sized NAND flash device, you might be given a device that has 512MB+8MB of storage. This extra area is referred to as the out of bounds or OOB area.

Typically ECCs are calculated per page of NAND data. When you ask a device for a page of data (the smallest size of data you can request from a NAND device) in addition to receiving that page worth of data you will also be given the data from that page's OOB area. The same is true for writing: when you want to write data to a NAND device, for each page you give it, you also have to provide the data for that page's OOB area.

By the way the answer is "yes it is" (if you're wondering whether or not the OOB area is susceptible to the same degradation as the rest of the data on a NAND device). One could add OOB areas to OOB areas ad infinitum, but one level is considered sufficient. A failure in either of the main data area or an OOB area is a failure for that page+OOB combination.

Raw vs FTL

Getting the most out of your NAND device requires not just ECC checks, but the ability to mark pages as bad, support for caching, tolerance for sudden power loss, wear-levelling, and other techniques that help it perform optimally and correctly. The addition of these algorithms is so necessary that many devices that use NAND flash internally will put a microcontroller and code between the user and the memory. These are called FTL or managed devices.

FTL stands for flash translation layer and it forms a level of indirection between what the user requests, and what actually happens to the memory. For example, the user might continuously read/modify/update one specific page over and over. But knowing that hammering on only one page might cause that page to wear out much faster than the rest of the device, internally the FTL actually maps the requests for modifications to a specific page that the user is commanding, to modifying a different physical page with every request. This lets the user think they are reading and writing the same page over and over while in fact those requests are being spread over the entire device so as to not wear out any one part of the chip prematurely. This called wear-levelling.

Managed devices will often present a different interface to the outside world than what one would expect from raw NAND. Therefore when working with a device that uses NAND technology, it's important to know whether you're dealing with a raw NAND chip, or one that is managed.

Most managed devices present themselves as hard-drives, so using them under Linux is simply a matter of having Linux treat the device the same as it would any other hard-drive. Once Linux is interfaced to it, you simply use it as any other hard-drive: you partition it, format with your favourite filesystem, mount it, then use it like normal.

If you're using a raw NAND device, then your best bet is to make use of raw NAND-handling software that is already available in order to deal with NAND's quirks. Under Linux, the MTD subsystem provides a uniform, though raw, interface to NAND. On top of the MTD subsystem you could use JFFS2, but it has been mostly superseded by UBI and UBIFS.

Unlike JFFS2, UBIFS can't sit directly on top of the MTD subsystem. It needs to sit inside a UBI container. It is UBI that has all the fancy logic and algorithms for managing the physical memory, providing fast boot times, handling power interruptions, wear-levelling, and so forth. UBIFS is a NAND-aware filesystem that sits on top of UBI with which it works well. A UBI can contain one or more UBIFS filesystems.

External Interface and ONFI

Interacting with NAND has been simplified and standardized thanks to the efforts of ONFI, the Open NAND Flash Interface. All NAND devices have either an 8- or 16-bit parallel I/O bus, plus a number of standardized control lines. Telling the NAND device what you want to do is simply a matter of some combination of giving it a command, providing it with an "address", then reading or writing the data. The I/O bus is multiplexed between commands, "addresses", and data. The NAND device distinguishes between these pieces of information based on the cycle and on the values of the logic levels on the control lines.

I use "address" in quotes because the address of any piece of data in a NAND device is not referenced the same way a piece of data would be referenced in, for example, RAM. As I've mentioned a couple times, you can only interface with a NAND device one page at a time. Telling the NAND device which page you want is a matter of specifying its column, its page address within a block, its block number, and specifying which plane it's on. The bits specifying this information are jumbled up together and sent to the device as either 3, 4, or 5 (depends on the device) 8-bit words (regardless whether the device has an 8- or 16-bit I/O interface) which are sent in subsequent I/O cycles during the "address" phase.

ONFI also specifies the bit patterns of the various commands that can be issued to a NAND device. In this way Linux's MTD software (for example) doesn't need to be chip-specific with regards to the command definitions.

Timing Charts

Like most pieces of silicon, NAND devices don't operate infinitely quickly; they certainly don't operate at the speed of the bus connecting your SoC to the NAND device. As such, one of the most important pieces of information contained in your device's datasheet is the table specifying minimum or maximum timings of various operations.

This table is usually found in a section called "AC Characteristics" and includes the timing information for around 3 dozen or so parameters. For example, the Address Latch Enable setup time is given as tALS. Sometimes the timing is specified as a minimum amount of time that one needs to wait for an event, other times it specifies a maximum time. Each parameter has an associated unit, usually nano-seconds, but sometimes micro-seconds.

Some parts of a NAND's datasheet aren't as important as others from a software point of view. But when working with a NAND chip at a low level, the timing information is certainly one of the more important sections.

SoCs and NAND Controllers

It would be pretty rare to see a micro-controller connected to a NAND device directly using nothing but GPIO lines. Part of the difficulty in controlling a NAND device directly would be to get the timing right and efficient. As such most SoCs include a dedicated NAND Controller.

The job of the NAND Controller is to handle the interaction between the SoC and the NAND so that the SoC is freed from the lowest-level details of handling the NAND; like a sort of buffer. The SoC creates a request by loading the controller's registers with the correct values, and it's the controller's job to twiddle the various control and I/O lines in the correct sequence, at the right times.

An SoC's NAND Controller will often incorporate logic for handling ECC calculations and manipulating portions of the OOB areas as appropriate. For example, I mentioned earlier that when providing data to the NAND, one must also supply the OOB area. In some cases the software only needs to provide the data, and the controller will calculate the ECC and supply the OOB data to the NAND device itself. The reverse also applies when reading data: the controller can be instructed to check the ECC and it will either correct the data itself (if it can, if an error is detected) or set flags to let the user know an issue was found (or both). In which case the user simply receives a page of corrected data.

The NAND Controller can only do its job properly if it is configured properly. For each SoC that has a NAND Controller, a portion of its registers need to be used to give the user a place to specify the configuration and timing parameters of the specific NAND device being used.

Configuration usually involves telling the NAND Controller the bus width (8 or 16), the page size, whether or not sub-pages are used, how many bytes to use when specifying the "address" (3, 4, or 5), and various other things.

For timing, the datasheet for the NAND device will always specify timing in absolute, "wall clock" values (e.g. 25[ns]), whereas the NAND Controller only knows how to count clock ticks. Therefore not only do you need to know the clock rate of the bus to which the NAND device is connected (which is almost guaranteed to not be the same as the clock rate of the CPU itself), but these values will need to be adjusted anytime the clock rate changes (e.g. in low power or power-saving modes). Telling the controller how many clock ticks to wait will always be specified as an integer number. Knowing your clock rate, you'll need to figure out how many ticks are required to get at least that much delay, then round up. For example a given timing parameter might specify a minimum delay of 25[ns], at a clock rate of 130[MHz] this would translate to 3.25 clocks. But since the controller can't count a quarter of a clock, this value needs to be rounded up to 4. At this clock rate 4 clocks actually gives a delay of 30.7[ns], but we can't specify 3 otherwise the controller won't wait long enough for the NAND device, and errors will result.

Unfortunately it's rare to find a controller that has a 1:1 mapping between the timing parameters provided in the NAND device's datasheet and the timing parameters required by the NAND Controller. For reasons that can only be described as masochistic, the NAND Controller will almost always want timing values that are calculations that, if you're lucky, will be based on values found in the NAND device's datasheet. For example, a typical device's datasheet will (thankfully) provide a timing parameter called tRHZ. But instead of asking for this value, the NAND Controller might say: I need NAND_TA and you calculate NAND_TA as:

((RD_HIGH - RD_LOW)/HCLK) + (NAND_TA/HCLK) ≥ tRHZ

RD_HIGH and RD_LOW are other timing parameters the controller wants, which you've already calculated in a manner similar to the above, but you must re-arrange the inequality to isolate NAND_TA. Thankfully tRHZ is found in the datasheet; sometimes the controller will request a parameter that isn't in the datasheet and you're left trying to figure out how to use the parameters the datasheet gives you to determine the value the controller wants.

Also, the above calculations depend on your ability to figure out the clock rate which requires an understanding of the clocking and PLL mechanisms of your SoC, which isn't trivial either.

Conclusion

NAND flash is an interesting technology, with its own advantages and quirks. To get NAND working on a specific device requires an understanding of the details of the specific NAND device you're using, as well as understanding the capabilities and limitations of your SoC.

28 Apr 2020

The LPC32xx Project - introduction

For the last while I've been working on an extremely exciting project at work; certainly one of the best jobs to come my way. I've been lucky in my career to have worked on some really exciting things, and this project is certainly one of them.

We have a customer who has a range of hardware from some very old stuff to some very new stuff. They want the same versions of bootloader, kernel, and userspace running on all of them. Additionally they want, for each device, A/B updates, each of which is to run in its own container, and all built with OpenEmbedded/Yocto.

I'm not at all worried about finding support for the newish stuff. It's actually the oldest hardware that needs the most attention. For example, look at the mailing lists for the Linux kernel, U-Boot, or anything graphics-related and you'll find hundreds of patches from many developers every day all working on hardware so cutting edge, some of it isn't even available for purchase yet. But many fewer people are making sure the old stuff is still working. At best a developer will make sure their changes won't cause support for an older device to stop compiling, but sometimes that's not enough.

Earlier in my career I had an opportunity to do board bring-up. The company I was working for had just created their own custom board. It was based around a variant of the AMCC PowerPC 440 SoC. Our board, thankfully, was very closely modelled on the reference board for the specific SoC we were using, but with two key differences:
  1. we were using a brand of SDRAM that differed from the one on the reference board (and therefore the timing parameters needed tweaking), and
  2. whereas the reference board was running at a middle-of-the-road clocking, we wanted to run our board at the highest clock rate possible
Although upstream U-Boot already had support for the reference board, it was my job to update it to work on our board by completing these two tasks. Before starting this job, I hadn't even heard the phrase "board bring-up", but I was hooked! If I could have, I would have plotted a career from that time on that would have included a lot more board bring-up activities! But you take what you can get, and there are some other exciting things to do other than board bring-up.

What I enjoy most about board bring-up is it provides an opportunity to get deep down into the details of how an SoC works. We're all aware of SDRAM, and flash, and DMA, and all of the dozens of other pieces that fit together to make an SoC. But it's not every day one gets the opportunity to examine these things at the register level. Getting these two tasks working was very exciting for me; definitely a highlight.

Here I am, years later, and I find myself doing board bring-up again. However this time it's not with cutting-edge hardware, but rather with really old hardware: the NXP LPC32xx SoC. I realize it's strange for me to claim to be doing "board bring-up" for an SoC that clearly already has support in U-Boot and the Linux kernel. However the fact is the support for this device was added decades ago and (especially in the case of U-Boot) has bitrotted quite badly in the interim.

U-Boot is a very current and exciting project! It's not unusual for the daily patch count to run well into the hundreds. There are always new boards and SoCs in need of support, plus there are ongoing projects to improve the underlying structure of the code, its build system, and its test/ci infrastructure. With a code base moving this fast, an older device can quickly fall out of step with the rest of the project.

I've been busily working on this project for a while and enjoying every minute. I'll be blogging more about it!

21 Oct 2019

OE Floating-Point Options for ARMv5 (ARM926EJ-S)

One of the (many) things I enjoy about OpenEmbedded is how easy it is to try out different configurations. Want to switch from sysvinit to systemd? Change the config, re-build, and there's your new image to test. Want to switch from busybox to coreutils? Change the config, re-build, and there's your new image.

Recently, I have been working with an ARMv5 device that was released in 2008: the NXP LPC3240 which is based on the ARM926EJ-S SoC. The specific device I'm using has a VFPv2 unit, however since the VFP was optional on the ARM926EJ-S, most distros/images are built with no floating-point support. From the standpoint of binary distributions, this makes the most sense: if you want to supply a binary to run on the most number of devices, build for the lowest common denominator. But when building your own distro/images from source using OpenEmbedded, you have the flexibility to tweak the parameters of your build to suit the specifics of your hardware.

Nowadays, a user has 3 choices when it comes to VFP on the ARM926EJ-S:
  1. soft: floating-point emulated in software (no hardware floating-point)
  2. softfp: enable hardware floating-point but have floating-point parameters passed in integer registers (i.e. use the soft calling conventions)
  3. hard: enable floating-point and have floating-point parameters passed in floating-point registers (i.e. use FPU-specific calling conventions)
The naming of option 2 (softfp) is unfortunate. To me, saying "soft floating-point" implies the floating-point is being emulated in software. However, its name was meant to contrast its calling convention with that of hard floating-point, not to imply the floating-point is being emulated in software.

By default in OpenEmbedded, specifying tune-arm926ejs.inc sets the DEFAULTTUNE to "armv5te" which disables VFP. By tweaking DEFAULTTUNE in your machine.conf file (or local.conf) you can try out all the options. Personally, when setting DEFAULTTUNE, I also like to tweak TUNE_CCARGS.

To try out the different options, set the following parameters:
  1. soft:
    DEFAULTTUNE = "armv5te"
    TUNE_CCARGS = "-mcpu=arm926ej-s -marm"
  2. softfp:
    DEFAULTTUNE = "armv5te-vfp"
    TUNE_CCARGS = "-mcpu=arm926ej-s -mfpu=vfp -mfloat-abi=softfp -marm"
  3. hard:
    DEFAULTTUNE = "armv5tehf-vfp"
    TUNE_CCARGS = "-mcpu=arm926ej-s -mfpu=vfp -mfloat-abi=hard -marm"
The meta-openembedded/meta-oe layer provides a number of recipes for benchmark applications. Interesting performance benchmark programs include: whetstone, dhrystone, linpack, nbench, and the "cpu" test of sysbench.

STD BENCHMARK DISCLAIMER: when it comes to benchmarks it's always important to remember that they are synthetic. That is: they are programs created to measure the performance of some artificial work-load of their choosing. If you want to know how the performance of your program will change under different settings, the only real way to determine that is to build and test your specific program under the different settings. It's also worth pointing out that during the era when benchmark programs were a really hot topic (late 90's-ish?) many vendors would tailor their hardware towards the popular benchmark programs of the time, skewing the results dramatically. In other words, a specific piece of hardware would be tuned to run a specific benchmark really well, but "real" workloads wouldn't see much improvement. Therefore YMMV.

For this experiment I created three images; each one built using one of the three floating-point tunings given above but all containing the same contents and the same versions of all the contents. I then loaded each of the images on my hardware in turn, so I could run the benchmark programs to generate performance data.

As of the time these images were built (Oct 11, 2019), the HEAD revision of openembedded-core was 59938780e7e776d87146002ea939b185f8704408 and the head revision of meta-openembedded/meta-oe was fd1a0c9210b162ccb147e933984c755d32899efc. At that time, the compiler being used was gcc-9.2, and the versions of various components are: glibc:2.30, bash:5.0, dhrystone:2.1, linpack:1.0, nbench:2.2.3, sysbench:0.4.12, and whetstone:1.2.

First Impressions

One of the first interesting things to note is the size of the various binaries:


soft softfp hard




whetstone 33,172 20,236 20,444




dhrystone 13,752 9,660 9,660




sysbench 81,268 77,176 77,176




linpack 13,744 9,652 9,652




nbench 47,308 43,216 43,216

    Looking at the disassembly of each of these binaries, it's not hard to see why this is. Disassembling the binaries is as simple as:
    $ arm-oe-linux-gnueabi-objdump -d whetstone
    While the softfp and hard programs are filled with VFP instructions (e.g. vldr, vmul.f64, vsub.f64, etc.) the soft program contains calls to various __aeabi_* functions and __adddf3. These functions come from libgcc, a library written by the gcc people to help shore up things that are missing from standard C libraries (such as software emulation of floating-point, see here for more info). Interestingly, the code of these functions is linked into the executable itself (and not as a shared library). As you can imagine, emulating floating-point operations in software is going to take a lot of code!

    If you have floating-point hardware, taking advantage of it will shrink the size of your executables (if they use floating-point math).

    Whetstone

    whetstone is a benchmark program whose primary purpose is to measure floating-point performance. In each image I ran the whetstone program 5 times, timing each run with time, and had it run 1,000,000 loops:
    # time whetstone 1000000
    The averages of each test are as follows. Higher MIPS is better, lower time is better:

    soft softfp hard
    MIPS duration [s] MIPS duration [s] MIPS duration [s]
    100.16 998.4 1872.84 53.4 1872.84 53.4

    Dhrystone

    dhrystone is a benchmark used to evaluate integer performance. In each image I ran the whetstone program 5 times, timing each run, and performing 1,000,000 iterations per run:
    # time echo 1000000 | dhry
    The averages are as follows. Higher dhry/sec is better, lower time is better:

    soft softfp hard
    dhry/sec duration [ms] dhry/sec duration [ms] dhry/sec duration [ms]
    432527.22 2.3 431037.7 2.3 429554.58 2.3

    Sysbench (cpu)

    sysbench is a benchmark which includes a bunch of sub-benchmarks, one of which is the "cpu" test. On each image I ran the cpu test 5 times, capping the run-time to 300[s]. The benchmark appears to perform prime factorization, measuring something called "events", and recording run time per event.
    # time sysbench --max-time=300 --test=cpu run
    soft softfp hard
    events duration/event [ms] events duration/event [ms] events duration/event [ms]
    1157.2 259.29 2951.6 101.638 2951 101.662


    As a final test, on each image I ran the cpu test just once without a time limitation, to see how much time it would otherwise take.
    # time sysbench --test=cpu run
    soft softfp hard
    events test duration events test duration events test duration
    10000 43m0.50s 10000 16m56.499s 10000 16m56.777

    Linpack

    linpack is a benchmark testing a computer's ability to perform numerical linear algebra. The program takes one required parameter: the size of the array to use. If you pass "200", it will calculate a 200x200 array. As it runs, it determines how many repetitions to perform, it bases the repetitions on its performance. For each repetition it records how much time it took. When it's done a set of repetitions, it calculates a KFLOPS count, then starts over with a different repetition count.

    For each image I ran the program once with "200" and once with "500". With no hardware floating point support calculating a 200x200 array it starts with 1 repetition, then tries 2, then 4, 8, etc. With hardware floating-point on a 200x200 array it starts with 8 repetitions, then 16, 32, etc. On a 200x200 array the repetition counts common to all images are 8, 16, and 32. On a 500x500 array the repetition counts common to all images are 1 and 2.

    The program never terminates; it keeps increasing the repetition count and going until explicitly killed.
    # echo 200 | linpack
    soft softfp hard
    reps time/rep KFLOPS reps time/rep KFLOPS reps time/rep KFLOPS
    8 4.3 2718.669 8 0.64 18553.356 8 0.62 19223.389
    16 8.6 2718.614 16 1.29 18552.917 16 1.25 19214.278
    32 17.2 2718.792 32 2.58 18552.361 32 2.49 19212.128
    # echo 500 | linpack
    soft softfp hard
    reps time/rep KFLOPS reps time/rep KFLOPS reps time/rep KFLOPS
    1 8.1 2674.928 1 1.38 15876.865 1 1.38 15883.324
    2 16.17 2674.871 2 2.74 15878.365 2 2.74 15882.516

    nbench

    nbench (aka BYTEmark) runs a bunch of sub-tests (including: numerical sort, string sort, bitfield, fp emulation, fourier, assignment, IDEA, huffman, neural net, and LU decomposition) then generates both an integer index and a floating-point index. These indices are relative to what were considered capable machines of the time (mid-1990's).

    This benchmark was run twice on each image, the averaged results are:

    soft softfp hard


    integer idx fp idx integer idx fp idx integer idx fp idx


    1.054 0.1 1.1095 0.961 1.109 0.979

    Conclusions

    Since software floating-point emulation gets added statically to C programs, using hardware floating point makes binaries smaller in programs that perform floating-point calculations. Enabling floating-point in such programs also improves the performance of floating-point operations noticeably. Interestingly, it appears as though integer performance is ever so slightly impacted in the hard case relative to softfp. Therefore it would seem to be that if your entire work-load is floating-point, then go with hard, otherwise if there is both floating-point and considerable integer calculations, softfp might be best.

    As always, test your own application to know which mode is best in your scenario.

    16 Sep 2019

    Board Bring-Up: An Introduction to gdb, JTAG, and OpenOCD

    Lately I've been working on getting a recent-ish U-Boot (2018.07) and Linux kernel (5.0.19) running on an SoC that was released back in 2008: the NXP LPC3240 which is based on the ARM926EJ-S processor (NOTE: the ARM926EJ-S was released in 2001).

    Although I had U-Boot working well enough to load and boot Linux, the moment the Linux kernel started printing its boot progress, my console was filled with the sort of garbage that tells an embedded developer they've got the wrong baud rate. Double-checking, and even triple-checking, of the baud rate values, however, showed that every place where it was configured, it had been correctly set to 115200 8N1.

    Having a working console is the basis from which the rest of a software developer's board bring-up activities take place! If you can compile a kernel, load it on the board, and get it to print anything, legibly, to the console, then you're already in a really good position. But if there's no working connection via the console, it means more low-level work is needed.

    Going down the hierarchy (from easier to harder), if the console isn't working, then you'll need to see if JTAG is a possibility. If a JTAG isn't available, then you'll need to look for an LED to blink. Blinking an LED to debug one's work during board bring-up isn't uncommon, but it can be a lot more painful. With nothing but (perhaps) a single LED, it can be hard (though strictly not impossible) to communicate something as simple as: "the value at 0x4000 4064 is 0x0008 097e, and I've reached <this> point in the code". Thankfully for me, this particular board has a working JTAG, and there is support for this SoC in OpenOCD.

    JTAG is a very large specification and has a lot of use-cases. For my purposes, JTAG consists of:
    • extra logic that is added to a chip which implements a very specific state machine
    • a bunch of extra pins (at least 4, but some designs add more) with which to interface to this internal state machine from outside the chip
    • a set of commands (in the state machine) that can be executed by toggling bits on the external pin interface
    These commands let you do things such as push individual bits into, or get individual bits out of, a particular device's JTAG scan chain. Depending on how the scan chain is implemented, this could translate to activities such as the ability to read/write registers, and read/write arbitrary values in the memory space (which includes things like peripherals, configuration, etc).

    Most development hosts don't have random GPIO lines available for interfacing, therefore a dongle of some sort is needed to go between the desktop machine and the target board's multi-wire JTAG interface. In days past, these dongles would be connected to the development host via serial or parallel interfaces; nowadays they're mostly USB.

    Armed with a JTAG dongle, in theory it would be possible to start interacting with the target board directly via JTAG commands. However, this could be very tedious as the JTAG commands are very primitive (i.e. having to follow the state machine precisely, and work 1 bit at a time). One of the more common arrangements is to use gdb, which permits the user to perform higher-level actions (i.e. set a breakpoint, read a given 32-bit memory address, list the register contents, etc) and let the software deal with the details. Note, however, gdb itself does not know how to "speak" JTAG nor does it know how to interact with a JTAG dongle. gdb does, however, speak its own command language called the remote gdbserver protocol. It is OpenOCD which acts as the interpreter between the remote gdbserver protocol on the one hand (e.g. over a network port), and JTAG commands for the target on the other (e.g. over USB to the dongle) marshalling all the data back and forth between the two.

    With the target board powered off, plug the JTAG dongle's pins into the board's JTAG connector; connect the development host to the JTAG dongle via USB.

    Power on the target board.

    Run OpenOCD on the development host. In my specific case the command I invoke is:
    $ openocd -f interface/ftdi/olimex-arm-usb-ocd-h.cfg -f board/phytec_lpc3250.cfg
    It is important to note that openocd runs as a daemon, and as such, once invoked, does not terminate until explicitly killed. In particular, this command is run in its own terminal, and simply left running until my debugging session is done. All other work that will be done, needs to be performed in other terminals. Perhaps you're thinking: "I'll just run it in the background using an ampersand". That would work, however: as it runs and interacts with gdb and the board, openocd will print out useful information to the terminal. Therefore giving it its own terminal and letting it run independently while keeping it visible is often quite useful. It's always someplace visible on my desktop while debugging.

    OpenOCD needs to know what dongle I'm using (it supports a number of JTAG dongles) and it needs to know the board or SoC to which it is connecting (it has support for many SoCs and boards). Implicit in the choice of dongle is the communication protocol (here USB) and dongle characteristics (properties, product ID, etc).  By specifying a target board or SoC, you're letting OpenOCD know things such as how to initialize its connection, the register size, what speed to use, details about how the device needs to be reset, and so on.

    More recently, some development boards come with built-in debug circuitry, including a USB connector, already designed into the target board itself. In these cases the JTAG dongle isn't needed. One simply needs to connect the target board directly to the development host via a single USB cable, and start up OpenOCD (and gdb) giving only one piece of information: the board's name. All other details are implied.

    Running on a GNU/Linux system, gdb works best with ELF executables. gdb can be coerced into working with raw binaries, but when presented with an ELF file, it is provided with a lot more of the data it needs to do its job. But neither the Linux kernel nor U-Boot are ELF binaries. As part of their default build processes, however, both the Linux kernel and U-Boot build systems generate ELF output in addition to the parts that are actually run. A U-Boot build will produce, for example, u-boot.bin, which is the actual U-Boot binary that is stored wherever the bootloader needs to be placed. But in addition to this, a file called u-boot is produced which is its ELF counterpart. Similarly for the Linux kernel, the kernel itself might be found in arch/arm/Image, but its ELF counterpart is vmlinux.

    If you want to debug a Linux kernel via JTAG using gdb, simply invoke:
    $ arm-oe-linux-gnueabi-gdb vmlinux

    Since the target is an ARM board and my host is an x86 board, I need to invoke the cross-gdb program, not the native one (otherwise it won't be able to make sense of the binary instructions). Since I do so much of my work using OpenEmbedded as a basis, when working independently on U-Boot and the kernel, I simply have OpenEmbedded generate an SDK targeting this particular board, and use it for all my work. When invoking this cross-debugger, I simply provide it with the path to, and the name of, the ELF file containing the kernel I have compiled.

    By default openocd listens on port 6666 for tcl connections, port 4444 for telnet connections, and port 3333 for gdb connections. In order to create the link between gdb and openocd, once gdb is up and running you'll need to link them together by issuing the target remote or target extended-remote command:
    (gdb) target extended-remote :3333
    Of course if you've told openocd to listen to a different port, you'll need to make the necessary adjustments to the connection.

    Congratulations! You're now debugging the Linux kernel on your target board via JTAG using gdb! No serial console required!

    In my particular case, although I knew the Linux kernel was doing something, I wasn't sure what exactly was going on since the baud rate via my serial console was messed up. Using this setup I was able to dump the kernel's ring buffer, allowing me to see exactly what the kernel was doing and providing me with valuable debugging information of its boot:
    (gdb) x /2000bs __log_buf


    27 Jun 2019

    Verizon Struggles to Understand How Email Works

    Email has been around for longer than I've been alive! But apparently, 48 years on, it remains too complicated for even a telecommunications company such as Verizon to understand.

    On June 1st I get the following email in my inbox:

    Your updated email address needs to be verified.

    To protect your privacy and ensure that we're sending important information to the right place, click below to verify your email address.


    Turns out someone has just signed up for a Verizon account and given my email address instead of his own. No problem, I'll just not click on the link and everything should be fine, right?

    Lol... NOT!

    In the last 3 weeks I've received 11 emails from Verizon... letting me know my new phone is on its way (and verifying my account and address information), confirming my order, providing details of my next bill and plan details, asking me to fill out a survey (let's just say they didn't get top marks in that one!), etc... and I never clicked the link!

    It's a good thing they sent out that initial "address verification email". Wouldn't want all that personal information going to the wrong person, eh?

    At the bottom of every email they've sent, there's always an "unsubscribe" link. Great, I'll just click on that... Oh wait, I can't unsubscribe by clicking the link. I have to sign in to my Verizon account before I can unsubscribe. Is that even legal? I thought unsubscribing was supposed to be a one-click thing in the USA?

    So I figure maybe I'll get a bit creative and ask for a password reset, the system will send me a link, I'll click the link, and be able to change the email address? Nope. Can't do that either. "For security reasons you need to provide the secret PIN that was used when the account was created in order to reset the account". Oh that's nice, at least that part of their system works.

    Oh, here my solution: on their website, under support, is a messaging app that I can use to contact a customer service rep. I'll use that, chat with a rep, and have them remove my email address from this account. Nope. Can't do that, the app asks that I login to my account before I can chat with a customer service rep from the website.

    Looking through the emails that I've received so far, I find the name, address, email, and phone of the Verizon customer service rep who signed this person up. Oh this should be the ticket! I'll email her, let her know what's up, then she can use her insider magic to erase my email from this customer's account. Wow! I must have been on something when I thought that was going to fix anything. She outright refused to help. Her reply was "I will reach out to <customer> and ask them to correct the email address". Really?! That's your solution?! The person who didn't know what an email address was in the first place is who we're relying on now to fix this? The person who has no clue what his email address is (or, apparently, what an email address is to begin with) is the genius who's going to get us out of this mess? If he had a clue to begin with, we wouldn't be here, would we?! When I, politely, point this out to her, she then asks if I know <customer>'s email address so she can change it to that. WTF??! How do you expect me to know the correct email address of some random whack-job on the Internet? I'm so stupid, I should have just said "yes" and given her some other random email address (like, maybe her own). Then this would be solved (from my point of view). And if she is capable of changing it (should I have given her some random email address) why can't she just delete mine without asking <genius> to do it? Why would she be able to solve the problem had I provided a reply to her ridiculous question, but can't fix the problem otherwise?

    So tonight I decided to call Verizon customer support itself and get this sorted out. Spoiler alert: it's still not fixed. First off, the customer-support dial-in system is very adamant that I provide my Verizon phone number and PIN in order to let me do almost anything. In fact, one of the top-level menu items is "if you're not an existing customer" (so this gets me out of having to have a Verizon account) but then if you pick option #6 on the very next menu (for "other") then it asks for your Verizon phone number and PIN!! So I have to call back again and pretend I'm not a current customer but that I want to become one. This finally lets me talk to a person (the "sales" lineup is never busy). I explain the issue. He's very nice and all, but insists that there is no way for him (or anyone else) to change the account information on an account without knowing the PIN of that account.

    I understand the point. Verizon (like most companies) doesn't trust their own employees (especially the ones at the lower echelons) and therefore has a system in place such that customer service reps can't log themselves into random accounts and mess around with the data. That sounds all fine and good.

    But in 2019, as sophisticated (or whatever you wish to use) as Verizon's system is, there's no contingency for the scenario whereby a customer puts in the wrong email address other than to wait and hope for the customer to fix the error themselves? Nobody anywhere who was part of designing Verizon's systems ever considered the possibility that random users might (accidentally perhaps?) put in the wrong email address and therefore provide a mechanism to remove such an email address from their system? It just never occurred to them that this might happen?

    Worse yet, is the fact the original "email verification link" is apparently pointless. Regardless whether the link is clicked or not, if Verizon has an email to send to a customer, the email on file is used whether or not it has been verified.

    It seems like a pretty basic oversight. If you're going to have a path whereby the system is going to send out verification emails to verify the email address a random person randomly puts into the system, there should be a little more thought put into what should happen should the email link never get clicked (maybe it could delete itself after a short period?). Or at the very least, a mechanism whereby someone within Verizon can delete an email address from an account (especially if it hasn't been verified). Or even less than that, the "unsubscribe" links at the bottom of the emails should allow a person to unsubscribe without having to log into an account and provide a PIN (especially in the case where the email address has not yet been verified).

    2 Feb 2019

    LoRa - first steps

    It took all of (maybe?) an hour to setup 2 Adafruit Feather M0 with RFM95 LoRa Radio devices and have them ping each other using the simple getting started guide and default Arduino code. Yes Arduino, boo hiss, I agree. But it was a very simple and easy way to perform a quick test which helps answer a few basic feasibility questions.



    As some of you know, we own a farm, which presents lots of amazing opportunities for electronics projects: remote sensing, remote control, recording, etc. It would be great to know if someone cranks up the heat in the tack room, then leaves without turning it back down again. It would be great to know if someone accidentally leaves a light on, or a door open, somewhere in the barn. It would be great to be alerted if the electric fence goes down. It would be fantastic to be able to track water temperature in numerous places throughout our outdoor wood boiler HVAC system and correlate that with ambient room temperatures and outdoor temperature. It would be even more amazing to be able to track property-wide and area-specific electricity usage and water usage. And perhaps even consider some HVAC control projects too! Then there's motion sensing, detecting cars/people coming and going, gate operations/accessibility, wildlife/herd tracking, mailbox alerts, ..., it's quite a list!

    But before I can even start to dream too much, I need to look at a lot of mundane things and figure out a whole bunch of details. For example: how do I communicate with things over the length (685m) and width (190m) of our property (~30acres)? What's the best way to communicate with things in the barn? Does everything need to be plugged in, or are batteries feasible?

    One of the challenges that might not be readily obvious to most, is that the barn is mostly wrapped in metal. Trying to do wireless things in, around, through, and past an all-metal-wrapped barn is not straight-forward. Even our house has a metal roof. Another challenge is the fact our house is made of field-stone, and has roughly 17" thick concrete/stone walls! Try getting WiFi out of the underground basement through a 17" concrete/stone wall!

    I'm sure to most people, it's obvious WiFi isn't a solution. Maybe sections of the property could be covered by WiFi, but it's certainly not the solution everywhere. And even at that, trying to cover an outdoor area in WiFi requires outdoor antennas, and WiFi extenders (which are not cheap, and can be difficult to get them to work together). Not to mention: WiFi is hard on battery-operated devices. Obviously Bluetooth isn't going to cut it either. So that eliminates all those Espressif ESP8266/ESP32 and BT/BTLE devices. A traditionally popular option would be Zigbee, but I get the feeling its popularity is waning. The rising star today for "IoT things" seems to be LoRa, so I wanted to give that a try. Ideally though, I'd like to try Zigbee too, so I can evaluate it and LoRa side-by-side.

    But how well is LoRa going to work on my property? Sure we hear all sorts of amazing numbers describing the theoretical LoRa range, but these results always come with provisos. How well is LoRa going to work from my basement? Through my house's thick walls? Past the all-metal barn? Over the hills? And through the forest?

    Then on top of LoRa itself is this whole LoRaWAN stuff and The Things Network... (whatever those things are).

    Above the radio we then have to consider microcontrollers. I wouldn't want to wake up one day to find that I had grown overly-biased in my preference for one microcontroller over all others. But having worked with 8-bit PICs, 8-bit AVRs, and 8051s, I have to say: those 32-bit CortexMs from ARM are pretty sweet! Maybe I'll consider using a PIC here or there just to improve my microprocessor breadth, but they won't be a top priority. Another up-and-coming microcontroller that I'll want to experiment with would be one of those smaller RISC-V designs such as the FE310.

    On top of the microcontroller goes the code. As I said above, the Arduino environment is cute for some quick prototyping, but ideally I'd prefer to be closer to the hardware. Popular choices in the maker community include MicroPython and Adafruit's CircuitPython. Those are okay choices and both have their place, but only you're fond of "single control-loop" designs. Through these projects I'm hoping to explore MicroPython, CircuitPython, and, yes, even Arduino stuff, but ultimately I'd like to spend most of my time with things like FreeRTOS, Zephyr, mbed, and libopencm3. Any others I should consider?

    Above the "firmware" comes higher-level software such as messaging. I'm guessing MQTT is the only sensible choice here?

    I'm still not done. Another item that needs serious attention are all the various hardware choices: hardware form-factors, batteries, weatherproof enclosures, .... If every item is going to be a one-off design, then I can try a bunch of different boards, batteries, enclosures, and form-factors to see which ones work better than others. But if I want to build up an ecosystem of devices all built on the same known platform, then I need to consider standardizing on some of these options.

    I like what Adafruit has done with their Feather line of development boards. They're standardized, breadboard-able, have LiPo connectors and charging hardware onboard, and have an ever-growing ecosystem of daughter-boards (FeatherWings). What's nice about the Feather ecosystem is how the user has a choice of microcontroller for the baseboard itself. I think it would be fair to call Adafruit's Feather ecosystem a form-factor for The Internet of Things. Are there any others worth considering?


    I started out saying how it didn't take very long to get two of these boards sending messages to each other. Although my research had told me that it should work easily, I was still very amazed when I took one of the boards, plugged in a LiPo battery, put everything inside a weatherproof enclosure, brought it to the barn, and returned to my desk to find they were still communicating! The barn is about 80m (~260') away, and my office is underground, behind a 17" thick concrete/stone wall! I tweaked the code a little, but didn't make any changes to the radio operation other than to set the frequency. I'm using just a plain, simple 3" wire soldered to the "antenna" pad. Wow!

    And with this little experiment, I've (finally!) started down the path of (hopefully!) many fun, electronics, farm projects! I now know I can at least communicate from my desk to the barn over LoRa using a simple 3" wire antenna and two tiny Feather boards.

    4 Sep 2018

    OE Hands-On, September 13 2018

    Back in April I gave a talk about OpenEmbedded/Yocto at my local RaspberryPi Meetup:
    https://www.meetup.com/Raspberry-Pi/events/gbdwdpyxgbqb/
    slides: https://www.slideshare.net/TrevorWoerner/using-openembedded-93867379

    That talk went well, and participants were anxious to try it themselves. Therefore we've arranged for this upcoming Toronto Raspberry Pi Meetup, September 13 2018, to be a hands-on session with OpenEmbedded/Yocto! Bring your Raspberry Pi, and associated equipment[1], with you to the meeting and I'll help you work on generating your own distros/images!

    Admission is free, but limited, so please sign up at:
    https://www.meetup.com/Raspberry-Pi/events/gbdwdpyxmbrb/



    [1] If you want to participate, you need to bring, at a minimum, your Raspberry Pi (any of rpi0, rpi1 (original), rpi2, rpi3 (any), or cm3), its power supply, and a microSD card. If you want to verify anything is working you'll need either a serial console cable, or a device to plug into your device's HDMI port. I'll bring some spare serial console cables with me. If you use an HDMI device, you might also want to bring a USB keyboard and mouse.