Everyone knows that doing an OE build can take a bit of time (there are good reasons for this being true) so it follows that performing an OE build in a VM will take even longer. But when you do a build "natively" you get to potentially use all of the computer's free memory and all of its processing power. A VM running on that same machine will, at most, only allow you to use half its processing power and memory.
The question I wanted to answer was: if I performed a build "natively" but only used half the memory+cpu of a computer, how would that compare to a build in a VM that thought it was using the full computer's resources but had been constrained to only half the resources due to the use of the VM?
When performing an OE build there are two variables you can set that allow you to restrict the amount of cpu resources that are used: PARALLEL_MAKE and BB_NUMBER_THREADS. But using these variables isn't really the same as building on a machine with half the processing resources; the initial parse, for example, will use all available cpu resources (it can't know how much you want to restrict the build until it has actually completed parsing all the configuration files). Plus, there are no OE variables you can tweak to say "only use this much memory during the build".
So in order to better measure a VM's performance we need to find a better way to perform a restricted "native" build (other than just tweaking some configuration parameters). Answer: cgroups.
To be honest, I was hoping that I could demonstrate that if I used qemu+kvm and used virtio drivers everywhere I could (disk, network, etc) that the performance of a VM would at least approach that of doing a "native" build that was restricted via cgroups to the same amount of resources as was available to the VM. My findings, however, didn't bear that out.
First I'll present my results (since that's really want you want to see), then I'll describe my test procedure (for you to pick holes in ;-) ). I tried my tests on two computers and ran each test 5 times.
First I ran a build on the "raw/native" computer using all its resources (C=1):
| Computer 1 | Computer 2 |
1 | 00:24:19 | 01:02:42 |
2 | 00:25:03 | 01:02:19 |
3 | 00:24:51 | 01:02:34 |
4 | 00:24:55 | 01:02:21 |
5 | 00:25:00 | 01:02:57 |
avg | 00:24:50 | 01:02:35 |
I'm using "C" to represent the CPU resources; "1" meaning "all CPUs", i.e. using a value of "oe.utils.cpu_count()" for both PARALLEL_MAKE and BB_NUMBER_THREADS. A value of "0.5" means I've adjusted these parameters to be half of what "oe.utils.cpu_count()" would give.
Then I ran the same build again on the "native" machine but this time using BB_NUMBER_THREAD/PARALLEL_MAKE to only use half the cpus/threads (C=0.5):
| Computer 1 | Computer 2 |
1 | 00:24:20 | 00:57:10 |
2 | 00:24:34 | 00:57:32 |
3 | 00:25:11 | 00:57:08 |
4 | 00:24:39 | 01:03:31 |
5 | 00:24:31 | 01:01:50 |
avg | 00:24:39 | 00:59:26 |
Counter-intuitively, when restricting the resources, the builds performed ever-so-slightly better than allowing the build to use of all the computer's resources. Perhaps these builds aren't CPU-bound, and this set just happened to come out slightly better than the "full resources" builds. Or (for this workload) these Intel CPUs aren't really able to make much use of CPU threads, it's CPU cores that count. (??)
Then I performed the same set of builds on the "native" computers, but after having restricted their resources via cgroups.
C=1 (restricted by half via cgroups, same for memory):
| Computer 1 | Computer 2 |
1 | 00:28:05 | 01:05:57 |
2 | 00:28:18 | 01:05:33 |
3 | 00:28:21 | 01:05:46 |
4 | 00:28:14 | 01:05:22 |
5 | 00:27:37 | 01:05:37 |
avg | 00:28:07 | 01:05:39 |
C=0.5 (further restricted by half (again) via cgroups, same for memory):
| Computer 1 | Computer 2 |
1 | 00:27:20 | 01:04:08 |
2 | 00:27:15 | 01:07:39 |
3 | 00:27:08 | 01:07:42 |
4 | 00:27:17 | 01:07:57 |
5 | 00:27:13 | 01:08:16 |
avg | 00:27:15 | 01:07:08 |
So there's obviously a difference between restricting a build's resources via BB_NUMBER_THREAD/PARALLEL_MAKE versus setting hard limits using cgroups. That's not too surprising. But again there's very little difference between using "cpu_count()" number of CPU resources versus using half that amount, in fact, for Computer 1 the build time improved slightly.
Now here's the part where I used a VM running under qemu+kvm. I had been hoping these times would be comparable to the times I obtained when restricting the build via cgroups, but that wasn't the case.
C=1 (restricted by half via VM, same for memory):
| Computer 1 | Computer 2 |
1 | 00:41:36 | 01:45:42 |
2 | 00:41:22 | 01:47:52 |
3 | 00:41:41 | 01:44:31 |
4 | 00:41:16 | 01:50:25 |
5 | 00:41:12 | 01:41:41 |
avg | 00:41:25 | 01:46:02 |
C=0.5 (further restricted again by half via VM, same for memory):
| Computer 1 | Computer 2 |
1 | 00:42:02 | 01:30:23 |
2 | 00:42:07 | 01:34:43 |
3 | 00:43:05 | 01:31:14 |
3 | 00:43:14 | 01:36:46 |
4 | 00:42:12 | 01:42:05 |
avg | 00:42:32 | 01:35:02 |
Analysis
Using the first build as a reference (letting the build use all the resources it wants on a "raw/native" machine):
- constraining the build to use half the machine's resources via cgroups results in build times that are from 4.90% to 13.22% slower.
- performing the same build in a qemu+kvm VM results in build times that are from 59.90% to 72.55% slower.
Specifics
- bitbake core-image-minimal
- fido release, e4f3cf8950106bd420e09f463f11c4e607462126, 2138 tasks
- DISTRO=poky (meta-poky)
- a "-c fetchall" was performed initially, then all directories but "conf" were deleted and the timed build was performed with "BB_NO_NETWORK=1"
- between each build everything would be deleted except for the "conf" directory; therefore no sstate or tmp or cache etc
- /tmp implemented on a tmpfs
- because the VMs had limited disk space, all builds were performed with "INHERIT+=rm_work"
- to help load/manage the VMs I use a set of scripts I created here: https://github.com/twoerner/qemu_scripts
- in the VMs, both the "Download" directory and the source/recipes are mounted and shared from the host (using virtio)
To restrict a build using cgroups I created a cgroup named "oebuild" in both the "cpuset" and "memory" groups and placed my shell in them.
So, for example, say I'm running a shell, bash, and its PID is 1234. As root goto /sys/fs/cgroup and:
- mkdir cpuset/oebuild
- mkdir memory/oebuild
If your system has 8 CPUs and 16GB of RAM (adjust your numbers accordingly):
- echo "0-3" > cpuset/oebuild/cpuset.cpus
- echo 0 > cpuset/oebuild/cpuset.mems
- echo 8G > memory/oebuild/memory.limit_in_bytes
And finally ("1234" is the PID of the shell I want to put in the "oebuild" cgroup, I then use this shell to run the build):
- echo 1234 > cpuset/oebuild/tasks
- echo 1234 > memory/oebuild/tasks
To confirm your shell is now running in this cgroup, run the following command in your shell:
$ cat /proc/self/cgroup
10:hugetlb:/
9:perf_event:/
8:blkio:/
7:net_cls,net_prio:/
6:freezer:/
5:devices:/
4:memory:/oebuild
3:cpu,cpuacct:/
2:cpuset:/oebuild
1:name=systemd:/user.slice/user-1000.slice/session-625.scope
Here you see both the "memory" and "cpuset" cgroups are constrained by the oebuild cgroup for this process (i.e. /proc/self).