The question I wanted to answer was: if I performed a build "natively" but only used half the memory+cpu of a computer, how would that compare to a build in a VM that thought it was using the full computer's resources but had been constrained to only half the resources due to the use of the VM?
When performing an OE build there are two variables you can set that allow you to restrict the amount of cpu resources that are used: PARALLEL_MAKE and BB_NUMBER_THREADS. But using these variables isn't really the same as building on a machine with half the processing resources; the initial parse, for example, will use all available cpu resources (it can't know how much you want to restrict the build until it has actually completed parsing all the configuration files). Plus, there are no OE variables you can tweak to say "only use this much memory during the build".
So in order to better measure a VM's performance we need to find a better way to perform a restricted "native" build (other than just tweaking some configuration parameters). Answer: cgroups.
To be honest, I was hoping that I could demonstrate that if I used qemu+kvm and used virtio drivers everywhere I could (disk, network, etc) that the performance of a VM would at least approach that of doing a "native" build that was restricted via cgroups to the same amount of resources as was available to the VM. My findings, however, didn't bear that out.
First I'll present my results (since that's really want you want to see), then I'll describe my test procedure (for you to pick holes in ;-) ). I tried my tests on two computers and ran each test 5 times.
First I ran a build on the "raw/native" computer using all its resources (C=1):
Computer 1 | Computer 2 | |
---|---|---|
1 | 00:24:19 | 01:02:42 |
2 | 00:25:03 | 01:02:19 |
3 | 00:24:51 | 01:02:34 |
4 | 00:24:55 | 01:02:21 |
5 | 00:25:00 | 01:02:57 |
avg | 00:24:50 | 01:02:35 |
I'm using "C" to represent the CPU resources; "1" meaning "all CPUs", i.e. using a value of "oe.utils.cpu_count()" for both PARALLEL_MAKE and BB_NUMBER_THREADS. A value of "0.5" means I've adjusted these parameters to be half of what "oe.utils.cpu_count()" would give.
Then I ran the same build again on the "native" machine but this time using BB_NUMBER_THREAD/PARALLEL_MAKE to only use half the cpus/threads (C=0.5):
Computer 1 | Computer 2 | |
---|---|---|
1 | 00:24:20 | 00:57:10 |
2 | 00:24:34 | 00:57:32 |
3 | 00:25:11 | 00:57:08 |
4 | 00:24:39 | 01:03:31 |
5 | 00:24:31 | 01:01:50 |
avg | 00:24:39 | 00:59:26 |
Counter-intuitively, when restricting the resources, the builds performed ever-so-slightly better than allowing the build to use of all the computer's resources. Perhaps these builds aren't CPU-bound, and this set just happened to come out slightly better than the "full resources" builds. Or (for this workload) these Intel CPUs aren't really able to make much use of CPU threads, it's CPU cores that count. (??)
Then I performed the same set of builds on the "native" computers, but after having restricted their resources via cgroups.
C=1 (restricted by half via cgroups, same for memory):
Computer 1 | Computer 2 | |
---|---|---|
1 | 00:28:05 | 01:05:57 |
2 | 00:28:18 | 01:05:33 |
3 | 00:28:21 | 01:05:46 |
4 | 00:28:14 | 01:05:22 |
5 | 00:27:37 | 01:05:37 |
avg | 00:28:07 | 01:05:39 |
C=0.5 (further restricted by half (again) via cgroups, same for memory):
Computer 1 | Computer 2 | |
---|---|---|
1 | 00:27:20 | 01:04:08 |
2 | 00:27:15 | 01:07:39 |
3 | 00:27:08 | 01:07:42 |
4 | 00:27:17 | 01:07:57 |
5 | 00:27:13 | 01:08:16 |
avg | 00:27:15 | 01:07:08 |
So there's obviously a difference between restricting a build's resources via BB_NUMBER_THREAD/PARALLEL_MAKE versus setting hard limits using cgroups. That's not too surprising. But again there's very little difference between using "cpu_count()" number of CPU resources versus using half that amount, in fact, for Computer 1 the build time improved slightly.
Now here's the part where I used a VM running under qemu+kvm. I had been hoping these times would be comparable to the times I obtained when restricting the build via cgroups, but that wasn't the case.
C=1 (restricted by half via VM, same for memory):
Computer 1 | Computer 2 | |
---|---|---|
1 | 00:41:36 | 01:45:42 |
2 | 00:41:22 | 01:47:52 |
3 | 00:41:41 | 01:44:31 |
4 | 00:41:16 | 01:50:25 |
5 | 00:41:12 | 01:41:41 |
avg | 00:41:25 | 01:46:02 |
C=0.5 (further restricted again by half via VM, same for memory):
Computer 1 | Computer 2 | |
---|---|---|
1 | 00:42:02 | 01:30:23 |
2 | 00:42:07 | 01:34:43 |
3 | 00:43:05 | 01:31:14 |
3 | 00:43:14 | 01:36:46 |
4 | 00:42:12 | 01:42:05 |
avg | 00:42:32 | 01:35:02 |
Analysis
Using the first build as a reference (letting the build use all the resources it wants on a "raw/native" machine):- constraining the build to use half the machine's resources via cgroups results in build times that are from 4.90% to 13.22% slower.
- performing the same build in a qemu+kvm VM results in build times that are from 59.90% to 72.55% slower.
Specifics
- bitbake core-image-minimal
- fido release, e4f3cf8950106bd420e09f463f11c4e607462126, 2138 tasks
- DISTRO=poky (meta-poky)
- a "-c fetchall" was performed initially, then all directories but "conf" were deleted and the timed build was performed with "BB_NO_NETWORK=1"
- between each build everything would be deleted except for the "conf" directory; therefore no sstate or tmp or cache etc
- /tmp implemented on a tmpfs
- because the VMs had limited disk space, all builds were performed with "INHERIT+=rm_work"
- to help load/manage the VMs I use a set of scripts I created here: https://github.com/twoerner/qemu_scripts
- in the VMs, both the "Download" directory and the source/recipes are mounted and shared from the host (using virtio)
So, for example, say I'm running a shell, bash, and its PID is 1234. As root goto /sys/fs/cgroup and:
- mkdir cpuset/oebuild
- mkdir memory/oebuild
- echo "0-3" > cpuset/oebuild/cpuset.cpus
- echo 0 > cpuset/oebuild/cpuset.mems
- echo 8G > memory/oebuild/memory.limit_in_bytes
- echo 1234 > cpuset/oebuild/tasks
- echo 1234 > memory/oebuild/tasks
$ cat /proc/self/cgroupHere you see both the "memory" and "cpuset" cgroups are constrained by the oebuild cgroup for this process (i.e. /proc/self).
10:hugetlb:/
9:perf_event:/
8:blkio:/
7:net_cls,net_prio:/
6:freezer:/
5:devices:/
4:memory:/oebuild
3:cpu,cpuacct:/
2:cpuset:/oebuild
1:name=systemd:/user.slice/user-1000.slice/session-625.scope
No comments:
New comments are not allowed.