mbedTLS vs BoringSSL on ARM

Goals and assumptions

Goal is to choose most suitable TLS library that could be statically linked with an application. The application will be runing on modern mobile operating system and variety of ARM CPUs. I’m interested in client side of the TLS only. Ideal library is the one which ensures the best security, implements algorithms optimized for speed and compiles to reasonably small binary. Additionally I assume I can control both sides of the connection, meaning I’m free to choose a cipher(s) to be used for both - symmetric and assymetric encryption (without using PSK). I also have some requirements regarding licences and being open-source.

I’ve identified two libraries which seem to met those requirements:

mbedTLS - is a library formerly known as PolarSSL. It makes it fairly easy for developers to include cryptographic and TLS capabilities in embedded products. It is highly configurable, so that facilitating TLS functionality may have very small minimal coding footprint. It is currently maintained by ARM.
BoringSSL - is a fork of OpenSSL maintained and used by Google. It is a default TLS library used by Android OS (starting from version M), Chrome as well as used on Cloudflare systems. I has advantage of being originated from OpenSSL - it means that library got a lot of reviews and testing.

Testing application

It’s a implementation of simple C-based test application, which compiles and links against library under test and run on ARMv8 platform running Android operating system. The app is composed of client and server. As I’m only interested in client side of the TLS end, we fix the server to always use same library (it’s based on BoringSSL). Server is configured to support only TLSv1.2 (as 1.3 is not supported by mbedTLS, yet [16]). In order to start a server, user provides an argument which specifies cetificate type to be used (RSA, ECDSA or EdDSA based). Once run it always enforces same cipher suite to be used - for example in case of RSA it will be ECDHE key agreement with RSA signature and AES/256 in GCM AEAD mode.

Client application is the one which I want to benchmark. I have implemented one which uses mbedTLS API and links with this library and similar one for BoringSSL. Client always establishes TCP connection in blocking mode (simplicity). It implements 3 different tests:

Handshake : during this test client opens TCP connection and performs many handshake without closing the connection. Performance of this test depends on key type used for certificate signing and symmetric key agreement algorithm (as well as elliptic curve used), hence this test is performed multiple times, once for each certificate type
Write: clients opens TCP connection and sends few hundred megabytes of data. This test is done mostly to assess performance of symmetric encryption
Read: clients opens TCP connection and sends a request to the server which sends back few hundred megabytes of data. This test is done mostly to assess performance of symmetric decryption

Details regarding testing environment

Software version

Library Commit

BoringSSL eb7c3008

mbedTLS 4ca9a457
Compiler and environment settings

Name Setting

Compiler aarch64-linux-android-clang5.0 (as Google is deprecating gcc )

ABI arm64-v8a

NDK version 16b

Android Native API level 27

Android Build type Release
Testing platform

Hardware platform used for testing is a HiKey620 development board. It is powered by Kirin 620 SoC (8 x ARM Cortex-A53) from HiSilicon. It is running Android 8 from AOSP (see build details in Appendix B here ). Details about the board can be found here and here.

Details of the environment used:
```
Linux localhost 4.9.29-g23875fc #1 SMP PREEMPT Tue Jul 4 14:25:00 CEST 2017 aarch64
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32
```

Library	Commit
BoringSSL	`eb7c3008`
mbedTLS	`4ca9a457`

Name	Setting
Compiler	aarch64-linux-android-clang5.0 (as Google is deprecating `gcc` )
ABI	arm64-v8a
NDK version	16b
Android Native API level	27
Android Build type	Release

Preparation step

Following script is used to set-up platform for benchmarking. Most important step is to fix CPU frequency so that it is not auto-regulated by things like EAS [11].

 1# Number of CPUs on the board
 2NUM_CPU=8            
 3# CPU scaling governor.
 4GOVERNOR=userspace         
 5# Requested CPU frequency
 6MAX_FREQ=1200000         
 7
 8
 9adb root
10adb remount
11# Prevent system from suspending
12adb shell "echo temporary > /sys/power/wake_lock"
13# Probably useful only on qcom, but anyway...
14adb shell stop thermal-engine 
15adb shell stop mpdecision     
16
17for ID in `seq 0 $((NUM_CPU-1))`
18do
19adb shell "echo 1 > /sys/devices/system/cpu/cpu${ID}/online"
20adb shell "echo ${GOVERNOR} > /sys/devices/system/cpu/cpu${ID}/cpufreq/scaling_governor"
21adb shell "echo ${MAX_FREQ} > /sys/devices/system/cpu/cpu${ID}/cpufreq/scaling_setspeed"
22done
23
24for ID in `seq 0 $((NUM_CPU-1))`
25do
26adb shell "cat /sys/devices/system/cpu/cpu${ID}/online"
27adb shell "cat /sys/devices/system/cpu/cpu${ID}/cpufreq/scaling_governor"
28adb shell "cat /sys/devices/system/cpu/cpu${ID}/cpufreq/scaling_cur_freq"
29done

Binary size reduction

mbedTLS

mbedTLS makes it possible to select features of TLS library before compile time. Configuration template is available in config.h file and is managed by definining or disabling number of preprocessor symbols (look for MBEDTLS_CONFIG_FILE for more details). This is an easy way for developers to include cryptographic and (optional) SSL/TLS capabilities in their products, facilitating those functionalities with a minimal coding footprint. Indeed, it is interesting feature for memory constrained devices (for example microcontrollers).

mbedtls compilation produces 3 separated libraries - crypto, ssl and x509 library. Compilation also outputs number of test binaries.

As a first step I have applied set of obvious size optimization provided by compiler (-Os) and stripped all the symbols (they can be stored in separated file if needed). I also applied -ffuntion-sections and --fdata-sections options to the compiler. This will cause compiler to place each function or data item into its own section. Then thanks to -Wl,--gc-sections linker will be able to chose only those sections which are actually used, which makes resulting binary much smaller (one can add -Wl,--print-gc-sections in order to see removed sections). This optimization may produce unexpected results, so I strongly advice to look at documentation and get familiar with the details of this optimization.

In a second step I have changed config.h file and removed capabilities which are not needed by our client application, leaving following capabilities only:

TLS client side code
TLS v1.2
AES-GCM used as a symmetric cipher
RSA, ECDSA and ECDH with curves P-256
SHA-256 and SHA-512
Code for key pre-sharing has been removed
Some additional features required by the client code

In a third step I’ve removed support of RSA, which from one hand isn’t actually necessarily, as I control both sides of a connection, and from the other hand it’s interesting how much binry size get’s reduced.

Step	Optim	`libmbedx509.a`	`libmbedtls.a`	`libmbedcrypto.a`	Test app
0	Initial size (with -O2)	96K	260K	604K	464K
1	Removal of data and function sections, strip, -Os	68K	132K	380K	272K
2	Disabling not needed capabilities	52K	52K	236K	128K
3	Disabling RSA	40K	48K	208K	108K

The test client has been reduced more than 4 times and indeed to very small size. Further reductions are possible (see [8] for ideas), nevertheless at this point I’m satisfied with the size and I don’t think it is possible to change it much. Also removing RSA reduces a size only by 20 bytes, so I’ve decided to keep RSA and pay a little penalty.

Also it’s worth noting that 48KB for TLSv1.2 implementation is really small memory footprint. Very interesting for small devices which implement most of needed crypto in hardware.

BoringSSL

Similar experiment as bove has been done with BoringSSL. This library doesn’t offer so many configuration possibilities as mbedTLS, nevertheless it provides some.

In the first step I’ve applied exactly same compiler flags as in case of mbedTLS (-Os, symbol strip, indexing data&function sections).

In a second step I’ve applied OPENSSL_SMALL=1 configuration option. This tells the compiler to use algorithm implementation which is optimized for size rather than for speed (see [12] for more details).

In a third step I’ve tried to remove assembly implementation. For some algorithms this causes huge performance degradation as some optimizations are written in assembly as well as hardware acceleration needs a “glue” code which is written in assembly. Nevertheless, it is interesting step when comparing against mbedTLS, as it doesn’t have any such optimizations currently (see [6]).

BoringSSL provides concept of crypto buffers which can be used instead of some functions from memory hungry X509 and ASN.1 implementation. This feature together with indexing data&function sections (done in first step) greatly reduces binary size. We have used it in step 4. In step 5 we go a bit further - need for X509 and ASN.1 can be complatelly removed, assuming user provides custom certificate verification function. My client doesn’t implement such function, but from one hand it shouldn’t be very complicated to implement such function and also code size of such function won’t change much final binary size. Hence it’s interesting to see a result of size reduction.

In last step I’ve tried to introduce more aggressive changes to the comment out (with preprocessor symbols) RSA and DH implementation.

Step	Optim	`libcrypto.a`	`libssl.a`	Test App
0	Initial size	12.4M	7.6M	7.5M
1	Removal of data and function sections, strip, -Os	1244K	356K	796K
2	OPENSSL_SMALL=1	1200K	356K	756K
3	OPENSSL_NO_ASM	1184K	356K	736K
4	BoringSSL crypto buffers	1184K	356K	700K
5	Complate elimination of X.509 and ASN.1 code	1184K	356K	392K
6	Disabling RSA and DH	1144K	352K	356K

I’m positivelly surprised by the fact that it is possible to remove X509 and ASN.1 code, it gives you really small library. At the moment I don’t want to implement my own certificate verification function and I want to perform certificate verification during performance benchmarking. But it’s worth noting that with fairly small cost BoringSSL can be reduced almost twice to the binary size that’s a bit more than 3 times bigger than the one produced with mbedTLS, which is quite interesting.

Removing ASM hits performance a lot - so I will keep it. Removing RSA and DH gives on 36KB smaller binary, but it introduces very high maintenance cost - it will be hard and error prone to apply those changes to the code after updating library to newer version. As a side note - OpenSSL has a switch which removes RSA (OPENSSL_NO_RSA), FWIW it might be that this code could be ported to BoringSSL.

Finally for my further analysis I’ll apply steps 1,2 and 4 (and I’ll encourage again to apply step 5).

Notes on size reduction

Something that wasn’t tried is a Link Time Optimization feature which may provide binary with reduced size (see [3], [4] and [5]).
- It might be interesting to see how different results will be when using this features instead/with section indexing
I’ve calculated also size of shared libraries for boring ssl - libcrypto.so: 1072KB; libssl.so: 276KB
mbedTLS doesn’t implement hardware acceleration, so performance won’t be as good as for BoringSSL. I wonder if it would make sense to take exremly small SSL implementation from mbedTLS and use crypto from BoringSSL.

Performance comparison comparison

Results from tools provided by the libraries

Both libraries provide tools for benchmarking. This subsection compares results reported by those tools. I compare default compilation against binary I got after applying tricks which reduce size of client application.

mbedTLS: default vs reduced

mbedTLS provides a tool for performance benchmarking called benchmark. The table below shows results for most interesting algorithms (for results of all algorithms see Appendix C here.

Algo	Reduced	Default (-O2)
SHA-256	46809 KiB/s	52044 KiB/s
AES-GCM-128	16399 KiB/s	16398 KiB/s
AES-GCM-256	14287 KiB/s	14286 KiB/s
RSA-2048	652 public/s	653 public/s
RSA-2048	17 private/s	17 private/s
RSA-4096	168 public/s	168 public/s
RSA-4096	3 private/s	3 private/s
ECDSA-secp256r1	189 sign/s	195 sign/s
ECDHE-secp256r1	57 handshake/s	60 handshake/s
ECDH-secp256r1	77 handshake/s	81 handshake/s
ECDHE-Curve25519	41 handshake/s	41 handshake/s
ECDH-Curve25519	80 handshake/s	82 handshake/s

One thing to notice is that (for algorithms above) there is no much difference bewteen applying -Os and -O2 as -Os enables all -O2 optimizations that do not typically increase code size. Also it’s worth to notice performance difference between static and ephemeral ECDH. It seems to be quite weird and probably root cause should be studied further.

BoringSSL: default vs reduced

Performance results are provieded by bssl speed tool from BoringSSL. Table with most interesting algorithms (for results of all algorithms see Appendix C here.

Operation	Reduced	Default (-O2)
RSA 2048 signing	(59.5 ops/sec)	(108.1 ops/sec)
RSA 2048 verify	(2377.5 ops/sec)	(4078.3 ops/sec)
RSA 4096 signing	(8.3 ops/sec)	(14.9 ops/sec)
RSA 4096 verify	(668.0 ops/sec)	(1088.4 ops/sec)
AES-128-GCM (1350 bytes)	(17675.4 ops/sec): 23.9 MB/s	(291430.3 ops/sec): 393.4 MB/s
AES-256-GCM (1350 bytes)	(14792.6 ops/sec): 20.0 MB/s	(254718.5 ops/sec): 343.9 MB/s
ChaCha20-Poly1305 (1350 bytes)	(33108.8 ops/sec): 44.7 MB/s	(67622.8 ops/sec): 91.3 MB/s
SHA-256 (8192 bytes)	(6824.7 ops/sec): 55.9 MB/s	(63214.7 ops/sec): 517.9 MB/s
SHA-512 (8192 bytes)	(14014.7 ops/sec): 114.8 MB/s	(14759.6 ops/sec): 120.9 MB/s
RNG (8192 bytes)	(4058.7 ops/sec): 33.2 MB/s	(55705.4 ops/sec): 456.3 MB/s
ECDH P-256 operations	(594.7 ops/sec)	(642.8 ops/sec)
ECDSA P-256 signing	(1396.6 ops/sec)	(1738.5 ops/sec)
ECDSA P-256 verify	(672.1 ops/sec)	(704.2 ops/sec)

Comparing `mbedTLS` and `BoringSSL` based client

Default compilation

Those results represent as close to best possible performance that we should expect on ARMv8 when using BoringSSL as a client.

Performance:

Test	mbedTLS	BoringSSL
Handshakes - RSA_2048 (x200)	0m21.69s	0m03.12s
Handshakes - ECDSA_256 (x200)	0m24.34s	0m01.38s
Write - ECDSA_256 (AES-GCM)	0m16.28s	0m03.94s
Read - ECDSA_256 (AES-GCM)	0m17.49s	0m03.92s

I could find following reasons for difference in performance:

BoringSSL contains support for ARMv8 crypto extensions implemented in hardrware (AES, PMULL, SHA256), which mbedTLS doesn’t support yet [6]. BoringSSL also uses vector instructions (NEON) for some algorithms, NEON can be find on both v7 (optional) and v8 (mandatory) ARMs. Nevertheless algorithms used in this test do not use NEON. But, Poly1305-ChaCha20 uses NEON and this is important because it could optimize devices based on ARMv7. Those devices do not offer hardware accelerated AES and hence if AES is used on such devices, it will be much slower. Poly-ChaCha implementation is only available in the BoringSSL. One more comment on hardware support - it is discovered at runtime and BoringSSL will fallback to software implementation (or NEON and then software) in case CPU doesn’t support required extension.
BoringSSL client supports X25519 curve. From the other hand, mbedTLS doesn’t support this curve in TLS (it supports it only as a primitive [10]). In the test above mbedTLS usedNIST P-384. Implementation of arithmetic on x25519 curve is much more efficient than than P-384. It’s obviously wrong to compare two different curves - one of the tests below enforces usage of P-256.
It seems mbedTLS does more I/O - it sends more TCP packets than BoringSSL

exchanged TCP packets were generally bigger (for example ClientHello, 470B - mbedTLS and 213B - BoringSSL)
mbedTLS sends “Client Key Exchange” and “Change Cipher Spec” in separated TCP packets, which is not a case for BoringSSL

According to mbedTLS forum, every TLS message is sent using the send bio callback. The default implementation is that every packet sent is sent separately. One could supply custom send callback, that will concatenate every possible message, and will send as one TCP packet. Nevertheless, this wasn’t done during this analysis.

Following two tests try to build libraries and TLS clients with different profiles, hopefully eliminating as much as possible some of differences described above.

Software implementation only

For this test I’ve built BoringSSL client which uses only crypto implemented in software and doesn’t use hardware acceleration. Those results should help to understand how BoringSSL will behave on CPUs which don’t provide such features.

Test	mbedTLS	BoringSSL
Handshakes - RSA_2048 (x200)	0m20.89s	0m04.89s
Handshakes - ECDSA_256 (x200)	0m23.80s	0m01.66s
Write - ECDSA_256 (AES-GCM)	0m16.41s	0m13.79s
Read - ECDSA_256 (AES-GCM)	0m17.51s	0m13.72s

Ok, so mostly symmetric encryption is affected.

Enforcing usage of NIST P-256 curve for ECDHE

This test enforces usage of curve NIST P-256. This mostly affect handshake time and eliminates some differences seen in first performance test.

Test mbedTLS BoringSSL

Handshakes - RSA_2048 (x200) 0m16.88s 0m03.56s

Handshakes - ECDSA_256 (x200) 0m20.26s 0m01.89s

Test	mbedTLS	BoringSSL
Handshakes - RSA_2048 (x200)	0m16.88s	0m03.56s
Handshakes - ECDSA_256 (x200)	0m20.26s	0m01.89s

Other things

BoringSSL seems to be a better choice, let see what else it offers.

Using EdDSA with X25519 for ECDHE

During course of action, I’ve found out that BoringSSL offers possibility to use Ed25519 with TLSv1.2. Results below show differences in performing 500 handshakes with Ed25519, ECDSA/P-256 and RSA/2048. CA certificate is still RSA/2048 (same as it was used for other tests).

Performance:

Handshake x500	TLS handsh.	PubKey	Sign size	Degradation
Handshakes - Ed25519	0m02.72s	256 bits	512 bits
Handshakes - ECDSA	0m03.47s	256 bits	512 bits	27.6%
Handshakes - RSA	0m07.83s	2048 bits	2048 bits	287.9%

It’s worth noticing that Ed25519 and ECDSA offer same security level and RSA/2048 is a bit weaker. Nevertheless, Ed25519 certificates are not yet very popular.

TLS 1.3 & 0-RTT

Only BoringSSL supports TLS 1.3, at the moment it implements latest draft of the standard (28). Gains from using TLS v1.3 (and 0-RTT) are well described in [13].

Out of scope / left for further analysis:

Few points there were not checked:

32bit (armeabi-v7a) code may be smaller and still run on ARM64. Thumb mode (variable-length instruction set) will produce even more compact code. Thumb mode is default setting in NDK
Something I havn’t checked is a power consumption, which is important in case of mobile application. It’s not complicated but requires specific hardware (see [18] and [19]). I assumed that thing which executes in smaller amount of time will consume less. But this assumption should be verified, as it’s probably not always true.
Implementing hardware acceleration mbedTLS is obvious improvement which should be considered. See here for more details. It is also highly time consuming task.
mbedTLS supports so called “alternative” implementation. One idea on using it would be to swap existing implementation of ECC with either smaller or faster implementation (for smaller implementation I would recomend uECC, which can be as small as 4KB). Other option could be to use small SSL implementation from mbedTLS and fast crypto implementation from BoringSSL or NaCL [17].
mbedTLS has a configuration option called MBEDTLS_SSL_MAX_CONTENT_LEN which determines the size of internal I/O buffer. Playing with this value may help improve performance or reduce size.
Performance of Poly-ChaCha on ARMv7

Conclusion

My preference goes to BoringSSL for following reasons:

It offers much better performance on ARM
It offers more features like TLSv1.3 and Curve25519
It compiles to binary size which is reasonable. Smallest possible resulting library is 3 times bigger than the one based on mbedTLS, overall result is just 350KB. The difference between smallest possible mbedTLS based client and BoringSSL one is just 248KB. Let say the library will be linked to each and every application on the phone. Assuming user has has 100 apps on a phone, the difference in size is 24MB, which nowadays is negligible. Also ccording to report by Statista [14], on average users have 27 apps instaled on the phone (which is less an argument and more interesting information).
BoringSSL is a default TLS library on Android and is a Google product. It means that there is a lot of intrest to make even more secure and fast.
Recently BoringSSL received formally verified implementation of Curve25519 and P-256 (see [15])

It seems both libraries have very different design goals. mbedTLS is made for resource constrained embedded systems, which face challanges in terms of memory availability. Embedded platforms often do not exceed 256KB of RAM, often don’t have memory management units and cannot support virtual memory, as a result dynamic allocation is avoided. I believe for such systems mbedTLS is unbeatable and a great choice.

BoringSSL doesn’t seem to have similar design goal. It seems to be designed for devices which offer more RAM, storage space and in general have much different profile than resource constrained embedded systems. Mobile devices offer all those features and it would be huge mistake not make use of it.

When thinking about software design, there is great difference between aiming for “reasonably small” and “smallest possible bianry size” - those are basically two different goals.

Finally

I would like to thank Ron E. from mbedTLS team for all the answers for my questions.

UPDATE: Recently one of my co-workers has implemented performance improvement for ARMv64. It is small change which give good speedup - see more details (here)[https://github.com/ARMmbed/mbedtls/pull/1964].

Footnotes

[0] Android NDK: reducing binary sizes: https://blog.algolia.com/android-ndk-how-to-reduce-libs-size/
[1] to check: https://stackoverflow.com/questions/6771905/how-to-decrease-the-size-of-generated-binaries
[2] C/C++ reducing size http://ptspts.blogspot.co.uk/2013/12/how-to-make-smaller-c-and-c-binaries.html
[3] “Link time optimization” in https://www.iecc.com/linker/linker11.html
[4] LTO GCC: https://gcc.gnu.org/onlinedocs/gccint/LTO-Overview.html
[5] LTO LLVM: https://llvm.org/docs/LinkTimeOptimization.html
[6] https://github.com/ARMmbed/mbedtls/pull/1424
[7] “Link time garbage collection” in https://www.iecc.com/linker/linker11.html
[8] https://github.com/android-ndk/ndk/issues/436
[9] https://tls.mbed.org/kb/how-to/reduce-mbedtls-memory-and-storage-footprint
[10] https://github.com/ARMmbed/mbedtls/issues/941
[11] https://wiki.linaro.org/WorkingGroups/PowerManagement/Resources/EAS
[12] https://boringssl.googlesource.com/boringssl/+/HEAD/BUILDING.md
[13] https://blog.cloudflare.com/introducing-0-rtt/
[14] https://www.apptentive.com/blog/2017/06/22/how-many-mobile-apps-are-actually-used/
[15] https://boringssl.googlesource.com/boringssl/+/HEAD/third_party/fiat/
[16] https://tls.mbed.org/discussions/feature-request/any-plans-for-tls-1-3-support
[17] https://eprint.iacr.org/2018/354/20180418:202819
[18] https://source.android.com/devices/tech/power/component
[19] https://developer.arm.com/products/software-development-tools/ds-5-development-studio/streamline/arm-energy-probe