## Goals and assumptions

Goal is to choose most suitable TLS library that could be statically linked with an application. The application will be runing on modern mobile operating system and variety of ARM CPUs. I’m interested in client side of the TLS only. Ideal library is the one which ensures the best security, implements algorithms optimized for speed and compiles to reasonably small binary. Additionally I assume I can control both sides of the connection, meaning I’m free to choose a cipher(s) to be used for both - symmetric and assymetric encryption (without using PSK). I also have some requirements regarding licences and being open-source.

I’ve identified two libraries which seem to met those requirements:

• mbedTLS - is a library formerly known as PolarSSL. It makes it fairly easy for developers to include cryptographic and TLS capabilities in embedded products. It is highly configurable, so that facilitating TLS functionality may have very small minimal coding footprint. It is currently maintained by ARM.
• BoringSSL - is a fork of OpenSSL maintained and used by Google. It is a default TLS library used by Android OS (starting from version M), Chrome as well as used on Cloudflare systems. I has advantage of being originated from OpenSSL - it means that library got a lot of reviews and testing.

## Testing application

It’s a implementation of simple C-based test application, which compiles and links against library under test and run on ARMv8 platform running Android operating system. The app is composed of client and server. As I’m only interested in client side of the TLS end, we fix the server to always use same library (it’s based on BoringSSL). Server is configured to support only TLSv1.2 (as 1.3 is not supported by mbedTLS, yet [16]). In order to start a server, user provides an argument which specifies cetificate type to be used (RSA, ECDSA or EdDSA based). Once run it always enforces same cipher suite to be used - for example in case of RSA it will be ECDHE key agreement with RSA signature and AES/256 in GCM AEAD mode.

Client application is the one which I want to benchmark. I have implemented one which uses mbedTLS API and links with this library and similar one for BoringSSL. Client always establishes TCP connection in blocking mode (simplicity). It implements 3 different tests:

• Handshake : during this test client opens TCP connection and performs many handshake without closing the connection. Performance of this test depends on key type used for certificate signing and symmetric key agreement algorithm (as well as elliptic curve used), hence this test is performed multiple times, once for each certificate type
• Write: clients opens TCP connection and sends few hundred megabytes of data. This test is done mostly to assess performance of symmetric encryption
• Read: clients opens TCP connection and sends a request to the server which sends back few hundred megabytes of data. This test is done mostly to assess performance of symmetric decryption

### Details regarding testing environment

• Software version

Library Commit
BoringSSL eb7c3008
mbedTLS 4ca9a457
• Compiler and environment settings

Name Setting
Compiler aarch64-linux-android-clang5.0 (as Google is deprecating gcc )
ABI arm64-v8a
NDK version 16b
Android Native API level 27
Android Build type Release
• Testing platform

Hardware platform used for testing is a HiKey620 development board. It is powered by Kirin 620 SoC (8 x ARM Cortex-A53) from HiSilicon. It is running Android 8 from AOSP (see build details in Appendix B here ). Details about the board can be found here and here.

Details of the environment used:

Linux localhost 4.9.29-g23875fc #1 SMP PREEMPT Tue Jul 4 14:25:00 CEST 2017 aarch64
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32


## Preparation step

Following script is used to set-up platform for benchmarking. Most important step is to fix CPU frequency so that it is not auto-regulated by things like EAS [11].

 1# Number of CPUs on the board
2NUM_CPU=8
3# CPU scaling governor.
4GOVERNOR=userspace
5# Requested CPU frequency
6MAX_FREQ=1200000
7
8
11# Prevent system from suspending
12adb shell "echo temporary > /sys/power/wake_lock"
13# Probably useful only on qcom, but anyway...
16
17for ID in seq 0 $((NUM_CPU-1)) 18do 19adb shell "echo 1 > /sys/devices/system/cpu/cpu${ID}/online"
20adb shell "echo ${GOVERNOR} > /sys/devices/system/cpu/cpu${ID}/cpufreq/scaling_governor"
21adb shell "echo ${MAX_FREQ} > /sys/devices/system/cpu/cpu${ID}/cpufreq/scaling_setspeed"
22done
23
24for ID in seq 0 $((NUM_CPU-1)) 25do 26adb shell "cat /sys/devices/system/cpu/cpu${ID}/online"
27adb shell "cat /sys/devices/system/cpu/cpu${ID}/cpufreq/scaling_governor" 28adb shell "cat /sys/devices/system/cpu/cpu${ID}/cpufreq/scaling_cur_freq"
29done

## Binary size reduction

### mbedTLS

mbedTLS makes it possible to select features of TLS library before compile time. Configuration template is available in config.h file and is managed by definining or disabling number of preprocessor symbols (look for MBEDTLS_CONFIG_FILE for more details). This is an easy way for developers to include cryptographic and (optional) SSL/TLS capabilities in their products, facilitating those functionalities with a minimal coding footprint. Indeed, it is interesting feature for memory constrained devices (for example microcontrollers).

mbedtls compilation produces 3 separated libraries - crypto, ssl and x509 library. Compilation also outputs number of test binaries.

As a first step I have applied set of obvious size optimization provided by compiler (-Os) and stripped all the symbols (they can be stored in separated file if needed). I also applied -ffuntion-sections and --fdata-sections options to the compiler. This will cause compiler to place each function or data item into its own section. Then thanks to -Wl,--gc-sections linker will be able to chose only those sections which are actually used, which makes resulting binary much smaller (one can add -Wl,--print-gc-sections in order to see removed sections). This optimization may produce unexpected results, so I strongly advice to look at documentation and get familiar with the details of this optimization.

In a second step I have changed config.h file and removed capabilities which are not needed by our client application, leaving following capabilities only:

• TLS client side code
• TLS v1.2
• AES-GCM used as a symmetric cipher
• RSA, ECDSA and ECDH with curves P-256
• SHA-256 and SHA-512
• Code for key pre-sharing has been removed
• Some additional features required by the client code

In a third step I’ve removed support of RSA, which from one hand isn’t actually necessarily, as I control both sides of a connection, and from the other hand it’s interesting how much binry size get’s reduced.

Step Optim libmbedx509.a libmbedtls.a libmbedcrypto.a Test app
0 Initial size (with -O2) 96K 260K 604K 464K
1 Removal of data and function sections, strip, -Os 68K 132K 380K 272K
2 Disabling not needed capabilities 52K 52K 236K 128K
3 Disabling RSA 40K 48K 208K 108K

The test client has been reduced more than 4 times and indeed to very small size. Further reductions are possible (see [8] for ideas), nevertheless at this point I’m satisfied with the size and I don’t think it is possible to change it much. Also removing RSA reduces a size only by 20 bytes, so I’ve decided to keep RSA and pay a little penalty.

Also it’s worth noting that 48KB for TLSv1.2 implementation is really small memory footprint. Very interesting for small devices which implement most of needed crypto in hardware.

### BoringSSL

Similar experiment as bove has been done with BoringSSL. This library doesn’t offer so many configuration possibilities as mbedTLS, nevertheless it provides some.

In the first step I’ve applied exactly same compiler flags as in case of mbedTLS (-Os, symbol strip, indexing data&function sections).

In a second step I’ve applied OPENSSL_SMALL=1 configuration option. This tells the compiler to use algorithm implementation which is optimized for size rather than for speed (see [12] for more details).

In a third step I’ve tried to remove assembly implementation. For some algorithms this causes huge performance degradation as some optimizations are written in assembly as well as hardware acceleration needs a “glue” code which is written in assembly. Nevertheless, it is interesting step when comparing against mbedTLS, as it doesn’t have any such optimizations currently (see [6]).

BoringSSL provides concept of crypto buffers which can be used instead of some functions from memory hungry X509 and ASN.1 implementation. This feature together with indexing data&function sections (done in first step) greatly reduces binary size. We have used it in step 4. In step 5 we go a bit further - need for X509 and ASN.1 can be complatelly removed, assuming user provides custom certificate verification function. My client doesn’t implement such function, but from one hand it shouldn’t be very complicated to implement such function and also code size of such function won’t change much final binary size. Hence it’s interesting to see a result of size reduction.

In last step I’ve tried to introduce more aggressive changes to the comment out (with preprocessor symbols) RSA and DH implementation.

Step Optim libcrypto.a libssl.a Test App
0 Initial size 12.4M 7.6M 7.5M
1 Removal of data and function sections, strip, -Os 1244K 356K 796K
2 OPENSSL_SMALL=1 1200K 356K 756K
3 OPENSSL_NO_ASM 1184K 356K 736K
4 BoringSSL crypto buffers 1184K 356K 700K
5 Complate elimination of X.509 and ASN.1 code 1184K 356K 392K
6 Disabling RSA and DH 1144K 352K 356K

I’m positivelly surprised by the fact that it is possible to remove X509 and ASN.1 code, it gives you really small library. At the moment I don’t want to implement my own certificate verification function and I want to perform certificate verification during performance benchmarking. But it’s worth noting that with fairly small cost BoringSSL can be reduced almost twice to the binary size that’s a bit more than 3 times bigger than the one produced with mbedTLS, which is quite interesting.

Removing ASM hits performance a lot - so I will keep it. Removing RSA and DH gives on 36KB smaller binary, but it introduces very high maintenance cost - it will be hard and error prone to apply those changes to the code after updating library to newer version. As a side note - OpenSSL has a switch which removes RSA (OPENSSL_NO_RSA), FWIW it might be that this code could be ported to BoringSSL.

Finally for my further analysis I’ll apply steps 1,2 and 4 (and I’ll encourage again to apply step 5).

## Notes on size reduction

• Something that wasn’t tried is a Link Time Optimization feature which may provide binary with reduced size (see [3], [4] and [5]).
• It might be interesting to see how different results will be when using this features instead/with section indexing
• I’ve calculated also size of shared libraries for boring ssl - libcrypto.so: 1072KB; libssl.so: 276KB
• mbedTLS doesn’t implement hardware acceleration, so performance won’t be as good as for BoringSSL. I wonder if it would make sense to take exremly small SSL implementation from mbedTLS and use crypto from BoringSSL.

## Performance comparison comparison

### Results from tools provided by the libraries

Both libraries provide tools for benchmarking. This subsection compares results reported by those tools. I compare default compilation against binary I got after applying tricks which reduce size of client application.

#### mbedTLS: default vs reduced

mbedTLS provides a tool for performance benchmarking called benchmark. The table below shows results for most interesting algorithms (for results of all algorithms see Appendix C here.

Algo Reduced Default (-O2)
SHA-256 46809 KiB/s 52044 KiB/s
AES-GCM-128 16399 KiB/s 16398 KiB/s
AES-GCM-256 14287 KiB/s 14286 KiB/s
RSA-2048 652 public/s 653 public/s
RSA-2048 17 private/s 17 private/s
RSA-4096 168 public/s 168 public/s
RSA-4096 3 private/s 3 private/s
ECDSA-secp256r1 189 sign/s 195 sign/s
ECDHE-secp256r1 57 handshake/s 60 handshake/s
ECDH-secp256r1 77 handshake/s 81 handshake/s
ECDHE-Curve25519 41 handshake/s 41 handshake/s
ECDH-Curve25519 80 handshake/s 82 handshake/s

One thing to notice is that (for algorithms above) there is no much difference bewteen applying -Os and -O2 as -Os enables all -O2 optimizations that do not typically increase code size. Also it’s worth to notice performance difference between static and ephemeral ECDH. It seems to be quite weird and probably root cause should be studied further.

#### BoringSSL: default vs reduced

Performance results are provieded by bssl speed tool from BoringSSL. Table with most interesting algorithms (for results of all algorithms see Appendix C here.

Operation Reduced Default (-O2)
RSA 2048 signing (59.5 ops/sec) (108.1 ops/sec)
RSA 2048 verify (2377.5 ops/sec) (4078.3 ops/sec)
RSA 4096 signing (8.3 ops/sec) (14.9 ops/sec)
RSA 4096 verify (668.0 ops/sec) (1088.4 ops/sec)
AES-128-GCM (1350 bytes) (17675.4 ops/sec): 23.9 MB/s (291430.3 ops/sec): 393.4 MB/s
AES-256-GCM (1350 bytes) (14792.6 ops/sec): 20.0 MB/s (254718.5 ops/sec): 343.9 MB/s
ChaCha20-Poly1305 (1350 bytes) (33108.8 ops/sec): 44.7 MB/s (67622.8 ops/sec): 91.3 MB/s
SHA-256 (8192 bytes) (6824.7 ops/sec): 55.9 MB/s (63214.7 ops/sec): 517.9 MB/s
SHA-512 (8192 bytes) (14014.7 ops/sec): 114.8 MB/s (14759.6 ops/sec): 120.9 MB/s
RNG (8192 bytes) (4058.7 ops/sec): 33.2 MB/s (55705.4 ops/sec): 456.3 MB/s
ECDH P-256 operations (594.7 ops/sec) (642.8 ops/sec)
ECDSA P-256 signing (1396.6 ops/sec) (1738.5 ops/sec)
ECDSA P-256 verify (672.1 ops/sec) (704.2 ops/sec)

### Comparing mbedTLS and BoringSSL based client

#### Default compilation

Those results represent as close to best possible performance that we should expect on ARMv8 when using BoringSSL as a client.

• Performance:
Test mbedTLS BoringSSL
Handshakes - RSA_2048 (x200) 0m21.69s 0m03.12s
Handshakes - ECDSA_256 (x200) 0m24.34s 0m01.38s
Write - ECDSA_256 (AES-GCM) 0m16.28s 0m03.94s
Read - ECDSA_256 (AES-GCM) 0m17.49s 0m03.92s

I could find following reasons for difference in performance:

1. BoringSSL contains support for ARMv8 crypto extensions implemented in hardrware (AES, PMULL, SHA256), which mbedTLS doesn’t support yet [6]. BoringSSL also uses vector instructions (NEON) for some algorithms, NEON can be find on both v7 (optional) and v8 (mandatory) ARMs. Nevertheless algorithms used in this test do not use NEON. But, Poly1305-ChaCha20 uses NEON and this is important because it could optimize devices based on ARMv7. Those devices do not offer hardware accelerated AES and hence if AES is used on such devices, it will be much slower. Poly-ChaCha implementation is only available in the BoringSSL. One more comment on hardware support - it is discovered at runtime and BoringSSL will fallback to software implementation (or NEON and then software) in case CPU doesn’t support required extension.

2. BoringSSL client supports X25519 curve. From the other hand, mbedTLS doesn’t support this curve in TLS (it supports it only as a primitive [10]). In the test above mbedTLS usedNIST P-384. Implementation of arithmetic on x25519 curve is much more efficient than than P-384. It’s obviously wrong to compare two different curves - one of the tests below enforces usage of P-256.

3. It seems mbedTLS does more I/O - it sends more TCP packets than BoringSSL

• exchanged TCP packets were generally bigger (for example ClientHello, 470B - mbedTLS and 213B - BoringSSL)
• mbedTLS sends “Client Key Exchange” and “Change Cipher Spec” in separated TCP packets, which is not a case for BoringSSL

According to mbedTLS forum, every TLS message is sent using the send bio callback. The default implementation is that every packet sent is sent separately. One could supply custom send callback, that will concatenate every possible message, and will send as one TCP packet. Nevertheless, this wasn’t done during this analysis.

Following two tests try to build libraries and TLS clients with different profiles, hopefully eliminating as much as possible some of differences described above.

• Software implementation only

For this test I’ve built BoringSSL client which uses only crypto implemented in software and doesn’t use hardware acceleration. Those results should help to understand how BoringSSL will behave on CPUs which don’t provide such features.

Test mbedTLS BoringSSL
Handshakes - RSA_2048 (x200) 0m20.89s 0m04.89s
Handshakes - ECDSA_256 (x200) 0m23.80s 0m01.66s
Write - ECDSA_256 (AES-GCM) 0m16.41s 0m13.79s
Read - ECDSA_256 (AES-GCM) 0m17.51s 0m13.72s

Ok, so mostly symmetric encryption is affected.

• Enforcing usage of NIST P-256 curve for ECDHE

This test enforces usage of curve NIST P-256. This mostly affect handshake time and eliminates some differences seen in first performance test.

Test mbedTLS BoringSSL
Handshakes - RSA_2048 (x200) 0m16.88s 0m03.56s
Handshakes - ECDSA_256 (x200) 0m20.26s 0m01.89s

### Other things

BoringSSL seems to be a better choice, let see what else it offers.

#### Using EdDSA with X25519 for ECDHE

During course of action, I’ve found out that BoringSSL offers possibility to use Ed25519 with TLSv1.2. Results below show differences in performing 500 handshakes with Ed25519, ECDSA/P-256 and RSA/2048. CA certificate is still RSA/2048 (same as it was used for other tests).

• Performance:
Handshake x500 TLS handsh. PubKey Sign size Degradation
Handshakes - Ed25519 0m02.72s 256 bits 512 bits
Handshakes - ECDSA 0m03.47s 256 bits 512 bits 27.6%
Handshakes - RSA 0m07.83s 2048 bits 2048 bits 287.9%

It’s worth noticing that Ed25519 and ECDSA offer same security level and RSA/2048 is a bit weaker. Nevertheless, Ed25519 certificates are not yet very popular.

#### TLS 1.3 & 0-RTT

Only BoringSSL supports TLS 1.3, at the moment it implements latest draft of the standard (28). Gains from using TLS v1.3 (and 0-RTT) are well described in [13].

## Out of scope / left for further analysis:

Few points there were not checked:

• 32bit (armeabi-v7a) code may be smaller and still run on ARM64. Thumb mode (variable-length instruction set) will produce even more compact code. Thumb mode is default setting in NDK
• Something I havn’t checked is a power consumption, which is important in case of mobile application. It’s not complicated but requires specific hardware (see [18] and [19]). I assumed that thing which executes in smaller amount of time will consume less. But this assumption should be verified, as it’s probably not always true.
• Implementing hardware acceleration mbedTLS is obvious improvement which should be considered. See here for more details. It is also highly time consuming task.
• mbedTLS supports so called “alternative” implementation. One idea on using it would be to swap existing implementation of ECC with either smaller or faster implementation (for smaller implementation I would recomend uECC, which can be as small as 4KB). Other option could be to use small SSL implementation from mbedTLS and fast crypto implementation from BoringSSL or NaCL [17].
• mbedTLS has a configuration option called MBEDTLS_SSL_MAX_CONTENT_LEN which determines the size of internal I/O buffer. Playing with this value may help improve performance or reduce size.
• Performance of Poly-ChaCha on ARMv7

## Conclusion

My preference goes to BoringSSL for following reasons:

• It offers much better performance on ARM
• It offers more features like TLSv1.3 and Curve25519
• It compiles to binary size which is reasonable. Smallest possible resulting library is 3 times bigger than the one based on mbedTLS, overall result is just 350KB. The difference between smallest possible mbedTLS based client and BoringSSL one is just 248KB. Let say the library will be linked to each and every application on the phone. Assuming user has has 100 apps on a phone, the difference in size is 24MB, which nowadays is negligible. Also ccording to report by Statista [14], on average users have 27 apps instaled on the phone (which is less an argument and more interesting information).
• BoringSSL is a default TLS library on Android and is a Google product. It means that there is a lot of intrest to make even more secure and fast.
• Recently BoringSSL received formally verified implementation of Curve25519 and P-256 (see [15])

It seems both libraries have very different design goals. mbedTLS is made for resource constrained embedded systems, which face challanges in terms of memory availability. Embedded platforms often do not exceed 256KB of RAM, often don’t have memory management units and cannot support virtual memory, as a result dynamic allocation is avoided. I believe for such systems mbedTLS is unbeatable and a great choice.

BoringSSL doesn’t seem to have similar design goal. It seems to be designed for devices which offer more RAM, storage space and in general have much different profile than resource constrained embedded systems. Mobile devices offer all those features and it would be huge mistake not make use of it.

When thinking about software design, there is great difference between aiming for “reasonably small” and “smallest possible bianry size” - those are basically two different goals.

## Finally

I would like to thank Ron E. from mbedTLS team for all the answers for my questions.

UPDATE: Recently one of my co-workers has implemented performance improvement for ARMv64. It is small change which give good speedup - see more details (here)[https://github.com/ARMmbed/mbedtls/pull/1964].