Quantcast
Channel: NSBlog
Viewing all 94 articles
Browse latest View live

Friday Q&A 2015-07-03: Address Sanitizer

$
0
0
Friday Q&A 2015-07-03: Address Sanitizer
Friday Q&A 2015-07-03: Address Sanitizer

Outside of Swift 2 news, one of the exciting announcements to come out of WWDC was that clang's Address Sanitizer is now available directly in Xcode 7. Today I'm going to discuss what it is, how it works, and how to use it, a topic suggested by Konstantin Gonikman.

A Most Precarious Situation
C is a great programming language in many ways. The fact that it's still going strong after more than four decades is a testament to that. It's not the first (or second) programming language I learned, but it is the one that first made me truly understand what was going on in these mysterious computer machines, and it's the first language I learned that I still use.

C is also a frighteningly dangerous programming language that's responsible for many woes in the world, and allows for the casual creation of bugs so ridiculous that most other languages can't even express them.

A major problem is memory safety. C has none. Code like this will compile and likely run without a problem:

    char *ptr = malloc(5);
    ptr[12] = 0;

This just allocates an array of five bytes, then writes into the 13th byte of that array, silently corrupting whatever happens to be in memory at that location. Probably nothing. (On Apple platforms, malloc always allocates at least 16 bytes even if you asked for less, so this should always work there. Do not write code that relies on this fact.) Maybe something unimportant. Maybe something important.

More sane languages keep track of array sizes and validate indexes before allowing an operation to go through. The equivalent Java code, for example, will reliably throw an exception. When you can count on this, it makes debugging mysterious problems a lot easier. For example, if a variable should contain the value 4, but actually contains the value 5, you know that some piece of code that modifies that variable is to blame. (At least until you reach that stage of debugging where you start looking carefully at the compiler.) In C, you can't assume this. It could be a piece of code that deliberately modifies that variable, or it could be a piece of code that accidentally modifies that variable by using a bad pointer or a bad index.

A whole cottage industry has sprung up to produce solutions to this problem. Clang's static analyzer, for example, can detect certain types of memory safety problems in source code. Programs like Valgrind detect unsafe memory accesses at runtime.

Address Sanitizer is another one of these solutions. It uses a new approach which has some advantages and disadvantages, but it can be a valuable tool for discovering problems in your code.

Memory Access Validation
Many of these tools work by validating memory access at runtime. The theory is that if you can validate accesses as they happen by comparing them against the memory actually allocated by the program, these bugs can be discovered as they happen, rather than being discovered through their side effects long after.

Ideally, every pointer would include data about the size and location of the overall memory region it belongs to, and each access could be validated against that. There's no particular reason a C compiler couldn't be built that does that, but the extra metadata attached to each pointer would make its programs incompatible with code compiled by normal C compilers. That means system libraries couldn't easily be used, and that would severely limit the code that could be tested with such a system.

Valgrind approaches this problem by running the entire program in an emulator. This allows it to work with binaries produced by a normal C compiler without any changes. It then analyzes the program as it runs and keeps track of the state of each chunk of memory as the program manipulates it. This allows it to work with essentially any program without modifications, and system libraries too. This comes with a huge speed penalty, which can make it impractical to run on performance-sensitive code. This approach also requires deep understanding of the semantics of each system call on the platform so that their changes to memory can be appropriately tracked, which requires tight integration with the hosting OS. As a result of that, Valgrind support on the Mac has been hit-or-miss over the years, and as of this writing it does not support 10.10.

Guard Malloc takes advantage of the CPU's built-in memory checking facilities. It replaces the standard malloc function with one that marks the memory off the end of each allocation as unreadable and unwriteable. When the program attempts to access memory off the end, the program traps predictably.

The problem with this is that hardware memory protection is relatively coarse. Memory can only be marked as readable or unreadable with page-level granularity, and a memory page on any modern system is at least 4kB in size. That means that each allocation uses at least 8kB of memory: one page for the allocation itself, and one forbidden page off the end, even if the allocation is just a few bytes. It also means that small overruns may not be detected. Memory needs to be allocated on a 16-byte boundary in order to preserve the guarantees of standard malloc, so any allocation which isn't a multiple of 16 bytes will have a few bytes off the end which aren't marked as forbidden.

Address Sanitizer attempts to make this concept of forbidden memory more granular. It's essentially a slower but more practical approach to how Guard Malloc works.

Tracking Forbidden Memory
If hardware memory protection can't be used, then it must be tracked in software. Since extra data can't be passed around with a pointer, it must be tracked in some sort of global table. This table needs to be fast to read and fast to modify.

Address Sanitizer uses a simple but brilliant approach. It reserves a fixed section within the process's address space called the shadow memory. In Address Sanitizer terms, a byte that is marked as forbidden is "poisoned," and the shadow memory tracks which bytes are poisoned. A simple formula translates each address within the process's address space into a spot in the shadow memory. Each eight-byte chunk of regular memory maps to a byte of shadow memory, which tracks the poison state of those eight bytes.

Since eight bytes of memory maps to eight bits of shadow memory, it would be natural to think that each byte's poison state is tracked by one bit in the shadow memory. However, Address Sanitizer actually keeps a single integer in the shadow memory byte. It's assumed that all poisoned memory within an eight byte chunk is contiguous and at the end, so the shadow byte contains the number of unpoisoned bytes within the chuck. A value of 0 indicates that all memory is unpoisoned, 1 indicates that the last byte is poisoned, 2 indicates that the last two bytes are poisoned, etc. and 7 indicates that all bytes are poisoned. For the case where all 8 bytes are poisoned, the value is negative. This allows for simpler computations when checking accesses against the shadow memory. Allocations are never that close together to begin with, so the assumption that poisoned bytes are contiguous and at the end doesn't cause any trouble.

With this table structure in place, Address Sanitizer generates extra code in the program to check every read and write through a pointer, and throw an error if the memory in question is poisoned. This is the advantage of being integrated into the compiler and not merely existing as an external library or runtime environment: every pointer access can be reliably identified and the appropriate checks added into the machine code.

Compiler integration also allows neat tricks like the ability to poison and guard local and global variables, not just heap allocations. Locals and globals are allocated with a bit of extra padding in between them, and the padding is poisoned to catch any overflows. This is something that Guard Malloc can't do, and that Valgrind has difficulty with.

Compiler integration has downsides as well. In particular, Address Sanitizer can't catch bad memory accesses in system libraries. It is compatible with system libraries, in that you can turn on Address Sanitizer, build a program that links against Cocoa (for example) and have it work, but it won't catch bad memory accesses performed by Cocoa, or performed by your code on memory allocated by Cocoa.

Address Sanitizer also helps to catch use-after-free errors. When memory is freed, it's all marked as poisoned, so subsequent accesses will be trapped. Use-after-free errors are particularly nasty when the memory is reused for a new allocation first, because then you corrupt unrelated bits of data. Address Sanitizer defends against this by placing newly freed memory into a recycling queue that keeps it unallocated for a while before it can be reused.

Adding a check for every single pointer access carries substantial overhead, of course. It depends heavily on just what your code is doing, since different types of code may access pointer contents much more or less frequently. On average, expect a roughly 2-5x slowdown. This is significant, but usually not enough to make a program unusable.

How to Use
With Xcode 7, using Address Sanitizer is simple. When compiling from the command line, add -fsanitize=address to the clang invocation. Here's a program that exercises it:

    #include <stdlib.h>

    void Write(char *ptr, size_t index, char value) {
        ptr[index] = value;
    }

    int main(int argc, char **argv) {
        char *ptr = malloc(12);
        Write(ptr, 12, 42);
    }

Compiling and running with Address Sanitizer:

    $ clang -fsanitize=address test.c
    $ ./a.out

It quickly crashes and produces a bunch of output:

    ==18186==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60200000df9c at pc 0x000101025efc bp 0x7fff5ebda8a0 sp 0x7fff5ebda898
    WRITE of size 1 at 0x60200000df9c thread T0
        #0 0x101025efb in Write (/Users/mikeash/Dropbox/shell/asan/./a.out+0x100000efb)
        #1 0x101025f46 in main (/Users/mikeash/Dropbox/shell/asan/./a.out+0x100000f46)
        #2 0x7fff940025c8 in start (/usr/lib/system/libdyld.dylib+0x35c8)
        #3 0x0  (<unknown module>)

    0x60200000df9c is located 0 bytes to the right of 12-byte region [0x60200000df90,0x60200000df9c)
    allocated by thread T0 here:
        #0 0x101070960 in wrap_malloc (/Applications/Xcode-beta.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/7.0.0/lib/darwin/libclang_rt.asan_osx_dynamic.dylib+0x42960)
        #1 0x101025f2d in main (/Users/mikeash/Dropbox/shell/asan/./a.out+0x100000f2d)
        #2 0x7fff940025c8 in start (/usr/lib/system/libdyld.dylib+0x35c8)
        #3 0x0  (<unknown module>)

    SUMMARY: AddressSanitizer: heap-buffer-overflow ??:0 Write
    0x1c0400001ba0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
    0x1c0400001bb0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
    0x1c0400001bc0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
    0x1c0400001bd0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
    0x1c0400001be0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
    =>0x1c0400001bf0: fa fa 00[04]fa fa 00 06 fa fa 00 00 fa fa 00 04
    0x1c0400001c00: fa fa 00 06 fa fa 00 07 fa fa 00 fa fa fa 00 00
    0x1c0400001c10: fa fa 00 00 fa fa 00 00 fa fa 00 00 fa fa 00 00
    0x1c0400001c20: fa fa 00 00 fa fa 00 00 fa fa 00 00 fa fa 00 00
    0x1c0400001c30: fa fa 00 00 fa fa 00 00 fa fa 00 00 fa fa 00 00
    0x1c0400001c40: fa fa 00 00 fa fa 00 00 fa fa 00 00 fa fa 00 00
    Shadow byte legend (one shadow byte represents 8 application bytes):
    Addressable:           00
    Partially addressable: 01 02 03 04 05 06 07
    Heap left redzone:       fa
    Heap right redzone:      fb
    Freed heap region:       fd
    Stack left redzone:      f1
    Stack mid redzone:       f2
    Stack right redzone:     f3
    Stack partial redzone:   f4
    Stack after return:      f5
    Stack use after scope:   f8
    Global redzone:          f9
    Global init order:       f6
    Poisoned by user:        f7
    Container overflow:      fc
    Array cookie:            ac
    Intra object redzone:    bb
    ASan internal:           fe
    Left alloca redzone:     ca
    Right alloca redzone:    cb
    ==18186==ABORTING
    Abort trap: 6

This is a wealth of information, and in a real-world scenario it would be enormously helpful in tracking down the problem. It not only shows where the bad write occurred, but where the memory was originally allocated, and a bunch of extra data besides.

Using Address Sanitizer from within Xcode is just as easy: edit your scheme, click the Diagnostics tab, and check the box labeled "Enable Address Sanitizer." Then just build and run as usual, and watch the diagnostics roll in.

Bonus Feature: Undefined Behavior Sanitizer
Bad memory accesses are just one of the many entertaining undefined behaviors offered by C. Clang offers another sanitizer which catches many instances of undefined behavior. Here's an example program:

    #include <stdio.h>
    #include <stdlib.h>

    int main(int argc, char **argv) {
        int value = 1;
        for(int x = 0; x < atoi(argv[1]); x++) {
            value *= 10;
            printf("%d\n", value);
        }
    }

Let's run it:

    $ clang undefined.c
    $ ./a.out 15
    10
    100
    1000
    10000
    100000
    1000000
    10000000
    100000000
    1000000000
    1215752192
    -727379968
    1316134912
    276447232
    -1530494976

That got a little weird at the end. No surprise: signed integer overflow is undefined behavior in C. It would be great to catch that instead of just producing bad data. Undefined behavior sanitizer to the rescue! This is enabled by passing -fsanitize=undefined-trap -fsanitize-undefined-trap-on-error:

    $ clang -fsanitize=undefined-trap -fsanitize-undefined-trap-on-error undefined.c
    $ ./a.out 15
    10
    100
    1000
    10000
    100000
    1000000
    10000000
    100000000
    1000000000
    Illegal instruction: 4

This doesn't give any additional information like Address Sanitizer does, but it does stop execution right at the point where the undefined behavior occurred and the problem can easily be inspected in the debugger.

The undefined behavior sanitizer is not integrated with Xcode at the moment. You can enable it for your app by adding the above compiler flags to your project's build settings directly.

Conclusion
Address Sanitizer is a great piece of technology that can catch a lot of problematic errors in C code. It's not perfect and it won't find all errors, but even so it provides some extremely useful diagnostics. I highly recommend that you try it on your code base and see what it finds. The results might surprise you.

That's it for today. Come back next time for more gooey goodness. Friday Q&A is driven by reader suggestions, as always, so if you have a topic you'd like to see me discuss here, please send it in!


Friday Q&A 2015-07-17: When to Use Swift Structs and Classes

$
0
0
Friday Q&A 2015-07-17: When to Use Swift Structs and Classes
Friday Q&A 2015-07-17: When to Use Swift Structs and Classes

One of the persistent topics of discussion in the world of Swift has been the question of when to use classes and when to use structs. I thought I'd contribute my own version of things today.

Values Versus References
The answer is actually really simple: use structs when you need value semantics, and use classes when you need reference semantics. That's it!

Come back next week for....

Wait
What?

That Doesn't Answer It
What do you mean? It's right there.

Yes, But...
What?

What Are Value and Reference Semantics?
Oh, I see. Maybe I should talk about that, then.

And How They Relate to struct and class
Right.

It all comes down to data and where it's stored. We store stuff in local variables, parameters, properties, and globals. There are fundamentally two different ways to store that stuff in all these places.

With value semantics, the data exists directly in the storage location. With reference semantics, the data exists elsewhere, and the storage location stores a reference to it. This difference isn't necessarily apparent when you access the data. Where it makes itself known is when you copy the storage. With value semantics, you get a new copy of the data. With reference semantics, you get a new copy of the reference to the same data.

This is all really abstract. Let's look at an example. To remove the question of Swift from the picture for a moment, let's look at an Objective-C example:

    @interface SomeClass : NSObject
    @property int number;
    @end
    @implementation SomeClass
    @end

    struct SomeStruct {
        int number;
    };

    SomeClass *reference = [[SomeClass alloc] init];
    reference.number = 42;
    SomeClass *reference2 = reference;
    reference.number = 43;
    NSLog(@"The number in reference2 is %d", reference2.number);

    struct SomeStruct value = {};
    value.number = 42;
    struct SomeStruct value2 = value;
    value.number = 43;
    NSLog(@"The number in value2 is %d", value2.number);

This prints:

    The number in reference2 is 43
    The number in value2 is 42

Why the difference?

The code SomeClass *reference = [[SomeClass alloc] init] creates a new instance of SomeClass in memory, then puts a reference to that instance in the variable. The code reference2 = reference places a reference to that same object into the new variable. Then reference.number = 43 modifies the number stored in the object both variables now point to. The result is that when the log prints the value from the object, it prints 43.

The code struct SomeStruct value = {} creates a new instance of SomeStruct in the variable. The code value2 = value copies that instance into the second variable. Each variable contains a separate chunk of data. The code value.number = 43 only modifies the one in value, and when the log prints the number from value2 it still prints 42.

This example maps directly to Swift:

    class SomeClass {
        var number: Int = 0
    }

    struct SomeStruct {
        var number: Int = 0
    }

    var reference = SomeClass()
    reference.number = 42
    var reference2 = reference
    reference.number = 43
    print("The number in reference2 is \(reference2.number)")

    var value = SomeStruct()
    value.number = 42
    var value2 = value
    value.number = 43
    print("The number in value2 is \(value2.number)")

As before, this prints:

    The number in reference2 is 43
    The number in value2 is 42

Experience With Value Types
Value types aren't new. But for a lot of people they feel new. What's the deal?

structs aren't used that often in most Objective-C code. We occasionally touch them in the form of CGRect and CGPoint and friends, but rarely make our own. For one thing, they aren't very functional. It's really difficult to correctly store references to objects in a struct in Objective-C, especially when using ARC.

Lots of other languages don't have anything like struct at all. Many languages like Python and JavaScript where "everything is an object" just have reference types. If you've come to Swift from a language like that, the concept might be even more foreign to you.

But wait! There's one area where almost every language uses value types: numbers! The following behavior shouldn't surprise any programmer with more than a few weeks of experience, regardless of the language:

    var x = 42
    var x2 = x
    x++
    print("x=\(x) x2=\(x2)")
    // prints: x=43 x2=42

This is so obvious and natural to us that we don't even realize that it acts differently, but it's right there in front of us. You've been working with value types for as long as you've been programming, even if you didn't realize it.

Lots of languages actually implement numbers as reference types, because they're hard-core on the "everything is an object" philosophy. However, they're immutable types, and the difference between a value type and an immutable reference type is hard to detect. They act like value types act, even if they might not be implemented that way.

This is a big part of understanding value and reference types. The distinction only matters, in terms of language semantics, when mutating data. If your data is immutable, then the value/reference distinction disappears, or at least turns into a mere question of performance rather than semantics.

This even shows up in Objective-C with tagged pointers. An object stored within the pointer value, as happens with a tagged pointer, is a value type. Copying the storage copies the object. This difference isn't apparent, because the Objective-C libraries are careful to only put immutable types in tagged pointers. Some NSNumbers are reference types and some are value types but it doesn't make a difference.

Making the Choice
Now that we know how value types work, how do you make the choice for your own data types?

The fundamental difference between the two is what happens when you use = on them. Value types get copied, and reference types just get another reference.

Thus the fundamental question to ask when deciding which one to use is: does it make sense to copy this type? Is copying an operation you want to make easy, and use often?

Let's look at some extreme, obvious examples first. Integers are obviously copyable. They should be value types. Network sockets can't be sensibly copied. They should be reference types. Points, as in x, y pairs, are copyable. They should be value types. A controller that represents a disk can't be sensibly copied. That should be a reference type.

Some types can be copied but it may not be something you want to happen all the time. This suggests that they should be reference types. For example, a button on the screen can conceptually be copied. The copy will not be quite identical to the original. A click on the copy will not activate the original. The copy will not occupy the same location on the screen. If you pass the button around or put it into a new variable you'll probably want to refer to the original button, and you'd only want to make a copy when it's explicitly requested. That means that your button type should be a reference type.

View and window controllers are a similar example. They might be copyable, conceivably, but it's almost never what you'd want to do. They should be reference types.

What about model types? You might have a User type representing a user on your system, or a Crime type representing an action taken by a User. These are pretty copyable, so they should probably be value types. However, you probably want updates to a User's Crime made in one place in your program to be visible to other parts of the program. This suggests that your Users should be managed by some sort of user controller which would be a reference type.

Collections are an interesting case. These include things like arrays and dictionaries, as well as strings. Are they copyable? Obviously. Is copying something you want to happen easily and often? That's less clear.

Most languages say "no" to this and make their collections reference types. This is true in Objective-C and Java and Python and JavaScript and almost every other language I can think of. (One major exception is C++ with STL collection types, but C++ is the raving lunatic of the language world which does everything strangely.)

Swift said "yes," which means that types like Array and Dictionary and String are structs rather than classes. They get copied on assignment, and on passing them as parameters. This is an entirely sensible choice as long as the copy is cheap, which Swift tries very hard to accomplish.

Nesting Types
There are four possibile combinations when nesting value and reference types. Life gets interesting with just one of them.

If you have a reference type which contains another reference type, nothing much interesting happens. Anything which has a reference to either the inner or outer value can manipulate it, as usual. Everyone will see any changes made.

If you have a value type which contains another value type, this effectively just makes the value bigger. The inner value is part of the outer value. If you put the outer value into some new storage, it all gets copied, including the inner value. If you put the inner value into some new storage, it gets copied.

A reference type which contains a value type effectively makes the referenced value bigger. Anyone with a reference to the outer value can manipulate the whole thing, included the nested value. Changes to the nested value are visible to everyone with a reference to the outer value. If you put the inner value into some new storage, it gets copied there.

A value type which contains a reference type is not so simple. You can effectively break value semantics without being obvious that you're doing it. This can be good or bad, depending on how you do it. When you put a reference type inside a value type, then the outer value is copied when you place it into new storage, but the copy has a reference to the same nested object as the original. Here's an example:

    class Inner {
        var value = 42
    }

    struct Outer {
        var value = 42
        var inner = Inner()
    }

    var outer = Outer()
    var outer2 = outer
    outer.value = 43
    outer.inner.value = 43
    print("outer2.value=\(outer2.value) outer2.inner.value=\(outer2.inner.value)")

This prints:

    outer2.value=42 outer2.inner.value=43

While outer2 gets a copy of value, it only copies the reference to inner, and so the two structs end up sharing the same instance of Inner. Thus an update to outer.inner.value affects outer2.inner.value. Yikes!

This behavior can be really handy. When used with care, it allows you to create structs which perform a copy on write, to allow efficient implementations of value semantics that don't copy a ton of data everywhere. This is how Swift's collections work, and you can build your own as well. For more information on how to do that, see Let's Build Swift.Array.

It can also be extremely dangerous. For example, let's say you're making a Person type. It's a model type that's sensibly copyable, so it can be a struct. In a fit of nostalgia, you decide to use NSString for the Person's name:

    struct Person {
        var name: NSString
    }

Then you build up a couple of Persons, constructing the name from parts:

    let name = NSMutableString()
    name.appendString("Bob")
    name.appendString(" ")
    name.appendString("Josephsonson")
    let bob = Person(name: name)

    name.appendString(", Jr.")
    let bobjr = Person(name: name)

Print them out:

    print(bob.name)
    print(bobjr.name)

This produces:

    Bob Josephsonson, Jr.
    Bob Josephsonson, Jr.

Eek!

What happened? Unlike Swift's String type, NSString is a reference type. It's immutable, but it has a mutable subtype, NSMutableString. When bob was created, it created a reference to the string held in name. When that string was subsequently mutated, the mutation was visible through bob. Note that this effectively mutates bob even though it's a value type stored in a let binding. It's not really mutating bob, merely mutating a value that bob holds a reference to, but since that value is part of bob's data, in a semantic sense, it looks like a mutation of bob.

This sort of thing happens in Objective-C all the time. Every Objective-C programmer with some experience gets in the habit of sprinkling defensive copies all over the place. Since an NSString might actually be an NSMutableString, you define properties as copy, or write explicit copy calls in your initializers, to avoid a catastrophe. The same goes for the various Cocoa collections.

In Swift, the solution here is simpler: use value types rather than reference types. In this case, make name be a String. There is then no worry about inadvertently sharing references.

In other cases, the solution may be less simple. For example, you may create a struct containing a view, which is a reference type, and can't be changed to a value type. This is probably a good indication that your type shouldn't be a struct, since you can't make it maintain value semantics anyway.

Conclusion
Value types are copied whenever you move them around, whereas reference types just get new references to the same underlying object. That means that mutations to reference types are visible to everything that has a reference, whereas mutations to value types only affect the storage you're mutating. When choosing which kind of type to make, consider how suitable your type is for copying, and lean towards a value type for types that are inherently copyable. Finally, beware of embedding reference types in value types, as terrible things can happen if you're not careful.

That wraps things up for today, for real this time. Come back next time for more fun. Friday Q&A is driven by reader suggestions, so if you have an idea for a topic you'd like to see covered, please send it in!

Friday Q&A 2015-07-31: Tagged Pointer Strings

$
0
0
Friday Q&A 2015-07-31: Tagged Pointer Strings
Friday Q&A 2015-07-31: Tagged Pointer Strings

Tagged pointers are an interesting technology used to increase performance and reduce memory usage. As of OS X 10.10, NSString got the tagged pointer treatment, and today I'm going to take a look at how it works, a topic suggested by Ken Ferry.

Recap
Objects are aligned in memory, such that their address is always at least a multiple of the pointer size, and in practice typically a multiple of 16. Object pointers are stored as a full 64-bit integer, but this alignment means that some of the bits will always be zero.

Tagged pointers take advantage of this fact to give special meaning to object pointers where those bits are not zero. In Apple's 64-bit Objective-C implementation, object pointers with the least significant bit set to one (which is to say, odd numbers) are considered tagged pointers. Instead of doing the standard isa dereference to figure out the class, the next three bits are considered as an index into a tagged class table. This index is used to look up the class of the tagged pointer. The remaining 60 bits are then left up to the tagged class to use as they please.

A simple use for this would be to make suitable NSNumber instances be tagged pointers, with the extra 60 bits used to hold the numeric value. The bottom bit would be set to 1. The next three bits would be set to the appropriate index for the NSNumber tagged class. The following 60 bits could then hold, for example, any integer value that fits within that space.

From the outside, such a pointer would look and act like any other object. It responds to messages like any other object, because objc_msgSend knows about tagged pointers. If you ask it for its integerValue, it will extract the data from those 60 bits and return it to you. However, you've saved a memory allocation, a pointer indirection on every access, and reference counting can be a no-op since there's no memory to free. For commonly used classes, this can make for a substantial performance improvement.

NSString doesn't seem like a good candidate for tagged pointers, since it's of variable length and can be much longer than the 60 bits in a tagged pointer. However, a tagged pointer class can coexist with a normal class, using tagged pointers for some values and normal pointers for others. For example, with NSNumber, any integer larger than 2^60 - 1 won't fit in the tagged pointer, and instead needs to be stored in a normal NSNumber objcet allocated in memory. As long as the code that creates the objects is written appropriately, this all works fine.

NSString could do the same thing. For strings that fit inside 60 bits, it could create a tagged pointer. Other strings would be placed in normal objects. This assumes that small strings are used often enough for this to be a noticeable performance gain. Is that actually the case in real-world code? It appears Apple thinks it is, since they went ahead and implemented it.

Possible Implementations
Before we look at how Apple did things, let's take a moment to think about how we might implement tagged string storage. The basics are simple: set the bottom bit to one, set the remaining bits to the appropriate tagged class index, set the remaining bits to whatever. How to use the remaining bits is the big question. We want to take the maximum advantage of the 60 bits available to us.

A Cocoa string is conceptually a sequence of Unicode code points. There are 1,112,064 valid Unicode code points, so one code point takes 21 bits to represent. That means we could fit two code points into these 60 bits, with 18 bits wasted. We could borrow some of those extra bits to hold a length, so that a tagged string could be zero, one, or two code points. Being limited to only two code points doesn't seem very useful, though.

The NSString API is actually implemented in terms of UTF-16, not raw Unicode code points. UTF-16 represents Unicode as a sequence of 16 bit values. The most common code points, which occupy the Basic Multilingual Plane or BMP, fit into a single 16 bit value, while code points above 65,535 require two. We could fit three 16 bit values into the 60 bits available, with 12 bits left over. Borrowing some bits for the length would allow us to represent 0-3 UTF-16 code units. This would allow up to three code points in the BMP, and one code point beyond the BMP (plus, optionally, one code point within it). Being limited to three is still pretty tight though.

Most strings in an app are probably ASCII. Even if the app is localized into a non-ASCII language, strings are used for far more than just displaying UI. They're used for URL components, file extensions, object keys, property list values, and much more. The UTF-8 encoding is an ASCII-compatible encoding that encodes each ASCII character as a single byte, while using up to four bytes for other Unicode code points. We can fit seven bytes in the 60 bits allotted to us, with 4 bits left over to use as a length. This allows our tagged strings to hold seven ASCII characters, or somewhat fewer non-ASCII characters, depending on exactly what they are.

If we're optimizing for ASCII, we might as well drop full Unicode support altogether. Strings containing non-ASCII characters can use real objects, after all. ASCII is a seven-bit encoding, so what if we allot only 7 bits per character? That lets us store up to eight ASCII characters in the 60 bits available, plus 4 bits left over for the length. This is starting to sound useful. There are probably a lot of strings in an app which are pure ASCII and contain eight characters or fewer.

Can we take it further? The full ASCII range contains a lot of stuff that isn't frequently used. There are a bunch of control characters, for example, and unusual symbols. Alphanumerics make up most of what's used. What can we squeeze into 6 bits?

6 bits results in 64 possible values. There are 26 letters in the ASCII alphabet. Including both upper and lower case brings it up to 52. Including all digits 0-9 brings it up to 62. There are two spots to spare, which we might give to, say, space and period. There are probably a lot of strings which only contain these characters. At 6 bits each, we can fit ten in our 60 bits of storage! But wait! We don't have any leftover bits to store the length. So either we store nine characters plus a length, or we remove one of the 64 possible values (I nominate the space to be the one to go) and use zero as a terminator for strings shorter than ten characters.

How about 5 bits? This isn't totally ludicrous. There are probably a lot of strings which are just lowercase, for example. 5 bits gives 32 possible values. If you include the whole lowercase alphabet, there are 6 extra values, which you could allot to the more common uppercase letters, or some symbols, or digits, or some mix. If you find that some of these other possibilities are more common, you could even remove some of the less common lowercase letters, like q. 5 bits per character gives eleven characters if we save room for length, or twelve if we borrow a symbol and use a terminator.

Can we take it further? 5 bits is about as far as you can reasonably go with a fixed-size alphabet. You could switch to a variable length encoding, for example using a Huffman code. This would allow the letter e, which is common, to be encoded with fewer bits than the letter q. This might allow for as little as 1 bit per character, in the unlikely case that your string is all es. This would come at the cost of more complex and presumably slower code.

Which approach did Apple take? Let's find out.

Exercising Tagged Strings
Here's a bit of code that creates a tagged string and prints its pointer value:

    NSString *a = @"a";
    NSString *b = [[a mutableCopy] copy];
    NSLog(@"%p %p %@", a, b, object_getClass(b));

The mutableCopy/copy dance is necessary for two reasons. First, although a string like @"a" could be stored as a tagged pointer, constant strings are never tagged pointers. Constant strings must remain binary compatible across OS releases, but the internal details of tagged pointers are not guaranteed. This works fine as long as tagged pointers are always generated by Apple's code at runtime, but it would break down if the compiler embedded them in your binary, as would be the case with a constant string. Thus we need to copy the constant string to get a tagged pointer.

The mutableCopy is necessary because NSString is too clever for us, and knows that a copy of an immutable string is a pointless operation, and returns the original string as the "copy." Constant strings are immutable, so the result of [a copy] is just a. A mutable copy forces it to make an actual copy, and then making an immutable copy of the result is enough to convince the system to give us a tagged pointer string.

Note that you must never depend on these details in your own code! The circumstances in which the NSString code decides to give you a tagged pointer could change at any time and if you write code that somehow relies on it, that code will eventually break. Fortunately, you have to go out of your way to do that, and all normal, sensible code will work fine, blissfully ignorant of tagged anything.

Here's what the above code prints on my computer:

    0x10ba41038 0x6115 NSTaggedPointerString

You can see the original pointer first, a nice round number indicative of an object pointer. The copy is the second value, and its taggedness is abundantly clear. It's an odd number, which means it can't be a valid object pointer. It's also a small number, well within the unmapped and unmappable 4GB zero page at the beginning of the 64-bit Mac address space, making it doubly imposible to be an object pointer.

What can we deduce from this value of 0x6115? We know that the bottom four bits are part of the tagged pointer mechanism itself. The lowest nybble 5 is 0101 in binary. The bottom one bit indicates that it's a tagged pointer. The next three bits indicate the tagged class. Here those bits are 010, indicating that the tagged string class occupies index 2. Not that there's much to do with that information.

The 61 that leads the value is suggestive. 61 in hexadecimal happens to be the ASCII value of the lowercase letter a, which is exactly what the string contains. It looks like there's a straight ASCII encoding in use. Convenient!

The class name makes it obvious what this class is for, and gives us a good starting point when it comes to looking into the actual code that implements this feature. We'll get to that shortly, but let's do some more outside inspection first.

Here's a loop that builds up strings of the form abcdef... and prints them out one by one until it stops getting tagged pointers back:

    NSMutableString *mutable = [NSMutableString string];
    NSString *immutable;
    char c = 'a';
    do {
        [mutable appendFormat: @"%c", c++];
        immutable = [mutable copy];
        NSLog(@"0x%016lx %@ %@", immutable, immutable, object_getClass(immutable));
    } while(((uintptr_t)immutable & 1) == 1);

The first iteration prints:

    0x0000000000006115 a NSTaggedPointerString

This matches what we saw above. Note that I'm printing out the full pointer with all leading zeroes to make things a bit more clear when comparing subsequent iterations. Let's compare with the second iteration:

    0x0000000000626125 ab NSTaggedPointerString

The bottom four bits didn't change, as we'd expect. That 5 will remain constant, always indicating that this is a tagged pointer of type NSTaggedPointerString.

The original 61 stays where it was, joined now by a 62. 62 is, of course, the ASCII code for b, so we can see that this encoding is an eight-bit encoding that uses ASCII. The four bits just before the terminal 5 changed from 1 to 2, suggesting that this might be the length. Subsequent iterations confirm this:

    0x0000000063626135 abc NSTaggedPointerString
    0x0000006463626145 abcd NSTaggedPointerString
    0x0000656463626155 abcde NSTaggedPointerString
    0x0066656463626165 abcdef NSTaggedPointerString
    0x6766656463626175 abcdefg NSTaggedPointerString

Presumably that's the end of it. The tagged pointer is full, and the next iteration will allocate an object and terminate the loop. Right? Wrong!

    0x0022038a01169585 abcdefgh NSTaggedPointerString
    0x0880e28045a54195 abcdefghi NSTaggedPointerString
    0x00007fd275800030 abcdefghij __NSCFString

The loop goes through two more iterations before terminating. The length section continues to increase, but the rest of the tagged pointer turns into gibberish. What's going on? Let's turn to the implementation to find out.

Disassembly
The NSTaggedPointer class lives in the CoreFoundation framework. It seems like it ought to live in Foundation, but a lot of the core Objective-C classes have been moved to CoreFoundation these days as Apple slowly gives up on making CoreFoundation an independent entity.

Let's start by looking at the implementation of -[NSTaggedPointerString length]:

    push       rbp
    mov        rbp, rsp
    shr        rdi, 0x4
    and        rdi, 0xf
    mov        rax, rdi
    pop        rbp
    ret

Hopper provides this handy decompilation to go along with it:

    unsigned long long -[NSTaggedPointerString length](void * self, void * _cmd) {
        rax = self >> 0x4 & 0xf;
        return rax;
    }

In short, to obtain the length, extract bits 4-7 and return them. This confirms what we observed above, that the four bits just before the terminal 5 indicate the length of the string.

The other primitive method for an NSString subclass is characterAtIndex:. I'll skip the lengthy disassembly and go straight to Hopper's decompiled output, which is pretty readable:

    unsigned short -[NSTaggedPointerString characterAtIndex:](void * self, void * _cmd, unsigned long long arg2) {
        rsi = _cmd;
        rdi = self;
        r13 = arg2;
        r8 = ___stack_chk_guard;
        var_30 = *r8;
        r12 = rdi >> 0x4 & 0xf;
        if (r12 >= 0x8) {
                rbx = rdi >> 0x8;
                rcx = "eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX";
                rdx = r12;
                if (r12 < 0xa) {
                        do {
                                *(int8_t *)(rbp + rdx + 0xffffffffffffffbf) = *(int8_t *)((rbx & 0x3f) + rcx);
                                rdx = rdx - 0x1;
                                rbx = rbx >> 0x6;
                        } while (rdx != 0x0);
                }
                else {
                        do {
                                *(int8_t *)(rbp + rdx + 0xffffffffffffffbf) = *(int8_t *)((rbx & 0x1f) + rcx);
                                rdx = rdx - 0x1;
                                rbx = rbx >> 0x5;
                        } while (rdx != 0x0);
                }
        }
        if (r12 <= r13) {
                rbx = r8;
                ___CFExceptionProem(rdi, rsi);
                [NSException raise:@"NSRangeException" format:@"%@: Index %lu out of bounds; string length %lu"];
                r8 = rbx;
        }
        rax = *(int8_t *)(rbp + r13 + 0xffffffffffffffc0) & 0xff;
        if (*r8 != var_30) {
                rax = __stack_chk_fail();
        }
        return rax;
    }

Let's clean this up a little. The first three lines are just Hopper telling us which registers get which arguments. Let's go ahead and replace all uses of rsi with _cmd and rdi with self. arg2 is actually the index parameter, so let's replace all uses of r13 with index. And we'll get rid of the __stack_chk stuff, as it's just a security hardening thing and not relevant to the actual workings of the method. Here's what the code looks like when cleaned up in this way:

    unsigned short -[NSTaggedPointerString characterAtIndex:](void * self, void * _cmd, unsigned long long index) {
        r12 = self >> 0x4 & 0xf;
        if (r12 >= 0x8) {
                rbx = self >> 0x8;
                rcx = "eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX";
                rdx = r12;
                if (r12 < 0xa) {
                        do {
                                *(int8_t *)(rbp + rdx + 0xffffffffffffffbf) = *(int8_t *)((rbx & 0x3f) + rcx);
                                rdx = rdx - 0x1;
                                rbx = rbx >> 0x6;
                        } while (rdx != 0x0);
                }
                else {
                        do {
                                *(int8_t *)(rbp + rdx + 0xffffffffffffffbf) = *(int8_t *)((rbx & 0x1f) + rcx);
                                rdx = rdx - 0x1;
                                rbx = rbx >> 0x5;
                        } while (rdx != 0x0);
                }
        }
        if (r12 <= index) {
                rbx = r8;
                ___CFExceptionProem(self, _cmd);
                [NSException raise:@"NSRangeException" format:@"%@: Index %lu out of bounds; string length %lu"];
                r8 = rbx;
        }
        rax = *(int8_t *)(rbp + index + 0xffffffffffffffc0) & 0xff;
        return rax;
    }

Right before the first if statement is this line:

    r12 = self >> 0x4 & 0xf

We can recognize this as the same length extraction code that we saw in the implementation of -length above. Let's go ahead and replace r12 with length throughout:

    unsigned short -[NSTaggedPointerString characterAtIndex:](void * self, void * _cmd, unsigned long long index) {
        length = self >> 0x4 & 0xf;
        if (length >= 0x8) {
                rbx = self >> 0x8;
                rcx = "eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX";
                rdx = length;
                if (length < 0xa) {
                        do {
                                *(int8_t *)(rbp + rdx + 0xffffffffffffffbf) = *(int8_t *)((rbx & 0x3f) + rcx);
                                rdx = rdx - 0x1;
                                rbx = rbx >> 0x6;
                        } while (rdx != 0x0);
                }
                else {
                        do {
                                *(int8_t *)(rbp + rdx + 0xffffffffffffffbf) = *(int8_t *)((rbx & 0x1f) + rcx);
                                rdx = rdx - 0x1;
                                rbx = rbx >> 0x5;
                        } while (rdx != 0x0);
                }
        }
        if (length <= index) {
                rbx = r8;
                ___CFExceptionProem(self, _cmd);
                [NSException raise:@"NSRangeException" format:@"%@: Index %lu out of bounds; string length %lu"];
                r8 = rbx;
        }
        rax = *(int8_t *)(rbp + index + 0xffffffffffffffc0) & 0xff;
        return rax;
    }

Looking inside the if statement, the first line shifts self right by 8 bits. The bottom 8 bits are bookkeeping: the tagged pointer indicator, and the string length. The remainder is then, we presume, the actual data. Let's replace rbx with stringData to make that more clear. The next line appears to put some sort of lookup table into rcx, so let's replace rcx with table. Finally, rdx gets a copy of the value of length. It looks like it gets used as some sort of cursor later, so let's replace rdx with cursor. Here's what we have now:

    unsigned short -[NSTaggedPointerString characterAtIndex:](void * self, void * _cmd, unsigned long long index) {
        length = self >> 0x4 & 0xf;
        if (length >= 0x8) {
                stringData = self >> 0x8;
                table = "eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX";
                cursor = length;
                if (length < 0xa) {
                        do {
                                *(int8_t *)(rbp + cursor + 0xffffffffffffffbf) = *(int8_t *)((stringData & 0x3f) + table);
                                cursor = cursor - 0x1;
                                stringData = stringData >> 0x6;
                        } while (cursor != 0x0);
                }
                else {
                        do {
                                *(int8_t *)(rbp + cursor + 0xffffffffffffffbf) = *(int8_t *)((stringData & 0x1f) + table);
                                cursor = cursor - 0x1;
                                stringData = stringData >> 0x5;
                        } while (cursor != 0x0);
                }
        }
        if (length <= index) {
                rbx = r8;
                ___CFExceptionProem(self, _cmd);
                [NSException raise:@"NSRangeException" format:@"%@: Index %lu out of bounds; string length %lu"];
                r8 = rbx;
        }
        rax = *(int8_t *)(rbp + index + 0xffffffffffffffc0) & 0xff;
        return rax;
    }

That's almost everything labeled. One raw register name remains: rbp. That's actually the frame pointer, so the compiler is doing something tricky indexing directly off the frame pointer. Adding the constant 0xffffffffffffffbf is the two's-complement "everything is ultimately an unsigned integer" way to subtract 65. Later on, it subtracts 64. This is all probably the same local variable on the stack. Given the bytewise indexing going on, it's probably a buffer placed on the stack. But it's weird, because there's a code path which does nothing but read from that buffer, without ever writing to it. What's going on?

It turns out that what's going on is Hopper forgot to decompile the else branch of that outer if statement. The relevant assembly looks like this:

    mov        rax, rdi
    shr        rax, 0x8
    mov        qword [ss:rbp+var_40], rax

var_40 is how Hopper shows that offset of 64 in the disassembly. (40 being the hexadecimal version of 64.) Let's call the pointer to this location buffer. The C version of this missing branch would look like:

    *(uint64_t *)buffer = self >> 8

Let's go ahead and insert that, and replace the other places where rbp is used to access buffer with more readable versions of the code, plus add a declaration of buffer to remind us what's going on:

    unsigned short -[NSTaggedPointerString characterAtIndex:](void * self, void * _cmd, unsigned long long index) {
        int8_t buffer[11];
        length = self >> 0x4 & 0xf;
        if (length >= 0x8) {
                stringData = self >> 0x8;
                table = "eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX";
                cursor = length;
                if (length < 0xa) {
                        do {
                                *(int8_t *)(buffer + cursor - 1) = *(int8_t *)((stringData & 0x3f) + table);
                                cursor = cursor - 0x1;
                                stringData = stringData >> 0x6;
                        } while (cursor != 0x0);
                }
                else {
                        do {
                                *(int8_t *)(buffer + cursor - 1) = *(int8_t *)((stringData & 0x1f) + table);
                                cursor = cursor - 0x1;
                                stringData = stringData >> 0x5;
                        } while (cursor != 0x0);
                }
        } else {
            *(uint64_t *)buffer = self >> 8;
        }
        if (length <= index) {
                rbx = r8;
                ___CFExceptionProem(self, _cmd);
                [NSException raise:@"NSRangeException" format:@"%@: Index %lu out of bounds; string length %lu"];
                r8 = rbx;
        }
        rax = *(int8_t *)(buffer + index) & 0xff;
        return rax;
    }

Getting better. All those crazy pointer manipulation statements are a bit hard to read, though, and they're really just array indexing. Let's fix those up:

    unsigned short -[NSTaggedPointerString characterAtIndex:](void * self, void * _cmd, unsigned long long index) {
        int8_t buffer[11];
        length = self >> 0x4 & 0xf;
        if (length >= 0x8) {
                stringData = self >> 0x8;
                table = "eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX";
                cursor = length;
                if (length < 0xa) {
                        do {
                                buffer[cursor - 1] = table[stringData & 0x3f];
                                cursor = cursor - 0x1;
                                stringData = stringData >> 0x6;
                        } while (cursor != 0x0);
                }
                else {
                        do {
                                buffer[cursor - 1] = table[stringData & 0x1f];
                                cursor = cursor - 0x1;
                                stringData = stringData >> 0x5;
                        } while (cursor != 0x0);
                }
        } else {
            *(uint64_t *)buffer = self >> 8;
        }
        if (length <= index) {
                rbx = r8;
                ___CFExceptionProem(self, _cmd);
                [NSException raise:@"NSRangeException" format:@"%@: Index %lu out of bounds; string length %lu"];
                r8 = rbx;
        }
        rax = buffer[index];
        return rax;
    }

Now we're getting somewhere.

We can see that there are three cases depending on the length. Length values less than 8 go through that missing else branch that just dumps the value of self, shifted, into buffer. This is the plain ASCII case. Here, index is used to index into the value of self to extract the given byte, which is then returned to the caller. Since ASCII character values match Unicode code points within the ASCII range, there's no additional manipulation needed to make the value come out correctly. We guessed above that the string stored plain ASCII in this case, and this confirms it.

What about the cases where the length is 8 or more? If the length is 8 or more but less than 10 (0xa), then the code enters a loop. This loop extracts the bottom 6 bits of stringData, uses that as an index into table, and then copies that value into buffer. It then shifts stringData down by 6 bits and repeats until it runs through the entire string. This is a six-bit encoding where the mapping from raw six-bit values to ASCII character values is stored in the table. A temporary version of the string is built up in buffer, and the indexing operation at the end then extracts the requested character from it.

What about the case where the length is 10 or more? The code there is almost identical, except that it works five bits at a time instead of six. This is a more compact encoding that would allow the tagged string to store up to 11 characters, but only using an alphabet of 32 values. This will use the first half of table as its truncated alphabet.

Thus we can see that the structure of the tagged pointer strings is:

  1. If the length is between 0 and 7, store the string as raw eight-bit characters.
  2. If the length is 8 or 9, store the string in a six-bit encoding, using the alphabet "eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX".
  3. If the length is 10 or 11, store the string in a five-bit encoding, using the alphabet "eilotrm.apdnsIc ufkMShjTRxgC4013"

Let's compare with the data we generated earlier:

    0x0000000000006115 a NSTaggedPointerString
    0x0000000000626125 ab NSTaggedPointerString
    0x0000000063626135 abc NSTaggedPointerString
    0x0000006463626145 abcd NSTaggedPointerString
    0x0000656463626155 abcde NSTaggedPointerString
    0x0066656463626165 abcdef NSTaggedPointerString
    0x6766656463626175 abcdefg NSTaggedPointerString
    0x0022038a01169585 abcdefgh NSTaggedPointerString
    0x0880e28045a54195 abcdefghi NSTaggedPointerString
    0x00007fbad9512010 abcdefghij __NSCFString

The binary expansion of 0x0022038a01169585 minus the bottom eight bits and divided into six-bit chunks is:

    001000 100000 001110 001010 000000 010001 011010 010101

Using these to index into the table, we can see that this does indeed spell out "abcdefgh". Similarly, the binary expansion of0x0880e28045a54195` minus the bottom eight bits and divided into six-bit chunks is:

    001000 100000 001110 001010 000000 010001 011010 010101 000001

We can see that it's the same string, plus i at the end.

But then it goes off the rails. After this point, it should switch to a five-bit encoding and give us two more strings, but instead it starts allocating objects at length 10. What gives?

The five-bit alphabet is extremely limited, and doesn't include the letter b! That letter must not be common enough to warrant a place in the 32 hallowed characters of the five-bit alphabet. Let's try this again, but instead start the string at c. Here's the output:

    0x0000000000006315 c NSTaggedPointerString
    0x0000000000646325 cd NSTaggedPointerString
    0x0000000065646335 cde NSTaggedPointerString
    0x0000006665646345 cdef NSTaggedPointerString
    0x0000676665646355 cdefg NSTaggedPointerString
    0x0068676665646365 cdefgh NSTaggedPointerString
    0x6968676665646375 cdefghi NSTaggedPointerString
    0x0038a01169505685 cdefghij NSTaggedPointerString
    0x0e28045a54159295 cdefghijk NSTaggedPointerString
    0x01ca047550da42a5 cdefghijkl NSTaggedPointerString
    0x39408eaa1b4846b5 cdefghijklm NSTaggedPointerString
    0x00007fbd6a511760 cdefghijklmn __NSCFString

We now have tagged strings all the way up to a length of 11. The binary expansions of the last two tagged strings are:

    01110 01010 00000 10001 11010 10101 00001 10110 10010 00010
    01110 01010 00000 10001 11010 10101 00001 10110 10010 00010 00110

Exactly what we're expecting.

Creating Tagged Strings
Since we know how tagged strings are encoded, I won't go into much detail on the code that creates them. The code in question is found within a private function called __CFStringCreateImmutableFunnel3 which handles every conceivable string creation case all in one gigantic function. This function is included in the open source release of CoreFoundation available on opensource.apple.com, but don't get excited: the tagged pointer strings code is not included in the open source version.

The code here is essentially the inverse of what's above. If the string's length and content fits what a tagged pointer string can hold, it builds a tagged pointer piece by piece, containing either ASCII, six-bit, or five-bit characters. There's an inverse of the lookup table. The table seen above as a constant string lives as a global variable called sixBitToCharLookup, and there's a corresponding table called charToSixBitLookup in the Funnel3 function.

That Weird Table
The full six-bit encoding table is:

    eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX

A natural question here is: why is it in such a strange order?

Because this table is used for both six-bit and five-bit encodings, it makes sense that it wouldn't be entirely in alphabetical order. Characters that are used most frequently should be in the first half, while characters that are used less frequently should go in the second half. This ensures that the maximum number of longer strings can use the five-bit encoding.

However, with that division in place, the order within the individual halves doesn't matter. The halves themselves could be sorted alphabetically, but they're not.

The first few letters in the table is similar to the order in which letters appear in English, sorted by frequency. The most common letter in English is E, then T, then A, O, I, N, and S. E is right at the beginning of the table, and the others are near the beginning. The table appears to be sorted by frequency of use. The discrepancy with English probably arises because short strings in Cocoa apps aren't a random selection of words from English prose, but a more specialized bit of language.

I speculate that Apple originally intended to use a fancier variable-length encoding, probably based on a Huffman code. This turned out to be too difficult, or not worth the effort, or they just ran out of time, and so they dialed it back to the less ambitious version seen above, where strings choose between constant-length encodings using eight, six, or five bits per character. The weird table lives on as a leftover, and a starting point if they decide to go for a variable-length encoding in the future. This is pure guesswork, but that's what it looks like to me.

Conclusion
Tagged pointers are a really cool technology. Strings are an unusual application of it, but it's clear that Apple put a lot of thought into it and they must see a significant benefit. It's interesting to see how it's put together and how they got the most out of the very limited storage available.

That's it for today. Come back next time for more strange explorations of the trans-mundane. Friday Q&A is driven by reader suggestions, so if you have an idea for a subject you'd like to see covered here, please send it in!

Friday Q&A 2015-08-14: An Xcode Plugin for Unsmoothed Text

$
0
0
Friday Q&A 2015-08-14: An Xcode Plugin for Unsmoothed Text
Friday Q&A 2015-08-14: An Xcode Plugin for Unsmoothed Text

Getting Xcode to display unsmoothed text in its editor has been an ongoing battle which finally required me to write an Xcode plugin to impose my will. Several readers asked me to discuss how it works, which is what I'm going to do today.

Background
I worship at the church of Monaco 9. It is, in my opinion, the One True Font for all monospaced tasks. This is especially true on old-fashioned non-retina screens, where its carefully designed pixels provide maximum readability in minimal space.

Those carefully designed pixels make it critically important that the text not be antialiased. Font smoothing defeats the entire purpose of Monaco 9.

It was a bit distressing when, a couple of OS versions ago, Xcode started to insist on smoothing Monaco 9 despite my best efforts, because of changes made for retina support. This was annoying when not using a retina display. Fortunately, it was possible to disable font smoothing with a defaults command.

Major trouble started when I got a retina display for my Mac Pro. I use it side by side with a normal non-retina display. Stuff that benefits from font smoothing, like web pages, e-mail, documentation, and cat pictures go on the retina display. Code goes on the regular display. However, the mere presence of the retina display made Xcode insist on font smoothing all over again, and the usual remedies were powerless.

I decided I'd have to get some code into Xcode and hack it from within. I thought about code injection using something like mach_inject or simply abusing lldb, but it turns out that Xcode has a built-in plugin mechanism that works well for this. It's undocumented and not officially supported, but it's not too hard to use.

Getting Started
If you want to make your own Xcode plugin, create a new "Bundle" project in Xcode. This sets up a project that builds a loadable bundle, which is what plugins usually are. Tell it to use .xcplugin as the bundle's extension, and you're on your way to creating something Xcode can load.

Of course, it's not quite so simple. Xcode is very particular about which plugins it will load, and requires some Info.plist keys.

The XC4Compatible key must be set to YES, otherwise Xcode will assume your plugin is utterly ancient and will refuse to load it.

The XCPluginHasUI key should be set to NO for our purposes. This causes Xcode to load the plugin's code immediately, allowing us to start causing trouble. If set to YES, there's presumably some sort of UI that allows it to be engaged manually, but I didn't explore this side of things since it wasn't necessary for my purposes.

The final and most annoying required key is DVTPlugInCompatibilityUUIDs. This is set to an array of strings. Each string is the UUID of an Xcode version that the plugin is compatible with. Each Xcode version has its own compatibilty UUID. If your plugin doesn't have the right UUID in its list, Xcode will refuse to load it.

Every single new version of Xcode gets a new UUID. These UUIDs can't be predicted in advance, so there's no way to preload the list in the plugin. This means that every new Xcode release breaks all plugins, and you have to manually go in and put the new UUID into the array. How annoying! If you're lucky, Xcode will dump an error to the Console stating the UUID it expects. Otherwise, you can extract the UUID from Xcode's own Info.plist.

Once you've made those changes, your project will build a plugin that Xcode actually wants to load. You still need to put the plugin in the right place for Xcode to actually find it, of course. That right place is ~/Library/Application Support/Developer/Shared/Xcode/Plug-ins. You could copy it manually every time you build, but that would be extremely annoying. It's much nicer to set up a Copy Files build phase to copy the plugin every time you build.

One more thing is needed for build nirvana, and that's to have your project actually start Xcode when you tell it to run. By default, plugin projects can't run, they can only build, because plugins are inert on their own. If you edit the scheme (option-click the run button in the toolbar) you can set a custom executable. Set this to Xcode, and Xcode will run Xcode when you tell Xcode to run your project. (Are you confused yet?)

For reasons completely unknown to me, Xcode will crash on launch with the default scheme settings. To fix this, make sure to go into Options, go to the very bottom, and disable "View Debugging ☑️ Enable user interface debugging." I do not know why this fix works.

To make sure everything is working, add a class to the plugin and implement +load to log some sort of "Hello, world" output. When you run the project, Xcode will launch and display your log output. You have full debugging available, with one copy of Xcode pointing at another. Try not to get confused about which is which.

Hacking
Looking at the view hierarchy in the debugger, it's apparent that Xcode's code view is a custom subclass of NSTextView. This is great, because NSTextView is a public, documented class with lots of knobs to turn, and surely one of those knobs would solve the problem. (Not to be confused with the knobs who created the problem in the first place.)

How do we get ahold of the code view instances, though? We're just hanging out in +load with no pointers to anything. We could try to start from NSApp and work our way down, but that's pretty painful. Fortunately this code doesn't need to be production-worthy and we can use big hammers. Instead of trying to find individual instances, why not just modify the NSTextView class itself to do our bidding?

How to modify the class? I could override init, but that might be too early. I could try to make modifications in viewDidMoveToWindow, but even that might be too early. Without knowing exactly which knobs to turn yet, I want to make sure I can experiment without worrying about whether my experiments are being overridden later. I finally decided to use another big hammer and override drawRect:. This means my code will run every time the view is drawn. This is probably way more often than necessary for this tweak. But as long as the tweak can run many times without hurting anything, why not? It'll be a tiny slowdown, and that's it.

How do we override drawRect: for the entire NSTextView class? Import objc/runtime.h and get cracking.

First, we want a convenient reference to the class and selector we're working with:

    Class nstv = [NSTextView class];
    SEL drawRect = @selector(drawRect:);

Then we can get a reference to the method in question:

    Method m = class_getInstanceMethod(nstv, drawRect);

We're going to provide a new implementation for the method. That implementation will call through to the original implementation, the moral equivalent of a call to super, so that the text still gets drawn. To make this happen, we need to save the pointer to the original implementation in a variable:

    IMP oldImplementation = method_getImplementation(m);

For convenience, I'll use the nifty imp_implementationWithBlock API to create the new implementation using a block. This call takes a block whose first argument is self and whose subsequent arguments are the method arguments, and turns it into an IMP which can be used as a method implementation. The runtime takes care of translating between the IMP function pointer and the block:

    IMP newImplementation = imp_implementationWithBlock(^(NSTextView *self, NSRect rect) {

Some sort of magic code is going to go it here to make everything better and destroy font smoothing for eternity, or at least until the next Xcode update:

        // MAGIC GOES HERE

After doing whatever magic, we want to call the original implementation. We do this by casting oldImplementation to a function pointer of the correct type, and then calling it:

        ((void (*)(id, SEL, NSRect))oldImplementation)(self, drawRect, rect);
    });

Now we have an IMP for the new implementation, and we can set it on the method:

    method_setImplementation(m, newImplementation);

That's it! Put this code in +load and our override now runs every time an NSTextView is drawn in Xcode.

What magic code goes in the override, though? With the surrounding code in place, it provides an excellent environment for experimentation. I tried CoreGraphics calls to disable font smoothing, I messed about with fonts, and various other things. I finally discovered that the magic incantation was to enable the use of screen fonts:

        [[self layoutManager] setUsesScreenFonts: YES];

I don't understand why, but apparently Xcode disables this. Re-enabling it in drawRect: fixes the problem. My perfectly pixelated Monaco 9 characters are back!

Conclusion
Building an Xcode plugin is easier than it looked at first glance, and provides a good way to get code into Xcode to fix problems that just can't be fixed in any other way. If you'd like to see the complete project I made with the above code, you can get it on GitHub here:

https://github.com/mikeash/DemoXcodePlugin

That's it for today. Use this knowledge wisely and go in peace, or war, or whatever it is that you do. Come back next time for more entertainment and occasional knowledge. Friday Q&A is driven by reader suggestions, so as always, if you have an idea you'd like to see covered next time or some other time, please send it in!

Friday Q&A 2015-09-04: Let's Build dispatch_queue

$
0
0
Friday Q&A 2015-09-04: Let's Build dispatch_queue
Friday Q&A 2015-09-04: Let's Build dispatch_queue

Grand Central Dispatch is one of the truly great APIs to come out of Apple in the past few years. In the latest installment of the Let's Build series, I'm going to explore a reimplementation of the most basic features of dispatch_queue, a topic suggested by Rob Rix.

Overview
A dispatch queue is a queue of jobs backed by a global thread pool. Typically, jobs submitted to a queue are executed asynchronously on a background thread. All threads share a single pool of background threads, which makes the system more efficient.

That is the essence of the API which I'll replicate. There are a number of fancy features provided by GCD which I'll ignore for the sake of simplicity. For example, the number of threads in the global pool scales up and down with the amount of work to be done and the CPU utilization of the system. If you have a bunch of jobs hitting the CPU hard and you submit another job, GCD will avoid creating another worker thread for it, because it's already running at 100% and another thread will just make things less efficient. I'll skip over this and just use a hardcoded limit on the number of threads. I'll skip other fancy features like target queues and barriers on concurrent queues as well.

The goal is to concentrate on the essence of dispatch queues: they can be serial or concurrent, they can dispatch jobs synchronously or asynchronously, and they're backed by a shared global thread pool.

Code
As usual, the code for today's article is available on GitHub here:

https://github.com/mikeash/MADispatchQueue

If you'd like to follow along as you read, or just want to explore on your own, you can find it all there.

Interface
GCD is a C API. Although GCD objects have turned into Objective-C objects on more recent OS releases, the API remains pure C (plus Apple's blocks extension). This is great for a low-level API, and GCD presents a remarkably clean interface, but for my own purposes I prefer to write my reimplementation in Objective-C.

The Objective-C class is called MADispatchQueue and it only has four calls:

  1. A method for getting a shared global queue. GCD has multiple global queues with different priorities, but we'll just have one for simplicity.
  2. An initializer which can create the queue as either concurrent or serial.
  3. An async dispatch call.
  4. A sync dispatch call.

Here's the interface declaration:

    @interface MADispatchQueue : NSObject

    + (MADispatchQueue *)globalQueue;

    - (id)initSerial: (BOOL)serial;

    - (void)dispatchAsync: (dispatch_block_t)block;
    - (void)dispatchSync: (dispatch_block_t)block;

    @end

The goal, then, is to implement these methods to actually do what they say they do.

Thread Pool Interface
The thread pool that backs the queues has a simpler interface. It will do the grunt work of actually running the submitted jobs. The queues will then be responsible for submitting their enqueued jobs at the right time to maintain the queue's guarantees.

The thread pool has a single job: submit some work to be run. Accordingly, its interface has just a single method:

    @interface MAThreadPool : NSObject

    - (void)addBlock: (dispatch_block_t)block;

    @end

Since this is the core, let's implement it first.

Thread Pool Implementation
Let's look at instance variables first. The thread pool is going to be accessed from multiple threads, both internally and externally, and so needs to be thread safe. While GCD goes out of its way to use fast atomic operations whenever possible, for my conceptual rebuilding I'm going to stick with good old-fashioned locks. I need the ability to wait and signal on this lock, not just enforce mutual exclusion, so I'm using an NSCondition rather than a plain NSLock. If you're not familiar with it, NSCondition is basically a lock and a single condition variable wrapped into one:

    NSCondition *_lock;

In order to know when to spin up new worker threads, I need to know how many threads are in the pool, how many are actually busy doing work, and the maximum number of threads I can have:

    NSUInteger _threadCount;
    NSUInteger _activeThreadCount;
    NSUInteger _threadCountLimit;

Finally, there's a list of blocks to execute. This is an NSMutableArray, treated as a queue by appending new blocks to the end and removing whem from the front:

    NSMutableArray *_blocks;

Initialization is simple. Initialize the lock, initialize the blocks array, and set the thread count limit to an arbitrarily-chosen 128:

    - (id)init {
        if((self = [super init])) {
            _lock = [[NSCondition alloc] init];
            _blocks = [[NSMutableArray alloc] init];
            _threadCountLimit = 128;
        }
        return self;
    }

The worker threads run a simple infinite loop. As long as the blocks array is empty, it will wait. Once a block is available, it will dequeue it from the array and execute it. When doing so, it will increment the active thread count, then decrement it again when done. Let's get started:

    - (void)workerThreadLoop: (id)ignore {

The first thing it does is acquire the lock. Note that it does this before the loop begins. The reason will become clear at the end of the loop:

        [_lock lock];

Now loop forever:

        while(1) {

If the queue is empty, wait on the lock:

            while([_blocks count] == 0) {
                [_lock wait];
            }

Note that this is done with a loop, not just an if statement. The reason for this is spurious wakeup. In short, wait can potentially return even though nothing signaled, so for correct behavior the condition being checked needs to be reevaluated when wait returns.

Once a block is available, dequeue it:

            dispatch_block_t block = [_blocks firstObject];
            [_blocks removeObjectAtIndex: 0];

Indicate that this thread is now doing something by incrementing the active thread count:

            _activeThreadCount++;

Now it's time to execute the block, but we have to release the lock first, otherwise we won't get any concurrency and we'll have all sorts of entertaining deadlocks:

            [_lock unlock];

With the lock safely relinquished, execute the block:

            block();

With the block done, it's time to decrement the active thread count. This must be done with the lock held to avoid race conditions, and that's the end of the loop:

            [_lock lock];
            _activeThreadCount--;
        }
    }

Now you can see why the lock had to be acquired before entering the loop above. The last act in the loop is to decrement the active thread count, which requires the lock to be held. The first thing at the top of the loop is to check the blocks queue. By performing the first lock outside of the loop, subsequent iterations can use a single lock operation for both operations, rather than locking, unlocking, and immediately locking again.

Now for addBlock:

    - (void)addBlock: (dispatch_block_t)block {

Everything here needs to be done with the lock acquired:

        [_lock lock];

The first task is to add the new block to the blocks queue:

        [_blocks addObject: block];

If there's an idle worker thread ready to take this block, then there isn't much to do. If there aren't enough idle worker threads to handle all the outstanding blocks, and the number of worker threads isn't yet at the limit, then it's time to create a new one:

        NSUInteger idleThreads = _threadCount - _activeThreadCount;
        if([_blocks count] > idleThreads && _threadCount < _threadCountLimit) {
            [NSThread detachNewThreadSelector: @selector(workerThreadLoop:)
                                     toTarget: self
                                   withObject: nil];
            _threadCount++;
        }

Now everything is ready for a worker thread to get started on the block. In case they're all sleeping, wake one up:

        [_lock signal];

Then relinquish the lock and we're done:

        [_lock unlock];
    }

That gives us a thread pool that can spawn worker threads up to a pre-set limit to service blocks as they come in. Now to implement queues with this as the foundation.

Queue Implementation
Like the thread pool, the queue will use a lock to protect its contents. Unlike the thread pool, it doesn't need to do any waiting or signaling, just basic mutual exclusion, so it uses a plain NSLock:

    NSLock *_lock;

Like the thread pool, it maintains a queue of pending blocks in an NSMutableArray:

    NSMutableArray *_pendingBlocks;

The queue knows whether it's serial or concurrent:

    BOOL _serial;

When serial, it also tracks whether it currently has a block running in the thread pool:

    BOOL _serialRunning;

Concurrent queues behave the same whether anything is running or not, so they don't track this.

The global queue is stored in a global variable, as is the underlying shared thread pool. They're both created in +initialize:

    static MADispatchQueue *gGlobalQueue;
    static MAThreadPool *gThreadPool;

    + (void)initialize {
        if(self == [MADispatchQueue class]) {
            gGlobalQueue = [[MADispatchQueue alloc] initSerial: NO];
            gThreadPool = [[MAThreadPool alloc] init];
        }
    }

The +globalQueue method can then just return the variable, since +initialize is guaranteed to have created it:

    + (MADispatchQueue *)globalQueue {
        return gGlobalQueue;
    }

This is just the sort of thing that calls for dispatch_once, but it felt like cheating to use a GCD API when I'm reimplementing a GCD API, even if it's not the same one.

Initializing a queue consists of allocating the lock and the pending blocks queue, and setting the _serial variable:

    - (id)initSerial: (BOOL)serial {
        if ((self = [super init])) {
            _lock = [[NSLock alloc] init];
            _pendingBlocks = [[NSMutableArray alloc] init];
            _serial = serial;
        }
        return self;
    }

Before we get to the remaining public API, there's an underlying method to build which will dispatch a single block on the thread pool, and then potentially call itself to run another block:

    - (void)dispatchOneBlock {

Its entire purpose in life is to run stuff on the thread pool, so it dispatches there:

        [gThreadPool addBlock: ^{

Then it grabs the first block in the queue. Naturally, this must be done with the lock held to avoid catastrophic explosions:

            [_lock lock];
            dispatch_block_t block = [_pendingBlocks firstObject];
            [_pendingBlocks removeObjectAtIndex: 0];
            [_lock unlock];

With the block in hand and the lock relinquished, the block can safely be executed on the background thread:

            block();

If the queue is concurrent, then that's all it needs to do. If it's serial, there's more:

            if(_serial) {

On a serial queue, additional blocks will build up, but can't be invoked until preceding blocks complete. When a block completes here, dispatchOneBlock will see if any other blocks are pending on the queue. If there are, it calls itself to dispatch the next block. If not, it sets the queue's running state back to NO:

                [_lock lock];
                if([_pendingBlocks count] > 0) {
                    [self dispatchOneBlock];
                } else {
                    _serialRunning = NO;
                }
                [_lock unlock];
            }
        }];
    }

With this method, implementing dispatchAsync: is relatively easy. Add the block to the queue of pending blocks, then set state and invoke dispatchOneBlock as appropriate:

    - (void)dispatchAsync: (dispatch_block_t)block {
        [_lock lock];
        [_pendingBlocks addObject: block];

If a serial queue is idle then set its state to running and call dispatchOneBlock to get things moving:

        if(_serial && !_serialRunning) {
            _serialRunning = YES;
            [self dispatchOneBlock];

If the queue is concurrent, then call dispatchOneBlock unconditionally. This makes sure the new block is executed as soon as possible even if another block is already running, since multiple blocks are allowed to run concurrently:

        } else if (!_serial) {
            [self dispatchOneBlock];
        }

If a serial queue is already running then nothing more needs to be done. The existing run of dispatchOneBlock will eventually get to the block that was just added to the queue. Now release the lock:

        [_lock unlock];
    }

On to dispatchSync:. GCD is smart about this and runs the block directly on the calling thread, while stopping other blocks from running on the queue (if it's serial). We won't try to be so smart. Instead, we'll just use dispatchAsync:, and wrap it to wait for execution to complete.

It does this using a local NSCondition, plus a done variable to indicate when the block has completed:

    - (void)dispatchSync: (dispatch_block_t)block {
        NSCondition *condition = [[NSCondition alloc] init];
        __block BOOL done = NO;

Then it dispatches a block asynchronously. This block calls the passed-in one, then sets done and signals the condition:

        [self dispatchAsync: ^{
            block();
            [condition lock];
            done = YES;
            [condition signal];
            [condition unlock];
        }];

Out in the original calling thread, it waits on the condition for done to be set, then returns.

        [condition lock];
        while (!done) {
            [condition wait];
        }
        [condition unlock];
    }

At this point, the block's execution has completed. Success! That's the last bit of API needed for MADispatchQueue.

Conclusion
A global thread pool can be implemented with a queue of worker blocks and a bit of intelligent spawning of threads. Using a shared global thread pool, a basic dispatch queue API can be built that offers basic serial/concurrent and synchronous/asynchronous dispatch. This rebuild lacks many of the nice features of GCD and is certainly much less efficient, but even so it gives a nice glimpse into what the inner workings of such a thing might be like, and shows that it's not really magic after all. (Except for dispatch_once. That's all magic.)

That's it for today. Come back next time for more fun, games, and fun. Friday Q&A is driven by reader ideas so if you have something you'd like to see discussed here next time or in the future, please let me know!

Friday Q&A 2015-09-18: Building a Gear Warning System

$
0
0
Friday Q&A 2015-09-18: Building a Gear Warning System
Friday Q&A 2015-09-18: Building a Gear Warning System

Today I'm going to go outside my usual set of topics and talk about a fun side project that might broaden your horizons. Or expose my ignorance. A couple of years ago I set out to build a gear warning system for my glider using an Arduino-type system, and I'm going to talk about how it works and what that code looks like.

What Is This Thing?
I fly gliders. I own a share in a relatively high-performance single seat craft.

A glider's enemy is drag, and drag comes from sticking things out into the air. The glider needs a wheel for takeoff and landing, but that big wheel sticking out the bottom creates a lot of drag while flying. Because of this, a lot of gliders have retractable landing gear, including mine. Once I'm airborne and flying on my own, I pull a handle that raises the wheel into the body of the glider. Before landing, I push the handle back to the original position, which extends the wheel.

Every so often, that last part doesn't happen, and the glider lands on its belly. I've never done this and I hope I never will, but it's something that does happen. If you're fortunate enough to do this on grass, the damage can be pretty minimal, or even none. If you do it on a paved runway, the result is a long and expensive white stripe. This is not a good way to end your day.

To avoid this, it's nice to have something that can warn you if you go to land without lowering the landing gear. That's what I built.

Basic Design
How do you actually detect when a pilot is landing with the gear up? Detecting whether the gear is up is relatively easy. You can install a microswitch somewhere in the retraction mechanism to detect its physical position, and you're all set.

Detecting "landing" is a bit harder. One possibility (which would be pretty fun to build) would be to use a GPS unit and terrain database (or at least an airport database) to detect when you're getting low. That's a lot of expense and complication I didn't need, though, and a nontrivial amount of power consumption as well.

The typical way to do this on a glider is to install a second microswitch that detects the position of the spoilers. Spoilers are big flat control surfaces that extend out the top of the wings to destroy the lift they produce and create drag. Drag is normally the enemy for a glider, but when landing drag can your friend. Gliders perform so well that it's extremely difficult to get them to come down at a reasonable rate when you want them to, so you use the spoilers to force a descent.

Spoilers are typically used only for landing. This isn't always true. There are other scenarios where you want to descend rapidly, like if you find yourself above a scattered cloud deck that's closing up, but they're pretty rare. The spoiler position can be used as a pretty good "am I landing?" indicator. There will occasionally be false positives, but they're uncommon and can be tolerated.

The idea, then, is to install two microswitches, and sound the alarm whenever the spoilers are extended and the landing gear is not.

How exactly do you "sound the alarm"? I used a cheap, simple piezo buzzer from RadioShack. It runs on almost no electricity and can be wired directly to a microcontroller's output pin. It's loud enough to be heard in flight (a glider cockpit is a pretty quiet place) without being annoyingly loud if it fires inadvertently.

Hardware
The microswitches and buzzer would suffice on their own, with some wiring. Hook up the switches in series and have them pass current when in the alarm position. Wire the whole thing to some electricity, and the buzzer will buzz at the appropriate time. However, I wanted to use a microcontroller to drive everything for a few reasons:

  1. A constant buzzing is not the most effective for getting someone's attention. Aviation is full of stories that go like, "What's that weird buzzing noise? Oh well, never mind that now, I have to land. CRUNCH" It happened to a friend of mine with one of these simple setups. A more complicated pattern stands a better chance of getting my attention.
  2. There's a chance of hardware failure causing the system to get stuck in the alarm position. Having the alarm remain on for the entire flight afterwards would be extremely annoying. With a microcontroller, it can shut the warning off after a couple of minutes. In a situation where the warning is real, the pilot only has a couple of minutes to do something before it's too late anyway, so there's no need to keep it on longer than that.
  3. It's a lot more fun.

For the microcontroller, I chose the Digispark USB Development Board. It has a bunch of really nice features for this project:

  1. It's cheap. The official price is $9, and I got it on sale for $6. If you're adventurous, you can pick them up on eBay for under $2.
  2. It's easy. It works with the Arduino IDE which makes programming it really simple. It plugs directly into a USB port, so it doesn't need any special cables. Or even boring cables.
  3. It has an on-board voltage regulator which accepts a wide range of voltages. This allows it to take power directly from my glider's main 12V batteries without needing any extra hardware to convert it.
  4. It's really small, which is useful when I don't have a lot of room for the device.

It has some downsides as well compared to more typical Arduino hardware. Program memory is extremely limited at about 6kB, and it only has 6 IO pins. But it's more than enough for this project.

Software Design
I have the Digispark run a loop that constantly polls the state of the microswitches. I could use interrupts instead, but since it has nothing else to do, a polling loop is easier. Power consumption is extremely low already, so there's no need to try to optimize this by sleeping the CPU.

When not in the alarm state, there's not much to do. The polling loop just ensures the buzzer isn't firing, and keeps checking.

When in the alarm state, it turns the buzzer on and off in a pattern. The pattern is fully programmable and needs to stop as soon as the microswitches change position. To make this happen, the buzzer pattern is stored as a sequence of bits in memory. When the buzzer starts, the current time is recorded. The time passed since that moment is used to look up the corresponding bit in the pattern, and the buzzer state is set to the value of that bit. A quick pulse of four buzzes would be recorded as 0xaa, and a slower sequence of on and off would be 0xff 0x00 0xff 0x00 0xff 0x00 0xff.

What exactly is the state we're looking for to sound the buzzer? Unfortunately, I was using switches that were already installed in the glider, and I couldn't remember how the switches were configured. Were they normally open or normally closed? What position activated them? I live a good distance from the airport, so it was inconvenient to go check. It needs to be configurable!

I thought about just baking it into the code and dragging my laptop out to the airport for the final setup. I thought about some sort of jumper configuration. But finally I settled on another plan.

I don't actually care how the switches are set up. There are four possible states the system can be in, and one of them is the state that sounds the alarm. But I don't care about the details of that state, or about the details of the other three.

I set up the system to watch another input pin. When that pin is pulled low (by shorting it to ground with some spare wire, for example), the system cycles through the four possible alarm states one per second. It saves the current state in the chip's onboard EEPROM, and loads that state when the program starts. To configure it, then, all I have to do is connect everything and put the aircraft into the alarm state while sitting on the ground. If the buzzer sounds, I'm all set. Otherwise, short the configuration pin to ground and wait for the buzzer to buzz. Once it does, remove the wire, and the correct configuration is saved.

Code
Arduino is programmed in plain old C++ with some extra libraries available and some unusual entry points. The code should therefore be pretty easy to follow. Where something different is going on, I'll explain.

I'll start by defining some constants for the various IO pins. Arduino identifies pins by number, but I want better names for them, and the ability to easily change which pin a function is assigned to, in case I change the hardware:

    #define PIN_BUZZER 1
    #define PIN_ALERT_CONFIGURE 0
    #define PIN_GEAR 2
    #define PIN_SPOILER 5

The current configuration is stored in the low two bits of a global variable:

    int alertBits;

The low bit controls the gear, and the high bit controls the spoiler. When the bit is 0 the alert condition is for that pin to be low. When the bit is 1 the alert condition is for that pin to be high. I built two convenience functions to extract the bits and turn them into the Arduino constants LOW and HIGH which are returned from the function that reads a pin:

    int alertWhenGearIs() {
        return alertBits & 1 ? HIGH : LOW;
    }

    int alertWhenSpoilerIs() {
        return alertBits & 2 ? HIGH : LOW;
    }

The configuration is stored in the EEPROM. Data is stored in the EEPROM by address, starting from zero. I define a constant for the address that stores alertBits, although I just chose zero:

    #define ALERT_BITS_EEPROM_ADDRESS 0

When in configuration mode, the system will wait for a one second, then move to the next configuration. To do that, it needs to keep track of when the system entered configuration mode, which is done in a global variable:

    unsigned long alertConfigStartMillis;

On Arduino, an int is only 16 bits, which only gives a range of a little over a minute when storing milliseconds. long is 32 bits, which is about 49 days. Since this value will be populated with the number of milliseconds since startup, it's good to have it in a long. It could work as an int since it's only used to compute a delta for a short period of time, but it's better to use a data type big enough to hold the full value.

There's also a constant for how long to wait before moving to the next configuration. It waits for a thousand milliseconds:

    #define ALERT_CONFIG_DELAY_MILLIS 1000

It's useful to keep track of whether the system is currently buzzing, so that actions can be taken when moving between states. I define two states, and a variable to hold the current one:

    enum State {
        kIdle,
        kBuzzing
    };

    enum State state;

To know which bit to use in the alert pattern, the system needs to know when it went into the buzzing state, so it can compute how long it's been and figure out which bit is the current one. The start time is stored in another global:

    unsigned long buzzStartMillis;

The alert sound is stored as an array of bytes. I'll leave out the actual bytes for now:

    uint8_t alertSound[] = {
        ...
    };

The program needs to know how long each bit should be played for. I selected 62 milliseconds, which makes for about 16 bits per second. This is a good compromise betwen fine control over the timing and having the alertSound array be really long:

    #define MILLIS_PER_BIT 62

The Arduino environment automatically calls a function called setup when the program starts. This is a good place to do, well, setup:

    void setup() {

The various IO pins need to be configured. This is done by calling the built-in pinMode function and giving it a constant that indicates which mode to use for the pin in question. The buzzer pin is used as an output:

        pinMode(PIN_BUZZER, OUTPUT);

The gear pin is used as an input:

        pinMode(PIN_GEAR, INPUT);

The gear pin is also pulled high by enabling the internal pullup resistor. This is done by calling digitalWrite and setting it to HIGH. When the pin is configured as an output, this would cause the output to be high, but when it's an input it enables the pullup resistor instead:

        digitalWrite(PIN_GEAR, HIGH);

This means that if the switch is open, the input will be high. When the switch is closed, the input will be whatever the other side of the switch is connected to. I connected the switches to ground, which pulls the input low. This means that the wiring to the switches is all connected to ground, which made me slightly more comfortable than having them be powered all the time, although it really doesn't matter.

The spoiler pin is configured in the same way:

        pinMode(PIN_SPOILER, INPUT);
        digitalWrite(PIN_SPOILER, HIGH);

As is the configuration pin:

        pinMode(PIN_ALERT_CONFIGURE, INPUT);
        digitalWrite(PIN_ALERT_CONFIGURE, HIGH);

Finally, alertBits is loaded from the EEPROM. The value will normally be from zero to three, but it will be set to 255 the first time because that's the value for EEPROM locations that have never been written. Values out of the normal range are interpreted as zero. Data can be read from the EEPROM by calling EEPROM.read, which takes an EEPROM address and returns the value currently in it:

        int savedBits = EEPROM.read(ALERT_BITS_EEPROM_ADDRESS);
        alertBits = savedBits < 4 ? savedBits : 0;
    }

After the setup function completes, Arduino repeatedly calls the loop function until the power is cut. Ongoing code is placed here. The ongoing code has two tasks: check the gear and spoiler switches to sound the alarm when appropriate, and check the configuration pin to change the configuration when requested:

    void loop() {
        checkGearSpoiler();
        checkAlertConfig();
    }

Let's look at checkGearSpoiler:

    void checkGearSpoiler() {

The first thing to do here is to read current state of the switches. This is done by calling the digitalRead function:

        int gear = digitalRead(PIN_GEAR);
        int spoiler = digitalRead(PIN_SPOILER);

Then see if we should be sounding the buzzer. We sound the buzzer when both gear and spoiler are in the alert state:

        int soundBuzzer = (gear == alertWhenGearIs() &&
                           spoiler == alertWhenSpoilerIs());

If we're not sounding the buzzer, make sure state reflects that, and ensure the buzzer is turned off. This is done by using digitalWrite to set the buzzer pin to LOW:

        if(!soundBuzzer) {
            state = kIdle;
            digitalWrite(PIN_BUZZER, LOW);

Otherwise, we're in the alert state and we need to sound the buzzer according to the current position in the pattern.

        } else {

The first thing here is to get the current time. This is used to compute the current position in the buzzer pattern. The built-in function millis returns the number of milliseconds since startup:

            unsigned long now = millis();

If the current state is kIdle then we just activated. Change the state, and set buzzStartMillis to the current time:

            if(state == kIdle) {
                state = kBuzzing;
                buzzStartMillis = now;
            }

To compute the current position in the pattern, we start by computing how much time has passed since we started the buzzer:

            unsigned long delta = now - buzzStartMillis;

The value to write to the buzzer pin (LOW or HIGH) will be stored in this local variable:

            int value;

Compute the current index in the pattern by dividing delta by the number of milliseconds per bit. Note that this is a bitwise index, which will require some massaging to turn into an actual bit extracted from alertSound:

            unsigned long index = delta / MILLIS_PER_BIT;

It's possible we'll run off the end of the alertSound array. We don't want to start reading garbage, so if the index is off the end, we'll just set value to LOW:

            int soundBytes = sizeof(alertSound);
            int soundMax = soundBytes * 8;
            if(index >= soundMax) {
                value = LOW;

Otherwise, we need to extract the bit that corresponds to index. To do this, index has to be broken up into the index of the byte which contains the bit, and the index of the bit within that byte. This is done by dividing by 8 and using the quotient and remainder. Since this is a resource-constrained microcontroller, I did this with bitshifting (division by 8 is the same as >> 3) and masking (taking the remainder of dividing by 8 is the same as & 0x07), even though it surely doesn't matter in this particular case:

            } else {
                int byteIndex = index >> 3;
                int bitIndex = index & 0x07;

With the two indexes in hand, we can then get the byte out of alertSound, then shift and mask to get the bit we're after:

                uint8_t byte = alertSound[byteIndex];
                uint8_t bit = (byte >> bitIndex) & 0x01;

Then value is HIGH if the bit is set, otherwise it's LOW:

                value = bit ? HIGH : LOW;
            }

Finally, set the buzzer pin to whatever value was set to:

            digitalWrite(PIN_BUZZER, value);
        }
    }

That's all we need to play the buzzer pattern! This code runs repeatedly, and each time it retrieves the bit for the current moment in time and then either plays the buzzer or not. As long as it runs frequently, the result will be a nice pattern of buzzes. (And since this chip doesn't have much else to do, it should run very frequently indeed.)

Let's look at checkAlertConfig next:

    void checkAlertConfig() {

The first thing it does is, naturally, get the current state of the alert pin:

        int alertConfig = digitalRead(PIN_ALERT_CONFIGURE);

If the pin is LOW then we're in the configuration state:

        if(alertConfig == LOW) {

As before, the first thing here is to get the current time. This is used to advance to the next configuration state after the designated amount of time has passed:

            unsigned long now = millis();

Then it looks at alertConfigStartMillis, which holds the time when the program entered the configuration state. If it's zero then it just entered the configuration state, so it can be set to the current time:

            if(alertConfigStartMillis == 0) {
                alertConfigStartMillis = now;

Otherwise, compute the amount of time that has passed since the program entered the configuration state:

            } else {
                unsigned long delta = now - alertConfigStartMillis;

Then see if we've been in the configuration state long enough to move to the next state:

                if(delta >= ALERT_CONFIG_DELAY_MILLIS) {

If we are, increment alertBits, masking to just the bottom two bits to ensure it doesn't go beyond the 0-3 range:

                    alertBits = (alertBits + 1) & 3;

Then write this value to the EEPROM using EEPROM.write:

                    EEPROM.write(ALERT_BITS_EEPROM_ADDRESS, alertBits);

Finally, start a new configuration cycle by setting alertConfigStartMillis to now. This could cause some slop to accumulate in the timing, since it doesn't account for any extra time beyond when configuration started. But extreme precision isn't very accurate in this case, since configuration is only done once, and it's all human-driven anyway:

                    alertConfigStartMillis = now;
                }
            }

Finally, if the configuration pin isn't active, ensure that alertConfigStartMillis is zero so the program sees it when it does enter the configuration state:

        } else {
            alertConfigStartMillis = 0;
        }
    }

And that's it!

Buzzer Pattern
For completeness, here is the full buzzer pattern I made for my unit. Since I made it so the pattern plays sixteen bits in one second, that means that two one-byte values make for one second. I formatted the array to put two values on each line, so each line is one second.

I constructed the pattern in a completely unscientific attempt to make something that would catch my attention even if distracted. I thought the key to this would be a lot of variation. It starts off with a rapid on/off pattern for one second. Then it pauses, then it plays a slower on/off pattern. Then there's a solid tone, in an attempt to say "PAY ATTENTION TO ME." Then I got creative and had it spell out GEAR UP and WARNING in Morse Code. I don't know Morse Code, but it stands out well. I end with a solid tone, and then if it still hasn't been fixed by then, the alert falls silent. Here's the full pattern:

    uint8_t alertSound[] = {
        // intermittent buzz for initial alert
        0xaa, 0xaa,

        // pause for a second
        0x00, 0x00,

        // steadier on/off sequence
        0xff, 0x00,
        0xff, 0x00,
        0xff, 0x00,
        0xff, 0x00,

        // pause for a second
        0x00, 0x00,

        // solid tone for two seconds
        0xff, 0xff,
        0xff, 0xff,

        // GEAR UP in morse code
        // --. . .- .-. / ..- .--.
        // 1110 1000  1000 1011
        // 1000 1011  1010 0000
        // 0010 1011  1000 1011
        // 1011 1010
        0xe8, 0x8b,
        0x8b, 0xa0,
        0x2b, 0x8b,
        0xba, 0x00,

        // repeat it a few times
        0xe8, 0x8b,
        0x8b, 0xa0,
        0x2b, 0x8b,
        0xba, 0x00,

        0xe8, 0x8b,
        0x8b, 0xa0,
        0x2b, 0x8b,
        0xba, 0x00,

        0xe8, 0x8b,
        0x8b, 0xa0,
        0x2b, 0x8b,
        0xba, 0x00,

        // WARNING in morse code
        // .-- .- .-. -. .. -. --.
        // 1011 1011  1000 1011
        // 1000 1011  1010 0011
        // 1010 0010  1000 1110
        // 1000 1110  1110 1000
        0xbb, 0x8b,
        0x8b, 0xa3,
        0xa2, 0x8e,
        0x8e, 0xe8,

        // then pause
        0x00, 0x00,

        // repeat that a few times too
        0xbb, 0x8b,
        0x8b, 0xa3,
        0xa2, 0x8e,
        0x8e, 0xe8,
        0x00, 0x00,

        0xbb, 0x8b,
        0x8b, 0xa3,
        0xa2, 0x8e,
        0x8e, 0xe8,
        0x00, 0x00,

        0xbb, 0x8b,
        0x8b, 0xa3,
        0xa2, 0x8e,
        0x8e, 0xe8,
        0x00, 0x00,

        // last ditch solid tone for ten seconds
        0xff, 0xff,
        0xff, 0xff,
        0xff, 0xff,
        0xff, 0xff,
        0xff, 0xff,
        0xff, 0xff,
        0xff, 0xff,
        0xff, 0xff,
        0xff, 0xff,
        0xff, 0xff,
    };

CPU Speed
Power is at a premium in a glider. There's no way to generate electricity on board (some fancy people have solar cells, but I don't), so everything runs off batteries. I currently use a LiFePO4 battery with a capacity of 15Ah at about 13V. This has to power everything on board, including the power-hundgry transmitters in the VHF radio and radar transponder.

Compared to that, the power consumed by this processor is probably so small it can be ignored entirely. But it's still nice to get it as low as possible. The Digispark has the ability to reduce the CPU speed, which in turn makes it use substantially less power. The CPU is normally clocked to 16MHz, and reducing it to 1MHz makes it use 16x less power. 1MHz is still plenty fast for the small amount of work this program needs to do.

This isn't done in code, but is just a menu item in the Arduino IDE. When working on a project, you specify what kind of board it's for. With the Digispark, each available speed shows as a separate "board," so there's a Digispark entry for 16MHz, 8Mhz, and 1MHz. I just picked the last one, re-uploaded my program, and done!

Conclusion
This was a fun project that was different from my usual Apple-related stuff. There wasn't anything too difficult about it, but it was nice to build something physical and practical. I put the end result in a small box and installed it behind my glider's instrument panel, where it lives right now, mostly doing nothing, and occasionally buzzing at me when I test it.

That's it for today! Come back next time for more exciting fun, probably back in the realm of Apple platforms. Friday Q&A is (mostly) driven by reader suggestions, so if you have an idea in the meantime of a topic you'd like to see covered here, please send it in!

A quick note: I'm going to be traveling for a while soon, so my articles will be on hiatus until I'm back around the end of October. I hope to get one more article posted before I go, but no guarantees. Either way, don't worry, more will come soon!

Friday Q&A 2015-11-06: Why is Swift's String API So Hard?

$
0
0
Friday Q&A 2015-11-06: Why is Swift's String API So Hard?
Friday Q&A 2015-11-06: Why is Swift's String API So Hard?

Welcome to a very delayed edition of Friday Q&A. One of the biggest complaints I see from people using Swift is the String API. It's difficult and obtuse, and people often wish it were more like string APIs in other languages. Today, I'm going to explain just why Swift's String API is designed the way it is (or at least, why I think it is) and why I ultimately think it's the best string API out there in terms of its fundamental design.

What is a String?
Let's build a conceptual foundation before we jump into it. Strings are one of those things we understand implicitly but often don't really think about too deeply. Thinking about them deeply helps understand what's going on.

What is a string, conceptually? The high level view is that a string is some text. "Hello, World" is a string, as is "/Users/mikeash" and "Robert'); DROP TABLE Students;--".

(Incidentally, I think that representing all these different concepts as a single string type is a mistake. Human-readable text, file paths, SQL statements, and others are all conceptually different, and this should be represented as different types at the language level. I think that having different conceptual kinds of strings be distinct types would eliminate a lot of bugs. I'm not aware of any language or standard library that does this, though.)

How is this general concept of "text" represented at the machine level? Well, it depends. There are a ton of different ways to do it.

In many languages, a string is an array of bytes. Giving meaning to those bytes is mostly left up to the program. This is the state of strings in C++ using std::string, in Python 2, in Go, and in many other languages.

C is a weird special case of this. In C, a string is a pointer to a sequence of non-zero bytes, terminated by a zero byte. The basic effect is the same, but C strings can't contain zero bytes, and operations like finding the length of a string require scanning memory.

A lot of newer languages define strings as a sequence of UCS-2 or UTF-16 code units. Java, C#, and JavaScript are all examples of this, as well as Objective-C using Cocoa and NSString. This is mostly a historical accident. When Unicode was first introduced in 1991, it was a pure 16-bit system. Several popular languages were designed around that time, and they used Unicode as the basis of their strings. By the time Unicode broke out of the 16-bit model in 1996, it was too late to change how these languages worked. The UTF-16 encoding allows them to encode the larger numbers as pairs of 16-bit code units, and the basic concept of a string as a sequence of 16-bit code units continues.

A variant of this approach is to define a string as a sequence of UTF-8 code units, which are 8-bit quantities. This is overall similar to the UTF-16 approach, but allows for a more compact representation for ASCII strings, and avoids conversion when passing strings to functions that expects C-style strings, as those often accept UTF-8 strings.

Some languages represent strings as a sequence of Unicode code points. Python 3 works this way, as do many C implementations of the built-in wchar_t type.

In short, a string is usually considered to be a sequence of some kind of characters, where a character is typically a byte, a UTF-16 code unit, or a Unicode code point.

Problems
Having a string be a sequence of some "character" type is convenient. You can typically treat them like arrays (often they are arrays) which makes it easy to grab subranges, slice pieces off the beginning or end, delete portions, count elements, etc.

The trouble is that we live in a Unicode universe, and Unicode makes things hard. Let's look at an example string and see how it works out:

    aé∞𝄞

Each Unicode code point has a number (written as U+nnnn) and a human-readable name (written in ALL CAPS for some reason) which make it easier to talk about the individual code points. This particular string consists of:

  • U+0061 LATIN SMALL LETTER A
  • U+0065 LATIN SMALL LETTER E
  • U+0301 COMBINING ACUTE ACCENT
  • U+221E INFINITY
  • U+1D11E MUSICAL SYMBOL G CLEF

Let's remove a "character" from the middle of this string, treating a "character" as a UTF-8 byte, a UTF-16 code unit, or a Unicode code point.

Let's start with UTF-8. Here's what this string looks like as UTF-8:

    61 65 cc 81 e2 88 9e f0 9d 84 9e
    -- -- ----- -------- -----------
    a  e    ´      ∞          𝄞

Let's remove the 3rd "character," which we're treating as the 3rd byte. That produces:

    61 65 81 e2 88 9e f0 9d 84 9e

This string is no longer valid UTF-8. UTF-8 bytes fall into three categories. Bytes of the form 0xxxxxxx with the top bit set to 0 represent plain ASCII characters and stand alone. Bytes of the form 11xxxxxx denote the start of a multi-byte sequence whose length is indicated by the location of the first zero bit. Bytes of the form 10xxxxxx denote the remainder of a multi-byte sequence. The byte cc formed the start of a multi-byte sequence a total of two bytes long, and the byte 81 was the trailing byte of that sequence. By removing cc, the trailing byte 81 is left standing alone. Any validating UTF-8 reader will reject this string. This same problem will occur when removing any of the bytes from the third place onward in this string.

How about the second byte? If we remove that, we get:

    61 cc 81 e2 88 9e f0 9d 84 9e
    -- ----- -------- -----------
    a    ´      ∞          𝄞

This is still valid UTF-8, but the result is not what we might expect:

á∞𝄞

To a human, the "second character" of this string is "é." But the second byte is just the "e" without the accent mark. The accent mark is added separately as a "combining character." Removing the second byte of the string just removes the "e" which causes the combining accent mark to attach to the "a" instead.

What if we remove the very first byte? This result, at least, is what we'd expect:

    65 cc 81 e2 88 9e f0 9d 84 9e
    -- ----- -------- -----------
    e    ´      ∞          𝄞

Let's take a look at UTF-16 now. Here's what the string looks like as UTF-16:

    0061 0065 0301 221e d834 dd1e
    ---- ---- ---- ---- ---------
     a    e    ´    ∞       𝄞

Let's try removing the second "character":

    0061 0301 221e d834 dd1e
    ---- ---- ---- ---------
     a    ´    ∞       𝄞

This has the same problem as we had above with UTF-8, deleting only the "e" but not the accent mark, causing it to attach to the "a" instead.

What if we delete the 5th character? We get this sequence:

    0061 0065 0301 221e dd1e

Similar to the problem we had with invalid UTF-8 above, this sequence is no longer valid UTF-16. The sequence d834 dd1e forms a surrogate pair, where two 16-bit units are used to represent a code point beyond the 16-bit limit. Leaving a single piece of the surrogate pair standing alone is invalid. Code that deals with UTF-8 usually rejects this sort of thing outright, but UTF-16 is often more forgiving. For example, Cocoa renders the resulting string as:

    aé∞�

What if the string is a sequence of unicode code points? It would look like this:

    00061 00065 00301 0221E 1D11E
    ----- ----- ----- ----- -----
      a     e     ´     ∞     𝄞

With this representation, we can remove any "character" without producing an invalid string. But the problem with the combining accent mark still occurs. Removing the second character produces:

    00061 00301 0221E 1D11E
    ----- ----- ----- -----
      a     ´     ∞     𝄞

Even with this representation, we're not safe from unintuitive results.

These are hardly artifical concerns, too. English is one of the few languages you can write with pure ASCII, and even then you'll have trouble, unless you feel like applying for a job with your "resume" instead of your résumé. The moment you step beyond ASCII, all these weird things start to appear.

Grapheme Clusters
Unicode has the concept of a grapheme cluster, which is essentially the smallest unit that a human reader would consider to be a "character." For many code points, a grapheme cluster is synonymous with a single code point, but it also extends to include things like combining accent marks. If we break the example string into grapheme clusters, we get something fairly sensible:

    a é ∞ 𝄞

If you remove any single grapheme cluster, you get something that a human would generally consider to be reasonable.

Note that I didn't include any numeric equivalents in this example. That's because, unlike UTF-8, UTF-16, or plain unicode code points, there is no single number that can describe a grapheme cluster in the general case. A grapheme cluster is a sequence of one or more code points. A single grapheme cluster is often one or two code points, but it can be a lot of code points in the case of something like Zalgo. For example, consider this string:

    e⃝⃞⃟⃠⃣⃤⃥⃦⃪⃧꙰꙲꙱

This mess consists of 14 separate code points:

  • U+0065
  • U+20DD
  • U+20DE
  • U+20DF
  • U+20E0
  • U+20E3
  • U+20E4
  • U+20E5
  • U+20E6
  • U+20E7
  • U+20EA
  • U+A670
  • U+A672
  • U+A671

All of these code points form a single grapheme cluster.

Here's an interesting example. Consider a string containing the Swiss flag:

🇨🇭

This one symbol is actually two code points: U+1F1E8 U+1F1ED. What are these code points?

  • U+1F1E8 REGIONAL INDICATOR SYMBOL LETTER C
  • U+1F1ED REGIONAL INDICATOR SYMBOL LETTER H

Rather than include a separate code point for the flag of every country on the planet, Unicode just includes 26 "regional indicator symbols." Add together the indicator for C and the indicator for H and you get the Swiss flag. Combine M and X and you get the Mexican flag. Each flag is a single grapheme cluster, but two code points, four UTF-16 code units, and eight UTF-8 bytes.

Implications For Implementations
We've seen that there are a lot of different ways to look at strings, and a lot of different things you can call a "character." A "character" as a grapheme cluster comes closest to what a human thinks of as a "character," but when manipulating a string in code, which definition you want to use will depend on the context. When moving an insertion point in text in response to an arrow key, you probably want to go by grapheme clusters. When measuring a string to ensure that it fits in a 140-character tweet, you want to go by unicode code points. When squeezing a string into an 80-character database table column, you're probably dealing in UTF-8 bytes.

How do you reconcile these when writing a string implementation, and balancing the conflicting requirements of performance, memory consumption, and clean code?

The typical answer is to pick a single canonical representation, then possibly allow conversions for the cases where other representations are needed. For example, NSString uses UTF-16 as its canonical representation. The entire API is built around UTF-16. If you want to deal with UTF-8 or Unicode code points, you can convert to UTF-8 or UTF-32 and then manipulate the result. Those are provided as data objects rather than strings, so it's not as convenient. If you want to deal with grapheme clusters, you can find their boundaries using rangeOfComposedCharacterSequencesForRange:, but it's a lot of work to do anything interesting.

Swift's String type takes a different approach. It has no canonical representation, and instead provides views on various representations of the string. This lets you use whichever representation makes the most sense for the task at hand.

Swift's String API, In Brief
In older versions of Swift, String conformed to CollectionType and presented itself as a collection of Character. As of Swift 2, this is no longer the case, and String mostly presents the various views as the proper way to access it.

This is not entirely the case, though, and String still somewhat favors Character and presents a bit of a collection-like interface:

    public typealias Index = String.CharacterView.Index
    public var startIndex: Index { get }
    public var endIndex: Index { get }
    public subscript (i: Index) -> Character { get }

You can index into a String to get individual Characters, but that's about it. Notably, you can't iterate using the standard for in syntax.

What is a "character" in Swift's eyes? As we've seen, there are a lot of possibilities. Swift has settled on the grapheme cluster as its idea of a "character." This seems like a good choice, since as we saw above it best matches our human idea of a "character" in a string.

The various views are exposed as properties on String. For example, here's the characters property:

    public var characters: String.CharacterView { get }

CharacterView is a collection of Characters:

    extension String.CharacterView : CollectionType {
        public struct Index ...
        public var startIndex: String.CharacterView.Index { get }
        public var endIndex: String.CharacterView.Index { get }
        public subscript (i: String.CharacterView.Index) -> Character { get }
    }

This looks a lot like the interface of String itself, except it conforms to CollectionType and so gets all of the functionality that provides, like slicing and iteration and mapping and counting. So while this is not allowed:

    for x in "abc" {}

This works fine:

    for x in "abc".characters {}

You can get a string back out of a CharacterView by using an initializer:

    public init(_ characters: String.CharacterView)

You can even get a String from an arbitrary sequence of Characters:

    public init<S : SequenceType where S.Generator.Element == Character>(_ characters: S)

Working our way down the hierarchy, the next view is the UTF-32 view. Swift calls UTF-32 code units "unicode scalars," since UTF-32 code units correspond exactly to Unicode code points. Here's what the (abbreviated) interface looks like:

    public var unicodeScalars: String.UnicodeScalarView

    public struct UnicodeScalarView : CollectionType, _Reflectable, CustomStringConvertible, CustomDebugStringConvertible {
        public struct Index ...
        public var startIndex: String.UnicodeScalarView.Index { get }
        public var endIndex: String.UnicodeScalarView.Index { get }
        public subscript (position: String.UnicodeScalarView.Index) -> UnicodeScalar { get }
    }

Like CharacterView, there's a String initializer for UnicodeScalarView:

    public init(_ unicodeScalars: String.UnicodeScalarView)

Unfortunately, there's no initializer for an arbitrary sequence of UnicodeScalars, so you have to do a little extra work if you prefer to manipulate, for example, an array of them and then turn them back into a string. There isn't even an initializer for UnicodeScalarView that takes an arbitrary sequence of UnicodeScalars. There is, however, a mutable append function, so you can build a String in three steps:

    var unicodeScalarsView = String.UnicodeScalarView()
    unicodeScalarsView.appendContentsOf(unicodeScalarsArray)
    let unicodeScalarsString = String(unicodeScalarsView)

Next is the UTF-16 view. It looks much like the others:

    public var utf16: String.UTF16View { get }

    public struct UTF16View : CollectionType {
        public struct Index ...
        public var startIndex: String.UTF16View.Index { get }
        public var endIndex: String.UTF16View.Index { get }
        public subscript (i: String.UTF16View.Index) -> CodeUnit { get }
    }

The String initializer for this view is slightly different:

    public init?(_ utf16: String.UTF16View)

Unlike the others, this is a failable initializer. Any sequence of Characters or UnicodeScalars is a valid String, but it's possible to have a sequence of UTF-16 code units that don't form a valid string. This initializer will produce nil if the view's contents aren't valid.

Going from an arbitrary sequence of UTF-16 code units back to a String is pretty obscure. UTF16View has no public initializers and few mutating functions. The solution is to use the global transcode function, which works with the UnicodeCodecType protocol. There are three implementations of this protocol: UTF8, UTF16, and UTF32. The transcode function can be used to convert between them. It's pretty gnarly, though. For the input, it takes a GeneratorType which produces the input, and for the output it takes a function which is called for each unit of output. This can be used to build up a string piece by piece by converting to UTF32, then converting each UTF-32 code unit to a UnicodeScalar and appending it to a String:

    var utf16String = ""
    transcode(UTF16.self, UTF32.self, utf16Array.generate(), { utf16String.append(UnicodeScalar($0)) }, stopOnError: true)

Finally, there's the UTF-8 view. It's what we'd expect from what we've seen so far:

    public var utf8: String.UTF8View { get }

    public struct UTF8View : CollectionType {
        /// A position in a `String.UTF8View`.
        public struct Index ...
        public var startIndex: String.UTF8View.Index { get }
        public var endIndex: String.UTF8View.Index { get }
        public subscript (position: String.UTF8View.Index) -> CodeUnit { get }
    }

There's an initializer for going the other way. Like with UTF16View, the initializer is failable, since a sequence of UTF-8 code units may not be valid:

    public init?(_ utf8: String.UTF8View)

Like before, there's no convenient way to turn an arbitrary sequence of UTF-8 code units into a String. The transcode function can be used here too:

    var utf8String = ""
    transcode(UTF8.self, UTF32.self, utf8Array.generate(), { utf8String.append(UnicodeScalar($0)) }, stopOnError: true)

Since these transcode calls are pretty painful, I wrapped them up in a pair of nice failable initializers:

    extension String {
        init?<Seq: SequenceType where Seq.Generator.Element == UInt16>(utf16: Seq) {
            self.init()

            guard transcode(UTF16.self,
                            UTF32.self,
                            utf16.generate(),
                            { self.append(UnicodeScalar($0)) },
                            stopOnError: true)
                            == false else { return nil }
        }

        init?<Seq: SequenceType where Seq.Generator.Element == UInt8>(utf8: Seq) {
            self.init()

            guard transcode(UTF8.self,
                            UTF32.self,
                            utf8.generate(),
                            { self.append(UnicodeScalar($0)) },
                            stopOnError: true)
                            == false else { return nil }
        }
    }

Now we can create Strings from arbitrary sequences of UTF-16 or UTF-8:

    String(utf16: utf16Array)
    String(utf8: utf8Array)

Indexes
The various views are all indexable collections, but they are very much not arrays. The index types are weird custom structs. This means you can't index views by number:

    // all errors
    string[2]
    string.characters[2]
    string.unicodeScalars[2]
    string.utf16[2]
    string.utf8[2]

Instead, you have to start with either the collection's startIndex or endIndex, then use methods like successor() or advancedBy() to move around:

    // these work
    string[string.startIndex.advancedBy(2)]
    string.characters[string.characters.startIndex.advancedBy(2)]
    string.unicodeScalars[string.unicodeScalars.startIndex.advancedBy(2)]
    string.utf16[string.utf16.startIndex.advancedBy(2)]
    string.utf8[string.utf8.startIndex.advancedBy(2)]

This is not fun. What's going on?

Recall that these are all views on the same underlying data, which is stored in some canonical form within the string object. When you use a view which doesn't match that canonical form, the data has to be converted when accessed.

Recall from above that these various encodings have different sizes and lengths. That means that there's no straightforward way to map a location in one view to a location in another view, because the mapping depends on the underlying data. Consider this string for example:

    AƎ工🄞

Imagine the canonical representation within String is UTF-32. That representation will be an array of 32-bit integers:

    0x00041 0x0018e 0x05de5 0x1f11e

Now imagine we get the UTF-8 view of this data. Conceptually, that data is a sequence of 8-bit integers:

    0x41 0xc6 0x8e 0xe5 0xb7 0xa5 0xf0 0x9f 0x84 0x9e

Here's how this sequence maps back to the original UTF-32:

    | 0x00041 |  0x0018e  |     0x05de5    |       0x1f11e       |
    |         |           |                |                     |
    |  0x41   | 0xc6 0x8e | 0xe5 0xb7 0xa5 | 0xf0 0x9f 0x84 0x9e |

If I ask the UTF-8 view for the value at index 6, it has to scan the UTF-32 array starting from the beginning to figure out where that value is and what it contains.

Obviously, this can be done. Swift provides the necessary functionality, it's just not pretty: string.utf8[string.utf8.startIndex.advancedBy(6)]. Why not make it easier, and allow indexing with an integer? It's essentially Swift's way of reinforcing the fact that this is an expensive operation. In a world where UTF8View provided subscript(Int), we'd expect these two pieces of code to be pretty much equivalent:

    for c in string.utf8 {
        ...
    }

    for i in 0..<string.utf8.count {
        let c = string.utf8[i]
        ...
    }

They would work the same, but the second one would be drastically slower. The first loop is a nice linear scan, whereas the second loop has to do a linear scan on each iteration, giving the whole loop a quadratic runtime. It's the difference between scanning a million-character string in a tenth of a second, and having that scan take three hours. (Approximate times taken from my 2013 MacBook Pro.)

Let's take another example, of simply getting the last character in the string:

    let lastCharacter = string.characters[string.characters.endIndex.predecessor()]

    let lastCharacter = string.characters[string.characters.count - 1]

The first version is fast. It starts at the end of the string, scans backwards briefly to figure out where the last Character starts, and then fetches it. The second version does a full scan of the entire string... twice! It has to scan the entire string to count how many Characters it contains, then scan it again to figure out where the specified numeric index is.

Everything you could do with an API like this can still be done with Swift's API, it's just different, and a little harder. These differences show the programmer that these views are not just arrays and they don't perform like arrays. When we see subscript indexing, we assume with pretty good reason that the indexing operation is fast. If String's views supported integer subscripting, it would be break that assumption, and make it easy to write really slow code.

Writing Code With String
What does all of this mean when writing code that uses String for practical purposes?

Use the highest-level API you can. If you need to see if a string starts with a certain letter, for example, don't index into the string to retrieve the first character and compare. Use the hasPrefix method, which takes care of the details for you. Don't be afraid to import Foundation and use NSString methods. For example, if you want to remove whitespace at the beginning and end of a String, don't manually iterate and look at the characters, use stringByTrimmingCharactersInSet.

If you do need to do character-level manipulation yourself, think carefully about exactly what a "character" means to you in this particular case. Often, the right answer is a grapheme cluster, which is what's represented by the Swift Character type and the characters view.

Whatever you're doing with the text, think about it in terms of a linear scan from the beginning or the end. Operations like asking for the count of characters or seeking into the middle will likely be linear time scans anyway, so it's better if you can arrange your code to do so explicitly. Grab the appropriate view, get its start or end index, then move that index around as you need it using advancedBy() and similar functions.

If you really need random access, or don't mind an efficiency hit and want the convenience of a more straightforward container, you can convert a view into an Array of whatever that view contains. For example, Array(string.characters) will produce an array of the grapheme clusters in that string. This is probably not a very efficient representation and will chew up some extra memory, but it's going to be much easier to work with. You can convert back to a String when done.

Conclusion
Swift's String type takes an unusual approach to strings. Many other languages pick one canonical representation for strings and leave you to fend for yourself if you want anything else. Often they simply give up on important questions like "what exactly is a character?" and pick something which is easy to work with in code but which sugar-coats the inherently difficult problems encountered in string processing. Swift doesn't sugar-coat it, but instead shows you the reality of what's going on. This can be difficult, but it's no more difficult than it needs to be.

The String API does have some holes in it, and could use some extra functionality to make life a little easier. In particular, converting from UTF-8 or UTF-16 to a String is unreasonably difficult and annoying. It would be nice to have facilities for initializing a UTF8View and UTF16View from arbitrary sequences of code units, as well as some more useful mutating functions on these views so they can be manipulated directly.

That's it for today. Come back next time for more shenanigans and terror. Friday Q&A is driven by reader ideas, so in the meantime, send me your requests for topics.

Friday Q&A 2015-11-20: Covariance and Contravariance

$
0
0
Friday Q&A 2015-11-20: Covariance and Contravariance
Friday Q&A 2015-11-20: Covariance and Contravariance

Subtypes and supertypes are a common part of modern programming. Covariance and contravariance tell us where subtypes are accepted and where supertypes are accepted in place of the original type. This shows up frequently in most of the programming most of us do, but many developers are unaware of the concepts beyond a loose instinctual sense. Today I'm going to discuss it in detail.

Subtypes and Supertypes
We all know what subclassing is. When you create a subclass, you're creating a subtype. To take a classic example, you might subclass Animal to create Cat:

    class Animal {
        ...
    }

    class Cat: Animal {
        ...
    }

This makes Cat a subtype of Animal. That means that all Cats are Animals, but not all Animals are Cats.

Subtypes can typically substitute for supertypes. It's obvious to any programmer who's been around a little while why in this Swift code the first line works and the second line doesn't:

    let animal: Animal = Cat()
    let cat: Cat = Animal()

This is true for function types as well:

    func animalF() -> Animal {
        return Animal()
    }

    func catF() -> Cat {
        return Cat()
    }

    let returnsAnimal: () -> Animal = catF
    let returnsCat: () -> Cat = animalF

All of this works in Objective-C too, using blocks. The syntax is much uglier, though, so I decided to stick with Swift.

Note that this does not work:

    func catCatF(inCat: Cat) -> Cat {
        return inCat
    }

    let animalAnimal: Animal -> Animal = catCatF

Confused yet? Not to worry, this whole article is going to explore exactly why the first version works while the second version doesn't, and hit on some more practical stuff along the way.

Overridden Methods
Similar things are at work with overridden methods. Imagine this class:

    class Person {
        func purchaseAnimal() -> Animal
    }

Now let's subclass it, override that method, and change the return type:

    class CrazyCatLady: Person {
        override func purchaseAnimal() -> Cat
    }

Is this legal? Yes. Why?

The Liskov substitution principle is the guiding principle for subclassing. It says, in short, that an instance of a subclass can always be substituted for an instance of its superclass. Anywhere you have an Animal, you can replace it with a Cat. Anywhere you have a Person, you can replace it with a CrazyCatLady.

Here's some code that uses a Person, with explicit type annotations for clarity:

    let person: Person = getAPerson()
    let animal: Animal = person.purchaseAnimal()
    animal.pet()

Imagine that getAPerson returns a CrazyCatLady. Does this code still work? CrazyCatLady.purchaseAnimal will return a Cat. That instance is placed into animal. A Cat is a valid Animal, so it can do everything an Animal can do, including pet. Having CrazyCatLady return Cat is valid.

Let's imagine we want to move the pet operation into Person, so we can have a particular person pet a particular animal:

    class Person {
        func purchaseAnimal() -> Animal
        func pet(animal: Animal)
    }

Naturally, CrazyCatLady only pets cats:

    class CrazyCatLady: Person {
        override func purchaseAnimal() -> Cat
        override func pet(animal: Cat)
    }

Is this legal? No!

To understand why, let's look at some code that uses this method:

    let person: Person = getAPerson()
    let animal: Animal = getAnAnimal()
    person.pet(animal)

Imagine that getAPerson() returns a CrazyCatLady. This line is still good:

    let person: Person = getAPerson()

Imagine that getAnAnimal() returns a Dog, which is a subclass of Animal with decidedly different behaviors from Cat. This line is still good as well:

    let animal: Animal = getAnAnimal()

Now we have a CrazyCatLady in person and a Dog in animal and we do this:

    person.pet(animal)

Kaboom! CrazyCatLady's pet method is expecting a Cat. It has no idea what to do with a Dog. It's probably going to be accessing properties and calling methods that Dog doesn't have.

This code is perfectly legal. It gets a Person, it gets an Animal, then it calls a method on Person that takes an Animal. The problem lies above, when we changed CrazyCatLady.pet to take a Cat. That broke the Liskov substitution principle: no longer can a CrazyCatLady be used anywhere a Person is used!

Thankfully, the compiler has our back. It knows that using a subtype for an overridden method's parameter is not legal, and will refuse to compile this code.

Is it ever legal to use a different type in an overridden method? Yes, actually: you can use a supertype. For example, imagine that Animal subclasses Thing. It would then be legal to override pet to take Thing:

    override func pet(thing: Thing)

This preserves substitutability. If treated as a Person, then this method will always be passed Animals, which are Things.

This is a key rule: function return values can changed to subtypes, moving down the hierarchy, whereas function parameters can be changed to supertypes, moving up the hierarchy.

Standalone Functions
The subtype/supertype relationship is obvious enough when it comes to classes. It directly follows the class hierarchy. What about functions?

    let f1: A -> B = ...
    let f2: C -> D = f1

When is this legal, and when is it not?

This is basically a miniature version of the Liskov substitution principle. In fact, you can think of functions as little mini-objects with just one method. When you have two different object types, when can you mix them like this? When the original type is a subtype of the destination. And when is a method a subtype of another method? As we saw above, it's when parameters are supertypes and the return value is a subtype.

Applied here, the above code works if A is a supertype of C, and if B is a subtype of D. Put concretely:

    let f1: Animal -> Animal = ...
    let f2: Cat -> Thing = f1

The two parts move in opposite directions. This may not be what you want, but it's the only way it can actually work.

This is another key rule: functions are subtypes of other functions if the parameter types are supertypes and the return types are subtypes.

Properties
Read-only properties are pretty simple. Subclass properties must be subtypes. A read-only property is essentially a function which takes no parameters and returns a value, and the same rules apply.

Read-write properties are also pretty simple. Subclass properties must be the exact same type as the superclass. A read-write property is essentially a pair of functions. The getter is a function with no parameters that returns a value, and the setter is a function with one parameter and no return value:

    var animal: Animal
    // This is like:
    func getAnimal() -> Animal
    func setAnimal(animal: Animal)

As we saw above, function parameters move up while function return types move down. Since both the parameter and the return value are forced to be the same type, that type can't change:

    // This doesn't work:
    override func getAnimal() -> Cat
    override func setAnimal(animal: Cat)

    // Neither does this:
    override func getAnimal() -> Thing
    override func setAnimal(animal: Thing)

Generics
How about generics? Given some type with a generic parameter, when does this work?

    let var1: SomeType<A> = ...
    let var2: SomeType<B> = var1

In theory, it depends on how the generic parameter is used. A generic parameter does nothing on its own, but is used as property types, method parameter types, and method return types.

If the generic parameter was used purely for method return types and read-only properties, then it would work if B were a supertype of A:

    let var1: SomeType<Cat> = ...
    let var2: SomeType<Animal> = var1

If the generic parameter was used purely for method parameter types, then it would work if B were a subtype of A:

    let var1: SomeType<Animal> = ...
    let var2: SomeType<Cat> = var1

If the generic parameter was used both ways, then it would only work if A and B were identical. This is also the case if the generic parameter was used as the type for a read-write property.

That's the theory. It's a bit complex and subtle. That's probably why Swift takes the easy way out. For two generic types to be compatible in Swift, they must have identical generic parameters. Subtypes and supertypes are never allowed, even when the theory says it would be acceptable.

Objective-C actually does this a bit better. A generic parameter in Objective-C can be annotated with __covariant to indicate that subtypes are acceptable, and __contravariant to indicate that supertypes are acceptable. This can be seen in the interface for NSArray, among others:

    @interface NSArray<__covariant ObjectType> : NSObject ...

Covariance and Contravariance
The astute reader may notice that the title of the article contains these two terms which I have carefully avoided using this whole time. Now that we're firm on the concepts, let's talk about the terminology.

Covariance is when subtypes are accepted. Overridden read-only properties are covariant.

Contravariance is when supertypes are accepted. The parameters of overridden methods are contravariant.

Invariance is when neither supertypes nor subtypes are accepted. Swift generics are invariant.

Bivariance is when both supertypes and subtypes are accpted. I can't think of any examples of bivariance in Objective-C or Swift.

You may find the terminology hard to remember. That's OK! It's not really that important. As long as you understand subtyping, supertyping, and what causes a subtype or supertype to be acceptable in any given place, you can just look up the terminology in the unlikely event that you need it.

Conclusion
Covariance and contravariance determines when a subtype or supertype can be used in place of a type. It most commonly appears when overriding a method and changing the argument or return types, in which case the return type must be a subtype, and the arguments must be supertypes. The guiding principle behind this is Liskov substitution, which means that an instance of a subclass must be usable anywhere an instance of the superclass can be used. Subtype and supertype requirements can be derived from this principle.

That's it for today. Come back for more exciting adventures. Or just come back for exciting adventures; "more" is probably out of place, since covariance is not exciting. In any case, Friday Q&A is driven by reader suggestions, so if you have a suggestion for an article here, please send it in!


Friday Q&A 2015-12-11: Swift Weak References

$
0
0
Friday Q&A 2015-12-11: Swift Weak References
Friday Q&A 2015-12-11: Swift Weak References

In case you have been on Mars, in a cave, with your eyes shut and your fingers in your ears, Swift has been open sourced. This makes it convenient to explore one of the more interesting features of Swift's implementation: how weak references work.

Weak References
In a garbage collected or reference counted language, a strong reference is one which keeps the target object alive. A weak reference is one which doesn't. An object can't be destroyed while there are strong references to it, but it can be destroyed while there are weak references to it.

When we say "weak reference," we usually mean a zeroing weak reference. That is, when the target of the weak reference is destroyed, the weak reference becomes nil. It's also possible to have non-zeroing weak references, which trap, crash, or invoke nasal demons. This is what you get when you use unsafe_unretained in Objective-C, or unowned in Swift. (Note that Objective-C gives us the nasal-demons version, while Swift takes care to crash reliably.)

Zeroing weak references are handy to have around, and they're extremely useful in reference counted languages. They allow circular references to exist without creating retain cycles, and without having to manually break back references. They're so useful that I implemented my own version of weak references back before Apple introduced ARC and made language-level weak references available outside of garbage collected code.

How Does It Work?
The typical implementation for zeroing weak references is to keep a list of all the weak references to each object. When a weak reference is created to an object, that reference is added to the list. When that reference is reassigned or goes out of scope, it's removed from the list. When an object is destroyed, all of the references in the list are zeroed. In a multithreaded environment (i.e. all of them these days), the implementation must synchronize obtaining a weak reference and destroying an object to avoid race conditions when one thread releases the last strong reference to an object at the same time another thread tries to load a weak reference to it.

In my implementation, each weak reference is a full-fledged object. The list of weak references is just a set of weak reference objects. This adds some inefficiency because of the extra indirection and memory use, but it's convenient to have the references be full objects.

In Apple's Objective-C implementation, each weak reference is a plain pointer to the target object. Rather than reading and writing the pointers directly, the compiler uses helper functions. When storing to a weak pointer, the store function registers the pointer location as a weak reference to the target. When reading from a weak pointer, the read function integrates with the reference counting system to ensure that it never returns a pointer to an object that's being deallocated.

Zeroing in Action
Let's build a bit of code so we can watch this stuff happen.

We want to be able to dump the contents of an object's memory. This function takes a region of memory, breaks it into pointer-sized chunks, and turns the whole thing into a convenient hex string:

    func contents(ptr: UnsafePointer<Void>, _ length: Int) -> String {
        let wordPtr = UnsafePointer<UInt>(ptr)
        let words = length / sizeof(UInt.self)
        let wordChars = sizeof(UInt.self) * 2

        let buffer = UnsafeBufferPointer<UInt>(start: wordPtr, count: words)
        let wordStrings = buffer.map({ word -> String in
            var wordString = String(word, radix: 16)
            while wordString.characters.count < wordChars {
                wordString = "0" + wordString
            }
            return wordString
        })
        return wordStrings.joinWithSeparator(" ")
    }

The next function creates a dumper function for an object. Call it once with an object, and it returns a function that will dump the content of this object. Internally, it saves an UnsafePointer to the object, rather than using a normal reference. This ensures that it doesn't interact with the language's reference counting system. It also allows us to dump the memory of an object after it has been destroyed, which will come in handy later.

    func dumperFunc(obj: AnyObject) -> (Void -> String) {
        let objString = String(obj)
        let ptr = unsafeBitCast(obj, UnsafePointer<Void>.self)
        let length = class_getInstanceSize(obj.dynamicType)
        return {
            let bytes = contents(ptr, length)
            return "\(objString) \(ptr): \(bytes)"
        }
    }

Here's a class that exists to hold a weak reference so we can inspect it. I added dummy variables on either side to make it clear where the weak reference lives in the memory dump:

    class WeakReferer {
        var dummy1 = 0x1234321012343210
        weak var target: WeakTarget?
        var dummy2: UInt = 0xabcdefabcdefabcd
    }

Let's give it a try! We'll start by creating a referer and dumping it:

    let referer = WeakReferer()
    let refererDump = dumperFunc(referer)
    print(refererDump())

This prints:

    WeakReferer 0x00007f8a3861b920: 0000000107ab24a0 0000000200000004 1234321012343210 0000000000000000 abcdefabcdefabcd

We can see the isa at the beginning, followed by some other internal fields. dummy1 occupies the 4th chunk, and dummy2 occupies the 6th. We can see that the weak reference in between them is zero, as expected.

Let's point it at an object now, and see what it looks like. I'll do this inside a do block so we can control when the target goes out of scope and is destroyed:

    do {
        let target = NSObject()
        referer.target = target
        print(target)
        print(refererDump())
    }

This prints:

<NSObject: 0x7fda6a21c6a0>
    WeakReferer 0x00007fda6a000ad0: 00000001050a44a0 0000000200000004 1234321012343210 00007fda6a21c6a0 abcdefabcdefabcd

As expected, the pointer to the target is stored directly in the weak reference. Let's dump it again after the target is destroyed at the end of the do block:

    print(refererDump())
    WeakReferer 0x00007ffe32300060: 000000010cfb44a0 0000000200000004 1234321012343210 0000000000000000 abcdefabcdefabcd

It gets zeroed out. Perfect!

Just for fun, let's repeat the experiment with a pure Swift object as the target. It's not nice to bring Objective-C into the picture when it's not necessary. Here's a pure Swift target:

    class WeakTarget {}

Let's try it out:

    let referer = WeakReferer()
    let refererDump = dumperFunc(referer)
    print(refererDump())
    do {
        class WeakTarget {}
        let target = WeakTarget()
        referer.target = target
        print(refererDump())
    }
    print(refererDump())

The target starts out zeroed as expected, then gets assigned:

    WeakReferer 0x00007fbe95000270: 00000001071d24a0 0000000200000004 1234321012343210 0000000000000000 abcdefabcdefabcd
    WeakReferer 0x00007fbe95000270: 00000001071d24a0 0000000200000004 1234321012343210 00007fbe95121ce0 abcdefabcdefabcd

Then when the target goes away, the reference should be zeroed:

    WeakReferer 0x00007fbe95000270: 00000001071d24a0 0000000200000004 1234321012343210 00007fbe95121ce0 abcdefabcdefabcd

Oh dear. It didn't get zeroed. Maybe the target didn't get destroyed. Something must be keeping it alive! Let's double-check:

    class WeakTarget {
        deinit { print("WeakTarget deinit") }
    }

Running the code again, we get:

    WeakReferer 0x00007fd29a61fa10: 0000000107ae44a0 0000000200000004 1234321012343210 0000000000000000 abcdefabcdefabcd
    WeakReferer 0x00007fd29a61fa10: 0000000107ae44a0 0000000200000004 1234321012343210 00007fd29a42a920 abcdefabcdefabcd
    WeakTarget deinit
    WeakReferer 0x00007fd29a61fa10: 0000000107ae44a0 0000000200000004 1234321012343210 00007fd29a42a920 abcdefabcdefabcd

So it is going away, but the weak reference isn't being zeroed out. How about that, we found a bug in Swift! It's pretty amazing that it hasn't been fixed after all this time. You'd think somebody would have noticed before now. Let's go ahead and generate a nice crash by accessing the reference, then we can file a bug with the Swift project:

    let referer = WeakReferer()
    let refererDump = dumperFunc(referer)
    print(refererDump())
    do {
        class WeakTarget {
            deinit { print("WeakTarget deinit") }
        }
        let target = WeakTarget()
        referer.target = target
        print(refererDump())
    }
    print(refererDump())
    print(referer.target)

Here comes the crash:

    WeakReferer 0x00007ff7aa20d060: 00000001047a04a0 0000000200000004 1234321012343210 0000000000000000 abcdefabcdefabcd
    WeakReferer 0x00007ff7aa20d060: 00000001047a04a0 0000000200000004 1234321012343210 00007ff7aa2157f0 abcdefabcdefabcd
    WeakTarget deinit
    WeakReferer 0x00007ff7aa20d060: 00000001047a04a0 0000000200000004 1234321012343210 00007ff7aa2157f0 abcdefabcdefabcd
    nil

Oh dear squared! Where's the kaboom? There was supposed to be an Earth-shattering kaboom! The output says everything is working after all, but we can see clearly from the dump that it isn't working at all.

Let's inspect everything really carefully. Here's a revised version of WeakTarget with a dummy variable to make it nicer to dump its contents as well:

    class WeakTarget {
        var dummy = 0x0123456789abcdef

        deinit {
            print("Weak target deinit")
        }
    }

Here's some new code that runs through the same procedure and dumps both objects at every step:

    let referer = WeakReferer()
    let refererDump = dumperFunc(referer)
    print(refererDump())
    let targetDump: Void -> String
    do {
        let target = WeakTarget()
        targetDump = dumperFunc(target)
        print(targetDump())

        referer.target = target

        print(refererDump())
        print(targetDump())
    }
    print(refererDump())
    print(targetDump())
    print(referer.target)
    print(refererDump())
    print(targetDump())

Let's walk through the output. The referer starts out life as before, with a zeroed-out target field:

    WeakReferer 0x00007fe174802520: 000000010faa64a0 0000000200000004 1234321012343210 0000000000000000 abcdefabcdefabcd

The target starts out life as a normal object, with various header fields followed by our dummy field:

    WeakTarget 0x00007fe17341d270: 000000010faa63e0 0000000200000004 0123456789abcdef

Upon assigning to the target field, we can see the pointer value get filled in:

    WeakReferer 0x00007fe174802520: 000000010faa64a0 0000000200000004 1234321012343210 00007fe17341d270 abcdefabcdefabcd

The target is much as before, but one of the header fields went up by 2:

    WeakTarget 0x00007fe17341d270: 000000010faa63e0 0000000400000004 0123456789abcdef

The target gets destroyed as expected:

    Weak target deinit

We see the referer object still has a pointer to the target:

    WeakReferer 0x00007fe174802520: 000000010faa64a0 0000000200000004 1234321012343210 00007fe17341d270 abcdefabcdefabcd

And the target itself still looks very much alive, although a different header field went down by 2 compared to the last time we saw it:

    WeakTarget 0x00007fe17341d270: 000000010faa63e0 0000000200000002 0123456789abcdef

Accessing the target field produces nil even though it wasn't zeroed out:

    nil

Dumping the referer again shows that the mere act of accessing the target field has altered it. Now it's zeroed out:

    WeakReferer 0x00007fe174802520: 000000010faa64a0 0000000200000004 1234321012343210 0000000000000000 abcdefabcdefabcd

The target is now totally obliterated:

    WeakTarget 0x00007fe17341d270: 200007fe17342a04 300007fe17342811 ffffffffffff0002

More and more interesting. We saw header fields incrementing and decremeting a bit, let's see if we can make that happen more:

    let target = WeakTarget()
    let targetDump = dumperFunc(target)
    do {
        print(targetDump())
        weak var a = target
        print(targetDump())
        weak var b = target
        print(targetDump())
        weak var c = target
        print(targetDump())
        weak var d = target
        print(targetDump())
        weak var e = target
        print(targetDump())

        var f = target
        print(targetDump())
        var g = target
        print(targetDump())
        var h = target
        print(targetDump())
        var i = target
        print(targetDump())
        var j = target
        print(targetDump())
        var k = target
        print(targetDump())
    }
    print(targetDump())

This prints:

    WeakTarget 0x00007fd883205df0: 00000001093a4840 0000000200000004 0123456789abcdef
    WeakTarget 0x00007fd883205df0: 00000001093a4840 0000000400000004 0123456789abcdef
    WeakTarget 0x00007fd883205df0: 00000001093a4840 0000000600000004 0123456789abcdef
    WeakTarget 0x00007fd883205df0: 00000001093a4840 0000000800000004 0123456789abcdef
    WeakTarget 0x00007fd883205df0: 00000001093a4840 0000000a00000004 0123456789abcdef
    WeakTarget 0x00007fd883205df0: 00000001093a4840 0000000c00000004 0123456789abcdef
    WeakTarget 0x00007fd883205df0: 00000001093a4840 0000000c00000008 0123456789abcdef
    WeakTarget 0x00007fd883205df0: 00000001093a4840 0000000c0000000c 0123456789abcdef
    WeakTarget 0x00007fd883205df0: 00000001093a4840 0000000c00000010 0123456789abcdef
    WeakTarget 0x00007fd883205df0: 00000001093a4840 0000000c00000014 0123456789abcdef
    WeakTarget 0x00007fd883205df0: 00000001093a4840 0000000c00000018 0123456789abcdef
    WeakTarget 0x00007fd883205df0: 00000001093a4840 0000000c0000001c 0123456789abcdef
    WeakTarget 0x00007fd883205df0: 00000001093a4840 0000000200000004 0123456789abcdef

We can see that the first number in this header field goes up by 2 with every new weak reference. The second number goes up by 4 with every new strong reference.

To recap, here's what we've seen so far:

  • Weak pointers look like regular pointers in memory.
  • When a weak target's deinit runs, the target is not deallocated, and the weak pointer is not zeroed.
  • When the weak pointer is accessed after the target's deinit runs, it is zeroed on access and the weak target is deallocated.
  • The weak target contains a reference count for weak references, separate from the count of strong references.

Swift Code
Now that Swift is open source, we can actually go relate this observed behavior to the source code.

The Swift standard library represents objects allocated on the heap with a HeapObject type located in stdlib/public/SwiftShims/HeapObject.h. It looks like:

    struct HeapObject {
    /// This is always a valid pointer to a metadata object.
    struct HeapMetadata const *metadata;

    SWIFT_HEAPOBJECT_NON_OBJC_MEMBERS;
    // FIXME: allocate two words of metadata on 32-bit platforms

    #ifdef __cplusplus
    HeapObject() = default;

    // Initialize a HeapObject header as appropriate for a newly-allocated object.
    constexpr HeapObject(HeapMetadata const *newMetadata)
        : metadata(newMetadata)
        , refCount(StrongRefCount::Initialized)
        , weakRefCount(WeakRefCount::Initialized)
    { }
    #endif
    };

The metadata field is the Swift equivalent of the isa field in Objective-C, and in fact it's compatible. Then there are these NON_OBJC_MEMBERS defined in a macro:

    #define SWIFT_HEAPOBJECT_NON_OBJC_MEMBERS       \
      StrongRefCount refCount;                      \
      WeakRefCount weakRefCount

Well, look at that! There are our two reference counts.

(Bonus question: why is the strong count first here, while in the dumps above the weak count was first?)

The reference counts are managed by a bunch of functions located in stdlib/public/runtime/HeapObject.cpp. For example, here's swift_retain:

    void swift::swift_retain(HeapObject *object) {
    SWIFT_RETAIN();
        _swift_retain(object);
    }
    static void _swift_retain_(HeapObject *object) {
        _swift_retain_inlined(object);
    }
    auto swift::_swift_retain = _swift_retain_;

There's a bunch of indirection, but it eventually calls through to this inline function in the header:

    static inline void _swift_retain_inlined(HeapObject *object) {
      if (object) {
        object->refCount.increment();
      }
    }

As you'd expect, it increments the reference count. Here's the implementation of increment:

    void increment() {
      __atomic_fetch_add(&refCount, RC_ONE, __ATOMIC_RELAXED);
    }

RC_ONE comes from an enum:

    enum : uint32_t {
      RC_PINNED_FLAG = 0x1,
      RC_DEALLOCATING_FLAG = 0x2,

      RC_FLAGS_COUNT = 2,
      RC_FLAGS_MASK = 3,
      RC_COUNT_MASK = ~RC_FLAGS_MASK,

      RC_ONE = RC_FLAGS_MASK + 1
    };

We can see why the count went up by 4 with each new strong reference. The first two bits of the field are used for flags. Looking back at the dumps, we can see those flags in action. Here's a weak target before and after the last strong reference went away:

    WeakTarget 0x00007fe17341d270: 000000010faa63e0 0000000400000004 0123456789abcdef
    Weak target deinit
    WeakTarget 0x00007fe17341d270: 000000010faa63e0 0000000200000002 0123456789abcdef

The field went from 4, denoting a reference count of 1 and no flags, to 2, denoting a reference count of zero and RC_DEALLOCATING_FLAG set. This post-deinit object is placed in some sort of DEALLOCATING limbo.

(Incidentally, what is RC_PINNED_FLAG for? I poked through the code base and couldn't figure out anything beyond that it indicates a "pinned object," which is already pretty obvious from the name. If you figure it out or have an informed guess, please post a comment.)

Let's check out the weak reference count's implementation, while we're here. It has the same sort of enum:

    enum : uint32_t {
      // There isn't really a flag here.
      // Making weak RC_ONE == strong RC_ONE saves an
      // instruction in allocation on arm64.
      RC_UNUSED_FLAG = 1,

      RC_FLAGS_COUNT = 1,
      RC_FLAGS_MASK = 1,
      RC_COUNT_MASK = ~RC_FLAGS_MASK,

      RC_ONE = RC_FLAGS_MASK + 1
    };

That's where the 2 comes from: there's space reserved for one flag, which is currently unused. Oddly, the comment in this code appears to be incorrect, as RC_ONE here is equal to 2, whereas the strong RC_ONE is equal to 4. I'd guess they were once equal, and then it was changed and the comment wasn't updated. Just goes to show that comments are useless and you shouldn't ever write them.

How does all of this tie in to loading weak references? That's handled by a function called swift_weakLoadStrong:

    HeapObject *swift::swift_weakLoadStrong(WeakReference *ref) {
      auto object = ref->Value;
      if (object == nullptr) return nullptr;
      if (object->refCount.isDeallocating()) {
        swift_weakRelease(object);
        ref->Value = nullptr;
        return nullptr;
      }
      return swift_tryRetain(object);
    }

From this, it's clear how the lazy zeroing works. When loading a weak reference, if the target is deallocating, zero out the reference. Otherwise, try to retain the target, and return it. Digging a bit further, we can see how swift_weakRelease deallocates the object's memory if it's the last reference:

    void swift::swift_weakRelease(HeapObject *object) {
      if (!object) return;

      if (object->weakRefCount.decrementShouldDeallocate()) {
        // Only class objects can be weak-retained and weak-released.
        auto metadata = object->metadata;
        assert(metadata->isClassObject());
        auto classMetadata = static_cast<const ClassMetadata*>(metadata);
        assert(classMetadata->isTypeMetadata());
        swift_slowDealloc(object, classMetadata->getInstanceSize(),
                          classMetadata->getInstanceAlignMask());
      }
    }

(Note: if you're looking at the code in the repository, the naming has changed to use "unowned" instead of "weak" for most cases. The naming above is current as of the latest snapshot as of the time of this writing, but development moves on. You can view the repository as of the 2.2 snapshot to see it as I have it here, or grab the latest but be aware of the naming changes, and possibly implementation changes.)

Putting it All Together
We've seen it all from top to bottom now. What's the high-level view on how Swift weak references actually work?

  1. Weak references are just pointers to the target object.
  2. Weak references are not individually tracked the way they are in Objective-C.
  3. Instead, each Swift object has a weak reference count next to its strong reference count.
  4. Swift decouples object deinitialization from object deallocation. An object can be deinitialized, freeing its external resources, without deallocating the memory occupied by the object itself.
  5. When a Swift object's strong reference count reaches zero while the weak count is still greater than zero, the object is deinitialized but not deallocated.
  6. This means that weak pointers to a deallocated object are still valid pointers and can be dereferenced without crashing or loading garbage data. They merely point to an object in a zombie state.
  7. When a weak reference is loaded, the runtime checks the target's state. If the target is a zombie, then it zeroes the weak reference, decrements the weak reference count, and returns nil.
  8. When all weak references to a zombie object are zeroed out, the zombie is deallocated.

This design has some interesting consequences compared to Objective-C's approach:

  • There is no list of weak references maintained anywhere. This simplifies code and improves performance.
  • There is no race condition between zeroing a weak reference on one thread, and loading that weak reference on another thread. This means that loading a weak reference and destroying a weakly-referenced object can be done without acquiring locks. This improves performance.
  • Weak references to an object will cause that object's memory to remain allocated even after there are no strong references to it, until all weak references are either loaded or discarded. This temporarily increases memory usage. Note that the effect is small, because while the target object's memory remains allocated, it's only the memory for the instance itself. All external resources (including storage for Array or Dictionary properties) are freed when the last strong reference goes away. A weak reference can cause a single instance to stay allocated, but not a whole tree of objects.
  • Extra memory is required to store the weak reference count on every object. In practice it appears that this is inconsequential on 64-bit. The header fields want to occupy a whole number of pointer-sized chunks, and the strong and weak reference counts share one. If the weak reference count weren't there, the strong reference count would just occupy all 64 bits by itself. It's possible that the strong reference could otherwise be moved into the isa by using a non-pointer isa, but I'm not sure how important that is or how it's going to shake out in the long term. For 32-bit, it looks like the weak count increases object sizes by four bytes. The importance of 32-bit is diminishing by the day, however.
  • Because accessing a weak pointer is so cheap, the same mechanism can be used to implement reliable semantics for unowned. Under the hood, unowned works exactly like weak, except that it fails loudly if the target went away rather than returning nil. In Objective-C, __unsafe_unretained is implemented as a raw pointer with undefined behavior if you access it late because it's supposed to be fast, and loading a weak pointer is somewhat slow.

Conclusion
Swift's weak pointers use an interesting approach that provides correctness, speed, and low memory overhead. By tracking a weak reference count for each object and decoupling object deinitialization from objct deallocation, weak references can be resolved both safely and quickly. The availability of the source code for the standard library lets us see exactly what's going on at the source level, instead of groveling through disassemblies and memory dumps as we often do. Of course, as you can see above, it's hard to break that habit fully.

That's it for today. Come back next time for more goodies. That might be a few weeks, as the holidays intervene, but I'm going to shoot for one shortish article before that happens. In any case, keep your suggestions for topics coming in. Friday Q&A is driven by reader ideas, so if you have one you'd like to see covered, let me know!

Friday Q&A 2015-12-25: Swifty Target/Action

$
0
0
Friday Q&A 2015-12-25: Swifty Target/Action
Friday Q&A 2015-12-25: Swifty Target/Action

Cocoa's target/action system for responding to controls is a great system for Objective-C, but is a bit unnatural to use in Swift. Today, I'm going to explore building a wrapper that allows using a Swift function as the action.

Overview
The target/action system is great for things like menu items which might command many different objects depending on context. For example, the Paste menu item connects to whatever object in the responder chain happens to implement a paste: method at the time.

It's less great when you're setting the target and and action in code and there's only ever one target object, as is commonly the case for buttons, text fields, and other such controls. It ends up being an exercise in stringly typed code and tends to be a bit error-prone. It also forces the action to be implemented separately, even when it's simple and would naturally fit inline.

Implementing a pure Swift control that accepted a function as its action would be simple, but for real code we still need to deal with Cocoa. The goal is to adapt NSControl to allow setting a function as the target, while still coexisting with the target/action system.

NSControl doesn't have any good hooks to intercept the sending of the action, so instead we'll just co-opt the existing targe/action mechanism. This requires an adapter object to act as the target. When it receives the action, it will then invoke the function passed to it.

First Attempt
Let's get started with some code. Here is an adapter object that holds a function and calls that function when its action method is called:

    class ActionTrampoline: NSObject {
        var action: NSControl -> Void

        init(action: NSControl -> Void) {
            self.action = action
        }

        @objc func action(sender: NSControl) {
            action(sender)
        }
    }

Here's an addition to NSControl that wraps the creation of the trampoline and setting it as the target:

    extension NSControl {
        @nonobjc func setAction(action: NSControl -> Void) {
            let trampoline = ActionTrampoline(action: action)
            self.target = trampoline
            self.action = "action:"
        }
    }

(The @nonobjc annotation allows this to coexist with the Objective-C action property on NSControl. Without it, this method would need a different name.)

Let's try it out:

    let button = NSButton()
    button.setAction({ print("Action from \($0)") })
    button.sendAction(button.action, to: button.target)

Oops, nothing happens.

Extending the Trampoline's Lifetime
This first attempt doesn't work, because target is a weak property. There are no strong references to trampoline to keep it alive, so it's deallocated immediately. Then the call to sendAction sends the action to nil, which does nothing.

We need to extend the trampoline's lifetime. We could return it to the caller and require them to keep it around somewhere, but that would be inconvenient. A better way is to tie the trampoline's lifetime to that of the control. We can accomplish this using associated objects.

We start by defining a key for use with the associated objects API. This is a little less convenient than in Objective-C, because Swift is not so friendly about taking the address of variables, and in fact doesn't guarantee that the pointer produced by using & on variable will be consistent. Instead of trying to use the address of a global variable, this code just allocates a bit of memory and uses that address:

    let NSControlActionFunctionAssociatedObjectKey = UnsafeMutablePointer<Int8>.alloc(1)

The NSControl extension then uses objc_setAssociatedObject to attach the trampoline to the control. Although the value is never retrieved, simply setting it ensures that it is kept alive as long as the control is alive:

    extension NSControl {
        @nonobjc func setAction(action: NSControl -> Void) {
            let trampoline = ActionTrampoline(action: action)
            self.target = trampoline
            self.action = "action:"

            objc_setAssociatedObject(self, NSControlActionFunctionAssociatedObjectKey, trampoline, .OBJC_ASSOCIATION_RETAIN)
        }
    }

Let's try it again:

    let button = NSButton()
    button.setAction({ print("Action from \($0)") })
    button.sendAction(button.action, to: button.target)

This time it works!

    Action from <NSButton: 0x7fe1019124d0>

Making it Generic
This first version works fine, but the types aren't quite right. The parameter to the function is always NSControl. That means that while the above test code works, this does not:

    button.setAction({ print("Action from \($0.title)") })

This will fail to compile, because NSControl doesn't have a title property. We know that the parameter is actually an NSButton, but the compiler doesn't know that. In Objective-C, we simply declare the proper type in the method and the compiler has to trust us. In Swift, we have to educate the compiler about the types.

We could make setAction generic, something like:

    func setAction<T>(action: T -> Void) { ...

But this requires an explicit type on the function passed in, so $0 would no longer work. You'd have to write something like:

    button.setAction({ (sender: NSButton) in ...

It would be a lot better to have type inference work for us.

Swift's Self type exists for this purpose. Self denotes the actual type of self, like instancetype does for Objective-C. For example:

    extension NSControl {
        func frobnitz() -> Self {
            Swift.print("Frobbing \(self)")
            return self
        }
    }

    button.frobnitz().title = "hello"

Let's use this in the NSControl extension to make it generic:

    extension NSControl {
        @nonobjc func setAction(action: Self -> Void) {

Oops, we can't:

    error: 'Self' is only available in a protocol or as the result of a method in a class; did you mean 'NSControl'?

Fortunately, the error message suggests a way forward: put the method in a protocol. Start with an empty protocol:

    protocol NSControlActionFunctionProtocol {}

Let's change the name of the associated object key to fit its new home, while we're at it:

    let NSControlActionFunctionProtocolAssociatedObjectKey = UnsafeMutablePointer<Int8>.alloc(1)

We'll need a generic version of ActionTrampoline. This it much like the original version, but the implementation of action requires a forced cast since @objc methods aren't allowed to refer to a generic type:

    class ActionTrampoline<T>: NSObject {
        var action: T -> Void

        init(action: T -> Void) {
            self.action = action
        }

        @objc func action(sender: NSControl) {
            action(sender as! T)
        }
    }

The method implementation is basically the same as before, just with Self instead of NSControl. Constraining the extension to Self: NSControl lets us use all NSControl methods and properties on self, like target and action:

    extension NSControlActionFunctionProtocol where Self: NSControl {
        func setAction(action: Self -> Void) {
            let trampoline = ActionTrampoline(action: action)
            self.target = trampoline
            self.action = "action:"
            objc_setAssociatedObject(self, NSControlActionFunctionProtocolAssociatedObjectKey, trampoline, .OBJC_ASSOCIATION_RETAIN)
        }
    }

Finally, we need to make NSControl conform to this protocol in an extension. Since the protocol itself is empty, the extension can be empty too:

    extension NSControl: NSControlActionFunctionProtocol {}

Let's try it out!

    let button = NSButton()
    button.setAction({ (button: NSButton) in
        print("Action from \(button.title)")
    })
    button.sendAction(button.action, to: button.target)

This prints:

    Action from Button

Success!

UIKit Version
Adopting this code for use in UIKit is easy. UIControl can have multiple targets for a variety of different events, so we just need to allow passing in the events as a parameter, and use addTarget to add the trampoline:

    class ActionTrampoline<T>: NSObject {
        var action: T -> Void

        init(action: T -> Void) {
            self.action = action
        }

        @objc func action(sender: UIControl) {
            print(sender)
            action(sender as! T)
        }
    }

    let NSControlActionFunctionProtocolAssociatedObjectKey = UnsafeMutablePointer<Int8>.alloc(1)

    protocol NSControlActionFunctionProtocol {}
    extension NSControlActionFunctionProtocol where Self: UIControl {
        func addAction(events: UIControlEvents, _ action: Self -> Void) {
            let trampoline = ActionTrampoline(action: action)
            self.addTarget(trampoline, action: "action:", forControlEvents: events)
            objc_setAssociatedObject(self, NSControlActionFunctionProtocolAssociatedObjectKey, trampoline, .OBJC_ASSOCIATION_RETAIN)
        }
    }
    extension UIControl: NSControlActionFunctionProtocol {}

Testing it:

    let button = UIButton()
    button.addAction([.TouchUpInside], { (button: UIButton) in
        print("Action from \(button.titleLabel?.text)")
    })
    button.sendActionsForControlEvents([.TouchUpInside])

    Action from nil

Apparently UIButton doesn't set a title by default like NSButton does. But, success!

A Note on Retain Cycles
It's easy to make retain cycles using this call. For example:

    button.addAction({ _ in
        self.doSomething()
    })

If you hold a strong reference to button (note that even if you declare button as weak, you'll indirectly hold a strong reference to it if you hold a strong reference to a view that contains it, or its window) then this will create a cycle that will cause your object to leak.

As is usually the case with cycles, the answer is to capture self either as weak or unowned:

    button.addAction({ [weak self] _ in
        self?.doSomething()
    })

Optional chaining keeps the body easy to read. Or if you're sure that the action will never, ever be invoked after self is destroyed, use [unowned self] instead to get a weak reference which doesn't need to be nil checked since it will fail loudly if self is ever destroyed early.

Conclusion
Making a Swift-y adapter for Cocoa target/actions is fairly straightforward. Memory management means we have to work a little to keep the trampoline object alive, but associated objects solve that problem. Making a method that has a parameter that's generic on the type of self requires jumping through some hoops, but a protocol extension makes it possible.

That's it for today. I'll be back with more goodies in the new year. Friday Q&A is driven by reader ideas, so if you have any topics you'd like to see covered in 2016 or beyond, please send them in!

Friday Q&A 2016-01-29: Swift Struct Storage

$
0
0
Friday Q&A 2016-01-29: Swift Struct Storage
Friday Q&A 2016-01-29: Swift Struct Storage

Swift's classes tend to be straightforward for most people new to the language to understand. They work pretty much like classes in any other language. Whether you've come from Objective-C or Java or Ruby, you've worked with something similar. Swift's structs are another matter. They look sort of like classes, but they're value types, and they don't do inheritance, and there's this copy-on-write thing I keep hearing about? Where do they live, anyway, and how do they work? Today, I'm going to take a close look at just how structs get stored and manipulated in memory.

Simple structs
To explore how structs get stored in memory, I built a test program consisting of two files. I compiled the test program with optimizations enabled but without whole-module optimization. By building tests that make calls from one file to the other, I was able to prevent the compiler from inlining everything, providing a clearer picture of where everything gets stored and how the data is passed between functions.

To start out, I created a simple struct with three elements:

    struct ExampleInts {
        var x: Int
        var y: Int
        var z: Int
    }

I created three functions that take an instance of this struct and return one of the fields:

    func getX(parameter: ExampleInts) -> Int {
        return parameter.x
    }

    func getY(parameter: ExampleInts) -> Int {
        return parameter.y
    }

    func getZ(parameter: ExampleInts) -> Int {
        return parameter.z
    }

In the other file, I created an instance of the struct and called each get function:

    func testGets() {
        let s = ExampleInts(x: 1, y: 2, z: 3)
        getX(s)
        getY(s)
        getZ(s)
    }

The compiler generates this code for getX:

    pushq   %rbp
    movq    %rsp, %rbp

    movq    %rdi, %rax

    popq    %rbp
    retq

Consulting our cheat sheet, we recall that arguments are passed sequentially in registers rdi, rsi, rdx, rcx, r8, and r9, and return values are placed in rax. The first two instructions here are just the function prologue, and the last two are the epilogue. The real work being done here is the movq %rdi, %rax, which takes the first parameter and returns it. Let's look at getY:

    pushq   %rbp
    movq    %rsp, %rbp

    movq    %rsi, %rax

    popq    %rbp
    retq

This is almost the same, but it returns the second parameter. How about getZ?

    pushq   %rbp
    movq    %rsp, %rbp

    movq    %rdx, %rax

    popq    %rbp
    retq

Again, almost the same, but it returns the third parameter. From this we can see that the individual struct elements are treated as separate parameters and passed to the functions individually. Picking out an element on the receiving end is a simple matter of picking the right register.

Let's confirm this on the calling end. Here's the generated code for testGets:

    pushq   %rbp
    movq    %rsp, %rbp

    movl    $1, %edi
    movl    $2, %esi
    movl    $3, %edx
    callq   __TF4main4getXFVS_11ExampleIntsSi

    movl    $1, %edi
    movl    $2, %esi
    movl    $3, %edx
    callq   __TF4main4getYFVS_11ExampleIntsSi

    movl    $1, %edi
    movl    $2, %esi
    movl    $3, %edx
    popq    %rbp
    jmp __TF4main4getZFVS_11ExampleIntsSi

We can see it constructing the struct instance directly in the parameter registers. (The edi, esi, and edx registers refer to the lower 32 bits of the rdi, rsi, and rdx registers, respectively.) It doesn't even bother trying to save the values across the calls, but just rebuilds the struct instance each time. Since the compiler knows exactly what the contents are, it can deviate significantly from how the Swift code is written. Note how the call to getZ is generated a bit differently from the other two. Since it's the last thing in the function, the compiler generates it as a tail call, cleaning up the local call frame and setting up getZ to return directly to the function that called testGets.

Let's see what sort of code the compiler generates when it doesn't know the struct contents. Here's a variant on this test function which gets the struct instance from elsewhere:

    func testGets2() {
        let s = getExampleInts()
        getX(s)
        getY(s)
        getZ(s)
    }

getExampleInts just creates the struct instance and returns it, but it's in the other file so the compiler can't see what's going on when optimizing testGets2. Here's that function:

    func getExampleInts() -> ExampleInts {
        return ExampleInts(x: 1, y: 2, z: 3)
    }

What sort of code does testGets2 generate, now that the compiler can't know what the struct contains? Here it is:

    pushq   %rbp
    movq    %rsp, %rbp

    pushq   %r15
    pushq   %r14
    pushq   %rbx
    pushq   %rax

    callq   __TF4main14getExampleIntsFT_VS_11ExampleInts
    movq    %rax, %rbx
    movq    %rdx, %r14
    movq    %rcx, %r15

    movq    %rbx, %rdi
    movq    %r14, %rsi
    movq    %r15, %rdx
    callq   __TF4main4getXFVS_11ExampleIntsSi

    movq    %rbx, %rdi
    movq    %r14, %rsi
    movq    %r15, %rdx
    callq   __TF4main4getYFVS_11ExampleIntsSi

    movq    %rbx, %rdi
    movq    %r14, %rsi
    movq    %r15, %rdx

    addq    $8, %rsp
    popq    %rbx
    popq    %r14
    popq    %r15
    popq    %rbp
    jmp __TF4main4getZFVS_11ExampleIntsSi

Since the compiler can't reconstitute the values at each step, it has to save them. It places the three struct elements into the registers rbx, r14, and r15, then loads the parameter registers from those registers at each call. Those registers are saved by the caller, which means that their values are preserved across the call. And again, the compiler generates a tail call for getZ, with some more extensive cleanup beforehand.

At the top of the function, it calls getExampleInts and then loads values from rax, rdx, and rcx. Apparently the struct values are returned in those registers. Let's look at getExampleInts to confirm:

    pushq   %rbp
    movl    $1, %edi
    movl    $2, %esi
    movl    $3, %edx
    popq    %rbp
    jmp __TFV4main11ExampleIntsCfMS0_FT1xSi1ySi1zSi_S0_

This places the values 1, 2, and 3 into the argument registers, then calls the struct's constructor. Here's the generated code for that constructor:

    pushq   %rbp
    movq    %rsp, %rbp

    movq    %rdx, %rcx
    movq    %rdi, %rax
    movq    %rsi, %rdx

    popq    %rbp
    retq

Sure enough, it returns the three values in rax, rdx, and rcx. The cheat sheet says nothing about returning multiple values in multiple registers. How about the official PDF? It does say that two values can be returned in rax and rdx, but there's no mention of returning a third value in rcx. That's clearly what's happening, though. That's the fun of a new language: it doesn't always have to play by the old rules. If it was interoperating with C code it would have to follow the standard conventions, but Swift-to-Swift calls can invent new ones.

How about inout parameters? If they work like we'd do it in C, we'd expect the struct to be laid out in memory and a pointer passed in. Here are two test functions (in two different files, of course):

    func testInout() {
        var s = getExampleInts()
        totalInout(&s)
    }

    func totalInout(inout parameter: ExampleInts) -> Int {
        return parameter.x + parameter.y + parameter.z
    }

Here's the generated code for testInout:

    pushq   %rbp
    movq    %rsp, %rbp
    subq    $32, %rsp

    callq   __TF4main14getExampleIntsFT_VS_11ExampleInts

    movq    %rax, -24(%rbp)
    movq    %rdx, -16(%rbp)
    movq    %rcx, -8(%rbp)
    leaq    -24(%rbp), %rdi
    callq   __TF4main10totalInoutFRVS_11ExampleIntsSi

    addq    $32, %rsp
    popq    %rbp
    retq

In the prologue, it creates a 32-byte stack frame. It then calls getExampleInts, and after the call saves the resulting values into stack slots at offsets -24, -16, and -8. It then calculates a pointer to offset -24, loads that into the rdi parameter register, and calls totalInout. Here's the generated code for that function:

    pushq   %rbp
    movq    %rsp, %rbp
    movq    (%rdi), %rax
    addq    8(%rdi), %rax
    jo  LBB4_3
    addq    16(%rdi), %rax
    jo  LBB4_3
    popq    %rbp
    retq
    LBB4_3:
    ud2

This loads the values by offset from the parameter that's passed in, totaling them up and returning the result in rax. The jo instructions are checking for overflow. If either of the addq instructions produce an oveflow, the jo instructions will jump down to the ud2 instruction which terminates the program.

We can see that it's exactly as we expected: when passing the struct to an inout parameter, the struct is laid out contiguously in memory and then a pointer to it is passed in.

Big structs
What happens if we're dealing with a larger struct, bigger than fits comfortably in registers? Here's a test struct with ten elements:

    struct TenInts {
        var elements = (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
    }

Here's a get function that constructs an instance and returns it. This is placed in a separate file to avoid inlining:

    func getHuge() -> TenInts {
        return TenInts()
    }

Here's a function that gets an element out of this struct:

    func getHugeElement(parameter: TenInts) -> Int {
        return parameter.elements.5
    }

Finally, a test function that exercises these:

    func testHuge() {
        let s = getHuge()
        getHugeElement(s)
    }

Let's look at the generated code, starting with testHuge:

    pushq   %rbp
    movq    %rsp, %rbp
    subq    $160, %rsp

    leaq    -80(%rbp), %rdi
    callq   __TF4main7getHugeFT_VS_7TenInts

    movups  -80(%rbp), %xmm0
    movups  -64(%rbp), %xmm1
    movups  -48(%rbp), %xmm2
    movups  -32(%rbp), %xmm3
    movups  -16(%rbp), %xmm4
    movups  %xmm0, -160(%rbp)
    movups  %xmm1, -144(%rbp)
    movups  %xmm2, -128(%rbp)
    movups  %xmm3, -112(%rbp)
    movups  %xmm4, -96(%rbp)

    leaq    -160(%rbp), %rdi
    callq   __TF4main14getHugeElementFVS_7TenIntsSi

    addq    $160, %rsp
    popq    %rbp
    retq

This code (excluding the function prologue and epilogue) can be broken into three pieces.

The first piece calculates the address for offset -80 relative to the stack frame, and calls getHuge, passing that address as a parameter. The getHuge function has no parameters in the source code, but it's common to use a hidden parameter to return larger structs. The caller allocates storage for the return value, then passes a pointer to that storage in the hidden parameter. That appears to be what's going on here, with that allocated storage residing on the stack.

The second piece copies the returned struct from stack offset -80 to stack offset -160. It loads pieces of the struct sixteen bytes at a time into five xmm registers, then places the contents of those registers back on the stack starting at offset -160. I'm not clear why the compiler generates this copy rather than using the original value in place. I suspect the optimizer just isn't quite smart enough to realize that it doesn't need the copy.

The third piece calculates the address for stack offset -160 and then calls getHugeElement passing that address as a parameter. In our previous experiment with a three-element struct, it was passed by value in registers. With this larger struct, it's passed by pointer instead.

The generated code for the other functions confirms this: the struct is passed in and out by pointer, and lives on the stack. Here's getHugeElement to start with:

    pushq   %rbp
    movq    %rsp, %rbp
    movq    40(%rdi), %rax
    popq    %rbp
    retq

This loads offset 40 from the parameter passed in. Each element is eight bytes, so offset 40 corresponds to elements.5. The function then returns this value.

Here's getHuge:

    pushq   %rbp
    movq    %rsp, %rbp
    pushq   %rbx
    subq    $88, %rsp

    movq    %rdi, %rbx
    leaq    -88(%rbp), %rdi
    callq   __TFV4main7TenIntsCfMS0_FT_S0_

    movups  -88(%rbp), %xmm0
    movups  -72(%rbp), %xmm1
    movups  -56(%rbp), %xmm2
    movups  -40(%rbp), %xmm3
    movups  -24(%rbp), %xmm4
    movups  %xmm0, (%rbx)
    movups  %xmm1, 16(%rbx)
    movups  %xmm2, 32(%rbx)
    movups  %xmm3, 48(%rbx)
    movups  %xmm4, 64(%rbx)
    movq    %rbx, %rax

    addq    $88, %rsp
    popq    %rbx
    popq    %rbp
    retq

This looks a lot like testHuge above: it allocates stack space, calls a function, in this case, the TenInts constructor function, then copies the return value to its final location. Here, that final location is the pointer passed in as the implicit parameter.

While we're here, let's take a look at the TenInts constructor:

    pushq   %rbp
    movq    %rsp, %rbp

    movq    $1, (%rdi)
    movq    $2, 8(%rdi)
    movq    $3, 16(%rdi)
    movq    $4, 24(%rdi)
    movq    $5, 32(%rdi)
    movq    $6, 40(%rdi)
    movq    $7, 48(%rdi)
    movq    $8, 56(%rdi)
    movq    $9, 64(%rdi)
    movq    $10, 72(%rdi)
    movq    %rdi, %rax

    popq    %rbp
    retq

Like the other functions, this takes a pointer to memory for the new struct as an implicit parameter. It then stores the values 1 through 10 into that memory and returns.

I came across an interesting case while building out these test cases. Here's a test function which makes three calls to getHugeElement intsead of just one:

    func testThreeHuge() {
        let s = getHuge()
        getHugeElement(s)
        getHugeElement(s)
        getHugeElement(s)
    }

Here's the generated code:

    pushq   %rbp
    movq    %rsp, %rbp
    pushq   %r15
    pushq   %r14
    pushq   %r13
    pushq   %r12
    pushq   %rbx
    subq    $392, %rsp

    leaq    -120(%rbp), %rdi
    callq   __TF4main7getHugeFT_VS_7TenInts
    movq    -120(%rbp), %rbx
    movq    %rbx, -376(%rbp)
    movq    -112(%rbp), %r8
    movq    %r8, -384(%rbp)
    movq    -104(%rbp), %r9
    movq    %r9, -392(%rbp)
    movq    -96(%rbp), %r10
    movq    %r10, -400(%rbp)
    movq    -88(%rbp), %r11
    movq    %r11, -368(%rbp)
    movq    -80(%rbp), %rax
    movq    -72(%rbp), %rcx
    movq    %rcx, -408(%rbp)
    movq    -64(%rbp), %rdx
    movq    %rdx, -416(%rbp)
    movq    -56(%rbp), %rsi
    movq    %rsi, -424(%rbp)
    movq    -48(%rbp), %rdi

    movq    %rdi, -432(%rbp)
    movq    %rbx, -200(%rbp)
    movq    %rbx, %r14
    movq    %r8, -192(%rbp)
    movq    %r8, %r15
    movq    %r9, -184(%rbp)
    movq    %r9, %r12
    movq    %r10, -176(%rbp)
    movq    %r10, %r13
    movq    %r11, -168(%rbp)
    movq    %rax, -160(%rbp)
    movq    %rax, %rbx
    movq    %rcx, -152(%rbp)
    movq    %rdx, -144(%rbp)
    movq    %rsi, -136(%rbp)
    movq    %rdi, -128(%rbp)
    leaq    -200(%rbp), %rdi
    callq   __TF4main14getHugeElementFVS_7TenIntsSi

    movq    %r14, -280(%rbp)
    movq    %r15, -272(%rbp)
    movq    %r12, -264(%rbp)
    movq    %r13, -256(%rbp)
    movq    -368(%rbp), %rax
    movq    %rax, -248(%rbp)
    movq    %rbx, -240(%rbp)
    movq    -408(%rbp), %r14
    movq    %r14, -232(%rbp)
    movq    -416(%rbp), %r15
    movq    %r15, -224(%rbp)
    movq    -424(%rbp), %r12
    movq    %r12, -216(%rbp)
    movq    -432(%rbp), %r13
    movq    %r13, -208(%rbp)
    leaq    -280(%rbp), %rdi
    callq   __TF4main14getHugeElementFVS_7TenIntsSi

    movq    -376(%rbp), %rax
    movq    %rax, -360(%rbp)
    movq    -384(%rbp), %rax
    movq    %rax, -352(%rbp)
    movq    -392(%rbp), %rax
    movq    %rax, -344(%rbp)
    movq    -400(%rbp), %rax
    movq    %rax, -336(%rbp)
    movq    -368(%rbp), %rax
    movq    %rax, -328(%rbp)
    movq    %rbx, -320(%rbp)
    movq    %r14, -312(%rbp)
    movq    %r15, -304(%rbp)
    movq    %r12, -296(%rbp)
    movq    %r13, -288(%rbp)
    leaq    -360(%rbp), %rdi
    callq   __TF4main14getHugeElementFVS_7TenIntsSi

    addq    $392, %rsp
    popq    %rbx
    popq    %r12
    popq    %r13
    popq    %r14
    popq    %r15
    popq    %rbp
    retq

The structure of this function is similar to the previous version. It calls getHuge, copies the result, then calls getHugeElement three times. For each call, it copies the structagain, presumably to guard against getHugeElement making modifications. What I found really interesting is that the copies are all done one element at a time using integer registers, rather than two elements at a time in xmm registers as testHuge did. I'm not sure what causes the compiler to choose the integer registers here, as it seems like copying two elements at a time with the xmm registers would be more efficient and result in smaller code.

I also experimented with really large structs:

    struct HundredInts {
        var elements = (TenInts(), TenInts(), TenInts(), TenInts(), TenInts(), TenInts(), TenInts(), TenInts(), TenInts(), TenInts())
    }

    struct ThousandInts {
        var elements = (HundredInts(), HundredInts(), HundredInts(), HundredInts(), HundredInts(), HundredInts(), HundredInts(), HundredInts(), HundredInts(), HundredInts())
    }

    func getThousandInts() -> ThousandInts {
        return ThousandInts()
    }

The generated code for getThousandInts is pretty crazy:

    pushq   %rbp
    pushq   %rbx
    subq    $8008, %rsp

    movq    %rdi, %rbx
    leaq    -8008(%rbp), %rdi
    callq   __TFV4main12ThousandIntsCfMS0_FT_S0_
    movq    -8008(%rbp), %rax
    movq    %rax, (%rbx)
    movq    -8000(%rbp), %rax
    movq    %rax, 8(%rbx)
    movq    -7992(%rbp), %rax
    movq    %rax, 16(%rbx)
    movq    -7984(%rbp), %rax
    movq    %rax, 24(%rbx)
    movq    -7976(%rbp), %rax
    movq    %rax, 32(%rbx)
    movq    -7968(%rbp), %rax
    movq    %rax, 40(%rbx)
    movq    -7960(%rbp), %rax
    movq    %rax, 48(%rbx)
    movq    -7952(%rbp), %rax
    movq    %rax, 56(%rbx)
    movq    -7944(%rbp), %rax
    movq    %rax, 64(%rbx)
    movq    -7936(%rbp), %rax
    movq    %rax, 72(%rbx)
    ...
    movq    -104(%rbp), %rax
    movq    %rax, 7904(%rbx)
    movq    -96(%rbp), %rax
    movq    %rax, 7912(%rbx)
    movq    -88(%rbp), %rax
    movups  -80(%rbp), %xmm0
    movups  -64(%rbp), %xmm1
    movups  -48(%rbp), %xmm2
    movups  -32(%rbp), %xmm3
    movq    %rax, 7920(%rbx)
    movq    -16(%rbp), %rax
    movups  %xmm0, 7928(%rbx)
    movups  %xmm1, 7944(%rbx)
    movups  %xmm2, 7960(%rbx)
    movups  %xmm3, 7976(%rbx)
    movq    %rax, 7992(%rbx)
    movq    %rbx, %rax

    addq    $8008, %rsp
    popq    %rbx
    popq    %rbp
    retq

The compiler generates two thousand instructions to copy this struct. This seems like a good place to emit a call to memcpy, but I imagine that optimizing for absurdly gigantic structs isn't a high priority for the compiler team right now.

Class Fields
Let's take a look at what happens when the struct fields are more complicated than simple integers. Here's a simple class, and a struct which contains one:

    class ExampleClass {}
    struct ContainsClass {
        var x: Int
        var y: ExampleClass
        var z: Int
    }

Here's a set of functions (split across two files to defeat inlining) which exercise them:

    func testContainsClass() {
        let s = ContainsClass(x: 1, y: getExampleClass(), z: 3)
        getClassX(s)
        getClassY(s)
        getClassZ(s)
    }

    func getExampleClass() -> ExampleClass {
        return ExampleClass()
    }

    func getClassX(parameter: ContainsClass) -> Int {
        return parameter.x
    }

    func getClassY(parameter: ContainsClass) -> ExampleClass {
        return parameter.y
    }

    func getClassZ(parameter: ContainsClass) -> Int {
        return parameter.z
    }

Let's start by looking at the generated code for the getters. Here's getClassX:

    pushq   %rbp
    movq    %rsp, %rbp
    pushq   %rbx
    pushq   %rax

    movq    %rdi, %rbx
    movq    %rsi, %rdi
    callq   _swift_release
    movq    %rbx, %rax

    addq    $8, %rsp
    popq    %rbx
    popq    %rbp
    retq

The three struct elements will be passed in the first three parameter registers, rdi, rsi, and rdx. This function wants to return the value in rdi by moving it to rax and then returning, but it has to do some bookkeeping first. It appears that the object reference passed in rsi is passed retained, and must be released before the function returns. This code moves rdi into a safe temporary register, rbx, then moves the object reference to rdi and calls swift_release to release it. It then moves the value in rbx to the return register rax and returns from the function.

The code for getClassZ is pretty much the same, except instead of taking the value from rdi, it takes it from rdx:

    pushq   %rbp
    movq    %rsp, %rbp
    pushq   %rbx
    pushq   %rax

    movq    %rdx, %rbx
    movq    %rsi, %rdi
    callq   _swift_release
    movq    %rbx, %rax

    addq    $8, %rsp
    popq    %rbx
    popq    %rbp
    retq

The code for getClassY will be the odd one, since it returns an object reference rather than an integer. Here it is:

    pushq   %rbp
    movq    %rsp, %rbp
    movq    %rsi, %rax
    popq    %rbp
    retq

This is short! It moves the value from rsi, which is the object reference, into rax and returns it. There's no bookkeeping, just a quick shuffling of data. Apparently, the value is passed in retained, but also returned retained, so this code doesn't have to do any memory management at all.

So far we've seen that the code for dealing with this struct is much like the code for dealing with the struct containing three Int fields, except that the object reference field is passed in retained and must be released by the callee. With that in mind, let's look at the generated code for testContainsClass:

    pushq   %rbp
    movq    %rsp, %rbp
    pushq   %r14
    pushq   %rbx

    callq   __TF4main15getExampleClassFT_CS_12ExampleClass
    movq    %rax, %rbx

    movq    %rbx, %rdi
    callq   _swift_retain
    movq    %rax, %r14
    movl    $1, %edi
    movl    $3, %edx
    movq    %rbx, %rsi
    callq   __TF4main9getClassXFVS_13ContainsClassSi

    movq    %r14, %rdi
    callq   _swift_retain
    movl    $1, %edi
    movl    $3, %edx
    movq    %rbx, %rsi
    callq   __TF4main9getClassYFVS_13ContainsClassCS_12ExampleClass
    movq    %rax, %rdi
    callq   _swift_release

    movl    $1, %edi
    movl    $3, %edx
    movq    %rbx, %rsi

    popq    %rbx
    popq    %r14
    popq    %rbp
    jmp __TF4main9getClassZFVS_13ContainsClassSi

The first thing this function does is call getExampleClass to get the ExampleClass instance it stores in the struct. It takes the returned reference and moves it to rbx for safekeeping.

Next, it calls getClassX, and to do so it has to build a copy of the struct in the parameter registers. The two integer fields are easy, but the object field needs to be retained to match what the functions expect. The code calls swift_retain on the value stored in rbx, then places that value in rsi and places 1 and 3 in rdi and rdx to build the complete struct. Finally, it calls getClassX.

The code to call getClassY is nearly the same. However, getClassY returns an object reference which needs to be released. After the call, this code moves the return value into rdi and calls swift_release to take care of its required memory management.

This function calls getClassZ as a tail call, so the code here is a bit different. The object reference came retained from getExampleClass, so it doesn't need to be retained separately for this final call. This code places it into rsi, places 1 and 3 into rdi and rdx again, then cleans up the stack and jumps to getClassZ to make the final call.

Ultimately, there's little change from a struct with all Ints. The only real difference is that copying a struct with an object in it requires retaining that object, and disposing of that struct requires releasing the object.

Conclusion
struct storage in Swift is ultimately pretty straightforward, and much of what we've seen carries over from C's much simpler structs. A struct instance is largely treated as a loose collection of independent values, which can be manipulated collectively when required. Local struct variables might be stored on the stack or the individual pieces might be stored in registers, depending on the size of the struct, the register usage of the rest of the code, and the whims of the compiler. Small structs are passed and returned in registers, while larger structs are passed and returned by reference. structs get copied whenever they're passed and returned. Although you can use structs to implement copy-on-write data types, the base language construct is copied eagerly and more or less blindly.

That's it for today. Come back next time for more daring feats of programming. Friday Q&A is driven by reader ideas, so if you grow bored while waiting for the next installment and have something you'd like to see discussed, send it in!

Friday Q&A 2016-02-19: What Is the Secure Enclave?

$
0
0
Friday Q&A 2016-02-19: What Is the Secure Enclave?
Friday Q&A 2016-02-19: What Is the Secure Enclave?

The big tech news this week is that the FBI is trying to force Apple to unlock a suspect's iPhone. One of the interesting points around this story is that the iPhone in question is an older one, an iPhone 5c. Newer phones have what Apple calls the Secure Enclave, which some say protects against requests of this nature; even if Apple wanted to break into your phone, they couldn't. Which then brings up an interesting question I've seen a lot of people asking: what exactly is the Secure Enclave, and what role does it play here?

A quick note before I get started: my usual approach to writing articles is to dig all the way down to the bits and bytes and then discuss what's there. This is necessarily somewhat different. By its very nature, the Secure Enclave is inaccessible to mere mortals like myself. Instead, most of the knowledge here comes from the information in Apple's iOS Security Guide, plus some general theory. The intent is to extract the relevant bits from that guide, explain them, and think through the implications. This article must assume Apple's information is accurate, since there's no practical way to check from the outside. The result will only be as good as the product of the guide's accuracy and my own understanding, so reader beware.

Also, this is intended to examine the technical aspects of this case and the Secure Enclave technology. No opinions on the merits of the FBI's request or Apple's response or any other political matters are stated or implied. If you want to discuss the political aspects of this case, there are many other places where you can do so.

With that out of the way, let's get started.

Review
The iPhone in question is protected by a passcode, which isn't stored anywhere on the device. The only way to get in is to brute-force the passcode. The computation needed to verify a passcode is deliberately slow, requiring about 80 milliseconds per attempt. Still, brute force cracking is feasible. For a four-digit passcode, trying 10,000 combinations at 80 milliseconds each would take less than 15 minutes. A six-digit passcode would take about a day.

This means that passcodes are not terribly secure. Apple mitigates this with additional delays after too many failed attempts. After a few failed attempts, the iPhone will make you wait before you can try again, starting with a one-minute delay, then a five-minute delay, and beyond. This makes a brute force attack impractical.

One might think that you could work around this if you pulled the flash memory out of the iPhone, copied its contents, then tried to crack it on a fast computer. You wouldn't have any software imposing additional delays. As a bonus, the 80 milliseconds needed for each attempt could go a lot faster with a faster computer, and you could run many attempts in parallel. However, this doesn't work. The data encryption is tied to the hardware, so the brute force attack must be run on the original device.

On the older iPhones without the Secure Enclave, there is a weakness in this system. The escalating delays prevent brute force cracking of the passcode, but these delays are just a feature of the phone's OS. The 80 millisecond key derivation is an inherent property of that computation, but the additional minutes or hours delay after too many failed attempts is just the OS code refusing to accept additional input until some time has passed. The FBI wants Apple to build and install a special OS version that doesn't enforce these delays and which allows passcodes to be submitted electronically. This would allow the FBI to crack the passcode within a few minutes or hours. iPhones won't accept OS updates from anyone besides Apple, so this system is secure from outside attackers, but it's not secure against Apple itself.

Note that this is based on the user using a numeric passcode. A good complex password is still secure even on these older iPhones. An eight-character alphanumeric password would take about 550,000 years to try all possible combinations, for example.

Unreadable UIDs
Each iOS CPU is built with a 256-bit unique identifier or UID. This UID is burned into the hardware and not stored anywhere else. The UID is not only inaccessible to the outside world, but it's inaccessible even to the software running at the highest privilege levels on the CPU. Instead, the CPU contains a hardware AES encryption engine, and the only way the UID can be accessed by the hardware is by loading it into the AES engine as a key and using it to encrypt or decrypt data.

Apple uses this hardware to entangle the user's passcode with the device. By setting the device's UID as the AES key and then encrypting the passcode, the result is a random-looking bunch of data which can't be recreated anywhere else, since it depends on both the user's passcode and the secret, unreadable, device-specific UID. This process is repeated over many rounds using the PBKDF2 function, feeding each round's output back into the next round's input, performing the heavy computation needed to force 80 milliseconds of work for each passcode verification attempt.

Secure Enclave
The Secure Enclave was introduced with Apple's A7 system on a chip. All iPhones starting with the 5s have it. The 5/5c and below do not. On the iPad side, everything starting from the iPad Mini 2 and the iPad Air have it.

The Secure Enclave is a separate CPU within the A7 (or later) that's responsible for low-level cryptographic operations. It doesn't run iOS or anything resembling iOS, but instead runs a modified L4 microkernel. L4 is intended to run as little code as possible in the kernel, which should theoretically make the system more secure by reducing the amount of potentially buggy code running with elevated privileges. The Secure Enclave uses a secure boot system to ensure that it the code it runs can't be modified, and it uses encrypted memory to ensure that the rest of the system can't read or tamper with its data. This effectively forms a little computer within the computer that's difficult to attack.

The Secure Enclave contains its own UID and hardware AES engine. The passcode verification process takes place here, separated from the rest of the system. The Secure Enclave also handles Touch ID fingerprint processing and matching, and authorizing payments for Apple Pay.

The Secure Enclave performs all key management for encrypted files. File encryption applies to nearly all user data. Most system apps use it, and third party apps all use it by default if running on iOS 7 or later. Each encrypted file has a unique key, which is in turn encrypted with another key derived from the device UID and the user's passcode. The main CPU can't read encrypted files on its own. It must request the file's keys from the Secure Enclave, which in turn is unable to provide them without the user's passcode.

The escalating delays for failed passcode attempts are enforced by the Secure Enclave. The main CPU merely submits passcodes and receives the results. The Secure Enclave performs the checks, and if there have been too many failures it will delay peforming those checks. The main CPU can't speed things along.

Implications
What does the Secure Enclave mean for the security of the system as a whole?

On most systems, if you can get into the OS kernel then you own the entire system. The kernel can do anything. It can read and write every byte of system memory, it can control all of the hardware, and it's in charge of all of the application code the system runs, which it can subvert at will.

Since the Secure Enclave is a separate CPU mostly cut off from the rest of the system, it isn't under the kernel's control. On an older iPhone, owning the kernel means owning everything done by the system, including the passcode verification process. With the Secure Enclave, no matter who is in control of the main CPU, no matter what code is in the OS running on it, the basic security functions remain intact.

This system essentially allows arbitrary code to be placed in front of cryptographic functions such that this arbitrary code can't be bypassed. It's a bit like a super-sized version of the 80 millisecond computation time for password attempts. That delay is enforced by using a calculation that inherently takes that much time. This means it can't be bypassed, but there are limits on what you can create from the inherent limitations of calculations. For example, you can't add a one-minute delay to the fifth attempt, because raw cryptographic constructs don't have a concept of "fifth attempt." With the Secure Enclave, that one-minute delay can be enforced, since even with the rest of the system subverted, the delay code in the Secure Enclave remains intact.

Software Updates
The iPhone 5c (and other pre-A7 iPhones) can be subjected to a brute force attack by creating a new OS without the artificial delays, loading that onto the device, and then testing passcodes as fast as the hardware can compute. The Secure Enclave prevents this. But what about carrying out the same kind of attack one level further down, by loading new software into the Secure Enclave which eliminates its artificial delays?

Apple's guide contains this discussion of software updates for the Secure Enclave:

"It utilizes its own secure boot and personalized software update separate from the application processor."

That's it! There are no details whatsoever. What is the actual situation? Here, we must enter the realm of speculation, because as far as I can dig up there is no information out there about how Secure Enclave software updates actually work. I see two possibilities.

The first possibility is that the Secure Enclave uses the same sort of software update mechanism as the rest of the device. That is, updates must be signed by Apple, but can be freely applied. This would make the Secure Enclave useless against an attack by Apple itself, since Apple could just create new Secure Enclave software that removes the limitations. The Secure Enclave would still be a useful feature, helping to protect the user if the main OS is exploited by a third party, but it would be irrelevant to the question whether Apple can break into its own devices.

The second possibility is that the Secure Enclave's software update mechanism does something more sophisticated to protect against an attack even from Apple. The whole idea of the Secure Enclave is that it enforces additional rules that can't be bypassed from the outside. This could include additional rules about its own software updates. Given the goal of protecting the user's data, it would make a lot of sense for the Secure Enclave to refuse to apply any software update unless the device has already been unlocked with the user's passcode. For a case where the user has forgotten the passcode and wants to wipe the device and start over, the Secure Enclave could allow updates but delete the master keys protecting the user's data.

Which one is true? For now, we don't know. Apple put in a lot of effort to protect user data, and it would make a lot of sense for them to take the second approach, where updates wipe the device if applied without the user's passcode. This would be fairly easy to implement, and shouldn't affect the usability of the device. Given Apple's public stance on user privacy, I would find it extremely weird if it the Secure Enclave's software update mechanism wasn't implemented in this way. On the other hand, Tim Cook's public letter implies that all iPhone models are potentially vulnerable, so perhaps they haven't taken this extra step.

When it comes to the matter of law enforcement forcing Apple to attack an iOS device, this is the key question. If Secure Enclave updates are secured even against Apple, then the FBI's ability to make these requests stops at the iPhone 5s. If they're not, then even the latest 6s could potentially be attacked. I am deeply interested in learning the answer to this question.

Conclusion
The Secure Enclave adds an additional line of defense against attacks by implementing core security and cryptography features in a separate CPU within Apple's hardware. This separate CPU runs special software and is walled off from the rest of the system, placing it outside the control of the main OS, including the kernel. The Secure Enclave implements device passcode verification, file encryption, Touch ID recognition, and Apple Pay, and enforces security restrictions such as the escalating delays applied after excessive incorrect passcode attempts.

The iPhone 5c that the FBI is asking Apple to break into predates the Secure Enclave, and so can be subverted by installing a new OS signed by Apple that removes the artificial passcode delays. Whether newer phones can be similarly subverted depends on how the Secure Enclave's software update mechanism is implemented. If software updates erase the master encryption keys when installed without the passcode, then newer iPhones can't be attacked in this way, even by Apple. If updates are allowed without the passcode and without erasing keys, then the Secure Enclave can potentially be subverted just as older iPhones can be. As far as I'm able to determine, whether this is the case remains an open question.

That's it for today! Come back again for more exciting adventures, probably with fewer inaccessible and opaque CPUs. Friday Q&A is driven by reader ideas, so if you have something you'd like to see covered here, please send it in!

Friday Q&A 2016-03-04: Swift Asserts

$
0
0
Friday Q&A 2016-03-04: Swift Asserts
Friday Q&A 2016-03-04: Swift Asserts

Asserts are really useful for checking assumptions in code to ensure that errors are caught early and often. Today, I'm going to explore the assert calls available in Swift and how they're implemented, a topic suggested by reader Matthew Young.

I'm not going to spend much time discussing what asserts are in general or where to use them. This article just looks at what's available in Swift and some details of their implementations. If you want to read about how best to use asserts in your code, see my previous article Proper Use of Asserts.

APIs
There are two primary assert functions in the Swift standard library.

The first is creatively named assert. You call it with an expression that's supposed to be true, like so:

    assert(x >= 0) // x can't be negative here

It optionally takes a message that will be printed as part of the failure if the expression is false:

    assert(x >= 0, "x can't be negative here")

assert only functions in non-optimized builds. When optimizations are enabled, the entire thing is compiled out. This is useful for asserting conditions that are so expensive to compute that they would slow down your release builds too much, but which are still useful and important to check when debugging.

Some people prefer to only have asserts in debug builds, under the theory that it's good to have checks when debugging, but it's best not to crash the app out in the real world with real users. However, the error is present regardless of the presence of the assert that checks for it, and if it's not caught right away it's just going to cause havoc down the road. It's much better to fail quickly and obviously when it's practical to do so. That brings us to the next function.

The precondition function is much like assert. Calling it looks the same:

    precondition(x >= 0) // x can't be negative here
    precondition(x >= 0, "x can't be negative here")

The difference is that it performs the check even in optimized builds. This makes it a much better choice for most assertion checks, as long as the check is sufficiently fast.

While precondition remains active in a normal optimized build, it is not active in an "unchecked" optimized build. Unchecked builds are done by specifying -Ounchecked at the command line. These not only remove precondition calls, but also important things like array bounds checks. This is really dangerous, so this option should probably not be used unless you really, really need the performance and there's no other way to achieve it.

One interesting note about unchecked builds is that, while the precondition check is removed, the optimizer will also assume that the condition is always true and use that to optimize the following code. For the above examples, the generated code will no longer check to see if x is negative, but it will compile the code that comes after with the assumption that x is always zero or greater. The same is true of assert.

Each of these functions has a variant without the conditional, which always signals a failure when called. These two variants are assertionFailure and preconditionFailure. This is useful when the condition you're asserting doesn't fit nicely within the call. For example:

    guard case .Thingy(let value) = someEnum else {
        preconditionFailure("This code should only be called with a Thingy.")
    }

The behavior under optimization is similar to the ones with conditions. assertionFailure is compiled out when optimizations are enabled. preconditionFailure remains in optimized builds, but is removed in unchecked optimized builds. In unchecked builds, the optimizer will assume that these functions can never be reached, and will generate code based on that assumption.

Finally, there's fatalError. This function always signals a failure and halts the program, regardless of the optimization level, even in unchecked builds.

Logging Caller Info
When you hit an assertion failure, you get a message like this:

    precondition failed: x must be greater than zero: file test.swift, line 6

How does it get the file and line information?

In C, we'd write assert as a macro and use the magic __FILE__ and __LINE__ identifiers to get the info:

    #define assert(condition) do { \
            if(!(condition)) { \
                fprintf(stderr, "Assertion failed %s in file %s line %d\n", #condition, __FILE__, __LINE__); \
                abort(); \
            } \
        }

These end up being the caller's file and line, because the macro is expanded there. Swift doesn't have macros, so how does it work?

This works in Swift by using default argument values. There are magic identifiers which can be used as the default value for an argument. If the caller doesn't provide an explicit value, then the default value expands to the call location's file and line. Currently, these magic identifiers are __FILE__ and __LINE__, but in the next Swift release they're changing to #file and #line for better consistency with the rest of the language.

To see this in action, we can look at the definition of assert:

    public func assert(
      @autoclosure condition: () -> Bool,
      @autoclosure _ message: () -> String = String(),
      file: StaticString = #file, line: UInt = #line
    )

Normally, you call assert and only pass one or two arguments. The file and line arguments are left as the default, which means that the caller's information is passed in.

You're not required to leave the default values. You can pass in other values if you prefer. You could do this to, for example, lie:

    assert(false, "Guess where!", file: "not here", line: 42)

This produces:

    assertion failed: Guess where!: file not here, line 42

For a more practical use, this allows you to write wrappers that preserve the original call site's information. For example:

    func assertWrapper(
        @autoclosure condition: () -> Bool,
        @autoclosure _ message: () -> String = String(),
        file: StaticString = #file, line: UInt = #line
    ) {
        if !condition() {
            print("Oh no!")
        }
        assert(condition, message, file: file, line: line)
    }

There is one missing piece from the Swift version of assert. In the simple C version above, the expression for the failed assertion is printed by using #condition to get a stringified version of that parameter. Unfortunately, there is no equivalent in Swift, so while Swift can print the file and line number where the failure occurred, it's not able to print the expression that was supposed to be true.

Autoclosures
These functions use the @autoclosure attribute on the condition and message arguments. Why is that?

First, a quick recap in case you're not familiar with what @autoclosure does. The @autoclosure argument can be applied to an argument of function type which takes no parameters. At the call site, the caller provides an expression for that argument. This expression is then implicitly wrapped in a function, and that function is passed in as the parameter. Here's an example:

    func f(@autoclosure value: () -> Int) {
        print(value())
    }

    f(42)

This is equivalent to:

    func f(value: () -> Int) {
        print(value())
    }

    f({ 42 })

What's the point of passing in an expression as a function? It allows the callee to control when that expression is evaluated. For example, consider the boolean and operator. We could implement this to take two Bool parameters:

    func &&(a: Bool, b: Bool) -> Bool {
        if a {
            if b {
                return true
            }
        }
        return false
    }

This works fine for some things:

    x > 3 && x < 10

However, it's wasteful if the right-hand side operand is expensive to compute:

    x > 3 && expensiveFunction(x) < 10

It can be downright crashy if we assume the right-hand side doesn't execute when the left-hand side is false:

    optional != nil && optional!.value > 3

Like C, Swift's && operator short-circuits. That means that if the left-hand side is false, the right-hand side is never even evaluated. That makes this expression safe with Swift's implementation, but not with ours. @autoclosure lets the function control when the expression is evaluated, to ensure that it's only evaluated when the left-hand side is true:

    func &&(a: Bool, @autoclosure b: () -> Bool) -> Bool {
        if a {
            if b() {
                return true
            }
        }
        return false
    }

Now the semantics match Swift's semantics, because when a is false then b is never called.

How does this apply to the asserts? It's all about performance. The assert's message may be expensive to compute. Imagine:

    assert(widget.valid, "Widget wasn't valid: \(widget.dump())")

You don't want to compute that big string every time through, even when the widget is valid and nothing is going to be printed. By using @autoclosure for the message argument, assert can avoid evaluating the message expression unless the assert actually fails.

The condition itself is also an @autoclosure. Why? Because assert doesn't check the condition in optimized builds. If it doesn't check the condition, there's no point in even evaluating it. Using @autoclosure means that this doesn't slow down optimized builds:

    assert(superExpensiveFunction())

All of the functions in this API use @autoclosure to ensure that the parameters aren't evaluated unless they really need to be. For some reason, even fatalError uses it, even though fatalError executes unconditionally.

Code Removal
These functions are removed from the generated code depending on how your code is compiled. They exist in the Swift standard library, not your code, and the standard library was compiled long before your code was. How does that work?

In C, it's all about the macros. Macros just exist in the header, so their code is compiled at the call site. Even if they're conceptually part of a library, they're actually just dumped straight into your own code. That means they can check for the existence of a DEBUG macro or similar, and produce no code when it's not set. For example:

    #if DEBUG
    #define assert(condition) do { \
            if(!(condition)) { \
                fprintf(stderr, "Assertion failed %s in file %s line %d\n", #condition, __FILE__, __LINE__); \
                abort(); \
            } \
        }
    #else
    #define assert(condition) (void)0
    #endif

And again, Swift doesn't have macros, so how does this work?

If you look at the definition of these functions in the standard library, you'll see that they're all annotated with @_transparent. This attribute makes the function a little bit macro-like. Every call is inlined at the call site rather than emitted as a call to a separate function. When you write precondition(...) in Swift code, the body of the standard library precondition function gets pulled into your code and treated as if you had copy/pasted it in yourself. That means that it gets compiled under the same conditions as the rest of your code, and the optimizer is able to see all the way into the function body. It can see that nothing happens in assert when optimizations are enabled and remove the entire thing.

The standard library is a separate library. How can functions from a separate library be inlined into your own code? Coming from a C universe where libraries consist of compiled object code, this makes no sense.

The Swift standard library is provided as a .swiftmodule file, which is a completely different beast from a .dylib or .a file. A .swiftmodule file contains declarations for everything in the module, but it can also contain full implementations. To quote the module format documentation:

The SIL block contains SIL-level implementations that can be imported into a client's SILModule context.

That means that the full bodies of the various assert functions are saved, in an intermediate form, into the standard library module. Those bodies are then available to be inlined wherever you call them. Since they're inlined, they have access to the context in which they're compiled, and the optimizer can remove them entirely when it's warranted.

Conclusion
Swift provides a nice set of assert functions. The assert function and its companion assertionFailure are only active in non-optimized builds. This can be useful for checking conditions that are slow to compute, but should usually be avoided. The precondition and preconditionFailure functions are active in normal optimized builds as well.

These functions use the @autoclosure attribute on their condition and message parameters, which allows them to control when those parameters are evaluated. This prevents custom assert messages from being evaluated every time an assertion is checked, and it prevents assertion conditions from being evaluated when the assertion is disabled in optimized builds.

The assert functions are part of the standard library, but their use of the @_transparent attribute causes the generated intermediate code to be emitted into the module file. When they're called, the entire body is inlined at the call site, which allows the optimizer to remove the call entirely when it's appropriate.

That's it for today! I hope knowing what's going on might encourage you to use more asserts in your code. They can help a lot by making problems show themselves immediately and obviously, rather than causing subtle symptoms long after the initial problem occurred. Come back next time for more exciting ideas. Until then, Friday Q&A is driven by reader ideas, so if you have a topic you'd like to see covered here, please send it in!

Friday Q&A 2016-04-15: Performance Comparisons of Common Operations, 2016 Edition

$
0
0
Friday Q&A 2016-04-15: Performance Comparisons of Common Operations, 2016 Edition
Friday Q&A 2016-04-15: Performance Comparisons of Common Operations, 2016 Edition

Back in the mists of time, before Friday Q&A was a thing, I posted some articles running performance tests on common operations and discussing the results. The most recent one was from 2008, running on 10.5 and the original iPhone OS, and it's long past time to do an update.

Previous Articles
If you'd like to compare with decades past, here are the links to the previous articles:

(Note that the name of Apple's mobile OS didn't become "iOS" until 2010.)

Overview
Performance testing can be dangerous. Tests are usually highly artificial, unless you have a specific application with a real-world workload you can test. These particular tests are certainly artificial, and the results may not reflect how things actually perform in your own programs. The idea is just to give you a feel for the rough order of magnitude, not put a precise number on everything.

It's particularly difficult to measure extremely fast operations, like an Objective-C message send or a simple arithmetic operation. Modern CPUs are heavily pipelined and parallel, and the time such an operation takes in isolation may not correspond with the time it takes when in the context of a real program. Adding one of these operations into the middle of other code may not increase the running time of that code at all, if it's sufficiently independent that the CPU can run it in parallel. On the other hand, it could increase the running time a lot if it ties up important resources.

Performance also depends on external factors. Many modern CPUs will run faster when cold, and throttle down as they get hot. Filesystem performance will depend on the storage hardware and the state of the filesystem. Even relative performance can differ.

If something is performance critical, you always want to measure and profile it so you can see exactly what takes time in your code and know where to concentrate your efforts. It can and will surprise you to find out what's actually slow in working code.

All that said, it's still really useful to have a rough idea of how fast various things are compared to each other. It's worth a little effort to avoid writing a ton of data to the filesystem if you don't have to. It's probably not worth a little effort to avoid a single message send. In between, it depends.

Methodology
The code used for these tests is available on GitHub:

https://github.com/mikeash/PerformanceTest

The code is written in Objective-C++, with the core performance measuring code written in C. I don't yet have a good enough handle on how Swift performs to feel like I could do a good job of this in Swift.

The basic technique is simple: run the operation in question in a loop for a few seconds. Divide the total running time by the number of loop iterations to get the time per operation. The number of iterations is hardcoded, and I chose that number by experiment to make the test run for a reasonable amount of time.

I attempt to account for the overhead of the loop itself. This overhead is completely unimportant for the slower operations, but is substantial for the faster ones. To do this, I time an empty loop, then subtract the time per iteration from the times measured for the other tests.

For some tests, the test code appears to get pipelined in with the loop code. This produces amazingly low times for those tests, but the results are false. To compensate for this, all of the fast operations are manually unrolled so that a single loop iteration executes the test ten times, which I hope produces a more realistic result.

The tests are compiled and run without optimizations. This is contrary to what we normally do in the real world, but I think it's the best choice here. For operations which mostly depend on external code, like working with files or decoding JSON, it makes little difference. For short operations like arithmetic or method calls, it's difficult to write a test that doesn't just get optimized away entirely as the compiler realizes that the test doesn't do anything that's externally visible. Optimization will also change how the loop is compiled, making it hard to account for loop overhead.

The Mac tests were run on my 2013 Mac Pro, with a 3.5GHz Xeon E5 running OS X 10.11.4. The iOS tests were run on an iPhone 6s running iOS 9.3.1.

The Mac Tests
Here are the Mac numbers. Each test lists what it tested, how many iterations the test runs, the total time it took to run the test, and the per-operation time. All times are listed with loop overhead subtracted.

NameIterationsTotal time (sec)Time per (ns)
16 byte memcpy10000000000.70.7
C++ virtual method call10000000001.51.5
IMP-cached message send10000000001.61.6
Objective-C message send10000000002.62.6
Floating-point division with integer conversion10000000003.73.7
Floating-point division10000000003.73.7
Integer division10000000006.26.2
ObjC retain and release1000000002.323.2
Autorelease pool push/pop1000000002.525.2
Dispatch_sync1000000002.929.0
16-byte malloc/free1000000005.555.4
Object creation100000001.0101.0
NSInvocation message send100000001.7174.3
16MB malloc/free100000003.2317.1
Dispatch queue create/destroy100000004.1411.2
Simple JSON encode10000001.41421.0
Simple JSON decode10000002.72659.5
Simple binary plist decode10000002.72666.1
NSView create/destroy10000003.33272.1
Simple XML plist decode10000005.55481.6
Read 16 byte file10000006.46449.0
Simple binary plist encode10000008.88813.2
Dispatch_async and wait10000009.39343.5
Simple XML plist encode10000009.59480.9
Zero-zecond delayed perform1000002.019615.0
pthread create/join1000002.827755.3
1MB memcpy1000005.656310.6
Write 16 byte file100001.7165444.3
Write 16 byte file (atomic)100002.4237907.9
Read 16MB file10003.43355650.0
NSWindow create/destroy100010.610590507.9
NSTask process spawn1006.766679149.2
Write 16MB file (atomic)302.894322686.1
Write 16MB file303.1104137671.1

The first thing that stands out in this table is the first entry in it. The 16-byte memcpy test takes less than a nanosecond per call. Looking at the generated code, the compiler is smart enough to turn the call to memcpy into a sequence of mov instructions, even with optimizations off. This is an interesting lesson: just because you write a function call doesn't mean the compiler has to generate one.

A C++ virtual method call and an ObjC message send with a cached IMP both take about the same amount of time. They're essentially the same operation: an indirect function call through a function pointer.

A normal Objective-C message send is a bit slower, as we'd expect. Still, the speed of objc_msgSend continues to astound me. Considering that it performs a full hash table lookup followed by an indirect jump to the result, the fact that it runs in 2.6 nanoseconds is amazing. That's about 9 CPU cycles. In the 10.5 days it was a dozen or more, so we've seen a nice improvement. To turn this number upside down, if you did nothing but Objective-C message sends, you could do about 400 million of them per second on this computer.

Using NSInvocation to call a method is much slower, as expected. NSInvocation has to construct the message at runtime, doing the work that the compiler does at compile time for each call. Fortunately, NSInvocation is rarely a bottleneck in real programs. It appears to have slowed down since 10.5, with an NSInvocation call taking about twice as much time in this test compared to the old one, even though this test is running on faster hardware.

A retain and release pair take about 23 nanoseconds together. Modifying an object's reference count must be thread safe, so it requires an atomic operation which is relatively expensive when we're down at the nanosecond level counting individual CPU cycles.

Autorelease pools have become quite a bit faster than they used to be. In the old test, creating and destroying an autorelease pool took well over 300ns. Here, it shows up at 25ns. The implementation of autorelease pools has been completely redone and the new implementation is a lot faster, so this is no surprise. Pools used to be instances of the NSAutoreleasePool class, but now they're done using runtime functions which just do some pointer manipulation. At 25ns, you can afford to sprinkle @autoreleasepool anywhere you even suspect you might accumulate some autoreleased objects.

Allocating and freeing 16 bytes costs much like before, but larger allocations have become significantly faster. Allocating and freeing 16MB took about 4.5 microseconds back in the day, but only took about 300 nanoseconds here. Typical apps do tons of memory allocations, so this is a great improvement.

Objective-C object creation also got a nice speedup, from almost 300ns to about 100ns. Obviously, the typical app creates and destroys a lot of Objective-C objects, so this is really useful. On the flip side, consider that you can send an existing object about 40 messages in the same amount of time it takes to create and destroy a new object, so it's still a significantly more expensive operation, especially considering that most objects will take more time to create and destroy than a simple NSObject instance does.

The dispatch_queue tests show an interesting contrast between the various operations. A dispatch_sync on an uncontended queue is extremely fast, under 30ns. GCD is smart and doesn't do any cross-thread calls for this case, so it ends up just acquiring and then releasing a lock. dispatch_async takes a lot longer, since it has to find a worker thread to use, wake it up, and get the call over to it. Creating and destroying a dispatch_queue is pretty cheap, with a time comparable to creating an Objective-C object. GCD is able to share all of the heavyweight threading stuff, so the individual queues don't contain very much.

I added tests for JSON and property list serialization and deserialization, which I didn't test the last time around. With the rise of the iPhone, these things became a lot more prominent. These tests encode or decode a simple three-element dictionary. As expected, it's relatively slow compared to simple, low-level stuff like message sends, but it's still in the microseconds range. It's interesting that JSON outperforms property lists, even binary property lists, which I expected would be the fastest. This could be because JSON sees more use and so gets more attention, or it might just be that the JSON format is actually faster to parse. Or it might be that testing with a three-element dictionary isn't realistic, and the relative speeds would look different for something larger.

Zero-second delayed performs come in pretty heavyweight, relatively speaking, at about twice the cost of a dispatch_async. Runloops have a lot of work to do, it seems.

Creating a pthread and then waiting for it to terminate is another relatively heavyweight operation, taking a bit under 30 microseconds. We can see why GCD uses a thread pool and tries not to create new threads unless it's necessary. However, this is one test which got a lot faster since the old days. This same test took well over 100 microseconds in the old test.

Creating an NSView instance is fast, at about 3 microseconds. In constrast, creating an NSWindow is much slower, taking about 10 milliseconds. NSView is really a relatively light structure that represents an area of a window, while an NSWindow represents a chunk of pixel buffer in the window server. Creating one involves communicating with the window server to have it create the necessary structures, and it also requires a lot of work to set up all the various internal objects an NSWindow needs, like views for the title bar. You can go crazy with the views, but you might want to go easy on the windows.

File access is, as always, pretty slow. SSDs make it a lot faster, but there's still a ton of stuff going on there. Do it if you have to, try not to do it if you don't have to.

The iOS Tests
Here are the iOS results.

NameIterationsTotal time (sec)Time per (ns)
C++ virtual method call10000000000.80.8
IMP-cached message send10000000001.21.2
Floating-point division with integer conversion10000000001.51.5
Integer division10000000002.12.1
Objective-C message send10000000002.72.7
Floating-point division10000000003.53.5
16 byte memcpy10000000005.35.3
Autorelease pool push/pop1000000001.514.7
ObjC retain and release1000000003.736.9
Dispatch_sync1000000007.979.0
16-byte malloc/free1000000008.686.2
Object creation100000001.2119.8
NSInvocation message send100000002.7268.3
Dispatch queue create/destroy100000006.4636.0
Simple JSON encode10000001.51464.5
16MB malloc/free1000000015.21524.7
Simple binary plist decode10000002.42430.0
Simple JSON decode10000002.52515.9
UIView create/destroy10000003.83800.7
Simple XML plist decode10000005.55519.2
Simple binary plist encode10000007.67617.7
Simple XML plist encode100000010.510457.4
Dispatch_async and wait100000018.118096.2
Zero-zecond delayed perform1000002.424229.2
Read 16 byte file100000027.227156.1
pthread create/join1000003.737232.0
1MB memcpy10000011.7116557.3
Write 16 byte file1000020.22022447.6
Write 16 byte file (atomic)1000030.63055743.8
Read 16MB file10006.26169527.5
Write 16MB file (atomic)301.652226907.3
Write 16MB file302.378285962.9

The most remarkable thing about this is how similar it looks to the Mac results above. Looking back at the old tests, the iPhone was orders of magnitude slower. An Objective-C message send, for example, was about 4.9ns on the Mac, but it took an eternity on the iPhone at nearly 200ns. A simple C++ virtual method call took a bit over a nanosecond on the Mac, but 80ns on the iPhone. A small malloc/free at around 50ns on the Mac took about 2 microseconds on the iPhone.

Comparing the two today, and things have clearly changed a lot in the mobile world. Most of these numbers are just slightly worse than the Mac numbers. Some are actually faster! For example, autorelease pools are substantially faster on the iPhone. I guess ARM64 is better at doing the stuff that the autorelease pool code does.

Reading and writing small files stands out as an area where the iPhone is substantially slower. The 16MB file tests are comparable to the Mac, but the iPhone takes nearly ten times longer for the 16-byte file tests. It appears that the iPhone's storage has excellent throughput but suffers somewhat in latency compared to the Mac's.

Conclusion
An excessive focus on performance can interfere with writing good code, but it's good to keep in mind the rough performance of the common operations we perform in your programs. That performance changes as software and hardware improves. The Mac has seen some nice improvements over the years, but the progress on the iPhone is remarkable. In eight years, it's gone from being almost a hundred times slower to being roughly on par with the Mac.

That's it for today. Come back next time for more fun stuff. Friday Q&A is driven by reader suggestions, so if you have a topic you'd like to see covered next time or some other time, please send it in!

Good News, Bad News, and Ugly News

$
0
0
Good News, Bad News, and Ugly News
Good News, Bad News, and Ugly News

The good news is that I'm officially restarting work on The Complete Friday Q&A: Volume II. I got partway into it a while ago and ran out of steam. The restarted edition includes all posts made since then, making it pretty massive. I can't commit to a specific timeframe, but I hope that it will be a few months at most before I have it out. There may be opportunities for reader involvement in checking and polishing it, so watch this space.

The bad news is that I can't keep up with new blog posts at the same time. It just hasn't worked out. That's part of why I've been quiet lately, and so I'm suspending regular posts for the duration. I may make occasional irregular posts in the meantime (I'm sure WWDC will have something worth discussing) and I'll resume a more regular schedule once the book is done.

The ugly news is... there is no ugly news, that was just a pointless movie reference. Don't panic!


Advanced Swift Workshop in Washington, DC

$
0
0
Advanced Swift Workshop in Washington, DC
Advanced Swift Workshop in Washington, DC

I will be holding a one-day workshop on advanced Swift programming in the Washington, DC area on December 12th. If you enjoy my articles and want to sharpen your Swift skills, check it out.

I'm going to be discussing the ins and outs of ARC and memory management, reference cycles, enums, generics, designing code to take advantage of enums and generics, pointer APIs, and interfacing with C APIs. I have been building out a set of nifty Xcode playgrounds to illustrate everything, and attendees will receive a copy of them, as well as the presentation slides.

The format will be part lecture, part exercises using the playgrounds, with plenty of opportunity for discussions and personalized help.

If you think you might like to come, take a look at the event on Eventbrite. And if you know others who might like to come, please tell them about it!

In unrelated news, since I'm sure some of you are wondering, Volume II of my book is coming along slowly but surely, and I hope to get it out the door and get back to writing articles before too much longer. Stay tuned!

Advanced Swift Workshop in New York City

$
0
0
Advanced Swift Workshop in New York City
Advanced Swift Workshop in New York City

I will be holding another one-day workshop on advanced Swift programming in New York City on May 4th. This will be much the same as my previous one in Washington in December, in a new location and with various tweaks and improvements. If you enjoy my articles and want to sharpen your Swift skills, check it out.

I'll discuss the ins and outs of ARC and memory management, reference cycles, enums, generics, designing code to take advantage of enums and generics, pointer APIs, and interfacing with C APIs. Attendees will receive a bunch of Xcode playgrounds that illustrate everything we discuss, as well as the presentation slides.

The format will be part lecture, part exercises using the playgrounds, with plenty of opportunity for discussions and personalized help.

For more information, or to buy tickets, visit the event page. If you know anyone who might like to come, please pass the word!

Another book update, since my last workshop post included one: I've completed my final read-through and am in the middle of fixing the problems that uncovered. Then it'll be ready to go! I don't have a date for it, but it's getting close.

More Advanced Swift Workshop, and Blog and Book Updates

$
0
0
More Advanced Swift Workshop, and Blog and Book Updates
More Advanced Swift Workshop, and Blog and Book Updates

I'm hoping to resume a regular posting schedule soon, and I wanted to give everybody some updates.

First, I'm holding two more Advanced Swift Workshops next month, one in DC on July 13th and one in New York City on July 24th. Click here for the one in DC, and here for the one in New York City. As with the previous ones, we'll be covering various advanced topics on Swift programming, with yours truly presenting and a small group with lots of opportunity for discussion and experimentation.

Second, The Complete Friday Q&A Volume II is nearly ready. Not quite there yet, but the text and layout are done and it's down to some final tweaking and doing the work of actually getting it out there. Stay tuned.

Last, I hope to resume regular posts in the next couple of weeks. There were some interesting things from WWDC that I want to write about, plus my usual routine of crazy topics. In particular, I did a thorough analysis of the latest implementation of objc_msgSend for ARM64 for a talk that I'd like to write up, and I want to write something about Swift's new Codable stuff. As always, I'm driven by reader suggestions, so if you have something from WWDC you'd like to see, or something unrelated you think would be cool, send it in!

Friday Q&A 2017-06-30: Dissecting objc_msgSend on ARM64

$
0
0
Friday Q&A 2017-06-30: Dissecting objc_msgSend on ARM64
Friday Q&A 2017-06-30: Dissecting objc_msgSend on ARM64

We're back! During the week of WWDC, I spoke at CocoaConf Next Door, and one of my talks involved a dissection of objc_msgSend's ARM64 implementation. I thought that turning it into an article would make for a nice return to blogging for Friday Q&A.

Overview
Every Objective-C object has a class, and every Objective-C class has a list of methods. Each method has a selector, a function pointer to the implementation, and some metadata. The job of objc_msgSend is to take the object and selector that's passed in, look up the corresponding method's function pointer, and then jump to that function pointer.

Looking up a method can be extremely complicated. If a method isn't found on a class, then it needs to continue searching in the superclasses. If no method is found at all, then it needs to call into the runtime's message forwarding code. If this is the very first message being sent to a particular class, then it has to call that class's +initialize method.

Looking up a method also needs to be extremely fast in the common case, since it's done for every method call. This, of course, is in conflict with the complicated lookup process.

Objective-C's solution to this conflict is the method cache. Each class has a cache which stores methods as pairs of selectors and function pointers, known in Objective-C as IMPs. They're organized as a hash table so lookups are fast. When looking up a method, the runtime first consults the cache. If the method isn't in the cache, it follows the slow, complicated procedure, and then places the result into the cache so that the next time can be fast.

objc_msgSend is written in assembly. There are two reasons for this: one is that it's not possible to write a function which preserves unknown arguments and jumps to an arbitrary function pointer in C. The language just doesn't have the necessary features to express such a thing. The other reason is that it's extremely important for objc_msgSend to be fast, so every last instruction of it is written by hand so it can go as fast as possible.

Naturally, you don't want to write the whole complicated message lookup procedure in assembly langauge. It's not necessary, either, because things are going to be slow no matter what the moment you start going through it. The message send code can be divided into two parts: there's the fast path in objc_msgSend itself, which is written in assembly, and the slow path implemented in C. The assembly part looks up the method in the cache and jump to it if it's found. If the method is not in the cache, then it calls into the C code to handle things.

Therefore, when looking at objc_msgSend itself, it does the following:

  1. Get the class of the object passed in.
  2. Get the method cache of that class.
  3. Use the selector passed in to look up the method in the cache.
  4. If it's not in the cache, call into the C code.
  5. Jump to the IMP for the method.

How does it do all of that? Let's see!

Instruction by Instruction
objc_msgSend has a few different paths it can take depending on circumstances. It has special code for handling things like messages to nil, tagged pointers, and hash table collisions. I'll start by looking at the most common, straight-line case where a message is sent to a non-nil, non-tagged pointer and the method is found in the cache without any need to scan. I'll note the various branching-off points as we go through them, and then once we're done with the common path I'll circle back and look at all of the others.

I'll list each instruction or group of instructions followed by a description of what it does and why. Just remember to look up to find the instruction any given piece of text is discussing.

Each instruction is preceded by its offset from the beginning of the function. This serves as a counter, and lets you identify jump targets.

ARM64 has 31 integer registers which are 64 bits wide. They're referred to with the notation x0 through x30. It's also possible to access the lower 32 bits of each register as if it were a separate register, using w0 through w30. Registers x0 through x7 are used to pass the first eight parameters to a function. That means that objc_msgSend receives the self parameter in x0 and the selector _cmd parameter in x1.

Let's begin!

    0x0000 cmp     x0, #0x0
    0x0004 b.le    0x6c

This performs a signed comparison of self with 0 and jumps elsewhere if the value is less than or equal to zero. A value of zero is nil, so this handles the special case of messages to nil. This also handles tagged pointers. Tagged pointers on ARM64 are indicated by setting the high bit of the pointer. (This is an interesting contrast with x86-64, where it's the low bit.) If the high bit is set, then the value is negative when interpreted as a signed integer. For the common case of self being a normal pointer, the branch is not taken.

    0x0008 ldr    x13, [x0]

This loads self's isa by loading the 64-bit quantity pointed to by x0, which contains self. The x13 register now contains the isa.

    0x000c and    x16, x13, #0xffffffff8

ARM64 can use non-pointer isas. Traditionally the isa points to the object's class, but non-pointer isa takes advantage of spare bits by cramming some other information into the isa as well. This instruction performs a logical AND to mask off all the extra bits, and leaves the actual class pointer in x16.

    0x0010 ldp    x10, x11, [x16, #0x10]

This is my favorite instruction in objc_msgSend. It loads the class's cache information into x10 and x11. The ldp instruction loads two registers' worth of data from memory into the registers named in the first two arguments. The third argument describes where to load the data, in this case at offset 16 from x16, which is the area of the class which holds the cache information. The cache itself looks like this:

    typedef uint32_t mask_t;

    struct cache_t {
        struct bucket_t *_buckets;
        mask_t _mask;
        mask_t _occupied;
    }

Following the ldp instruction, x10 contains the value of _buckets, and x11 contains _occupied in its high 32 bits, and _mask in its low 32 bits.

_occupied specifies how many entries the hash table contains, and plays no role in objc_msgSend. _mask is important: it describes the size of the hash table as a convenient AND-able mask. Its value is always a power of two minus 1, or in binary terms something that looks like 000000001111111 with a variable number of 1s at the end. This value is needed to figure out the lookup index for a selector, and to wrap around the end when searching the table.

    0x0014 and    w12, w1, w11

This instruction computes the starting hash table index for the selector passed in as _cmd. x1 contains _cmd, so w1 contains the bottom 32 bits of _cmd. w11 contains _mask as mentioned above. This instruction ANDs the two together and places the result into w12. The result is the equivalent of computing _cmd % table_size but without the expensive modulo operation.

    0x0018 add    x12, x10, x12, lsl #4

The index is not enough. To start loading data from the table, we need the actual address to load from. This instruction computes that address by adding the table index to the table pointer. It shifts the table index left by 4 bits first, which multiplies it by 16, because each table bucket is 16 bytes. x12 now contains the address of the first bucket to search.

    0x001c ldp    x9, x17, [x12]

Our friend ldp makes another appearance. This time it's loading from the pointer in x12, which points to the bucket to search. Each bucket contains a selector and an IMP. x9 now contains the selector for the current bucket, and x17 contains the IMP.

    0x0020 cmp    x9, x1
    0x0024 b.ne   0x2c

These instructions compare the bucket's selector in x9 with _cmd in x1. If they're not equal then this bucket does not contain an entry for the selector we're looking for, and in that case the second instruction jumps to offset 0x2c, which handles non-matching buckets. If the selectors do match, then we've found the entry we're looking for, and execution continues with the next instruction.

    0x0028 br    x17

This performs an unconditional jump to x17, which contains the IMP loaded from the current bucket. From here, execution will continue in the actual implementation of the target method, and this is the end of objc_msgSend's fast path. All of the argument registers have been left undisturbed, so the target method will receive all passed in arguments just as if it had been called directly.

When everything is cached and all the stars align, this path can execute in less than 3 nanoseconds on modern hardware.

That's the fast path, how about the rest of the code? Let's continue with the code for a non-matching bucket.

    0x002c cbz    x9, __objc_msgSend_uncached

x9 contains the selector loaded from the bucket. This instruction compares it with zero and jumps to __objc_msgSend_uncached if it's zero. A zero selector indicates an empty bucket, and an empty bucket means that the search has failed. The target method isn't in the cache, and it's time to fall back to the C code that performs a more comprehensive lookup. __objc_msgSend_uncached handles that. Otherwise, the bucket doesn't match but isn't empty, and the search continues.

    0x0030 cmp    x12, x10
    0x0034 b.eq   0x40

This instruction compares the current bucket address in x12 with the beginning of the hash table in x10. If they match, it jumps to code that wraps the search back to the end of the hash table. We haven't seen it yet, but the hash table search being performed here actually runs backwards. The search examines decreasing indexes until it hits the beginning of the table, then it starts over at the end. I'm not sure why it works this way rather than the more common approach of increasing addresses that wrap to the beginning, but it's a safe bet that it's because it ends up being faster this way.

Offset 0x40 handles the wraparound case. Otherwise, execution proceeds to the next instruction.

    0x0038 ldp    x9, x17, [x12, #-0x10]!

Another ldp, once again loading a cache bucket. This time, it loads from offset 0x10 to the address of the current cache bucket. The exclamation point at the end of the address reference is an interesting feature. This indicates a register write-back, which means that the register is updated with the newly computed value. In this case, it's effectively doing x12 -= 16 in addition to loading the new bucket, which makes x12 point to that new bucket.

    0x003c b      0x20

Now that the new bucket is loaded, execution can resume with the code that checks to see if the current bucket is a match. This loops back up to the instruction labeled 0x0020 above, and runs through all of that code again with the new values. If it continues to find non-matching buckets, this code will keep running until it finds a match, an empty bucket, or hits the beginning of the table.

    0x0040 add    x12, x12, w11, uxtw #4

This is the target for when the search wraps. x12 contains a pointer to the current bucket, which in this case is also the first bucket. w11 contains the table mask, which is the size of the table. This adds the two together, while also shifting w11 left by 4 bits, multiplying it by 16. The result is that x12 now points to the end of the table, and the search can resume from there.

    0x0044 ldp    x9, x17, [x12]

The now-familiar ldp loads the new bucket into x9 and x17.

    0x0048 cmp    x9, x1
    0x004c b.ne   0x54
    0x0050 br     x17

This code checks to see if the bucket matches and jumps to the bucket's IMP. It's a duplicate of the code at 0x0020 above.

    0x0054 cbz    x9, __objc_msgSend_uncached

Just like before, if the bucket is empty then it's a cache miss and execution proceeds into the comprehensive lookup code implemented in C.

    0x0058 cmp    x12, x10
    0x005c b.eq   0x68

This checks for wraparound again, and jumps to 0x68 if we've hit the beginning of the table a second time. In this case, it jumps into the comprehensive lookup code implemented in C:

    0x0068 b      __objc_msgSend_uncached

This is something that should never actually happen. The table grows as entries are added to it, and it's never 100% full. Hash tables become inefficient when they're too full because collisions become too common.

Why is this here? A comment in the source code explains:

Clone scanning loop to miss instead of hang when cache is corrupt. The slow path may detect any corruption and halt later.

I doubt that this is common, but evidently the folks at Apple have seen memory corruption which caused the cache to be filled with bad entries, and jumping into the C code improves the diagnostics.

The existence of this check should have minimal impact on code that doesn't suffer from this corruption. Without it, the original loop could be reused, which would save a bit of instruction cache space, but the effect is minimal. This wraparound handler is not the common case anyway. It will only be invoked for selectors that get sorted near the beginning of the hash table, and then only if there's a collision and all the prior entries are occupied.

    0x0060 ldp    x9, x17, [x12, #-0x10]!
    0x0064 b      0x48

The remainder of this loop is the same as before. Load the next bucket into x9 and x17, update the bucket pointer in x12, and go back to the top of the loop.

That's the end of the main body of objc_msgSend. What remains are special cases for nil and tagged pointers.

Tagged Pointer Handler
You'll recall that the very first instructions checked for those and jumped to offset 0x6c to handle them. Let's continue from there:

    0x006c b.eq    0xa4

We've arrived here because self is less than or equal to zero. Less than zero indicates a tagged pointer, and zero is nil. The two cases are handled completely differently, so the first thing the code does here is check to see whether self is nil or not. If self is equal to zero then this instruction branches to 0xa4, which is where the nil handler lives. Otherwise, it's a tagged pointer, and execution continues with the next instruction.

Before we move on, let's briefly discuss how tagged pointers work. Tagged pointers support multiple classes. The top four bits of the tagged pointer (on ARM64) indicate which class the "object" is. They are essentially the tagged pointer's isa. Of course, four bits isn't nearly enough to hold a class pointer. Instead, there's a special table which stores the available tagged pointer classes. The class of a tagged pointer "object" is found by looking up the index in that table which corresponds to the top four bits.

This isn't the whole story. Tagged pointers (at least on ARM64) also support extended classes. When the top four bits are all set to 1 the next eight bits are used to index into an extended tagged pointer class table. This allows the runtime to support more tagged pointer classes, at the cost of having less storage for them.

Let's continue.

    0x0070 mov    x10, #-0x1000000000000000

This sets x10 to an integer value with the top four bits set and all other bits set to zero. This will serve as a mask to extract the tag bits from self.

    0x0074 cmp    x0, x10
    0x0078 b.hs   0x90

This checks for an extended tagged pointer. If self is greater than or equal to the value in x10, then that means the top four bits are all set. In that case, branch to 0x90 which will handle extended classes. Otherwise, use the primary tagged pointer table.

    0x007c adrp   x10, _objc_debug_taggedpointer_classes@PAGE
    0x0080 add    x10, x10, _objc_debug_taggedpointer_classes@PAGEOFF

This little song and dance loads the address of _objc_debug_taggedpointer_classes, which is the primary tagged pointer table. ARM64 requires two instructions to load the address of a symbol. This is a standard technique on RISC-like architectures. Pointers on ARM64 are 64 bits wide, and instructions are only 32 bits wide. It's not possible to fit an entire pointer into one instruction.

x86 doesn't suffer from this problem, since it has variable-length instructions. It can just use a 10-byte instruction, where two bytes identify the instruction itself and the target register, and eight bytes hold the pointer value.

On a machine with fixed-length instructions, you load the value in pieces. In this case, only two pieces are needed. The adrp instruction loads the top part of the value, and the add then adds in the bottom part.

    0x0084 lsr    x11, x0, #60

The tagged class index is in the top four bits of x0. To use it as an index, it has to be shifted right by 60 bits so it becomes an integer in the range 0-15. This instruction performs that shift and places the index into x11.

    0x0088 ldr    x16, [x10, x11, lsl #3]

This uses the index in x11 to load the entry from the table that x10 points to. The x16 register now contains the class of this tagged pointer.

    0x008c b      0x10

With the class in x16, we can now branch back to the main code. The code starting with offset 0x10 assumes that the class pointer is loaded into x16 and performs dispatch from there. The tagged pointer handler can therefore just branch back to that code rather than duplicating logic here.

    0x0090 adrp   x10, _objc_debug_taggedpointer_ext_classes@PAGE
    0x0094 add    x10, x10, _objc_debug_taggedpointer_ext_classes@PAGEOFF

The extended tagged class handler looks similar. These two instructions load the pointer to the extended table.

    0x0098 ubfx   x11, x0, #52, #8

This instruction loads the extended class index. It extracts 8 bits starting from bit 52 in self into x11.

    0x009c ldr    x16, [x10, x11, lsl #3]

Just like before, that index is used to look up the class in the table and load it into x16.

    0x00a0 b      0x10

With the class in x16, it can branch back into the main code.

That's nearly everything. All that remains is the nil handler.

nil Handler
Finally we get to the nil handler. Here it is, in its entirety.

    0x00a4 mov    x1, #0x0
    0x00a8 movi   d0, #0000000000000000
    0x00ac movi   d1, #0000000000000000
    0x00b0 movi   d2, #0000000000000000
    0x00b4 movi   d3, #0000000000000000
    0x00b8 ret

The nil handler is completely different from the rest of the code. There's no class lookup or method dispatch. All it does for nil is return 0 to the caller.

This task is a bit complicated by the fact that objc_msgSend doesn't know what kind of return value the caller expects. Is this method returning one integer, or two, or a floating-point value, or nothing at all?

Fortunately, all of the registers used for return values can be safely overwritten even if they're not being used for this particular call's return value. Integer return values are stored in x0 and x1 and floating point return values are stored in vector registers v0 through v3. Multiple registers are used for returning smaller structs.

This code clears x1 and v0 through v3. The d0 through d3 registers refer to the bottom half of the corresponding v registers, and storing into them clears the top half, so the effect of the four movi instructions is to clear those four registers. After doing this, it returns control to the caller.

You might wonder why this code doesn't clear x0. The answer to that is simple: x0 holds self which in this case is nil, so it's already zero! You can save an instruction by not clearing x0 since it already holds the value we want.

What about larger struct returns that don't fit into registers? This requires a little cooperation from the caller. Large struct returns are performed by having the caller allocate enough memory for the return value, and then passing the address of that memory in x8. The function then writes to that memory to return a value. objc_msgSend can't clear this memory, because it doesn't know how big the return value is. To solve this, the compiler generates code which fills the memory with zeroes before calling objc_msgSend.

That's the end of the nil handler, and of objc_msgSend as a whole.

Conclusion
It's always interesting to dive into framework internals. objc_msgSend in particular is a work of art, and delightful to read through.

That's it for today. Come back next time for more squishy goodness. Friday Q&A is driven by reader input, so if you have something you'd like to see discussed here, send it in!

Friday Q&A 2017-07-14: Swift.Codable

$
0
0
Friday Q&A 2017-07-14: Swift.Codable
Friday Q&A 2017-07-14: Swift.Codable

One of the interesting additions to Swift 4 is the Codable protocol and the machinery around it. This is a subject near and dear to my heart, and I want to discuss what it is and how it works today.

Serialization
Serializing values to data that can be stored on disk or transmitted over a network is a common need. It's especially common in this age of always-connected mobile apps.

So far, the options for serialization in Apple's ecosystem were limited:

  1. NSCoding provides intelligent serialization of complex object graphs and works with your own types, but works with a poorly documented serialization format not suitable for cross-platform work, and requires writing code to manually encode and decode your types.
  2. NSPropertyListSerialization and NSJSONSerialization can convert between standard Cocoa types like NSDictionary/NSString and property lists or JSON. JSON in particular is used all over the place for server communication. Since these APIs provide low-level values, you have to write a bunch of code to extract meaning from those values. That code is often ad-hoc and handles bad data poorly.
  3. NSXMLParser and NSXMLDocument are the choice of masochists or people stuck working with systems that use XML. Converting between the basic parsed data and more meaningful model objects is once again up to the programmer.
  4. Finally, there's always the option to build your own from scratch. This is fun, but a lot of work, and error-prone.

These approaches tend to result in a lot of boilerplate code, where you declare a property called foo of type String which is encoded by storing the String stored in foo under the key "foo" and is decoded by retrieving the value for the key "foo", attempting to cast it to a String, storing it into foo on success, or throwing an error on failure. Then you declare a property called bar of type String which....

Naturally, programmers dislike these repetitive tasks. Repitition is what computers are for. We want to be able to just write this:

    struct Whatever {
        var foo: String
        var bar: String
    }

And have it be serializable. It ought to be possible: all the necessary information is already present.

Reflection is a common way to accomplish this. A lot of Objective-C programmers have written code to automatically read and write Objective-C objects to and from JSON objects. The Objective-C runtime provides all of the information you need to do this automatically. For Swift, we can use the Objective-C runtime, or make do with Swift's Mirror and use wacky workarounds to compensate for its inability to mutate properties.

Outside of Apple's ecosystem, this is a common approach in many languages. This has led to various hilarioussecuritybugs over the years.

Reflection is not a particularly good solution to this problem. It's easy to get it wrong and create security bugs. It's less able to use static typing, so more errors happen at runtime rather than compile time. And it tends to be pretty slow, since the code has to be completely general and does lots of string lookups with type metadata.

Swift has taken the approach of compile-time code generation rather than runtime reflection. This means that some of the knowledge has to be built in to the compiler, but the result is fast and takes advantage of static typing, while still remaining easy to use.

Overview
There are a few fundamental protocols that Swift's new encoding system is built around.

The Encodable protocol is used for types which can be encoded. If you conform to this protocol and all stored properties in your type are themselves Encodable, then the compiler will generate an implementation for you. If you don't meed the requirements, or you need special handling, you can implement it yourself.

The Decodable protocol is the companion to the Encodable protocol and denotes types which can be encoded. Like Encodable, the compiler will generate an implementation for you if your stored properties are all Decodable.

Because Encodable and Decodable usually go together, there's another protocol called Codable which is just the two protocols glued together:

    typealias Codable = Decodable & Encodable

These two protocols are really simple. Each one contains just one requirement:

    protocol Encodable {
        func encode(to encoder: Encoder) throws
    }

    protocol Decodable {
        init(from decoder: Decoder) throws
    }

The Encoder and Decoder protocols specify how objects can actually encode and decode themselves. You don't have to worry about these for basic use, since the default implementation of Codable handles all the details for you, but you need to use them if you write your own Codable implementation. These are complex and we'll look at them later.

Finally, there's a CodingKey protocol which is used to denote keys used for encoding and decoding. This adds an extra layer of static type checking to the process compared to using plain strings everywhere. It provides a String, and optionally an Int for positional keys:

    protocol CodingKey {
        var stringValue: String { get }
        init?(stringValue: String)

        var intValue: Int? { get }
        public init?(intValue: Int)
    }

Encoders and Decoders
The basic concept of Encoder and Decoder is similar to NSCoder. Objects receive a coder and then call its methods to encode or decode themselves.

The API of NSCoder is straightforward. NSCoder has a bunch of methods like encodeObject:forKey: and encodeInteger:forKey: which objects call to perform their coding. Objects can also use unkeyed methods like encodeObject: and encodeInteger: to do things positionally instead of by key.

Swift's API is more indirect. Encoder doesn't have any methods of its own for encoding values. Instead, it provides containers, and those containers then have methods for encoding values. There's one container for keyed encoding, one for unkeyed encoding, and one for encoding a single value.

This helps make things more explicit and fits better with portable serialization formats. NSCoder only has to work with Apple's encoding format so it just needs to put the same thing out that it got in. Encoder has to work with things like JSON. If an object encodes values with keys, that should produce a JSON dictionary. If it uses unkeyed encoding then that should produce a JSON array. What if the object is empty and encodes no values? With the NSCoder approach, it would have no idea what to output. With Encoder, the object will still request a keyed or unkeyed container and the encoder can figure it out from that.

Decoder works the same way. You don't decode values from it directly, but rather ask for a container, and then decode values from the container. Like Encoder, Decoder provides keyed, unkeyed, and single value containers.

Because of this container design, the Encoder and Decoder protocols themselves are small. They contain a bit of bookkeeping info, and methods for obtaining containers:

    protocol Encoder {
        var codingPath: [CodingKey?] { get }
        public var userInfo: [CodingUserInfoKey : Any] { get }

        func container<Key>(keyedBy type: Key.Type)
            -> KeyedEncodingContainer<Key> where Key : CodingKey
        func unkeyedContainer() -> UnkeyedEncodingContainer
        func singleValueContainer() -> SingleValueEncodingContainer
    }

    protocol Decoder {
        var codingPath: [CodingKey?] { get }
        var userInfo: [CodingUserInfoKey : Any] { get }

        func container<Key>(keyedBy type: Key.Type) throws
            -> KeyedDecodingContainer<Key> where Key : CodingKey
        func unkeyedContainer() throws -> UnkeyedDecodingContainer
        func singleValueContainer() throws -> SingleValueDecodingContainer
    }

The complexity is in the container types. You can get pretty far by recursively walking through properties of Codable types, but at some point you need to get down to some raw encodable types which can be directly encoded and decoded. For Codable, those types include the various integer types, Float, Double, Bool, and String. That makes for a whole bunch of really similar encode/decode methods. Unkeyed containers also directly support encoding sequences of the raw encodable types.

Beyond those basic methods, there are a bunch of methods that support exotic use cases. KeyedDecodingContainer has methods called decodeIfPresent which return an optional and return nil for missing keys instead of throwing. The encoding containers have methods for weak encoding, which encodes an object only if something else encodes it too (useful for parent references in a complex graph). There are methods for getting nested containers, which allows you to encode hierarchies. Finally, there are methods for getting a "super" encoder or decoder, which is intended to allow subclasses and superclasses to coexist peacefully when encoding and decoding. The subclass can encode itself directly, and then ask the superclass to encode itself with a "super" encoder, which ensures keys don't conflict.

Implementing Codable
Implementing Codable is easy: declare conformance and let the compiler generate it for you.

It's useful to know just what it's doing, though. Let's take a look at what it ends up generating and how you would do it yourself. We'll start with an example Codable type:

    struct Person: Codable {
        var name: String
        var age: Int
        var quest: String
    }

The compiler generates a CodingKeys type nested inside Person. If we did it ourselves, that nested type would look like this:

    private enum CodingKeys: CodingKey {
        case name
        case age
        case quest
    }

The case names match Person's property names. Compiler magic gives each CodingKeys case a string value which matches its case name, which means that the property names are also the keys used for encoding them.

If we need different names, we can easily accomplish this by providing our own CodingKeys with custom raw values. For example, we might write this:

    private enum CodingKeys: String, CodingKey {
        case name = "person_name"
        case age
        case quest
    }

This will cause the name property to be encoded and decoded under person_name. And this is all we have to do. The compiler happily accepts our custom CodingKeys type while still providing a default implementation for the rest of Codable, and that default implementation uses our custom type. You can mix and match customizations with the compiler-provided code.

The compiler also generates an implementation for encode(to:) and init(from:). The implementation of encode(to:) gets a keyed container and then encodes each property in turn:

    func encode(to encoder: Encoder) throws {
        var container = encoder.container(keyedBy: CodingKeys.self)

        try container.encode(name, forKey: .name)
        try container.encode(age, forKey: .age)
        try container.encode(quest, forKey: .quest)
    }

The compiler generates an implementation of init(from:) which mirrors this:

    init(from decoder: Decoder) throws {
        let container = try decoder.container(keyedBy: CodingKeys.self)

        name = try container.decode(String.self, forKey: .name)
        age = try container.decode(Int.self, forKey: .age)
        quest = try container.decode(String.self, forKey: .quest)
    }

That's all there is to it. Just like with CodingKeys, if you need custom behavior here you can implement your own version of one of these methods while letting the compiler generate the rest. Unfortunately, there's no way to specify custom behavior for an individual property, so you have to write out the whole thing even if you want the default behavior for the rest. This is not particularly terrible, though.

If you were to do it all by hand, the full implementation of Codable for Person would look like this:

    extension Person {
        private enum CodingKeys: CodingKey {
            case name
            case age
            case quest
        }

        func encode(to encoder: Encoder) throws {
            var container = encoder.container(keyedBy: CodingKeys.self)

            try container.encode(name, forKey: .name)
            try container.encode(age, forKey: .age)
            try container.encode(quest, forKey: .quest)
        }

        init(from decoder: Decoder) throws {
            let container = try decoder.container(keyedBy: CodingKeys.self)

            name = try container.decode(String.self, forKey: .name)
            age = try container.decode(Int.self, forKey: .age)
            quest = try container.decode(String.self, forKey: .quest)
        }
    }

Implementing Encoder and Decoder
You may never need to implement your own Encoder or Decoder. Swift provides implementations for JSON and property lists, which take care of the common use cases.

You can implement your own in order to support a custom format. The size of the container protocols means this will take some effort. Fortunately, it's mostly a matter of size, not complexity.

To implement a custom Encoder, you'll need something that implements the Encoder protocol plus implementations of the container protocols. Implementing the three container protocols involves a lot of repetitive code to implemente encoding or decoding methods for all of the various directly encodable types.

How they work is up to you. The Encoder will probably need to store the data being encoded, and the containers will inform the Encoder of the various things they're encoding.

Implementing a custom Decoder is similar. You'll need to implement that protocol plus the container protocols. The decoder will hold the serialized data and the containers will communicate with it to provide the requested values.

I've been experimenting with a custom binary encoder and decoder as a way to learn the protocols, and I hope to present that in a future article as an example of how to do it.

Conclusion
Swift 4's Codable API looks great and ought to simplify a lot of common code. For typical JSON tasks, it's sufficient to declare conformance to Codable in your model types and let the compiler do the rest. When needed, you can implement parts of the protocol yourself in order to handle things differently, and you can implement it all if needed.

The companion Encoder and Decoder protocols are more complex, but justifiably so. Supporting a custom format by implementing your own Encoder and Decoder takes some work, but is mostly a matter of filling in a lot of similar blanks.

That's it for today! Come back again for more exciting serialization-related material, and perhaps even things not related to serialization. Until then, Friday Q&A is driven by reader ideas, so if you have a topic you'd like to see covered here, please send it in!

Viewing all 94 articles
Browse latest View live