==================[ "Deconstructing C++" ]============== C++ introduced Object-oriented programming into the industry mainstream. It hit a sweet spot, by combining the familiar syntax and memory model of C with the automatic enforcement of API contracts by the compiler. In addition, it linked cleanly with C libraries---which meant that they didn't need to be rewritten to be used from C++. C++ textbooks are full of philosophical justification of its many features, extolling the virtues of encapsulation, inheritance, polymorphism, and generic programming. Because of C++'s overwhelming syntactic complexity, these features look like magic. In reality, they work very simply, through a combination of two tricks: - "name mangling" for methods, which encodes the name of the class and the argument types into a "mangled" name of a function symbol, which is then handled by the regular C linker. All methods take an implicit first argument that is the "this" pointer to the memory struct representing the current object that is an instance of the class. - dynamic function call dispatch, created by the compiler when compiling calls to methods. Any object struct contains a pointer (called "VPTR") to the table (called "VTABLE") of function pointers that point to the class' methods. That way the right method to call for a method with a given name and argument type signature (resolved statically and ultimately translating to an offset into the method table) is looked up at runtime, based on the VPTR and the offset. We'll see the examples below. ========================[ Readings ]============================ Read Chapters 9 and 10 of the Mitchell textbook about the history of OOP. You may overlook the details of the ML examples for now---we will be looking at these same ideas in Haskell shortly. For a short (and rather opinionated) C++ tutorial, see Thomas Andreson's "A Quick Introduction to C++", http://homes.cs.washington.edu/~tom/c++example/ We looked at that code in class, and had to make a few changes to conform to the modern C++ compiler requirements. The code is in oo/cxx.tar ==================[ Why look at C++ at all? ]================= C++ is a huge and complex language; it can be an endless source of puzzles, such as this one I mentioned today: http://gotw.ca/gotw/005.htm (also, see http://stackoverflow.com/questions/9124238/c-inheritance-and-overload-resolution). These puzzles often turn into interview questions. As far as I know, most C++ development shops deliberately limit the features they use to a limited subset of C++, to avoid creating "accidental puzzles". For example, the Google C++ Style Guide (https://google.github.io/styleguide/cppguide.html) prohibits the use of some features such as exceptions and non-public inheritance, and discourages others, such as function overloading, a core C++ language feature. Still, C++ is the largest production language in industry use, and will be for some time. While some of its features, although long in the language standards, will fall by the wayside, understanding how C++ implements core OO concepts is important (not least, to be aware when to use and when not to use C++). For a long critique of C++, see "C++ Frequently Questioned Answers" http://yosefk.com/c++fqa/ =====================[ C++ OO internals ]===================== For C++, encapsulation and (structural) polymorphism is a matter of compilation. Objects, in practice, are contiguous memory regions in which the object's data members are laid out at pre-fixed offsets decided by the compiler---just like C structs, with the addition of a VPTR. The difference is that objects have _methods_: functions that refer to the object's members as if these members were local variables, with separate and persistent set of values for each object instance. These "method" functions are always called "on" (or "with") a particular object instance, and operate on that instance, leaving all other instances unchanged. If you think this sounds similar to closures and local bindings inside a closure, it really is---except the machine/compilation model is quite different, and is much closer to a function's stack frame, in which "slots" corresponding to variables are accessed by their pre-fixed offsets. It's just that a function's stack frame goes away when the function exists, whereas an object's underlying memory area can live on the heap or in a global memory area and persist throughout the program's lifetime. Methods are compiled as functions with one extra ("implicit") argument, the pointer to the object's memory area. This value is available inside the method as the "this" variable (which should not be written to inside the function). You'll see it as the x0 arguments in Armv8 disassembly. -----------------[ Overloading == Name Mangling ]----------------- To implement overloading, function names of methods (i.e., the names by which the linker and the rest of the ABI know them) are "mangled": rewritten to include the name of the "class" (think of it as the combination of the objects' common layout such as the size of the memory area and the members' offsets inside it, and the shared collection of the methods). Wikipedia gives a good summary of name mangling: https://en.wikipedia.org/wiki/Name_mangling (in the olden days, you had to buy the C++ Annotated Reference to find out about it). Look at name mangling in action in the class examples oo/inh.cc and oo/ovl.cc Here is what the name mangling looks like for ovl.c on my Mac OS X, where g++ is actually the LLVM compiler (not GCC) $ g++ --version Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1 Apple LLVM version 6.1.0 (clang-602.0.49) (based on LLVM 3.6.0svn) Target: x86_64-apple-darwin14.3.0 Thread model: posix $ g++ -Wall -o ovl ovl.cc // A hacky way of getting the actual names of functions as created by the // compiler for my Base::foo and Derived::foo, and used by linker and // and other binary tools. // https://en.wikipedia.org/wiki/Name_mangling explains some of the // mangling rules, e.g., '_Z' is the standard prefix for all such names, // 'N' stands for "Nested" structs and classes, actual names are // prefixed by their length, like 4Base, 7Derived, 3foo, as a // way of encoding where they end and the next portion of the // mangled name begins, and so on. Argument type encodings start after 'E'. // The 1st answer to / https://stackoverflow.com/questions/6921295/dual-emission-of-constructor-symbols // goes into a bit more details, and covers the multiple names of constructors // like __ZN4BaseC1Ev and __ZN4BaseC2Ev . $ gobjdump -d ovl | grep ^00 0000000100000ec0 <_main>: 0000000100000f20 <__ZN4BaseC1Ev>: 0000000100000f40 <__ZN7DerivedC1Ev>: 0000000100000f60 <__ZN7DerivedC2Ev>: 0000000100000fa0 <__ZN4BaseC2Ev>: 0000000100000fe0 <__ZN7Derived3fooEv>: // Derived::foo() 0000000100001010 <__ZN4Base3fooEi>: // Base::foo(int) 0000000100001050 <__ZN4Base3fooEPS_>: // Base::foo(Base*) (S_ probably stands for "self", P for pointer) 0000000100001090 <__ZN4Base3fooEPPPPc>: // Base::foo(char****) (P surely stands for pointer) 00000001000010d0 <__ZN4Base3fooEP7Derived>: // Base::foo(Derived*) 0000000100001110 <__ZNSt3__1lsINS_11char_traitsIcEEEERNS_13basic_ostreamIcT_EES6_PKc>: 0000000100001160 <__ZNSt3__124__put_character_sequenceIcNS_11char_traitsIcEEEERNS_13basic_ostreamIT_T0_EES7_PKS4_m>: 00000001000015e0 <__ZNSt3__116__pad_and_outputIcNS_11char_traitsIcEEEENS_19ostreambuf_iteratorIT_T0_EES6_PKS4_S8_S8_RNS_8ios_baseES4_>: 0000000100001c90 <___clang_call_terminate>: 0000000100001cb0 <__ZN4Base3fooEv>: // Base::foo() 0000000100001cdc <__TEXT.__stubs>: 0000000100001d54 <__TEXT.__stub_helper>: 0000000100001df0 : 0000000100001ea0 : 0000000100001f54 <__TEXT.__unwind_info>: -----------------[ Virtual functions == Vtables ]------------------- ----[ virtual == called with an extra function pointer look-up ]---- A key idea of C++ polymorphism---writing code in such a way that it works for objects with different internal structure, so long as they share the same basic names and arguments of their methods (a.k.a. an API/interface)---is that you should be able to call an object's method by referring to an object through a pointer to a _base_ class/type, and yet have the methods actually called be those of the _actual_ object's type. In other words, you cannot guess which function you'd need to call at compile time---unlike with overloading/name mangling, which can all be done by the compiler. You need to do something at runtime to dispatch the call to the right function. In C++ this polymorphism is handled via class inheritance: you declare a base class (which might not ever have object instances), and derived classes that do the actual---and different, but with the same method name---work. You then write code using pointers to the base class throughout, but the methods called are actually the respective methods of the derived classes. As a classic example, imagine that you have a base class Shape that represents a geometric figure like a dot, circle, square, oval, star, etc. This base class only has the coordinates of the figure's center and a pointer to the next figure, to make linked lists with. You then draw all figures on a list by traversing it and calling the method "draw()", named the same but doing the different respective things for the dot, circle, square, etc. For example: class Shape { int x, y; // coordinates of the figure's center Shape *next; // NULL if last virtual void draw(); // add "= 0" to make Shape a so-called "pure virtual class", // a class that's impossible to instantiate as such, // used only for a common interface definition, and // to refer by pointer Shape* to objects of derived classes }; // these members are shared by all figures; everything else is different, // e.g., circle has radius, square has sides and rotation angle, etc. class Dot : Shape { Color color; void draw() { /* I can draw a pixel */ } class Circle : Shape { int radius; Color color; void draw() { /* I can draw a circle */ } }; class Square : Shape { int side; int rotation_angle; Color color; void draw() { /* I can draw a square */ } }; // traverse the list, drawing every figure on it Shape *p; for( p = firstFigureOnSomeList; p != NULL ; p = p->next ) p->draw(); We want p->draw() to call the actual draw() of the respective classes in a linked list of Circles, Squares, Dots, etc. TADA! C++ allows this if you make the draw() method in Shape _virtual_. Magic, no? :) The way this is implemented is this: unlike normal functions, which are called (via the ARMv8 "blr" instruction) with a global address, virtual functions are called with one extra level of indirection. Namely, all objects are compiled with an extra hidden pointer, which can be thought of as another member of Shape, and is called a "VPTR". This pointer points to a table of methods (functions) that all Shapes share by name (although not by actual content)---in our example, just draw(). But, here's the trick: in all Circle instances this VPTR pointer points to a table that points to the Circle-specific function for draw. In Squares, their VPTRs point to the Square-specific table with a Square-specific function, and so on. These respective tables with their respective function pointers are called VTABLEs for the respective classes. That is all, but it's enough to implement "polymorphism". The "call this address" instruction for virtual functions becomes "fetch this address from this fixed offset in the object into a register, then call the function whose address you find there, via blr ". See the disassembly of oo/inh.cc for how this works. This adds an indirection (and this performance overhead), but gives C++ the flexibility of writing "polymorphic" code around base class pointers. This may seem contrived, but consider: if you want code to work for several different types of objects that share the same basic operations, then you need to (1) somehow describe these common operations and (2) refer to "any object that allows these operations". This is the role that the base class and the pointers to this base class play in the language. Again, surprisingly, Wikipedia has a reasonable summary with examples: https://en.wikipedia.org/wiki/Virtual_method_table See also the disassembly of oo/inh.cc .