Skip to content

Latest commit

 

History

History
1001 lines (827 loc) · 49.3 KB

p0875.md

File metadata and controls

1001 lines (827 loc) · 49.3 KB

Principle: information accumulation

Pull request

Table of contents

Problem

We should have consistent rules describing what information about a program is visible where.

Background

Information in a source file is provided incrementally, with each source utterance providing a small piece of the overall picture. Different languages have different rules for which information is available where.

Single-pass "splat" compilation

In C and other languages of a similar age, single-pass compilation was highly desirable, due to resource limits and performance concerns. In these languages:

  • Information is accumulated top-down, and can only be used lexically after it appears.
  • Most information can be discarded soon after it is provided: function bodies don't need to be kept around once they've been converted to the output format, and no information on local variable or parameter names needs to persist past the end of the variable's scope. However, the types of globals and the contents of type definitions must be retained.
  • The behavior of an entity can be different at different places in the same source file. An early use may fail if it depends on information that's provided later, and in some cases a later use may fail when an earlier use succeeded because the use is invalid in a way that was not visible at the point of an earlier use.

Global consistency

In more modern languages such as C#, Rust, Java, and Swift, there is no lexical information ordering. In these languages:

  • Information is effectively accumulated and processed in separate passes.
  • The language design and implementation ensure that the behavior of an entity is the same everywhere: both before its definition, after its definition, within its definition, and in any other source file in which it was made visible.
  • Dependency cycles between program properties are carefully avoided by the language designers.

The C++ compromise

In C++, a hybrid approach is taken. There is a C-like lexical information ordering rule, but this rule is subverted within classes by -- effectively -- reordering certain parts of a class that appear within the class definition so that they are processed after the class definition. This primarily applies to the bodies of member functions. Here:

  • Information is mostly accumulated top-down, and is accumulated fully top-down after the reordering step.
  • The behavior of a class is the same within member function bodies that are defined inside the class as it is within member function bodies defined lexically after the class.
  • The language designers need to ensure that the bounds of the member function bodies and similar constructs can be determined without parsing them, so that the late-parsed portions can be separated from the early-parsed portions. In C++, this was not done successfully, and there are constructs for which this determination is very hard or impossible.

Separate declarations and definitions

Somewhat separate from the direction of information flow is the ability to separate the information about an entity into multiple distinct regions of source files. In C and C++, entities can be separately declared and defined. As a consequence, these languages need rules to determine whether two declarations declare the same entity.

In C++, especially for templated declarations, these rules can be incredibly complex, and even now, more than 30 years after the introduction of templates in C++, basic questions are not fully answered, and implementations disagree about which declarations declare the same entity in fairly simple examples.

One key benefit of this separation is in reduction of physical dependencies: in order to validate a usage of an entity, we need only see a source file containing a declaration of that entity, and need never consider the source file containing its definition. This both reduces the number of steps required for an incremental rebuild and reduces the input information and processing required for each individual step.

The ability to break physical dependencies is limited to the cases where information can actually be hidden from the users of the entity. For example, if the user actually needs a function body, either because they will evaluate a call to the function during compilation or because they will inline it prior to linking, it cannot be physically isolated from the user of that information. As a consequence, in C++, a programmer must carefully manage which information they put in the source files that are exposed to client code and which information is kept separate.

Another key benefit is that the exported interface of a source file can become more readable, by presenting an interface that contains only the facts that are salient to a user and not the implementation details.

Proposal

Carbon will use a top-down approach with deferred method bodies, with forward declarations.

In any case where the strict global consistency model with the block-scope exception would consider the program to be invalid or would assign it a different meaning, the program is invalid.

Put another way: we accept exactly the cases where those two models both accept and give the same result, and reject all other cases.

Details

Each entity has only one meaning and only one set of properties; that meaning and those properties do not depend on which source file you are within or where in that source file you are. However, not all information will be available everywhere, and it is an error to attempt to use information that is not available -- either because it's in a different source file and not exported or not imported, or because it appears later in the same source file.

For example, a full definition of a class Widget may be available in some parts of the program, with only a forward declaration available in other parts of the program. In this case, any attempt to use Widget where only a forward declaration is available will always either result in the program being invalid or doing the same thing that would happen if the full information were available. This is a stronger guarantee than is provided by C++'s One Definition Rule, which guarantees that every use sees an equivalent definition or no definition, but does not guarantee that every use has the same behavior -- in C++ a function could do one thing if Widget is defined and a different thing if Widget is only forward declared, and that will not be possible in Carbon.

Name lookup only provides earlier-declared names. However, the bodies of class member functions that are defined inside the class are reordered to be separately defined after the class, so all class member names are declared before the class body is seen. Unlike in C++, classes are either incomplete or complete, with no intermediate state. Any attempt to perform name lookup into an incomplete class is an error, even if the name being looked up has already been declared.

class A {
  fn F() {
    // OK: A and B are complete class types here.
    var b: A.B;
  }

  class B {}

  // Error: cannot look up B within incomplete class A.
  var p: A.B*;
}

Forward declarations are permitted in order to allow separation of interface and implementation whenever the developer wishes to do so, and introduce names early in order to break cycles or when the developer does not wish to define entities in topological order. All declarations of an entity are required to be part of the same library, but can be in different source files. If a point within a source file can only see a declaration of an entity and not its definition, some uses of that entity will be invalid that would be valid if the definition were visible. For example, without a class definition, we may be unable to create instances of the class, and without a function definition, we may be unable to use the function to compute a compile-time constant.

Goals

For this proposal, we have the following goals as refinements of the overall Carbon goals:

  • Comprehensibility. Our rules should be understandable, and should minimize surprise and gotchas. Our behavior should be self-consistent, and explainable in only a few sentences.
  • Ergonomics. It should be easy to express common developer desires, without a lot of boilerplate or repetitive code.
  • Readability. Code written using our rules should be as straightforward as possible for Carbon developers to read and reason about.
  • Efficient and simple compilation. It should be relatively straightforward to implement our semantic rules. Implementation heroics shouldn't be required, and the number of special cases required should be minimized.
  • Diagnosability. An implementation should be able to explain coding errors in ways that are easy to understand and are well-correlated with the error and its remedy. Diagnostics should appear in an order and style that guides the developer through logical steps to fix their mistakes.
  • Toolability. Relatively simple tools should be able to understand simpler properties of Carbon code. It should ideally be possible to identify which names can be used in a particular context and what those names mean without full processing. It should ideally be possible to gather useful and mostly complete information about a potentially-invalid source file that is currently being edited, for which it may be desirable to assume there is a "hole" in the source file at the cursor position that will be filled by unknown code.

Choice of alternative

We have a large space of options here, all of which have both benefits and drawbacks. The rationale for our choice is as follows:

The first consideration is whether to provide a globally consistent view of code. Providing such a view would require us to find a compatible build process that can scale to large projects; see the detailed analysis for details. Moreover, this is likely to be a choice that is relatively hard to evolve away from. A non-scalable build process would be an existential threat to the Carbon language under its stated goals, so we set this option aside. This should be revisited if a complete design for a scalable build process with a globally-consistent semantic model is proposed.

Once we step away from strict global consistency, there will be cases where the meaning of an entity differs in different parts of the program. In order for programs to behave consistently and predictably, especially in the presence of templates and generics, we do not want to allow the same operation to have two or more different behaviors depending on which information is available. So we choose to make programs invalid if they would make use of information that is not available.

Even though we have chosen not to provide a globally consistent view, we could still choose to provide a consistent view throughout a package, library, or source file, such as in the context-sensitive local consistency model. However, again for build performance reasons, we want to avoid package-at-a-time or even library-at-a-time compilation, and would like a file-at-a-time compilation strategy. The cost of having different views in different files is substantially reduced given that we have already chosen to have different views at least in different packages. So we allow the known information about an entity to vary between source files, even in the same library.

This implies that we at least need forward declarations for entities that are declared in an api file for a library and defined in an impl file. However, this leaves the question of what happens within a single source file: do entities need to be manually written in dependency order, or is the implementation expected to use information that appears after its point of declaration? Neither option guarantees that an entity will behave consistently wherever it's used, because we already gave up that guarantee for entities visible across source files, and we need to cope with the resulting risk of incoherence anyway.

The simplest option for implementation, and the one most similar to C++, would then be to accumulate information top-down. This is also the choice that is most consistent with the cross-source-file behavior: moving a prefix of a source file into a separate, imported file should generally preserve the validity of the program. Further, this choice gives us the most opportunity for future language evolution, as it is the most restrictive option, and hence it is forward-compatible with the other rules under consideration, as well as supporting paths for evolution that require a top-down view, such as some approaches to metaprogramming.

This choice implies that we need forward declarations, both for declarations in different source files from their definitions and to resolve cycles or cases where the developer cannot or does not wish to write code in topological order. Allowing separate declaration and definition may also be valuable to allow the code author to hide implementation details from readers of the interface, and present a condensed catalog or table of contents for the library. This is also an important aspect in providing a language that is familiar to C++ developers. So we choose to allow separate declaration and definition in general.

Given the above decisions, class member functions could not be meaningfully defined inline under a strict top-down rule, because the class would not yet be complete. We make an ergonomic affordance of reordering the bodies of such functions as if they appear after the class definition. This also improves consistency with C++, which has a similar rule.

Rationale based on Carbon's goals

  • Language tools and ecosystem
    • See "Toolability" goal.
  • Software and language evolution
    • Ensuring that the program interpretation is consistent with an interpretation with full information makes it easier to evolve code, as changing the amount of visible information -- for example, by reordering code or by adding or removing imports -- cannot change the meaning of a program from one valid meaning to another.
    • Selecting a rule that disallows information from flowing backwards keeps open more language evolution paths:
      • We could allow more information to flow backwards.
      • We could expose metaprogramming constructs that can inspect the state of a partial source file without creating inconsistency in our model.
  • Code that is easy to read, understand, and write
    • See "Readability", "Ergonomics", and "Comprehensibility" gaols.
  • Fast and scalable development
    • See "Efficient and simple compilation" and "Diagnosability" goals.
  • Interoperability with and migration from existing C++ code
    • The chosen rule is mostly the same as the corresponding rule in C++, both in accumulating information top-down and in treating inline method bodies as a special case. This should keep migration simple and improve familiarity for experienced C++ developers.

It should be noted that the decision between the model presented here and the leading alternative, which was context-sensitive local consistency, was very close, and came down to minor details. For each of our goals, neither alternative had a significant advantage. The similarity to C++ and evolvability of the alternative in this proposal were ultimately the deciding factors.

Alternatives considered

Below, various alternatives are presented and rated according to the goals for this proposal.

Generally we need to pick a rule for allowing or disallowing separate declaration and definition, and a rule for information flow. These choices are not independent: some information flow rules require separate declaration and definition.

Separate declaration and definition

Allow separate declaration and definition

We could allow two or more declarations of each entity, with at most one declaration providing a definition and the other declarations merely introducing the existence of the entity and its basic properties:

// Forward declaration.
fn F();
// ...
// Another forward declaration of the same function.
fn F();
// One and only definition.
fn F() {}
// Yet another forward declaration.
fn F();

This could be done in either a free-for-all fashion as in the above example, where any number of declarations is permitted, or in a more restricted fashion, where there can be at most one declaration introducing an entity, and, if that declaration is not a definition, at most one separate implementation providing the definition. In this case, the separate definition could contain a syntactic marker indicating that a prior declaration should exist:

// Forward declaration.
fn F();
// ...
// Separate definition, with `impl` marker to indicate that this is just an
// implementation of an entity that was already declared.
impl fn F() {}
Analysis

Comprehensibility: In general, determining whether two declarations declare the same entity may not be straightforward, particularly for a redeclaration of an overloaded function. We have some options, but they all have challenges for comprehensibility or ergonomics of the rule:

  • We could require token-for-token identical declarations, but this may result in ergonomic problems -- developers may wish to leave out information in an interface that isn't relevant for consumers of that interface, or spell the same type or parameter name differently in an implementation.
  • We could allow the declarations to differ so long as they have the same meaning. However, there may be cases, as there are in C++, where it's unclear whether two declarations are "sufficiently similar" so as to declare the same entity.
  • We could require all overloads of a function to be declared together in a group, perhaps with some dedicated syntax, and then match up redeclarations based on position in the group rather than signature. Such an approach is likely to lead to developers, especially those familiar with the C++ rule, being surprised. It will likely also interact poorly with cases where some but not all of the overloads are defined in their first declaration and the rest are defined separately, or where it is desirable for different overloads to be defined in different implementation files.

Ergonomics: Any case where the declaration and definition are separated results in repetitive code. However, this repetition may serve a purpose to the extent that it's presenting an API description or acting as a compilation barrier.

Readability: Readability of code may be substantially improved by separating a declaration of an API from its definition, and permitting an API user to read, in a small localized portion of a source file, only the information they care about, without regard for implementation details.

Efficient and simple compilation: Allowing multiple declarations of an entity introduces some implementation complexity, as the implementation must look up, validate, and link together multiple declarations of the same entity. If redeclarations are permitted across source files -- such as between the interface and implementation of an API, or between multiple implementation files -- then this may also require some stable mechanism for identifying the declaration, such as a name mangling scheme. This may be especially complex in the case of overloaded function templates, which might differ in arbitrary Carbon expressions appearing in the function's declaration. This complexity could be greatly reduced if we require all overloads to be declared together -- at least in the same source file -- as we can then identify them by name and index.

This approach supports physical separation of implementation from interface, potentially leading to build time wins through reduced recompilation and through reducing the amount of information fed into each compilation step.

Diagnosability: Identifying errors where a redeclaration doesn't exactly match a prior declaration is not completely straightforward. This is especially the case when function overloads can be added by the same syntax with which a definition of a prior declaration would be introduced. Even if declarations and definitions use a different syntax, diagnosing a mismatch between a definition of an overloaded function and a prior set of declarations requires some amount of fuzzy matching to infer which one was probably intended.

Toolability: Supporting separate declarations and definitions will present the same kinds of complexity for non-compiler tools as for compilers. Tools will need to be able to reason about multiple declarations providing distinct information about an entity and the effect of that on how the entity can be used at different source locations.

Disallow separate declaration and definition

We could require each entity to be declared in exactly one place, and reject any cases where the same entity is declared more than once.

Analysis

Comprehensibility: This rule is easy to explain and easy to reason about.

Ergonomics: Minimizes the work required by the developer to implement new functionality and maintain existing functionality. Reduces the chance that a change in one place will be forgotten in another.

Readability: Each entity has only one declaration, giving a single canonical location to include documentation, and no question as to where information about the entity might be found. This will likely improve the navigability of source code and the ease with which all the information about an entity can be gathered, and reduce the chance that information in the declaration or definition will be incomplete or that they will be inconsistent with each other.

However, there would be no ability to present a function or class as source code without also including implementation details. Such implementation details cannot be physically separated from the API as presented to clients. The ability to read and understand an interface will depend more heavily on tools that can hide implementation details, such as documentation generators and IDEs with outlining, and implementation techniques such as using interfaces to present the API of a class separate from its implementation.

For developers familiar with C++ in particular, the absence of the ability to separate declaration from definition may create friction, due to the familiarity of that feature and having their thinking about code layout shaped around it. Given that Carbon aims to be familiar to those coming from C++, an absence of forward declarations will work against that goal to some extent, although empirical evidence suggests that the extent to which this is an issue varies widely between existing C++ developers.

Efficient and simple compilation: This approach leads to a simpler compilation strategy. However, in the absence of forward declarations, every compilation referring to an entity needs to have a build dependency on the full definition of that entity, meaning there is no physical separation between the APIs of entities and their definitions. This would impose a potentially-significant build time cost due to increased recompilations when the definition of an entity changes.

There are possibilities to reduce this build time cost. For example, instead of each declaration identifying whether it is public or private to a library, we could add a third visibility level to say that the declaration but not the definition is public, and use a stripping tool as part of the build process to avoid dependencies on non-public definitions from cascading into unnecessary rebuilds.

This approach is also unlikely to be acceptable from an ergonomic standpoint unless declarations within a source file are visible throughout at least the entire library containing that file, although this is strictly an independent choice. Making definitions from a source file visible throughout its library would presumably require a library-at-a-time build strategy, which is also expected to introduce build time costs due to changes to a library requiring more code in that library to be recompiled. This cost could be reduced by using a reactive compilation model, recompiling only the portions of a library that are affected by a code change, at some complexity cost.

Diagnosability: Because each entity can only be declared once, there is no significant challenge in diagnosing redeclaration issues. However, there is still some work required to diagnose conflicts between similar or functionally identical function overloads.

Toolability: Having a unique source location for each entity allows for somewhat simpler tooling. For example, there is no need to distinguish between "jump to declaration" and "jump to definition", or to decide which declaration should be consulted to find documentation comments, parameter names, and so on.

Information flow

Strict top-down

Carbon could accumulate information top-down. We could require that each program utterance is type-checked and fully validated before any later code is considered.

In order to support this and still permit cyclic references between entities, we would need to allow separate declaration and definition.

Analysis

Comprehensibility: This rule is simple to explain, and has no special cases. However, the inability to look at information from later in the source file is likely to result in gotchas:

class Base {
  var n: i32;
}
class Derived extends Base {
  // Returns me.(Base.n), not me.(Derived.n), because the latter has not
  // been declared yet.
  fn Get[me: Self]() -> i32 { return me.n; }
  var n: i32;
}

It might be possible to require a diagnostic in such cases, when we find a declaration that would change the meaning of prior code if it had appeared earlier, but that would result in implementation complexity, and the fact that such cases are rejected would still be a surprise.

Ergonomics: The developer is required to topologically sort their source files in dependency order, manually breaking cycles with forward declarations. Common refactoring tasks such as reorganizing code may require effort or tooling assistance in order to preserve a topological order.

Developers could adapt to this ruleset by beginning each source file with a collection of forward declarations. This would mitigate the need to produce a topological ordering, except within those forward declarations themselves, and other declarations required to provide those forward declarations. For example, a forward declaration of a class member will likely only be possible within a class definition, and the order in which that class definition is given can be relevant to the validity of other class definitions. However, our experience with C++ indicates that this will not be done, and instead an ad-hoc and undisciplined combination of a topological sort and occasional forward declarations will be used. Indeed, including these additional forward declarations would increase the maintenance burden and introduce an additional opportunity for errors due to mismatches between declaration and definition.

Readability: Developers wishing to understand code have the advantage that they need only consider prior code, and there is no possibility that a later source utterance could change the meaning of the code they're reading. However, it is rare to read code top-down, so the effect of this advantage may be modest.

This advantage leads to a significant disadvantage: the behaviour of an entity can be different at different places within a source file. For example, a type can be incomplete in one place and complete in another, or can fail to implement an interface when inspected early and then found to implement that interface later. This can lead to very subtle incoherent behavior.

In practice, the topological ordering constraint tends to lead to good locality of information: helpers for particular functionality are often located near to the functionality. However, this is not a universal advantage, and the topological constraint sometimes leads to internal helpers being ordered immediately before their first use instead of in a more logical position near correlated functionality.

Efficient and simple compilation: This rule is mostly simple and efficient to implement, and even allows single-pass processing of source files.

Diagnosability: Because information is provided top-down, diagnostics can also be provided top-down and in every case the diagnostic will be caused by an error at the given location or earlier. Fixing errors should require little or no backtracking by the developer.

However, an implementation that strictly confines its processing to top-down order and produces diagnostics eagerly cannot deliver diagnostics that react intelligently to contextual cues that appear after the point of the diagnostic. This approach diminishes the ability for an implementation to pinpoint the cause of the error and describe it in a developer-oriented fashion.

Toolability: Limiting information flow to top-down means that tools such as code completion tools need only consider context prior to the cursor, and they can be confident that if all the code prior to the cursor is valid that it can be type-checked and suitable completions offered.

However, in the case where the user wants to refer to a later-declared entity, such tools would not be able to use this strategy. They would need to parse as if there were not a top-down rule in order to find such later-declared entities, and would likely additionally need the ability to add forward declarations or to reorder declarations in order to satisfy the ordering requirement.

Strict global consistency

Carbon could follow an approach of requiring the behavior of every entity to be globally consistent. In this approach, the behavior of every entity would be as if the entire program could be consulted to determine facts about that entity.

In practice, to make this work, we would need to limit where those facts can be declared. For example, we limit implementations of interfaces to appear only in source files that must already be in use wherever the question "does this type implement that interface?" can be asked.

In addition, we need to reject at least the case where some property of the program recursively depends upon itself:

struct X {
  var a: Array(sizeof(X), i8);
}

In order to give globally consistent semantics to, for example, a package name, we would likely need to process all source files comprising a package at the same time. This is likely to encourage packages to be small, whereas we have designed package support on the assumption that the scope of a package matches that of a complete source code repository.

This alternative can be considered either with or without the ability to separate declarations from definitions. If we permit separate declaration and definition, we would likely require declarations of the same entity to appear in the same library; however, that limitation is desirable regardless of which strategy we choose.

Block-scope exception

Applying this rule to local name lookup in block scope does result in some surprises. For example, C# uses this approach, and combined with its disallowance of shadowing of local variables, this confuses some developers. As a variant of this alternative, it would be reasonable and in line with likely programmer expectations to not apply this rule to names in block scope. We call this the block-scope exception.

For example:

var m: i32 = 1;
fn F(b: bool) -> i32 {
  if (b) {
    var n: i32 = 2;
    // With the block-scope exception:
    // * `n` is unambiguously the variable declared on the line above, and
    // * `m` is unambiguously the global.
    //
    // Without the block-scope exception:
    // * depending on the rules we choose for shadowing, `n` might name the
    //   variable above, might be ambiguous, or might be invalid because it
    //   shadows the variable `n` declared below, and
    // * `m` would name the local variable declared below.
    return n + m;
  }
  var n: i32 = 3;
  var m: i32 = 4;
  return n * m;
}
Analysis

Comprehensibility: This rule is simple to explain, and has no special cases, other than perhaps the block scope variant. The disallowance of semantic cycles is likely to be unsurprising as it is a logical necessity in any rule.

Ergonomics: The developer can organize or arrange their code in any way they desire. There is never a requirement to forward-declare or repeat an interface declaration. Refactoring and code reorganization do not require any non-obvious changes, because the same code means the same thing regardless of how it is located relative to other code.

Readability: Reasoning about code is simple in this model, as such reasoning is largely not context-sensitivity. Instead of questioning "what does this do here?" we can instead consider "what does this do?". Some context sensitivity may remain, for example due to access and name bindings differing in different contexts.

However, to developers accustomed to a top-down semantic model, the ability to defer giving key information about an entity -- or even declaring it at all -- until long after it is first used may hinder readability in some circumstances, particularly when reading code top-down.

Efficient and simple compilation: This model forces the compilation process to operate in multiple stages rather than as a single pass.

Some form of cycle detection is necessary if cycles are possible. However, such a mechanism is likely to be necessary for template instantiation too, so this is likely not a novel requirement for Carbon.

Forcing all files within a package to be compiled together in order to provide consistent semantics for the package name may place an undesirable scalability barrier on the build system. This will also tend to push up build latency, especially for incremental builds, by decreasing the granule size of compilation. It may increase the scope of recompilations due to additional physical dependencies unless we implement a mechanism to detect whether a library has changed in an "important" way.

We do not have an existence proof of an efficient build strategy for a Carbon-like language that follows this model. For example, Rust has known build scalability issues, as does Java for large projects unless a tool like ijar is used to automatically create something resembling a C++ header file from a source file. There is a significant risk that such an automated tool would be very hard to create for Carbon: because the definitions of classes and functions can be exposed in various ways through compile-time function evaluation, templates, and metaprogramming, it seems challenging to put any useful bound on which entities can be stripped without restricting the capabilities of client code.

Diagnosability: The implementation is likely to have more contextual information when providing diagnostics, improving their quality. However, the diagnostics may appear in a confusing order: if an early declaration needs information from a later declaration in order to type-check, diagnostics associated with that later declaration may be produced first, or may be interleaved with diagnostics for the earlier declaration, leading to the programmer potentially revisiting the same code multiple times during a compile-edit cycle.

Toolability: This model requires tools to consider the whole file as context, because code may refer to entities that are only introduced later. For an IDE scenario, where the cursor represents a location where an arbitrary chunk of code may be missing, this presents a challenge of determining how to resynchronize the input in order to determine how to interpret the portion of the source file following the cursor.

Sophisticated tooling for a top-down model may wish to inspect the trailing portion of the file anyway, in order to provide a better developer experience, but this complexity would be forced upon tools with this model.

Top-down with minimally deferred type checking

We could follow a top-down approach generally, but defer type-checking some or all top-level entities until we reach the end of that entity. For example, we would check an entire class as a single unit, following the same principles as in the globally-consistent rule, but using only information provided prior to the end of the class definition. This would allow class members to use other members that have not yet been declared, while not permitting a function definition preceding the class definition to use such members.

Following C++'s lead, we would apply this to at least classes and interfaces, but perhaps to nothing else. We may still want to apply the top-down rule to function definitions appearing within a class or interface, as the C# behavior that the scope of a local variable extends backwards before its declaration may be surprising.

Analysis

This approach generally has the properties as strict top-down, except as follows:

Comprehensibility: Slightly reduced due to additional special-casing of top-level declarations. Gotchas would be rarer, but may not be fully eliminated. For example:

import X;
// G from package X, not G from class X below.
fn F() { X.G(); }
class X {
  fn G() {}
}
class Y {
  // G from class X below, not G from package X.
  fn F() { X.G(); }
  class X {
    fn G() {}
  }
}

Ergonomics: Improved within classes, as members can now be named everywhere within the class.

Readability: The ability to read and understand code may be somewhat harmed, as the rules for behavior across top-level declarations aren't the same as the rules for behavior within a top-level declaration. It may be especially jarring that code that is valid within a class definition is not valid at the top level:

class A {
  // OK
  var b: B;
  class B {}
}

// Error, I have not heard of B.
var b: B;
class B {}

However, the behavior of an entity no longer differs at different points within the same declaration of that entity, so the ability to reason about the behavior of an entity is somewhat improved compared to the strict top-down approach. For entities without a separate declaration and definition, the behavior of those entities is the same everywhere.

Efficient and simple compilation: This has the same complications as the strict global consistency approach, except that it does not apply to contexts that span multiple files, so it doesn't have the build time disadvantages of requiring all source files in a library to be built together.

Diagnosability: Diagnostics should be encountered by the implementation in the top-down order of the top-level declaration containing the error, but may in principle appear in any order within that top-level declaration, if there are forward references that require later code to be analyzed before earlier code is.

Diagnostics can be caused by errors appearing both before and after the location reported, but not arbitrarily far after: later top-level declarations cannot affect an earlier diagnostic.

Toolability: Tools that want to correctly model Carbon semantics would be required to deal with incomplete source code in the vicinity of the cursor. This is probably the hardest kind of incomplete source code to handle, as the region of "damaged" code is most likely to be relevant context. In practice, this is likely to present the same difficulties as tooling in the strict global consistency model.

Top-down with deferred method bodies

This is the proposed approach.

We could follow a top-down approach generally, but defer all processing of bodies of class member functions until we have finished processing the class, at which point we would process those member function bodies in the order they appeared within the class. For example, we would reinterpret:

class A {
  fn F[me: Self]() -> i32 { return me.n; }
  fn Make() -> A { return {.n = 5}; }
  var n: i32;
}

exactly as if it were written with the member function bodies out of line:

class A {
  fn F[me: Self]() -> i32;
  fn Make() -> A;
  var n: i32;
}
fn A.F[me: Self]() -> i32 { return me.n; }
fn A.Make() -> A { return {.n = 5}; }

The strict top-down would be applied to this rewritten form of the program.

As in C++, member functions of nested classes would be deferred until the outermost class is complete. This deferral would apply only to the bodies of member functions, unlike in C++ where it also applies to default arguments, default member initializers, and exception specifications, and unlike top-down with minimally deferred type checking, where it also applies to member function signatures and the declarations and definitions of non-function members.

Analysis

This approach generally has the properties as strict top-down, except as follows:

Comprehensibility: Slightly reduced due to additional special-casing of class member functions, but these rules are well-aligned with the rules from C++, which do not seem to suffer from major comprehensibility problems in this area.

Ergonomics: Improved within classes, as members can now be named within member function bodies.

Readability: The ability to read and understand code may be somewhat harmed, as the rules for behavior within a member function body are different from the rules everywhere else in the language. However, the very similar rule in C++ is not known to cause readability problems, so no major concerns are anticipated.

Efficient and simple compilation: The complexity of this approach is not substantially greater than the simple top-down approach. Processing of member function bodies must be deferred, but this can be done either by storing and replaying the tokens, or by forming a parse tree but deferring the semantic analysis and type-checking until the end of the class.

Diagnosability: Diagnostics for errors in member functions may appear after diagnostics for later code in the same class.

Developers may be confused by diagnostics claiming that a class member function body is not available yet and referring to a point in the program text that lexically follows the member function definition. It is likely possible to special-case such diagnostics to explain the nature of the problem.

Toolability: Tools that want to parse incomplete Carbon code would need to cope with member function bodies containing errors. For the most part, skipping a brace-balanced function body should be straightforward, but tools will also need to consider whether they attempt to detect and recover from mismatched braces within member functions.

Context-sensitive local consistency

We could use different behaviors in different contexts, as follows:

  • For contexts that are fundamentally ordered, such as function bodies, a top-down rule is used.
  • For contexts that are defined across multiple source files, such as packages and namespaces, we guarantee consistent behavior within each source file, but the behavior may be inconsistent across source files: different source files may see different sets of names within a package or namespace, depending on what they have imported.
  • For contexts that are defined within a single source file, such as a class or an interface, we guarantee globally consistent behavior.

Compared to the strict global consistency model, this would not guarantee that the contents of a package or namespace are the same everywhere: you only see the names that you declare or import into a package or namespace. Also, a top-down rule is used within function bodies, as that likely better matches programmer expectations.

Compared to the top-down with minimally deferred type checking and top-down with deferred method bodies models, this would not require forward declarations for namespace and package members that are declared or defined later in the same source file.

Analysis

Comprehensibility: This model is probably a little easier to understand than the minimally-deferred type-checking model, because there is no difference between top-level declarations and nested declarations.

Because the behavior of every entity is consistent within a file, and every name lookup is consistent within a scope -- other than the top-down behavior within function scopes -- there are fewer opportunities for surprises than in the top-down approaches, and nearly as few as in the strict global consistency model.

Ergonomics: As with the strict global consistency model, code can be organized as the developer desires, and refactoring doesn't have any surprises due to disrupting a topological order.

Readability: As with the strict global consistency model, readability is improved by not being context-sensitive. However, it is file-sensitive, and a different set of imports may lead to different behavior.

Similarly, as with the strict global consistency model, developers accustomed to the top-down model may find it harder to understand code in which a name can be referenced before it is declared.

Efficient and simple compilation: This model has the same general costs as the top-down with minimal deferred type-checking model. The implementation needs to be able to separate parsing from type-checking, and type-check lazily in order to handle forward references. However, more state needs to be accumulated before type-checking can begin.

This model supports separate compilation of source files, as no attempt is made to ensure consistency across source files, only within a source file.

Diagnosability: The compilation process will be split up into multiple phases, and it is likely that diagnostics from one phase will precede those from another. For example, a parsing error for a later construct may precede a type error for an earlier one.

If an earlier phase fails, it may not be feasible to diagnose later phases. For example, if parsing encounters an unrecoverable failure, type errors for the successfully-parsed portion cannot be generated, because they may depend on constructs appearing later in the same source file. However, this may also mean that better diagnostics are produced for cases such as a missing }, where the syntactic error will precede or prevent diagnostics for semantic errors caused by the misinterpretation of the program resulting from that missing }.

Within a particular phase of processing, diagnostics will generally be produced in a topological order, and the processing can be arranged such that they are produced in top-down order except where a forward reference requires that a later declaration is checked earlier.

Toolability: Largely the same as the strict global consistency model, except that tools only need to use the current source file as context. Also largely the same as the top-down with minimally deferred type checking model.