解释器 Java 代码的主机编译

对于以下文档，我们区分主机编译和客户编译。

主机编译应用于解释器的 Java 实现。如果解释器运行在 HotSpot 上，当 Truffle 解释器作为 Java 应用程序 JIT 编译（或动态编译）时，会应用这种类型的编译。这种编译在生成本地映像期间提前应用。
客户编译应用于客户语言代码。这种类型的编译使用部分求值和 Futamura 投影从 Truffle AST 和字节码中推导出优化后的代码。

本节讨论应用于 Truffle AST 和字节码解释器的特定领域主机编译。

主机内联 #

Truffle 解释器被编写为通过应用第一个 Futamura 投影来支持运行时编译。运行时可编译代码，也常称为部分可求值代码，具有以下特点

它天生就针对高性能设计，因为它还定义了语言运行时编译后的性能。
它被编写为避免递归，因为递归代码不能快速部分求值。
它避免使用复杂的抽象和第三方代码，因为它们通常不是为 PE 设计的。
部分可求值代码的边界由用 @TruffleBoundary 注释的方法、受 CompilerDirectives.transferToInterpreter() 调用支配的块或受 CompilerDirectives.inInterpreter() 调用保护的块可靠地定义。

Truffle 主机内联利用了这些属性，并尽可能地在主机编译期间对运行时可编译代码路径强制内联。一般假设是，对运行时编译重要的代码对解释器执行也很重要。每当检测到 PE 边界时，主机内联阶段不再进行任何内联决策，并将它们推迟到更适合常规 Java 代码的后续内联阶段。

此阶段的源代码可以在 HostInliningPhase 中找到。

在编译用 @HostCompilerDirectives.BytecodeInterpreterSwitch 注释的方法时，会应用 Truffle 主机内联。此类方法的最大节点成本可以使用 -H:TruffleHostInliningByteCodeInterpreterBudget=100000（对于本地映像）和 -Djdk.graal.TruffleHostInliningByteCodeInterpreterBudget=100000（对于 HotSpot）进行配置。如果用 @BytecodeInterpreterSwitch 注释的方法调用了具有相同注释的方法，则只要这两个方法的成本不超过预算，该方法就会直接内联。换句话说，任何此类方法都会被内联阶段视为与根字节码切换方法一样。这允许字节码解释器切换在需要时由多个方法组成。

本地映像在封闭世界分析期间计算运行时编译可访问的所有方法。从 RootNode.execute(...) 可访问的任何方法都将被确定为运行时可编译。对于本地映像，除了字节码解释器切换之外，所有运行时可编译的方法都使用 Truffle 主机内联进行优化。此类内联传递的最大节点成本可以使用 -H:TruffleHostInliningBaseBudget=5000 进行配置。在 HotSpot 上，运行时可编译方法集是未知的。因此，对于 HotSpot 上未注释为字节码解释器切换的方法，我们只能依赖常规的 Java 方法内联。

每当编译单元的最大预算达到时，内联就会停止。相同的预算将用于内联期间子树的探索。如果在预算范围内无法完全探索和内联调用，则不会对单个子树做出任何决定。对于绝大多数运行时可编译方法，不会达到此限制，因为它会受到自然 PE 边界以及对 @Child 节点的 execute 方法的多态调用的限制。如果存在超过预算限制的方法，那么建议通过添加更多 PE 边界来优化此类节点。如果方法超过了限制，则该代码可能在运行时编译中也具有高成本。

调试主机内联 #

此阶段执行的内联决策最好使用 -H:Log=HostInliningPhase,~CanonicalizerPhase,~GraphBuilderPhase（对于本地映像）或 -Djdk.graal.Log=HostInliningPhase,~CanonicalizerPhase,~GraphBuilderPhase（对于 HotSpot）进行调试。您可以使用 -Djdk.graal.LogFile=FILE（适用于两者）将输出重定向到文件。

考虑以下示例，它展示了之前描述的 Truffle 解释器中部分可求值代码的常见模式

class BytecodeNode extends Node {

    @CompilationFinal(dimensions = 1) final byte[] ops;
    @Children final BaseNode[] polymorphic = new BaseNode[]{new SubNode1(), new SubNode2()};
    @Child SubNode1 monomorphic = new SubNode1();

    BytecodeNode(byte[] ops) {
        this.ops = ops;
    }

    @BytecodeInterpreterSwitch
    @ExplodeLoop(kind = LoopExplosionKind.MERGE_EXPLODE)
    public void execute() {
        int bci = 0;
        while (bci < ops.length) {
            switch (ops[bci++]) {
                case 0:
                    // regular operation
                    add(21, 21);
                    break;
                case 1:
                    // complex operation in @TruffleBoundary annotated method
                    truffleBoundary();
                    break;
                case 2:
                    // complex operation protected behind inIntepreter
                    if (CompilerDirectives.inInterpreter()) {
                        protectedByInIntepreter();
                    }
                    break;
                case 3:
                    // complex operation dominated by transferToInterpreter
                    CompilerDirectives.transferToInterpreterAndInvalidate();
                    dominatedByTransferToInterpreter();
                    break;
                case 4:
                    // first level of recursion is inlined
                    recursive(5);
                    break;
                case 5:
                    // can be inlined is still monomorphic (with profile)
                    monomorphic.execute();
                    break;
                case 6:
                    for (int y = 0; y < polymorphic.length; y++) {
                        // can no longer be inlined (no longer monomorphic)
                        polymorphic[y].execute();
                    }
                    break;
                default:
                    // propagates transferToInterpeter from within the call
                    throw CompilerDirectives.shouldNotReachHere();
            }
        }
    }

    private static int add(int a, int b) {
        return a + b;
    }

    private void protectedByInIntepreter() {
    }

    private void dominatedByTransferToInterpreter() {
    }

    private void recursive(int i) {
        if (i == 0) {
            return;
        }
        recursive(i - 1);
    }

    @TruffleBoundary
    private void truffleBoundary() {
    }

    abstract static class BaseNode extends Node {
        abstract int execute();
    }

    static class SubNode1 extends BaseNode {
        @Override
        int execute() {
            return 42;
        }
    }

    static class SubNode2 extends BaseNode {
        @Override
        int execute() {
            return 42;
        }
    }
}

我们可以通过在 Graal 存储库（参见类 HostInliningBytecodeInterpreterExampleTest）中将其作为单元测试运行来运行它，方法是在 graal/compiler 中运行以下命令行

mx unittest  -Djdk.graal.Log=HostInliningPhase,~CanonicalizerPhase,~GraphBuilderPhase -Djdk.graal.Dump=:3  HostInliningBytecodeInterpreterExampleTest

这将打印

[thread:1] scope: main
  [thread:1] scope: main.Testing
  Context: HotSpotMethod<HostInliningBytecodeInterpreterExampleTest$BytecodeNode.execute()>
  Context: StructuredGraph:1{HotSpotMethod<HostInliningBytecodeInterpreterExampleTest$BytecodeNode.execute()>}
      [thread:1] scope: main.Testing.EnterpriseHighTier.HostInliningPhase
      Truffle host inlining completed after 2 rounds. Graph cost changed from 136 to 137 after inlining:
      Root[jdk.graal.compiler.truffle.test.HostInliningBytecodeInterpreterExampleTest$BytecodeNode.execute]
          INLINE jdk.graal.compiler.truffle.test.HostInliningBytecodeInterpreterExampleTest$BytecodeNode.add(int, int)                      [inlined    2, monomorphic false, deopt false, inInterpreter false, propDeopt false, subTreeInvokes    0, subTreeCost    8, incomplete false,  reason null]
          CUTOFF jdk.graal.compiler.truffle.test.HostInliningBytecodeInterpreterExampleTest$BytecodeNode.truffleBoundary()                  [inlined   -1, monomorphic false, deopt false, inInterpreter false, propDeopt false, subTreeInvokes    1, subTreeCost    0, incomplete false,  reason truffle boundary]
          INLINE com.oracle.truffle.api.CompilerDirectives.inInterpreter()                                                                    [inlined    0, monomorphic false, deopt false, inInterpreter false, propDeopt false, subTreeInvokes    0, subTreeCost    6, incomplete false,  reason null]
          CUTOFF jdk.graal.compiler.truffle.test.HostInliningBytecodeInterpreterExampleTest$BytecodeNode.protectedByInIntepreter()          [inlined   -1, monomorphic false, deopt false, inInterpreter  true, propDeopt false, subTreeInvokes    1, subTreeCost    0, incomplete false,  reason protected by inInterpreter()]
          INLINE com.oracle.truffle.api.CompilerDirectives.transferToInterpreterAndInvalidate()                                               [inlined    3, monomorphic false, deopt  true, inInterpreter false, propDeopt false, subTreeInvokes    0, subTreeCost   32, incomplete false,  reason null]
            INLINE com.oracle.truffle.api.CompilerDirectives.inInterpreter()                                                                  [inlined    3, monomorphic false, deopt  true, inInterpreter false, propDeopt false, subTreeInvokes    0, subTreeCost    6, incomplete false,  reason null]
            CUTOFF com.oracle.truffle.runtime.hotspot.AbstractHotSpotTruffleRuntime.traceTransferToInterpreter()                              [inlined   -1, monomorphic false, deopt  true, inInterpreter  true, propDeopt false, subTreeInvokes    0, subTreeCost    0, incomplete false,  reason dominated by transferToInterpreter()]
          CUTOFF jdk.graal.compiler.truffle.test.HostInliningBytecodeInterpreterExampleTest$BytecodeNode.dominatedByTransferToInterpreter() [inlined   -1, monomorphic false, deopt  true, inInterpreter false, propDeopt false, subTreeInvokes    0, subTreeCost    0, incomplete false,  reason dominated by transferToInterpreter()]
          INLINE jdk.graal.compiler.truffle.test.HostInliningBytecodeInterpreterExampleTest$BytecodeNode.recursive(int)                     [inlined    4, monomorphic false, deopt false, inInterpreter false, propDeopt false, subTreeInvokes    1, subTreeCost   20, incomplete false,  reason null]
            CUTOFF jdk.graal.compiler.truffle.test.HostInliningBytecodeInterpreterExampleTest$BytecodeNode.recursive(int)                   [inlined   -1, monomorphic false, deopt false, inInterpreter false, propDeopt false, subTreeInvokes    1, subTreeCost    0, incomplete false,  reason recursive]
          INLINE jdk.graal.compiler.truffle.test.HostInliningBytecodeInterpreterExampleTest$BytecodeNode$SubNode1.execute()                 [inlined    1, monomorphic false, deopt false, inInterpreter false, propDeopt false, subTreeInvokes    0, subTreeCost    6, incomplete false,  reason null]
          CUTOFF jdk.graal.compiler.truffle.test.HostInliningBytecodeInterpreterExampleTest$BytecodeNode$BaseNode.execute()                 [inlined   -1, monomorphic false, deopt false, inInterpreter false, propDeopt false, subTreeInvokes    1, subTreeCost    0, incomplete false,  reason not direct call: no type profile]
          CUTOFF com.oracle.truffle.api.CompilerDirectives.shouldNotReachHere()                                                               [inlined   -1, monomorphic false, deopt false, inInterpreter false, propDeopt  true, subTreeInvokes    0, subTreeCost   98, incomplete false,  reason propagates transferToInterpreter]

请注意，我们还使用了 -Djdk.graal.Dump=:3 选项，它将图形发送到任何运行的 IdealGraphVisualizer 实例以供进一步检查。在 Native Image 上，使用 -H:Dump=:2 -H:MethodFilter=... 来转储给定方法的主机编译图形。

要调试不完整探索的 CUTOFF 决策（具有 incomplete true 的条目），请使用 -Djdk.graal.TruffleHostInliningPrintExplored=true 选项查看日志中的所有不完整子树。

调整主机内联 #

在学习如何调试和跟踪主机内联决策后，该是看看一些调整它的方法了。第一步是确定对解释器性能至关重要的编译单元。为此，Truffle 解释器可以通过将 engine.Compilation 标志设置为 false 来在仅解释器模式下执行。之后，可以使用 Java 性能分析器来识别执行中的热点。有关性能分析的更多详细信息，请参见 Profiling.md 如果您正在寻找有关如何以及何时优化 Truffle 解释器的建议，请参见 Optimizing.md

在识别出热点方法（例如，Truffle 字节码解释器中的字节码调度循环）后，我们可以使用上一节中描述的主机内联日志进行进一步调查。有趣的条目以 CUTOFF 为前缀，并具有解释单个中断原因的 reason。

CUTOFF 条目的常见原因是

dominated by transferToInterpreter() 或 protected by inInterpreter()：这意味着调用是在慢速路径中执行的。主机内联不会对这种调用做出决定，而只是将其标记为 CUTOFF。
target method not inlinable 这发生在无法内联的主机 VM 方法上。通常我们对此无能为力。
Out of budget 我们在内联此方法时用完预算。如果方法的成本过高，就会发生这种情况。

此外，为了避免代码大小爆炸，主机内联具有内置的启发式方法来检测被认为过于复杂而无法内联的调用子树。例如，跟踪可能会打印以下内容

CUTOFF com.oracle.truffle.espresso.nodes.BytecodeNode.putPoolConstant(VirtualFrame, int, char, int)   [inlined   -1, explored    0, monomorphic false, deopt false, inInterpreter false, propDeopt false, graphSize 1132, subTreeCost 5136, invokes    1, subTreeInvokes   12, forced false, incomplete false,  reason call has too many fast-path invokes - too complex, please optimize, see truffle/docs/HostOptimization.md

这表明子树中存在太多快速路径调用（默认情况下为 10），它也会在该数量后停止探索。-Djdk.graal.TruffleHostInliningPrintExplored=true 标志可以用来查看决策的整个子树。以下调用被认为是快速路径调用

目标方法用 @TruffleBoundary 注释的调用。
多态的调用或没有单态性能分析反馈的调用。例如，对子表达式的 execute 方法的调用。
递归调用。
本身过于复杂的调用。例如，具有太多快速路径调用的调用。

以下调用不被认为是快速路径调用

可以使用主机内联启发式方法内联的调用。
在慢速路径中的调用，例如，受 transferToInterpreter() 支配或受 isInterpreter() 保护的任何调用。
由于主机 VM 的限制而无法内联的调用，例如，对 Throwable.fillInStackTrace() 的调用。
不再可访问的调用。

无法完全避免快速路径调用，因为例如，子节点需要在 AST 中执行。理论上可以避免字节码解释器中的所有快速路径调用。在实践中，语言会依赖于 @TruffleBoundary 以便运行时能够实现更复杂的字节码。

在以下各节中，我们将讨论有关如何改进主机解释器代码的技术

优化：使用 @HostCompilerDirectives.InliningCutoff 手动剪切代码路径 #

如上一节所述，启发式方法会自动剪切具有太多调用的子树。优化此方法的一种方法是使用 @InliningCutoff 注释。

考虑以下示例

abstract class AddNode extends Node {

   abstract Object execute(Object a, Object b);

   @Specialization int doInt(int a, int b) { return a + b; }
   
   @Specialization double doDouble(double a, double b) { return a + b; }
   
   @Specialization double doGeneric(Object a, Object b, @Cached LookupAndCallNode callNode) { 
       return callNode.execute("__add", a, b); 
   }
}

在此示例中，特化 doInt 和 doDouble 非常简单，但还有一个 doGeneric 特化，它调用一个复杂的查找链。假设 LookupAndCallNode.execute 是一个非常复杂的方法，具有超过 10 个快速路径子树调用，我们可能无法期望 execute 方法被内联。主机内联目前不支持自动组件分析；尽管可以使用 @InliningCutoff 注释手动指定它

abstract class AddNode extends Node {

   abstract Object execute(Object a, Object b);

   @Specialization int doInt(int a, int b) { return a + b; }
   
   @Specialization double doDouble(double a, double b) { return a + b; }
   
   @HostCompilerDirectives.InliningCutoff
   @Specialization double doGeneric(Object a, Object b, @Cached LookupAndCallNode callNode) { 
       return callNode.execute("__add__", a, b); 
   }
}

更改代码后，主机内联现在可能会决定内联 AddNode 的 execute 方法（如果它适合主机内联预算），但在 doGeneric(...) 方法调用处强制执行 CUTOFF。请查看 javadoc 以了解使用此注释的其他用例。

优化：从部分求值期间折叠的分支中消除重复调用 #

以下是一个示例，其中代码对于使用部分求值进行编译效率很高，但对于主机编译来说并不理想。

@Child HelperNode helperNode;

final boolean negate;
// ....

int execute(int argument) {
	if (negate) {
		return helperNode.execute(-argument);
	} else {
         return helperNode.execute(argument);
	}
}

当使用部分求值编译此代码时，由于negate字段是编译最终的，所以条件保证会折叠成单个情况，因此代码效率很高。在主机优化期间，negate字段不是编译最终的，编译器会将代码内联两次，或者选择不内联execute方法。为了避免这种情况，可以按如下方式重写代码

@Child HelperNode helperNode;

final boolean negate;
// ....

int execute(int argument) {
    int negatedArgument;
    if (negate) {
        negatedArgument = -argument;
    } else {
        negatedArgument = argument;
    }
    return helperNode.execute(negatedArgument);
}

如果使用许多具有相同方法体的特化，则可以通过代码生成间接产生类似的代码模式。主机编译器通常难以自动优化此类模式。

优化：将复杂的慢速路径代码提取到单独的方法中 #