folio — plugin sandbox verification research (v0.1.0-draft)

§1. 調査背景

§1.1 中核問題

folio Phase X3 着手直前で、 plugin 構成 (1 skill + 4 hook + 6 script + 1 CLI、 §7.3) は確定したが、 hooks の実挙動 / PreToolUse permissionDecision deny bug (Issue #37210 / #33106) / lint script の throughput を 自前で確認していない。「机上設計と現実の乖離」リスクを潰すため、 plugin の 機能単位 (use case 単位) の sandbox 検証フレーム を設計する必要がある。

§1.2 10 サブ質問

#	サブ質問	担当
Q1	Claude Code 公式 plugin development workflow (test/dev/sandbox、 Issue #37210 等の verify)	researcher-1
Q2	類似 plugin ecosystem test harness deep dive (VS Code / Neovim / JetBrains / MCP / Chrome / Obsidian / Cursor / Continue / Aider)	researcher-2
Q3	AI agent behavior evaluation framework (LangChain / OpenAI / Anthropic / AutoGen / CrewAI / LangGraph / Inspect AI / Promptfoo / Helicone)	researcher-3
Q4	Conformance test suite design pattern (W3C WPT / LSP / TypeScript baseline / IETF RFC / OpenAPI Specmatic)	researcher-4
Q5	隔離環境構築 sandbox runtime (Nix / Docker / git worktree / Devbox / mise / Firecracker / Vagrant / WSL2)	researcher-5
Q6	Property-based / behavioral testing (Hypothesis / fast-check / mutation / Gherkin / EARS→scenario)	researcher-3 統合
Q7	観察 instrumentation (transcript / hook stdio / golden file / snapshot / structured logging)	researcher-4 統合
Q8	folio 用 sandbox verification framework 推奨設計	controller 統合導出
Q9	twill experiment-verified との境界明文化	controller 統合導出
Q10	ADR 起票候補	controller 統合導出

§1.3 調査手法

差分調査モード (DIFFERENTIAL_MODE=true、既存 memory 前提: hash f621abb6 plugin-architecture / hash 96b16f0c twl Hooks/MCP / hash 322df262 Claude Code v2.1 更新)。 5 researcher を並列 spawn (sonnet model)、 67 fetch 成功 (failure 4)、 87 unique sources を集約。 critic 判定 PASS (WARNING 3 + Gap 5、詳細は §9-§10)。

§2. Q1 — Claude Code 公式仕様 reference (test/dev workflow)

§2.1 Development mode: `--plugin-dir` が公式 dev install

結論: --dev フラグは 存在しない。公式の "development install" 相当は claude --plugin-dir ./my-plugin (verified [1])。セッション限定ロード (install なし) がセキュリティ境界。

機能	command	備考
dev install	`claude --plugin-dir ./my-plugin`	install なし、 session 限定。同名 installed plugin より優先 (managed settings 上書き不可)
ZIP 対応 (v2.1.128+)	`claude --plugin-dir ./my-plugin.zip`	CI build artifact テスト用
URL 指定	`claude --plugin-url https://example.com/my-plugin.zip`	説明用ダミー URL [1]
複数同時	`--plugin-dir ./p1 --plugin-dir ./p2`	複数 plugin 同時テスト可
hot reload	`/reload-plugins`	plugins / skills / agents / hooks / plugin MCP / plugin LSP をリロード。 monitors は除外 (session restart 必要)
デバッグ	`claude --debug`	plugin load 詳細 / manifest error / skill/agent/hook registration 状況 [2]
validate	`claude plugin validate [--strict]`	plugin.json / frontmatter / hooks.json の syntax/schema check。 marketplace community 審査と同一ツール [1]
token cost	`claude plugin details <name>`	各 skill の token cost 確認

§2.2 Hook テストパターン: 公式に automated framework なし (deduced)

公式は「Test the script manually」と手動実行を推奨 [2]。公式に hook の automated test framework は 存在しない。推奨手順:

chmod +x ./scripts/your-script.sh (実行権限確認)
shebang 行確認 (#!/bin/bash)
${CLAUDE_PLUGIN_ROOT} 利用確認
echo '{...JSON}' | bash ./hooks/script.sh で stdin/exit code/stdout を手動 verify
claude --debug で hook 起動確認
event 名の大小区別 (PostToolUse not postToolUse)
matcher pattern ("matcher": "Write|Edit")

hook handler type は 5 種 (command / http / mcp_tool / prompt / agent)。

§2.3 Skill テスト = Evaluation-Driven Development (verified)

公式推奨アプローチ [3]:

Skill なしで Claude に representative task → 失敗観察
失敗ケースをカバーする 3 つ以上の evaluation シナリオ作成
baseline performance 測定
Skill 記述 (最小限)
evaluation で改善測定 → iterate

Evaluation JSON 形式 (公式サンプル):

{
  "skills": ["pdf-processing"],
  "query": "...",
  "files": ["test-files/document.pdf"],
  "expected_behavior": ["...", "..."]
}

§2.4 Transcript / Log capture: 公式 API なし、 JSONL 自動保存 (verified)

セッションデータは ~/.claude/projects/<encoded-project-path>/ に JSONL 形式で自動保存 [7] (詳細 schema は §5.1)
非公式 simonw/claude-code-transcripts は「unofficial and undocumented APIs の変更により broken」 [8]
LangSmith 連携: LANGCHAIN_TRACING_V2=true 等で送信可能 ― 注記: この主張は researcher-1 の検索結果由来で、公式 Claude Code docs に LangSmith 連携の専用 page は未確認
/plugin Errors タブ (UI) でエラー確認

§2.5 Marketplace 審査 (verified)

2 つの公式 marketplace [4]:

claude-plugins-official: Anthropic 管理、申請プロセスなし、裁量で決定
claude-community: 第三者 submission を review 後掲載
- claude plugin validate ローカル実行必須
- submission: claude.ai/settings/plugins/submit or platform.claude.com/plugins/submit
- review pipeline = claude plugin validate + automated safety screening (詳細非公開)
- 承認後: 特定の commit SHA に pin、 nightly sync で anthropics/claude-plugins-community/marketplace.json 反映

§2.6 PreToolUse `permissionDecision: "deny"` Bug 群 (4 件 verified)

全 4 件が Closed / not planned:

Issue	対象	内容	ステータス
#37210 [5]	Edit tool	well-formed deny が無視される	Closed / not planned
#33106 [6]	MCP server tools	deny が無視される (built-in は動作)	Closed / not planned (2026-03-11)
#39344 [7]	settings.deny vs hook	hook の "ask" が deny rules を silent bypass	Closed (2026-03-26)、 `area:security` ラベル
#18312 [URL 未確認]	allow list 内ツール	Bash allow list で deny/ask 完全無視	Closed

公式設計と実態の乖離:

§2.7 Claude Code 別途 Sandbox 機構 (plugin 関連ではない)

sandbox.filesystem 設定 + OS レベルの filesystem/network アクセス制限 (Bash tool 限定)
autoAllowBashIfSandboxed: true (default) で sandbox 内 Bash は prompt なし
plugin development mode (--plugin-dir) では sandbox の記述なし (unsandboxed)
monitors は「run unsandboxed at the same trust level as hooks」と明記
BoxLite sandbox 統合提案 (Issue #15888) は Anthropic により "not planned" として却下 [13]。公式 plugin isolation sandbox は存在しない

§3. Q2 — 類似 plugin ecosystem test harness deep dive (9 事例)

§3.1 9 事例の 4 軸 (runner / scenario / instrumentation / CI) 整理

事例	Runner Architecture	Scenario Format	Observation	CI 統合
VS Code Extension [14][15][16][17]	別プロセス Extension Host + `@vscode/test-electron` + Mocha	`.vscode-test.mjs` (`defineConfig`)	SourceMapStore, Test Explorer UI, exit code	xvfb-run + GitHub Actions
@vscode/test-cli [16][17]	CLI 設定駆動、 Mocha	`.vscode-test.mjs`	AST / eval 二方式テスト抽出	`vscode-test` コマンド
Neovim plenary.nvim [19][20][21]	ヘッドレス nvim 別インスタンス per file	`*_spec.lua` (Busted BDD)	luassert spy/stub/mock, coroutine async	`nvim --headless -c PlenaryBustedDirectory` exit 0/1
Neovim busted+shim [22]	XDG 分離 shim + `-l` フラグ	`*_spec.lua`	message trace table	`.busted` config
JetBrains IntelliJ [23][24][25]	二プロセス (Starter+Driver) + JMX/RMI、 Light/Heavy 二層	`testdata/` + JUnit 5	bundled plugin が例外収集、 error log	Gradle + JUnit 5
MCP Inspector (公式) [28]	React UI + Node.js Proxy + stdio/SSE/HTTP	CLI `--cli` + JSON output	JSON-RPC 完全可視化	CLI mode
mcp-recorder [30]	Proxy record/replay/verify	`cassette.json` (metadata + interactions) + scenarios.yml	request/response/latency 記録	pytest plugin `@mcp_cassette`
MCPSpec [31]	YAML declarative + session recording	YAML テストコレクション (10 assertion type)	schema drift detection	`mcpspec ci-init` 自動生成
Obsidian [36][37]	Node.js hidden API + sandbox vault	なし (非構造化)	なし	未成熟
Chrome Extension (MV3) [26][27]	Puppeteer/Playwright + Browser context	Jest `beforeEach`/`afterEach`	`extensionRealms()` / service worker target	xvfb-run (headed mode 必須)
Continue.dev [33]	モノレポ 3 層 (Jest/Vitest/npm)	Jest/Vitest 標準	PR ごと自動実行	PR merge ブロッカー
Aider [34]	pytest 自己テスト + auto-fix loop	pytest	`--test-cmd` 任意コマンド	pytest CI
pi-test-harness [35]	実 runtime + 境界 3 点モック	Playbook DSL (`when`/`calls`/`says`)	event collection 全ステップ	exit code

§3.2 6 応用 pattern (folio 用、 A〜F)

Pattern A: 別プロセス Plugin Sandbox (VS Code / JetBrains): Plugin を Host process から分離した isolated subprocess で実行、 IPC (JSON-RPC / RMI 相当) で通信。 folio 応用: plugin verification 時に plugin を別プロセスで起動し、 folio API への呼び出しを intercept・記録
Pattern B: Light / Heavy 二層 (JetBrains): Light: in-memory fixture、 plugin state のみリセット、高速。 Heavy: フル環境再構築、実ファイルシステム、低速だが現実的。 folio 応用: 普段の unit 検証は Light (in-memory)、 phase 完了時の統合検証は Heavy (実 sandbox)
Pattern C: Record → Replay → Verify (mcp-recorder): plugin との interaction を cassette として記録、 CI では live plugin なしに cassette を replay。 schema drift 検知 (ツール/API インターフェース変更の自動検出)。 folio 応用: plugin API contract を cassette 化、 upgrade 時の互換性自動検証
Pattern D: Playbook ベース Scenario (pi-test-harness): when(trigger, actions=[calls(...), says(...)]) 形式の宣言的シナリオ、 LLM/外部プロセス境界のみ差替、実 runtime はそのまま。 folio 応用: plugin の動作シナリオを Playbook 形式 YAML/DSL で記述
Pattern E: YAML 宣言テストコレクション (MCPSpec): action: verify_plugin_contract 相当のアクション定義、 10 種以上の assertion type、 environment variable injection、 tags、 parallel execution、 ci-init で GitHub Actions スクリプト自動生成。 folio 応用: verification rule を YAML で宣言し folio verify で実行
Pattern F: XDG / ENV 分離 (Neovim shim): XDG_CONFIG_HOME / XDG_DATA_HOME を tmp ディレクトリに向けて完全分離、 symlink で test plugin を動的注入・削除。 folio 応用: sandbox vault のデータディレクトリを一時 dir に向け、本番 vault 汚染防止

§3.3 全事例共通 pattern (folio でも必須採用)

プロセス分離: テスト対象は必ず別プロセスまたは隔離環境で実行
宣言的 scenario: テスト条件を設定 file (YAML / .mjs / .lua) で宣言
exit code による CI gating: 0=pass / 1=fail でパイプラインをブロック
fixture lifecycle 管理: before_each / after_each で状態リセット
観察の非侵入性: 本番コードを変更せず instrumentation (SourceMapStore / event collection / cassette)

§4. Q3 + Q6 — AI agent eval framework + Property/Behavioral testing

§4.1 Inspect AI (UK AISI) — Task/Dataset/Solver/Scorer 4 層 (verified)

Task = REQ 単位のシナリオに直接マッピング可能 [38][39][40]
Solver: generate() / chain_of_thought() / self_critique() chain + カスタム Solver
Scorer: exact() / includes() / pattern() / model_graded_qa(grader_models=[...]) で複数 grader majority voting [41]
Agent eval: react() built-in + use_tools() + generate_loop() (ツール呼び出し停止までループ)
Multi-agent: handoff() で会話履歴共有
Agent Bridge: AutoGen / LangChain 等の外部 framework 連携
Sandbox: Docker / K8s / Proxmox でツール実行環境分離、 sandbox().read_file() / sandbox().exec() でスコアリング時検査
200+ 事前構築 eval 同梱、 METR / Apollo Research 採用

§4.2 Promptfoo — YAML 宣言 + 多様な assertion (verified)

現在 OpenAI が取得済み (MIT ライセンス維持) [42][43]。

assertion 種別	用途	特徴
`llm-rubric`	rubric pass/fail	迅速・明確
`g-eval`	CoT 複数次元	高精度・高 latency
`factuality`	参照との一貫性	ground truth あり
`select-best`	pairwise 比較	モデル間優劣
`context-faithfulness`	RAG faithfulness	検索 context 根拠

Judge プロンプト標準構造 (RCAF): Role / Rubric as Context / Action / Format。セキュリティ指示 + スコアリング規則 + JSON 出力 (reason/score/pass) + コンテキスト
Multi-judge voting: assert-set の threshold: 0.66 で 3 judge 中 2 合格要求
段階的パイプライン: 決定論的 check (高速) → 軽量 judge (常時) → 高精度 judge (条件付き)
YAML 設定: description / env / prompts / providers / defaultTest / scenarios / tests 順、 Nunjucks 変数展開

§4.3 LLM-as-judge 4 主要バイアス対策 (verified)

バイアス	対策
Position bias (提示順序による好み)	位置交換 (swap)
Verbosity bias (長い回答への過評価)	内容基準の明示
Self-preference bias (同系統モデルへの過評価)	異系統 judge 使用
Non-determinism (スコアの不安定性)	temperature=0、複数回実行平均

Few-shot prompting で GPT-4 一致率 65% → 77.5% に改善 [47]
Calibration: 30-50 多様例 + 人間ラベル + 90% 一致率確認 + hold-out 検証 + ドリフト監視
DAG 構造: 評価を原子的サブタスクに分割、各ノードが特定の判断
folio 適用: REQ ごとに専用 judge を割り当てる単一次元分解がベストプラクティス

§4.4 Hypothesis — RuleBasedStateMachine (verified)

プラグイン状態遷移の自動検証パターン [50]:

デコレータ	機能
`@initialize`	テスト前に exactly once 実行 (プラグイン登録)
`@rule`	基本操作定義 (任意順で呼び出し、 lifecycle 操作)
`@precondition`	ルール実行前条件チェック
`@invariant`	各ステップ後に必ず実行する検証 (REQ 維持)
`Bundle`	生成値の名前付きコレクション (target/consumes で再利用)

folio 適用: hook 状態遷移 (unregistered → registered → active → error) を状態機械でモデル化、全操作列に対して REQ invariant が保たれることを検証可能。

§4.5 fast-check — Model-based Testing (JS/TS、 deduced)

「UI、 API、状態機械のテストにプロパティベーステストを活用」公式サポート [51]
Stateful tests: 決定論的モデル → コマンドシーケンス生成 → 実システムで実行 → モデルと結果常時比較
fc.ModelBasedTesting で commands と model を指定、 Jest/Mocha/Vitest で実行
folio 適用: TypeScript plugin の場合、 fast-check で plugin registry の状態遷移を property test (Phase X4+ 候補)

§4.6 EARS → Gherkin/BDD 1:1 変換 (verified)

EARS パターン	構文	Gherkin 対応
Ubiquitous	`The [system] shall [behavior]`	Given (初期) + When (任意) + Then
Event-Driven	`When [trigger], the [system] shall [response]`	Given (前提) + When (トリガー) + Then
State-Driven	`While [state], the [system] shall [behavior]`	Given (状態) + When + Then
Unwanted Behavior	`If [error], then the [system] shall [recovery]`	Given + When (エラー) + Then (復旧)
Optional Feature	`Where [feature enabled], the [system] shall [behavior]`	Given (機能有効) + When + Then

変換キー原則: EARS の 1 要件 = 1 Gherkin シナリオ (conductofcode.io で実証済 [52])。自動変換ツールは RequireKit が /generate-bdd コマンドで EARS → Given-When-Then 自動生成 [53][54]。

folio REQ-CM-001 (Event-Driven) 変換例:

EARS: "WHEN AI agent attempts to write to spec_path, the system SHALL verify caller marker env var"
↓
Feature: spec edit caller marker enforcement
  Scenario: caller marker is set correctly → allow
    Given the AI agent has set FOLIO_ARCHITECT_CONTEXT=folio-architect
    When the agent attempts to Write to scratch/specs/example.html
    Then PreToolUse hook SHALL emit exit code 0 (allow)
  Scenario: caller marker missing → deny
    Given FOLIO_ARCHITECT_CONTEXT is unset
    When the agent attempts to Write to scratch/specs/example.html
    Then PreToolUse hook SHALL emit exit code 2 (block) with stderr message

§4.7 LangChain Trajectory eval / その他

軌跡評価 vs 出力評価 [44][45]: 出力評価は最終回答のみで内部判断の失敗を見落とす。軌跡評価はツール選択・中間推論・会話ターン全体を評価
Scoring Evaluator: 1-10 スケールのカスタム基準、 PairwiseStringEvalChain で A/B テスト
LangGraph [49]: explicit state で inspectable (Hypothesis state machine と相性良)
AutoGen: emergent flow で unit test 困難、録音 transcript の integration test 主軸
Mutation testing (Stryker / mutmut / LLMorpheus [55]): folio 直接適用は LLM 非決定性と相性要検討
Anthropic RSP eval [46]: 自動テスト + 標準化ベンチマーク + 専門家 red-teaming の 3 層構成 (folio eval pipeline 参考)

§5. Q4 + Q7 — Conformance test suite design + Observability instrumentation

§5.1 W3C Web Platform Tests (WPT) scenario file 構造 (verified)

testharness.js ベースの JS テスト (最多)
reftest (HTML pair) [56]:
- <link rel="match" href="references/test-ref.html"> で参照リンク
- reference は同等描画を produce するが、テスト対象技術を使わない
- rel="mismatch" で「異なること」確認も可
- fuzzy matching: <meta name=fuzzy content="maxDifference=15;totalPixels=300">
manifest (MANIFEST.json): 自動生成、 wpt manifest で更新
メタデータ (期待値) [57]: INI-like 形式の .ini ファイル、 tests/ 構造をミラーした metadata/ ディレクトリに file.html.ini 形式で配置
- キー例: expected: PASS / disabled: true / fuzzy: ...
- conditional value 構文: Python-like の if 式で platform 依存の期待値記述

§5.2 TypeScript compiler test baseline (verified)

Two-directory model [59][60]:

tests/baselines/reference/   ← VCS 管理 (期待値)
tests/baselines/local/       ← 実行生成 (.gitignore)

baseline ファイル拡張子: .js (コンパイル出力) / .errors.txt / .types / .symbols
regression detection flow:
1. reference/ vs local/ の diff
2. 差分 → test FAIL
3. 変更が正当なら jake baseline-accept で reference に昇格
4. jake baseline-accept[soft] は新規のみ (delete なし)

§5.3 Specmatic / Arazzo (OpenAPI/AsyncAPI conformance、 verified)

API 仕様を "Executable Contract" としてそのまま実行 [61][62]。 Scenario format (*.arazzo_inputs.json):

{
  "WorkflowName": {
    "DEFAULT": { "productType": "Electronics", "maxPrice": 1000 },
    "GetProducts.IsArrayEmpty": { "maxPrice": 0 }
  }
}

DEFAULT 基底 + named scenario 差分のみ指定してマージ
$failureMessage で期待失敗メッセージ
検証フロー: OpenAPI/AsyncAPI spec に対して実サービス実行、 happy + unhappy path、 closed-loop testing (spec 内参照でペイロード検証)

§5.4 Golden file / Snapshot testing の使い分け (verified)

用語整理: Golden file ≈ Snapshot ≈ Golden master ≈ Known-answer testing (本質的に同じ概念)。

状況	推奨	ツール例
複雑な出力構造 (DOM, AST)	Snapshot testing	Jest `*.snap` [65]
テキスト/CLI 出力	Golden file	TypeScript baseline
非決定的な値	Property matcher で動的値をマスク	Jest `expect.any(Date)`
小さな focused output	Inline snapshot	insta `assert_snapshot!(value, @"...")` [66]
大きなコーパス	External `.snap`/`.yml`	pytest-regressions [68]

4 つの diff/regression detection pattern:

two-directory model (TypeScript): reference/ vs local/
single-file update (Jest): .snap 直接更新 + VCS 追跡
accept/reject workflow (insta): .snap.new → review → .snap 昇格
database model (http-conformance): 結果を DB → Notebook 分析

§5.5 Claude Code transcript capture format (verified)

JSONL 形式 (append-only) [70]、 ~/.claude/projects/<encoded-project-path>/:

{"type":"user","uuid":"...","parentUuid":"...","timestamp":"2024-01-01T00:00:00Z","sessionId":"abc","cwd":"/path","gitBranch":"main","version":"1.0","message":{"role":"user","content":"Your prompt"}}
{"type":"assistant","uuid":"...","parentUuid":"...","timestamp":"...","message":{"role":"assistant","content":[{"type":"text","text":"..."},{"type":"tool_use","id":"call_1","name":"Bash","input":{"command":"npm test"}}],"usage":{...}}}
{"type":"user","message":{"role":"user","content":[{"type":"tool_result","tool_use_id":"call_1","content":"Test output"}]}}

Universal fields: type / uuid / parentUuid / timestamp / sessionId / cwd / gitBranch / version
Hook 経由の capture: transcript_path が全 hook イベント stdin JSON に含まれる、 SessionStart で transcript_path / session_id / cwd 利用可
Hook stdin/stdout format: stdin JSON (session_id, transcript_path, cwd, tool_name, tool_input)、 stdout JSON (continue, decision, reason, systemMessage)
exit code: 0=success, 2=blocking (PreToolUse では action ブロック)、その他=非ブロッキングエラー

§5.6 OpenTelemetry を CLI hook 経由で emit (verified)

claude_telemetry [72] (hook ベース wrapper): PreToolUse/PostToolUse で span capture、 span 階層 claude.agent.run → tool.*、 OTLP export
opentelemetry-instrumentation-claude-agent-sdk [73]: invoke_agent span (CLIENT) → execute_tool spans (INTERNAL)、 gen_ai.* 属性、 TRACEPARENT/TRACESTATE で distributed trace 統合、 OTEL_LOG_RAW_API_BODIES=1 (v2.1.111+)
OTLP File Exporter [74]: JSON Lines (.jsonl)、 UTF-8、各行 complete JSON object、 \n 区切り。 Top-level: TracesData / MetricsData / LogsData (混在不可)

§5.7 Scenario file format 比較 (folio 用推奨導出)

特性	YAML	JSON	Markdown
人間可読性	高 (コメント可)	中	最高
パーサビリティ	厳密 (注意)	高速・安定	限定的
コメント	可	不可	可 (prose)
LLM 可読性	GPT 系強	フォーマット失敗リスク	トークン効率最良
主用途	Config, OpenAPI	API responses, Smithy	ドキュメント
落とし穴	Norway problem, indent	trailing comma	テーブル解析

実際の採用状況: WPT (INI + HTML) / TypeScript (テキスト + //// fourslash) / Specmatic Arazzo (JSON) / pytest-regressions (YAML) / Jest (.snap 独自) / Smithy (YAML/JSON) / h2spec (Go コード埋込) / MCPSpec (YAML)。

folio 推奨 (§7 で詳述): シナリオ定義 = YAML、期待値出力 = JSON Lines / 独自テキスト、 Narrative = Markdown、差分表示 = テキスト diff。

§6. Q5 — 隔離環境構築 + Claude Code sandbox 可否

§6.1 9 方式比較表

方式	Declarative	Hermetic	起動時間	folio 適合性	API access 制御
Nix flake [75][76]	最高	最高 (bit-for-bit)	~0.3 秒 (初回 20 分)	高 (学習コスト大)	OS level
Devbox [77]	高 (JSON)	高 (Nix 同等)	~1.5 秒 (初回 4 分)	最適 (試作段階)	OS level
mise [78]	中 (TOML)	中 (ツール版のみ)	~10ms	ツール版用途	なし
git worktree [8][79]	なし	なし (ファイルのみ)	即時	最軽量 (repo 内完結)	なし
Docker / Podman [80][84][85]	中 (Dockerfile)	中〜高 (digest)	ms 〜秒 (初回 8 分)	CI 向け	ネットワーク制御可
devcontainer.json	中	中 (image drift 有)	数秒	VS Code 連携	コンテナ境界
Firecracker microVM [81]	低	最高 (KVM)	~125ms	過剰 (hook 検証のみ)	VM 境界
Vagrant / QEMU [82]	中 (Vagrantfile)	高	数分	過剰	VM 境界
WSL2 [83]	なし	なし	数秒	Windows 向け	なし

§6.2 「Claude Code 自体を sandbox で動かす」判定

api.anthropic.com への外部アクセスは Claude Code 動作に必須。完全網断は不可
API key (ANTHROPIC_API_KEY) は sandbox 起動前に注入必要。起動後の追加はコンテナ再起動が必要で会話 context を失う
~/.claude (ユーザーレベル設定) は sandbox 内で無効。プロジェクトレベル設定 (.claude/settings.json) のみ有効
hook の unit test (スクリプトロジック検証) は API call 不要。 JSON stdin / exit code / stdout をテストするだけで OK
hook の 統合テスト (実際の Claude セッション内で hook が正しく発火するか) は API call が必要
Claude Code 本体 API call 部分の mock は技術的に困難 (プロプライエタリ)。 hook script 自体は JSON I/O のみで mock 容易

Docker 公式 sandbox: sbx run claude コマンドが存在 [9]。認証は環境変数 or OAuth。 Squid proxy で他ドメイン遮断しつつ Anthropic API のみ許可可。

BoxLite sandbox 統合提案 (Issue #15888) は Anthropic により "not planned" として却下 [13]。公式 plugin isolation sandbox は存在しない。

§6.3 folio 用 3 段階推奨パターン

Step	方式	API call	適用フェーズ
Step 1 (API 不要)	hook script を `echo '{...}' \| bash ./hooks/script.sh` で直接 unit test	不要	Phase X3 最初 = MUST
Step 2 (軽量)	git worktree (`.claude/worktrees/`) に test project 作成、 `.claude/settings.json` で hook 設定	API key 注入で可	Phase X3 中盤 = SHOULD
Step 3 (本格)	統合テストが必要な場合のみ Docker Compose (Squid 付き) で Claude Code 実行	必須 (Anthropic API のみ allowlist)	Phase X4+ = MAY

§6.4 「folio 軽量」 vs「本格 hermetic」の境界

軽量 (folio 試作段階に適合)

git worktree: .claude/worktrees/ 配下、ファイル隔離のみ。ポート衝突等の runtime isolation は別途対策必要
Devbox: devbox.json をリポジトリに含め Layer 1 consumer に配布、 Nix の複雑さ隠蔽

本格 hermetic (Phase X3 以降・本番 CI)

Nix flake: flake.nix + flake.lock + nix flake check で CI、学習コスト高
Docker Compose + Squid proxy: internal: true + ドメイン allowlist で外部 API 制御

§7. Q8 — folio 用 Sandbox Verification Framework 推奨設計

§7.1 設計原則 (folio constitution と整合)

P-3 (WHAT-only) + P-11 (HOW 禁止): verification framework 本体は WHAT (どの REQ を検証するか) のみ規定。 HOW (具体 runner / shell 構文 / file path) は scratch/verification/ 配下に隔離
P-12 (Layer 0 一体配布): 完成形では verification framework も .claude-plugin/ の一部として配布候補。ただし Phase X3 段階では plugin 本体と分離した試作物 として scratch/verification/ に配置
試作駆動哲学: 1 use case の verification scenario から開始、 8 use case 全 cover は段階的に拡張

§7.2 配置 (scratch 内)

scratch/
├── specs/                          (既存)
├── decisions/                      (既存)
├── research/                       (既存、 本レポートを含む)
├── verification/                   ← 新規 (本フレーム)
│   ├── README.html                 entry point + 全 scenario 一覧
│   ├── scenarios/                  use case 別 scenario file
│   │   ├── caller-marker.yaml      REQ-CM-001〜003
│   │   ├── path-boundary.yaml      check-path-boundary.sh
│   │   ├── jsonld-lint.yaml        REQ-CI-010〜015 関連
│   │   ├── readme-update.yaml      auto-update on new spec
│   │   ├── context-inject.yaml     SessionStart inventory inject
│   │   ├── bidir-link.yaml         REQ-REL-002
│   │   ├── inventory-regen.yaml    REQ-REL-004
│   │   └── ears-coverage.yaml      REQ-CI-014
│   ├── fixtures/                   テストデータ
│   │   ├── valid-spec.html         REQ-REL-001 通過用
│   │   ├── invalid-jsonld.html     JSON-LD lint fail 用
│   │   └── ...
│   ├── baselines/                  期待値 (golden file pattern)
│   │   ├── reference/              VCS 管理
│   │   └── local/                  実行生成 (.gitignore)
│   └── runner.sh                   軽量 bash runner (試作)

理由: scratch/ は folio rule に縛られない箱 (CLAUDE.md 明記)。 verification 自体は HOW なので本体 specs/ と分離 (P-11 整合)。

§7.3 Scenario format 推奨: YAML (MCPSpec / Promptfoo / pytest-regressions 流)

folio scenario file の標準構造 (例: caller-marker.yaml):

# scratch/verification/scenarios/caller-marker.yaml
schema_version: "0.1"
req_id: "REQ-CM-001"
ears_pattern: "event-driven"
description: |
  WHEN AI agent attempts to write to spec_path,
  the system SHALL verify caller marker env var.

setup:
  env:
    FOLIO_ARCHITECT_CONTEXT: "folio-architect"
  fixtures:
    - source: fixtures/valid-spec.html
      dest: scratch/specs/test-target.html

scenarios:
  - name: "caller marker set correctly → allow"
    given:
      env: { FOLIO_ARCHITECT_CONTEXT: "folio-architect" }
    when:
      tool: "Edit"
      file_path: "scratch/specs/test-target.html"
      tool_input: { ... }
    expect:
      exit_code: 0
      stderr: ""
      decision: "allow"

  - name: "caller marker missing → deny (exit 2 fallback)"
    given:
      env: { FOLIO_ARCHITECT_CONTEXT: "" }
    when:
      tool: "Edit"
      file_path: "scratch/specs/test-target.html"
    expect:
      exit_code: 2          # Issue #37210/#33106 対策の確実な fallback
      stderr_contains: "FOLIO_ARCHITECT_CONTEXT"
      decision: "deny"

teardown:
  cleanup_files:
    - scratch/specs/test-target.html

設計判断の根拠:

YAML 採用 (§5.7 比較表より): コメント可、構造明確、 MCPSpec 実績
EARS → Gherkin Given/When/Then 1:1 マッピング (§4.6 より): REQ-ID 直結
exit code 中心の assertion (§2.6 より): permissionDecision: deny は信頼できない、 exit 2 + stderr を verify
fixtures + baselines 分離 (§5.4 より): TypeScript baseline pattern + WPT reftest pattern の折衷
Setup / Teardown (§3.3 共通): before_each / after_each 相当

§7.4 Runner 選択 trade-off

選択肢	利点	欠点	folio 適合性
独自 bash runner (`runner.sh`)	試作早い、依存ゼロ、 P-11 隔離容易、 hook script との親和性	scenario 仕様の標準化なし	Phase X3 最初 = 推奨
Promptfoo (YAML + llm-rubric) [42]	LLM judge 統合、 OSS 実績、 YAML 自然	重量、 plugin verification 用ではない	Phase X4+ (LLM behavior eval 必要時)
MCPSpec (YAML + 10 assertion) [31]	MCP 専用だが汎用 YAML scenario pattern が参考	MCP 専用名前空間、 folio に直適合せず	参考のみ
Inspect AI (Python Task/Scorer) [38]	高機能、 sandbox 統合、 multi-agent	Python 依存、試作段階に重い	Phase X4+ (Anthropic eval methodology 採用時)
pytest-regressions + Hypothesis [68][50]	golden file + property test、 Python 標準	Python 必須、 hook script 中心の folio に過剰	Step 2-3 で `validate_jsonld.py` のテストに採用可

Phase X3 推奨: 独自 bash runner。試作駆動哲学に合致。完成形は Inspect AI / Promptfoo 統合候補だが、まず動くものを 1 scenario から作る。

§7.5 8 use case verification scenario 設計

plugin-architecture-research.html §7.3 の Phase X3 minimal plugin (1 skill + 4 hook + 6 script + 1 CLI) と整合した 8 use case を REQ から導出:

#	Use case	scenario file	REQ source	Step	検証 method
1	caller marker check	`caller-marker.yaml`	REQ-CM-001〜003	Step 2	echo JSON pipe + exit code
2	path boundary check	`path-boundary.yaml`	(rules.html §10)	Step 2	echo JSON pipe + exit code
3	JSON-LD lint	`jsonld-lint.yaml`	rules §4.1 + relations §3	Step 3	fixture + lint script + golden output
4	README update	`readme-update.yaml`	(relations §4 inventory)	Step 4	新 spec 追加 → README diff baseline
5	context injection (SessionStart)	`context-inject.yaml`	(Beads pattern)	Step 5	session start hook → injected content snapshot
6	双方向 link materialize	`bidir-link.yaml`	REQ-REL-002	Step 8	forward 宣言 → reverse 自動付与 baseline
7	inventory.json regen	`inventory-regen.yaml`	REQ-REL-004	Step 5	spec edit → inventory.json diff baseline
8	EARS coverage report	`ears-coverage.yaml`	REQ-CI-014	Step 7	EARS markup 完全性 grep + REQ-ID uniqueness

§7.6 CI 統合 + 観察 instrumentation

PR ごと sandbox verification 自動実行: GitHub Actions で bash scratch/verification/runner.sh 起動、 exit code 0/1 で merge gating
Phase X3 minimal scope: scenario 1-3 (caller marker + path boundary + JSON-LD lint) のみ CI gate、残りは warn のみ → 段階的に fail 化
transcript capture: ~/.claude/projects/<path>/*.jsonl を session 終了時に scratch/verification/transcripts/ にコピー (§5.5)
hook stdio capture: runner が stdin/stdout/stderr/exit_code をすべて golden file (baselines/local/) に書き出し
OpenTelemetry はオプション: 試作段階では JSON Lines (.jsonl) 直接出力で十分 (§5.6)
snapshot pattern: baselines/reference/ (VCS) vs baselines/local/ (gitignore) の TypeScript 2-dir model + runner.sh --accept で reference 昇格 (§5.2, §5.4)

§8. Q9 — twill experiment-verified との境界明文化

§8.1 twill の experiment-verified pattern (reference)

twill は spec 内の「情報ひとつひとつ」 (HOW level) に experiment-verified label を付ける方式 (memory hash a6d6b7c1 / 09550ec2):

4-state status: proposed / accepted / experiment-verified / archived
各 claim を bats / smoke test で実機検証
EXP-019〜031 等の番号で個別 hyperlink
spec の中身 (情報単位 HOW) に verified label を埋め込む

§8.2 folio の方針: 機能単位 (use case 単位) sandbox 検証

軸	twill (情報単位)	folio (機能単位)
検証対象	spec 内の各 claim (HOW level)	plugin の use case (機能単位)
検証 marker	`experiment-verified` label を spec 内 inline	scenario YAML file (spec 本体外、 `scratch/verification/`)
検証粒度	1 claim = 1 EXP	1 use case = 1 scenario (REQ 単位 with EARS)
spec 内容	情報単位 verified	WHAT-only (verified label なし)
constitution 整合	(twill 独自)	P-3 + P-11 遵守

§8.3 なぜ folio が twill 方式を採用しないか

P-3 WHAT-only 違反: 情報単位 verified label を spec 内に書くと、 spec が HOW を保持してしまう (どの EXP で verify したか、 verify したコマンド等)
P-11 HOW 絶対禁止: experiment-verified の HOW (実行コマンド、 verify 結果のスニペット) は spec 本文に書けない
Layer 0 一体配布 (P-12) と矛盾: 情報単位 verified は spec を実装依存にする、 spec が「未来理想 anchor」 (P-1) でなくなる

§8.4 情報単位 verified が将来別 plugin として独立する根拠

将来、 folio Layer 0 のオプション plugin として folio-claim-verifier (仮称) を提供する設計余地:

各 spec の inline claim に data-claim-id="C-001" 等を付与
別 file (scratch/claims/C-001.html 等) で claim 単位の verify status を管理
folio core (本 verification framework) は機能単位 sandbox、別 plugin は情報単位 verified という二層構成
本研究の scope 外 (今回は機能単位のみ規定、情報単位は別 plugin で隔離)

§8.5 本 verification framework が WHAT のみ規定する設計

scratch/verification/scenarios/*.yaml は WHAT (どの REQ をどう検証するか、 use case 単位) のみ規定
HOW (runner.sh の実装、 fixture の中身、 baseline の binary 形式) は scratch/verification/ 配下に隔離
将来 architecture/verification/ に移植時、 binding を .claude-plugin/scripts/ に分離
情報単位 verified は別 plugin に隔離 (将来 work)

§9. Q10 — ADR 起票候補

ADR ID 候補	title	内容	関連 finding	優先度
ADR-0013	plugin verification framework (sandbox 設計 + scenario format + runner)	folio Phase X3 用 sandbox verification framework を `scratch/verification/` に配置。 YAML scenario + bash runner + 2-dir baseline。 8 use case を REQ から 1:1 派生。 P-3/P-11 整合のため本体 spec とは分離。	§7 全般	MUST (Phase X3 着手前)
ADR-0014	verification scenarios を EARS REQ から導出する規範	EARS 5 pattern を Gherkin Given/When/Then に 1:1 変換。 RequireKit `/generate-bdd` 相当のルールを folio 用に定式化。 1 REQ = 1 scenario MUST。	§4.6 (EARS→Gherkin)、 §7.3 (scenario format)	MUST
ADR-0015	sandbox verification vs experiment-verified (twill 方式) の境界	folio は機能単位 sandbox verification を採用、情報単位 experiment-verified は将来別 plugin で独立。 P-3 (WHAT-only) + P-11 (HOW 禁止) との整合を明文化。 twill 方式採用しない根拠。	§8 全般	MUST
ADR-0016	exit code 中心の assertion 規範 (Issue #37210/#33106 対策)	`permissionDecision: deny` JSON は信頼できない (Closed/not planned バグ群)。 verification scenario の assertion は `exit_code: 2` + `stderr_contains` を確実な fallback とする。	§2.6 (Issue verify)、 §7.3 (scenario format)	SHOULD (Issue #39344 fix 状況確認後)
ADR-0017	hook unit test vs integration test の境界	unit test (echo JSON pipe + exit code) は API call 不要、 Phase X3 で MUST 実装。 integration test (実 Claude session 内 hook 発火) は API call 必要、 Phase X4+ で Docker Compose + Squid 採用候補。	§6.2, §6.3	SHOULD
ADR-0018	scenario file の golden baseline 管理 (2-dir model + accept workflow)	TypeScript baseline pattern + insta accept workflow を採用。 `baselines/reference/` (VCS) vs `baselines/local/` (gitignore)、 `runner.sh --accept` で昇格。	§5.2, §5.4, §7.3	SHOULD

起票手順: 各 ADR について (1) user に提示 → (2) user 承認取得 → (3) scratch/decisions/ADR-NNNN-<slug>.html に Accepted status で起票 → (4) 関連 spec section に cross-ref を追加。 CLAUDE.md の制約により自動起票は禁止。

§10. Open Questions / Future Work

§10.1 Gaps (実装直前の盲点、 critic 検出)

Gap 1 (MUST): Issue #39344 の fix 確認未完了: §2.6 で「Closed (2026-03-26)」と記載したが、 fix が実施されたかは未確認。 ADR-0016 の「exit code 2 が確実な fallback」推奨の根拠が、この fix 未確認に依存している。もし #39344 が fix 済みであれば permissionDecision: deny JSON も信頼できる可能性があり、 assertion 戦略が変わる。 推奨検索: site:github.com/anthropics/claude-code/issues/39344 fix resolved commit、 "Claude Code" changelog v2.1 "permissions" "hook" fix release notes。
Gap 2 (MUST): runner.sh の仕様不在: §7.4 で runner.sh を Phase X3 の中核として推奨しているが、 runner.sh の具体的なインターフェース仕様 (YAML scenario の parse 方法、 assertion 評価ロジック、 exit code の集計方法) は未記載。 ADR-0013 起票時に runner.sh の最低限の設計決定 (bash + yq か、 Python + PyYAML か等) が必要。 推奨検索: bash YAML parser yq script lightweight test runner exit code assertion、 "yq" YAML parsing bash script test framework minimal 2026。
Gap 3 (SHOULD): claude plugin validate の全 check 項目: §2.1/§2.5 で「marketplace 審査と同一 check」と確認されているが、具体的な全チェック項目リストが未取得 (公式 docs は「syntax and schema errors」のみ)。 folio plugin が validate を通過するための必要条件が不完全。 推奨検索: "claude plugin validate" checks items list "strict" mode output format、 site:github.com/anthropics/claude-code "plugin.json" schema required fields validation。
Gap 4 (SHOULD): Issue #18312 の URL 確認: §2.6 の 4 件 Bug 表で言及されているが Evidence Table に URL 未記載。 4 件目のデータとして補完が望ましい。 推奨検索: site:github.com/anthropics/claude-code/issues "18312" Bash allow list deny ignore。
Gap 5 (LOW): fast-check の公式 docs 直接確認: §4.5 が (deduced) ラベル (GitHub OSS README 水準)。 fc.ModelBasedTesting API の具体的な使用例が公式 docs (fast-check.dev) から未取得。 folio TypeScript plugin 実装時に直接参照が必要になる可能性 (Phase X4+ 候補)。 推奨検索: fast-check "ModelBasedTesting" "commands" API documentation TypeScript example、 site:fast-check.dev model-based testing stateful。

§10.2 Citation WARNING (critic 検出)

Issue #18312 URL 未記載: §2.6 で言及するが Evidence Table に URL なし (Gap 4 と重複)。主結論 (exit code 2 fallback) への影響は軽微だが独立ソースとしての検証は未完了
LangSmith 連携の単独ソース: §2.4 の LANGCHAIN_TRACING_V2=true 主張は専用 URL が Evidence Table に未記載。公式 Claude Code docs に LangSmith 連携の専用 page が確認できなかった。 transcript capture の主方法 (JSONL 自動保存) は公式 verified、 LangSmith は補足情報に留めるべき
fast-check の信頼度ラベル: §4.5 を (deduced) に修正済 (本レポート反映済)。機能説明自体は正確、 Phase X3 採用なし

§10.3 本調査の範囲外 (将来の調査候補)

情報単位 experiment-verified 別 plugin (folio-claim-verifier): §8.4 で言及した将来構想。 twill 方式の利点取り込みと P-3/P-11 整合の両立。 Phase X4+ で別途調査
Inspect AI / Promptfoo の folio 統合詳細: §7.4 で Phase X4+ 候補として位置付け。完成形 verification framework 設計時に再調査
Continue.dev / Aider / Cursor の plugin test 詳細: §3.1 で薄いカバー。 folio 設計への影響は軽微
Inspect AI の Claude Code sandbox との統合: Docker/K8s sandbox は機能ありだが Claude Code 公式 sandbox との統合方法は未確認
MANIFEST.json (WPT) の完全 JSON schema: 大型 file のため閲覧困難、 folio inventory.json schema 設計時に再調査の可能性
LSP test suite の具体的 fixture (JSON-RPC request/response ペア): vscode-languageserver-node の実コード直接閲覧が必要

§11. References (Evidence Table)

全 87 sources (公式 docs 多数 + GitHub OSS 多数 + tech blog 多数 + GitHub Issues 4 件)。確認日 2026-05-23。

§12. メタ情報

session_id: 59a916c0
created_at: 2026-05-23T14:31:01+09:00
researchers: 5 並列 (researcher-1〜5、 sonnet model)
sources_total: 87 件 (公式 docs 多数 + GitHub OSS + Tech blog + Community + GitHub Issues 4 件)
fetch_success / failure: 67 / 4 (Obsidian testing GitHub、 Aider GitHub、 Specmatic AsyncAPI 404、 insta.rs 詳細)
critique_verdict: PASS (WARNING 3 件 + Gap 5 件は §10 に反映済)
snapshot: /tmp/research-search-59a916c0/
関連 memory hash: f621abb6 (plugin-architecture-research、 2026-05-23) / 96b16f0c (twl Hooks/MCP、 2026-05-08) / 322df262 (Claude Code v2.1 更新、 2026-03-31) / a6d6b7c1 (twill dig 第 4 弾、 2026-05-13、 experiment-verified の reference)