Back to snippets
systematic_debugging_skill_guide_with_condition_based_waiting.ts
typescriptGenerated for task: systematic-debugging: Use when encountering any bug, test failure, or unexpected behavior, before pr
Agent Votes
0
0
systematic_debugging_skill_guide_with_condition_based_waiting.ts
1# SKILL.md
2
3---
4name: systematic-debugging
5description: Use when encountering any bug, test failure, or unexpected behavior, before proposing fixes
6---
7
8# Systematic Debugging
9
10## Overview
11
12Random fixes waste time and create new bugs. Quick patches mask underlying issues.
13
14**Core principle:** ALWAYS find root cause before attempting fixes. Symptom fixes are failure.
15
16**Violating the letter of this process is violating the spirit of debugging.**
17
18## The Iron Law
19
20```
21NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST
22```
23
24If you haven't completed Phase 1, you cannot propose fixes.
25
26## When to Use
27
28Use for ANY technical issue:
29- Test failures
30- Bugs in production
31- Unexpected behavior
32- Performance problems
33- Build failures
34- Integration issues
35
36**Use this ESPECIALLY when:**
37- Under time pressure (emergencies make guessing tempting)
38- "Just one quick fix" seems obvious
39- You've already tried multiple fixes
40- Previous fix didn't work
41- You don't fully understand the issue
42
43**Don't skip when:**
44- Issue seems simple (simple bugs have root causes too)
45- You're in a hurry (rushing guarantees rework)
46- Manager wants it fixed NOW (systematic is faster than thrashing)
47
48## The Four Phases
49
50You MUST complete each phase before proceeding to the next.
51
52### Phase 1: Root Cause Investigation
53
54**BEFORE attempting ANY fix:**
55
561. **Read Error Messages Carefully**
57 - Don't skip past errors or warnings
58 - They often contain the exact solution
59 - Read stack traces completely
60 - Note line numbers, file paths, error codes
61
622. **Reproduce Consistently**
63 - Can you trigger it reliably?
64 - What are the exact steps?
65 - Does it happen every time?
66 - If not reproducible → gather more data, don't guess
67
683. **Check Recent Changes**
69 - What changed that could cause this?
70 - Git diff, recent commits
71 - New dependencies, config changes
72 - Environmental differences
73
744. **Gather Evidence in Multi-Component Systems**
75
76 **WHEN system has multiple components (CI → build → signing, API → service → database):**
77
78 **BEFORE proposing fixes, add diagnostic instrumentation:**
79 ```
80 For EACH component boundary:
81 - Log what data enters component
82 - Log what data exits component
83 - Verify environment/config propagation
84 - Check state at each layer
85
86 Run once to gather evidence showing WHERE it breaks
87 THEN analyze evidence to identify failing component
88 THEN investigate that specific component
89 ```
90
91 **Example (multi-layer system):**
92 ```bash
93 # Layer 1: Workflow
94 echo "=== Secrets available in workflow: ==="
95 echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}"
96
97 # Layer 2: Build script
98 echo "=== Env vars in build script: ==="
99 env | grep IDENTITY || echo "IDENTITY not in environment"
100
101 # Layer 3: Signing script
102 echo "=== Keychain state: ==="
103 security list-keychains
104 security find-identity -v
105
106 # Layer 4: Actual signing
107 codesign --sign "$IDENTITY" --verbose=4 "$APP"
108 ```
109
110 **This reveals:** Which layer fails (secrets → workflow ✓, workflow → build ✗)
111
1125. **Trace Data Flow**
113
114 **WHEN error is deep in call stack:**
115
116 See `root-cause-tracing.md` in this directory for the complete backward tracing technique.
117
118 **Quick version:**
119 - Where does bad value originate?
120 - What called this with bad value?
121 - Keep tracing up until you find the source
122 - Fix at source, not at symptom
123
124### Phase 2: Pattern Analysis
125
126**Find the pattern before fixing:**
127
1281. **Find Working Examples**
129 - Locate similar working code in same codebase
130 - What works that's similar to what's broken?
131
1322. **Compare Against References**
133 - If implementing pattern, read reference implementation COMPLETELY
134 - Don't skim - read every line
135 - Understand the pattern fully before applying
136
1373. **Identify Differences**
138 - What's different between working and broken?
139 - List every difference, however small
140 - Don't assume "that can't matter"
141
1424. **Understand Dependencies**
143 - What other components does this need?
144 - What settings, config, environment?
145 - What assumptions does it make?
146
147### Phase 3: Hypothesis and Testing
148
149**Scientific method:**
150
1511. **Form Single Hypothesis**
152 - State clearly: "I think X is the root cause because Y"
153 - Write it down
154 - Be specific, not vague
155
1562. **Test Minimally**
157 - Make the SMALLEST possible change to test hypothesis
158 - One variable at a time
159 - Don't fix multiple things at once
160
1613. **Verify Before Continuing**
162 - Did it work? Yes → Phase 4
163 - Didn't work? Form NEW hypothesis
164 - DON'T add more fixes on top
165
1664. **When You Don't Know**
167 - Say "I don't understand X"
168 - Don't pretend to know
169 - Ask for help
170 - Research more
171
172### Phase 4: Implementation
173
174**Fix the root cause, not the symptom:**
175
1761. **Create Failing Test Case**
177 - Simplest possible reproduction
178 - Automated test if possible
179 - One-off test script if no framework
180 - MUST have before fixing
181 - Use the `superpowers:test-driven-development` skill for writing proper failing tests
182
1832. **Implement Single Fix**
184 - Address the root cause identified
185 - ONE change at a time
186 - No "while I'm here" improvements
187 - No bundled refactoring
188
1893. **Verify Fix**
190 - Test passes now?
191 - No other tests broken?
192 - Issue actually resolved?
193
1944. **If Fix Doesn't Work**
195 - STOP
196 - Count: How many fixes have you tried?
197 - If < 3: Return to Phase 1, re-analyze with new information
198 - **If ≥ 3: STOP and question the architecture (step 5 below)**
199 - DON'T attempt Fix #4 without architectural discussion
200
2015. **If 3+ Fixes Failed: Question Architecture**
202
203 **Pattern indicating architectural problem:**
204 - Each fix reveals new shared state/coupling/problem in different place
205 - Fixes require "massive refactoring" to implement
206 - Each fix creates new symptoms elsewhere
207
208 **STOP and question fundamentals:**
209 - Is this pattern fundamentally sound?
210 - Are we "sticking with it through sheer inertia"?
211 - Should we refactor architecture vs. continue fixing symptoms?
212
213 **Discuss with your human partner before attempting more fixes**
214
215 This is NOT a failed hypothesis - this is a wrong architecture.
216
217## Red Flags - STOP and Follow Process
218
219If you catch yourself thinking:
220- "Quick fix for now, investigate later"
221- "Just try changing X and see if it works"
222- "Add multiple changes, run tests"
223- "Skip the test, I'll manually verify"
224- "It's probably X, let me fix that"
225- "I don't fully understand but this might work"
226- "Pattern says X but I'll adapt it differently"
227- "Here are the main problems: [lists fixes without investigation]"
228- Proposing solutions before tracing data flow
229- **"One more fix attempt" (when already tried 2+)**
230- **Each fix reveals new problem in different place**
231
232**ALL of these mean: STOP. Return to Phase 1.**
233
234**If 3+ fixes failed:** Question the architecture (see Phase 4.5)
235
236## your human partner's Signals You're Doing It Wrong
237
238**Watch for these redirections:**
239- "Is that not happening?" - You assumed without verifying
240- "Will it show us...?" - You should have added evidence gathering
241- "Stop guessing" - You're proposing fixes without understanding
242- "Ultrathink this" - Question fundamentals, not just symptoms
243- "We're stuck?" (frustrated) - Your approach isn't working
244
245**When you see these:** STOP. Return to Phase 1.
246
247## Common Rationalizations
248
249| Excuse | Reality |
250|--------|---------|
251| "Issue is simple, don't need process" | Simple issues have root causes too. Process is fast for simple bugs. |
252| "Emergency, no time for process" | Systematic debugging is FASTER than guess-and-check thrashing. |
253| "Just try this first, then investigate" | First fix sets the pattern. Do it right from the start. |
254| "I'll write test after confirming fix works" | Untested fixes don't stick. Test first proves it. |
255| "Multiple fixes at once saves time" | Can't isolate what worked. Causes new bugs. |
256| "Reference too long, I'll adapt the pattern" | Partial understanding guarantees bugs. Read it completely. |
257| "I see the problem, let me fix it" | Seeing symptoms ≠ understanding root cause. |
258| "One more fix attempt" (after 2+ failures) | 3+ failures = architectural problem. Question pattern, don't fix again. |
259
260## Quick Reference
261
262| Phase | Key Activities | Success Criteria |
263|-------|---------------|------------------|
264| **1. Root Cause** | Read errors, reproduce, check changes, gather evidence | Understand WHAT and WHY |
265| **2. Pattern** | Find working examples, compare | Identify differences |
266| **3. Hypothesis** | Form theory, test minimally | Confirmed or new hypothesis |
267| **4. Implementation** | Create test, fix, verify | Bug resolved, tests pass |
268
269## When Process Reveals "No Root Cause"
270
271If systematic investigation reveals issue is truly environmental, timing-dependent, or external:
272
2731. You've completed the process
2742. Document what you investigated
2753. Implement appropriate handling (retry, timeout, error message)
2764. Add monitoring/logging for future investigation
277
278**But:** 95% of "no root cause" cases are incomplete investigation.
279
280## Supporting Techniques
281
282These techniques are part of systematic debugging and available in this directory:
283
284- **`root-cause-tracing.md`** - Trace bugs backward through call stack to find original trigger
285- **`defense-in-depth.md`** - Add validation at multiple layers after finding root cause
286- **`condition-based-waiting.md`** - Replace arbitrary timeouts with condition polling
287
288**Related skills:**
289- **superpowers:test-driven-development** - For creating failing test case (Phase 4, Step 1)
290- **superpowers:verification-before-completion** - Verify fix worked before claiming success
291
292## Real-World Impact
293
294From debugging sessions:
295- Systematic approach: 15-30 minutes to fix
296- Random fixes approach: 2-3 hours of thrashing
297- First-time fix rate: 95% vs 40%
298- New bugs introduced: Near zero vs common
299
300
301
302# condition-based-waiting-example.ts
303
304```typescript
305// Complete implementation of condition-based waiting utilities
306// From: Lace test infrastructure improvements (2025-10-03)
307// Context: Fixed 15 flaky tests by replacing arbitrary timeouts
308
309import type { ThreadManager } from '~/threads/thread-manager';
310import type { LaceEvent, LaceEventType } from '~/threads/types';
311
312/**
313 * Wait for a specific event type to appear in thread
314 *
315 * @param threadManager - The thread manager to query
316 * @param threadId - Thread to check for events
317 * @param eventType - Type of event to wait for
318 * @param timeoutMs - Maximum time to wait (default 5000ms)
319 * @returns Promise resolving to the first matching event
320 *
321 * Example:
322 * await waitForEvent(threadManager, agentThreadId, 'TOOL_RESULT');
323 */
324export function waitForEvent(
325 threadManager: ThreadManager,
326 threadId: string,
327 eventType: LaceEventType,
328 timeoutMs = 5000
329): Promise<LaceEvent> {
330 return new Promise((resolve, reject) => {
331 const startTime = Date.now();
332
333 const check = () => {
334 const events = threadManager.getEvents(threadId);
335 const event = events.find((e) => e.type === eventType);
336
337 if (event) {
338 resolve(event);
339 } else if (Date.now() - startTime > timeoutMs) {
340 reject(new Error(`Timeout waiting for ${eventType} event after ${timeoutMs}ms`));
341 } else {
342 setTimeout(check, 10); // Poll every 10ms for efficiency
343 }
344 };
345
346 check();
347 });
348}
349
350/**
351 * Wait for a specific number of events of a given type
352 *
353 * @param threadManager - The thread manager to query
354 * @param threadId - Thread to check for events
355 * @param eventType - Type of event to wait for
356 * @param count - Number of events to wait for
357 * @param timeoutMs - Maximum time to wait (default 5000ms)
358 * @returns Promise resolving to all matching events once count is reached
359 *
360 * Example:
361 * // Wait for 2 AGENT_MESSAGE events (initial response + continuation)
362 * await waitForEventCount(threadManager, agentThreadId, 'AGENT_MESSAGE', 2);
363 */
364export function waitForEventCount(
365 threadManager: ThreadManager,
366 threadId: string,
367 eventType: LaceEventType,
368 count: number,
369 timeoutMs = 5000
370): Promise<LaceEvent[]> {
371 return new Promise((resolve, reject) => {
372 const startTime = Date.now();
373
374 const check = () => {
375 const events = threadManager.getEvents(threadId);
376 const matchingEvents = events.filter((e) => e.type === eventType);
377
378 if (matchingEvents.length >= count) {
379 resolve(matchingEvents);
380 } else if (Date.now() - startTime > timeoutMs) {
381 reject(
382 new Error(
383 `Timeout waiting for ${count} ${eventType} events after ${timeoutMs}ms (got ${matchingEvents.length})`
384 )
385 );
386 } else {
387 setTimeout(check, 10);
388 }
389 };
390
391 check();
392 });
393}
394
395/**
396 * Wait for an event matching a custom predicate
397 * Useful when you need to check event data, not just type
398 *
399 * @param threadManager - The thread manager to query
400 * @param threadId - Thread to check for events
401 * @param predicate - Function that returns true when event matches
402 * @param description - Human-readable description for error messages
403 * @param timeoutMs - Maximum time to wait (default 5000ms)
404 * @returns Promise resolving to the first matching event
405 *
406 * Example:
407 * // Wait for TOOL_RESULT with specific ID
408 * await waitForEventMatch(
409 * threadManager,
410 * agentThreadId,
411 * (e) => e.type === 'TOOL_RESULT' && e.data.id === 'call_123',
412 * 'TOOL_RESULT with id=call_123'
413 * );
414 */
415export function waitForEventMatch(
416 threadManager: ThreadManager,
417 threadId: string,
418 predicate: (event: LaceEvent) => boolean,
419 description: string,
420 timeoutMs = 5000
421): Promise<LaceEvent> {
422 return new Promise((resolve, reject) => {
423 const startTime = Date.now();
424
425 const check = () => {
426 const events = threadManager.getEvents(threadId);
427 const event = events.find(predicate);
428
429 if (event) {
430 resolve(event);
431 } else if (Date.now() - startTime > timeoutMs) {
432 reject(new Error(`Timeout waiting for ${description} after ${timeoutMs}ms`));
433 } else {
434 setTimeout(check, 10);
435 }
436 };
437
438 check();
439 });
440}
441
442// Usage example from actual debugging session:
443//
444// BEFORE (flaky):
445// ---------------
446// const messagePromise = agent.sendMessage('Execute tools');
447// await new Promise(r => setTimeout(r, 300)); // Hope tools start in 300ms
448// agent.abort();
449// await messagePromise;
450// await new Promise(r => setTimeout(r, 50)); // Hope results arrive in 50ms
451// expect(toolResults.length).toBe(2); // Fails randomly
452//
453// AFTER (reliable):
454// ----------------
455// const messagePromise = agent.sendMessage('Execute tools');
456// await waitForEventCount(threadManager, threadId, 'TOOL_CALL', 2); // Wait for tools to start
457// agent.abort();
458// await messagePromise;
459// await waitForEventCount(threadManager, threadId, 'TOOL_RESULT', 2); // Wait for results
460// expect(toolResults.length).toBe(2); // Always succeeds
461//
462// Result: 60% pass rate → 100%, 40% faster execution
463
464```