What 160 Minutes of AI Debugging Taught Me About AI-Coding tools

“In God we trust, all others bring data.” This principle has guided my approach to evaluating AI coding tools, but I realized I needed to practice what I preach.

Years ago, I took the Personal Software Process training and even adapted it to deliver it at Universidad ORT Uruguay. The educational value of this experience I published with my students in 2011 (see Dymenstein, Martin, Alfredo Etchamendi, Fernando Maidana, Santiago Matalonga, and Tomás San Feliu. ‘Towards a Student Oriented Approach to Teaching PSP Discipline’. CIESC 2011 (Quito), CLEI, 2011.)

So, to evaluate my interaction with AI coding assistants, I I decided to apply Personal Software Process (PSP) discipline to my own AI-assisted development work and track every single defect.

Over four weeks of development—creating Jupyter notebooks for research analysis, developing programming exercises for my teaching, and building an LLM data extraction tool (for a paper currently under review) —I meticulously logged every defect that escaped my initial development and testing phases. The results provide a reality check that validates the concerns I’ve raised about tool investments, productivity measurements, and the need for disciplined evaluation (in my previous posts).

The Data: 16 Defects, 160 Minutes, Hard Truths

Between May 9 and June 5, 2025, working with GitHub Copilot across Python, Java, and LaTeX projects, I recorded 16 defects that required debugging after I had considered the development task complete. The breakdown tells a story that every CTO investing in AI tools should understand:

Defect Attribution:

– LLM-caused defects: 12 (75%)

– Human-caused defects: 4 (25%)

Total debugging time: 160 minutes

– Average time per LLM defect: 8.7 minutes

– Average time per human defect: 4.25 minutes

One notable outlier: I spent 45 minutes trying to configure GitHub Actions using Visual Studio Code, and for most of that debugging time, I assumed the issue was with my YAML file—when in fact, it wasn’t

Most Common LLM Defect Types:

– Boundary/Edge cases not foreseen: 5 occurrences

– Syntax errors: 4 occurrences

– Variable type/initialization issues: 3 occurrences

You can access the google spreadsheet here: https://docs.google.com/spreadsheets/d/1dqriQs0l1UZzwZjMxb7wFI8797vK2OnjYrOCAmyAaDI/edit?usp=sharing

The Hidden Cost of “Productivity”

This data shows the gap between the promised productivity gains and the reality of AI-assisted development. While Copilot undoubtedly helped me write code faster, the 135 minutes spent debugging LLM-generated defects represents a hidden tax on that productivity.

More concerning is the pattern of defect types. Boundary cases and edge conditions—the kind of defensive programming that separates robust software from fragile demos—consistently tripped up the AI. These aren’t typos or simple oversights; they’re fundamental gaps in understanding context and requirements that require human judgment to resolve.

The fact that LLM defects took twice as long to debug on average (8.7 vs 4.25 minutes) also suggests something important: AI-generated errors are often more subtle and harder to diagnose than human mistakes. When I make a copy-paste error, I can usually spot it quickly. When Copilot generates code with incorrect API versions or misunderstood prompts, the debugging process requires deeper investigation.

The Context Problem Materialized

To me, this experience reinforces the arguments I’ve made in previous articles about the limitations of current productivity measurements. The 30-minute debugging session for a “Wrong Generation” defect where the LLM misunderstood my prompt perfectly illustrates why measuring coding speed without accounting for context comprehension gives misleading results.

Consider this pattern: I would write a prompt, Copilot would generate seemingly correct code (including my review of the code), I would integrate it into my project, and only during execution would I discover that the AI had missed crucial context about my specific use case. The boundary condition errors—five separate occurrences—all followed this same pattern: the AI optimized for the common case while ignoring edge conditions that human developers learn to anticipate through experience.

The Integration Challenge

What this data doesn’t capture—but my experience highlighted—is the cognitive overhead of working with AI assistance. Each suggestion requires evaluation: Is this correct? Does it fit my context? Will it handle edge cases? This constant assessment creates a mental load that doesn’t appear in productivity metrics but affects overall development effectiveness.

The teaching project data is particularly revealing. When developing programming exercises that students would use, the boundary case defects created by Copilot would have directly impacted educational quality. A subtle error in exception handling or variable initialization in a student exercise could confuse learners or teach incorrect patterns—consequences that extend far beyond the 5-10 minutes needed to fix the bug.

Connecting to Organizational Reality

This personal data helps explain the productivity paradox I’m seeing across coaching engagements. The same patterns that affected my small-scale projects—context misunderstanding, boundary case failures, prompt engineering overhead—get amplified in organizational settings where multiple developers use AI tools with varying skill levels and inconsistent practices. With DevOps practices behaving like eco-chambers for these mistakes and pushing them into production.

CTO dealing with unclear requirements will find that AI tools amplify the ambiguity rather than resolving it. A scaling organization struggling with coordination issues won’t solve them by generating code faster if that code introduces subtle defects that surface during integration. A Scrum Master tracking velocity might see story point completion rates improve while quality metrics deteriorate.

The Measurement Imperative

This exercise reinforced why objective measurement matters more than vendor promises or industry hype. The discipline of logging every defect, categorizing its source, and tracking resolution time provided insights that would be impossible to gain from high-level productivity metrics or subjective impressions.

For organizations investing in AI development tools, the lesson is clear: implement measurement systems that capture the full cost-benefit picture. Track not just code generation speed, but debugging time, defect rates, context switching overhead, and quality metrics. Measure what matters to your specific context, not what’s convenient for tool vendors to report.

The Balanced Path Forward

I’m not advocating against AI coding tools—the data shows they provide genuine value. But this PSP exercise demonstrates why the balanced approach I’ve advocated across previous articles remains essential. AI tools are powerful amplifiers that require disciplined implementation, careful measurement, and realistic expectations.

The 160 minutes I spent debugging represents the hidden cost of the “Age of Tools.” It’s time that vendor productivity studies don’t capture, academic research struggles to measure, and organizational metrics often miss. But it’s also time well spent in understanding the true impact of these tools on development effectiveness.

The companies that will succeed with AI coding assistants are those that implement them with the same discipline they would apply to any other productivity investment: clear measurement criteria, realistic expectations, and systematic evaluation of actual outcomes rather than promised benefits.

Your organization’s data will differ from mine—different tools, different contexts, different defect patterns. But the principle remains: in an environment saturated with biased claims and incomplete metrics, the most valuable insights come from disciplined measurement of your own experience. Trust the data, especially when you collect it yourself. How are you measuring your AI productivity? Contact me to discuss and path a way to high performance.

Discover more from The Software Coach

Subscribe to get the latest posts sent to your email.

What 160 Minutes of AI Debugging Taught Me About AI-Coding tools

The Data: 16 Defects, 160 Minutes, Hard Truths

The Hidden Cost of “Productivity”

The Context Problem Materialized

The Integration Challenge

Connecting to Organizational Reality

The Measurement Imperative

The Balanced Path Forward

Like this:

Related

Discover more from The Software Coach

Leave a Comment Cancel reply

The Data: 16 Defects, 160 Minutes, Hard Truths

The Hidden Cost of “Productivity”

The Context Problem Materialized

The Integration Challenge

Connecting to Organizational Reality

The Measurement Imperative

The Balanced Path Forward

Share this:

Like this:

Related

Discover more from The Software Coach

Related Post

Leave a Comment Cancel reply