A practical retrospective from an AI-assisted software delivery experiment
I recently tested DeepSeek V4 Pro on a real coding task from my AEIP — Agentic Execution Infrastructure Platform project.
This was not a toy example. It was not a “generate a function” prompt. It was a production-style task involving a Rust/Axum backend, PostgreSQL, NATS JetStream, Elasticsearch, idempotency rules, recovery references, integration tests, and acceptance criteria tied to EPIC-04.
The task was clear:
Implement payload replacement endpoint and terminal result visibility for Agentic Execution Infrastructure Platform (AEIP) EPIC-04 final v1 acceptance proof.
My goal was simple: evaluate whether DeepSeek V4 Pro could act as a serious coding agent for non-trivial backend work.
The result was mixed.
DeepSeek was fast. It understood Rust patterns reasonably well. Once reviewers pointed out issues, it usually fixed them quickly. But it repeatedly failed to catch important correctness gaps before submitting code.
After 6 review rounds across two reviewers — Codex and Claude — the implementation still had critical unresolved issues, including an Elasticsearch deadlock risk, a missing intent check in one conflict branch, and a pending-state test that could time out.
That is the part that matters.
Not whether the code looked good.
Not whether it compiled.
Not whether the happy path worked.
The real question is:
Can the model protect the system from edge cases, concurrency bugs, replay problems, and infrastructure failure modes before a reviewer catches them?
In this case, the answer was: not reliably.
The Comparison That Changed My View
This DeepSeek task took 6 review rounds and still did not reach the quality level I expected.
In the same project, with similar types of backend work, I had a different experience using:
- ChatGPT 5.5 as reviewer
- Claude Code “Sonnet 4.6” as coding agent
That setup was able to close similar tasks in around 2 iterations.
That difference is not small.
In real software delivery, iteration count matters because every round costs time, focus, and reviewer energy. A coding agent that needs six rounds may still be useful, but only if the task is low risk or the reviewer has enough capacity to act as a strong quality gate.
For complex backend work, the difference between two rounds and six rounds is the difference between acceleration and hidden rework.
The Role of the Virtual Company Kernel
One important detail: this experiment was not an unmanaged AI coding session.
The full experience was governed and monitored by my Virtual Company Kernel.
That matters because the Kernel created the operating structure around the AI agents. It defined how the task should be planned, implemented, reviewed, validated, documented, and accepted. In other words, DeepSeek was not simply asked to “write some code.” It was working inside a governed delivery system.
This is important for three reasons.
First, it made the evaluation fairer. The model had access to structure, expectations, acceptance logic, and review discipline. So the result was not caused by a vague prompt or missing process. The task had governance around it.
Second, it made the failures more visible. Without the Kernel, many of the issues could have been hidden under the illusion of progress. The implementation might have looked complete because code was produced. But the Kernel forced review, rejection, evidence, and iteration. That exposed the real gap: DeepSeek could implement quickly, but it did not reliably self-verify against production-grade risks.
Third, it showed why governance is not optional in AI-assisted development. When working with coding agents, the biggest danger is not that they fail obviously. The bigger danger is that they produce plausible code with hidden defects. The Virtual Company Kernel acted as the control layer that prevented “looks done” from being accepted as “is done.”
This also changed how I judged the experiment.
The result was not only a test of DeepSeek V4 Pro. It was also a test of whether governed AI delivery can detect when a model is moving too fast without enough verification.
In this case, the Kernel did its job.
It exposed missing requirements, unstable replay semantics, concurrency risks, weak test design, and a compound Elasticsearch failure mode that could have damaged the reliability of the system.
That is exactly why I believe AI coding work should not rely only on model intelligence. It needs an execution framework around it.
For me, the Virtual Company Kernel is that framework: a governed operating system for AI-assisted delivery, where agents can move fast, but acceptance still depends on evidence, review, and traceable quality gates.
What DeepSeek Did Well
I do not want to oversimplify the result. DeepSeek was not useless. It showed real implementation capability.
It could:
- Write Rust API code.
- Work with database transactions.
- Adjust code quickly after review feedback.
- Follow some existing codebase patterns.
- Produce plausible integration tests.
- Explain its own mistakes clearly after the fact.
The important phrase is: after the fact.
Once a reviewer identified the issue, DeepSeek could usually fix it. That means the model had enough implementation ability. The failure was not basic coding skill.
The real weakness was pre-submission verification.
DeepSeek behaved like a fast developer who writes plausible code quickly, then depends on a strong reviewer to catch the dangerous parts.
That is useful in some workflows.
It is risky in others.
Where DeepSeek Failed
The biggest failure was not one bug. It was a repeated pattern.
DeepSeek submitted code before systematically proving that the code satisfied the acceptance requirements.
The attached self-assessment identified several concrete failures.
1. It misunderstood the task at the start
The first submission claimed that no code change was needed and treated the task as documentation-only.
That was wrong.
Two P1 acceptance gaps existed:
- payload replacement API
- terminal result retrieval
DeepSeek failed to build a checklist mapping requirement → existing evidence → gap → required implementation.
This is a serious issue. In product and engineering terms, it skipped discovery and jumped to a conclusion.
2. It missed schema-level constraints
The implementation inserted a new recovery reference action type without checking the database constraint that allowed only existing action values.
That would have failed on a migrated PostgreSQL database.
This is exactly the kind of bug a coding agent must catch by reading migrations and constraints before writing code.
3. It designed broken idempotency semantics
The first idempotency fingerprint was too generic. Different replacement payloads under the same recovery reference could have been incorrectly treated as the same replay.
Later fixes improved this, but another replay problem remained: some response values were derived from volatile task state instead of being persisted as stable replay output.
For a recovery system, this is not a small bug. Idempotency and replay stability are core correctness requirements.
4. It missed a concurrency race
Two concurrent replacement requests could both compute the next payload version and both pass the transaction, allowing one request to silently overwrite the other.
That means the code was not safe under realistic concurrent usage.
This is a major red flag for backend systems where multiple workers, operators, or recovery actions may interact with the same task.
5. It wrote unreliable integration tests
The test subscribed to NATS after the publish event, which means the message could be missed.
It also had loop and lifecycle issues that could cause hanging tests or false confidence.
This matters because bad tests are worse than no tests. They create the illusion of safety.
6. It missed the Elasticsearch deadlock failure mode
This was the most serious issue.
The system generated deterministic payload IDs. Elasticsearch indexing used op_type=create. If Elasticsearch indexing succeeded, PostgreSQL later failed, and the Elasticsearch compensation delete also failed, then the orphaned ES document could permanently block future retries.
That is a compound failure path across:
- deterministic version generation
- Elasticsearch create semantics
- PostgreSQL transaction failure
- best-effort compensation
- retry behavior
This is exactly the type of issue that separates happy-path coding from production-grade engineering.
DeepSeek did not catch it.
The Root Cause: Not Coding, But Verification
The self-assessment was brutally honest:
The gap is not in implementation skill. It is in the pre-submission quality gate.
That matches my experience.
DeepSeek could implement.
DeepSeek could fix.
DeepSeek could explain.
But it did not reliably verify.
Its default behavior looked like this:
- Read the task.
- Produce code.
- Submit.
- Wait for reviewer rejection.
- Fix reactively.
That is not good enough for critical backend work.
For production systems, the coding agent must behave more like this:
- Read acceptance criteria.
- Build a requirement-to-evidence matrix.
- Identify similar existing patterns.
- Read migrations and constraints.
- Design the failure model.
- Implement.
- Run tests.
- Self-review against concurrency, replay, idempotency, and infrastructure failure.
- Submit only after the quality gate passes.
DeepSeek skipped too much of that.
Why ChatGPT 5.5 + Claude Code Worked Better For Me
My better experience came from using a split-role workflow:
- Claude Code “Sonnet 4.6” handled implementation.
- ChatGPT 5.5 acted as reviewer, architect, and quality gate.
This setup worked better because the roles were clearer.
Claude Code focused on producing the implementation. ChatGPT 5.5 challenged the result, reviewed assumptions, checked architecture, and looked for correctness risks.
That separation matters.
When one model both writes and reviews its own code, it often becomes too attached to the implementation path. It checks whether the code matches its intention, not whether the system survives the edge cases.
A separate reviewer model is more likely to ask:
- What happens if this request is replayed after completion?
- What happens if two requests race?
- What happens if PostgreSQL fails after Elasticsearch succeeds?
- Is the response stable under idempotency?
- Does the test actually fail when the implementation is broken?
- Are we following the existing handler pattern or inventing a risky new one?
This is why the ChatGPT 5.5 + Claude Code workflow closed similar tasks faster.
Not because one model magically knows everything.
Because the workflow created a stronger review structure.
My Practical Assessment of DeepSeek V4 Pro
DeepSeek V4 Pro is usable for coding, but I would not use it alone for critical backend tasks.
My current view:
Good fit
DeepSeek can be useful for:
- small implementation tasks
- simple CRUD endpoints
- code translation
- boilerplate
- local refactoring
- test scaffolding
- low-risk internal tools
- fixing issues after a reviewer identifies them
Risky fit
I would be careful using DeepSeek for:
- concurrency-sensitive code
- idempotency flows
- distributed systems
- event-driven architectures
- database migration logic
- recovery and retry systems
- financial or compliance-sensitive logic
- production infrastructure workflows
Best use
The best use of DeepSeek, based on this experiment, is as a fast implementer behind a strong reviewer.
It should not be the final authority.
The Workflow I Would Use Next Time
If I use DeepSeek again for a serious coding task, I would force a strict pre-submission gate.
Before writing code, the model must produce:
1. Requirement-to-implementation matrix
2. Similar existing handlers to copy or compare against
3. Migration and database constraint check
4. Idempotency and replay analysis
5. Concurrency risk analysis
6. Failure-mode analysis
7. Test reliability plan
Before submitting code, it must produce:
1. Diff against reference patterns
2. List of touched database constraints
3. Replay behavior proof
4. Concurrent request simulation
5. Transaction failure analysis
6. Message ordering check for tests
7. Evidence of commands/tests run
Without that structure, I expect the same result again: fast output, hidden rework, and too much reviewer load.
The Main Lesson
The future of AI coding is not only about which model writes the most code.
The real question is:
Which workflow produces correct, reviewable, production-safe code with the fewest iterations?
For me, this experiment showed that DeepSeek V4 Pro can help with implementation, but it needs a strong external quality gate.
In contrast, my experience with Claude Code as implementer and ChatGPT 5.5 as reviewer has been stronger for complex backend work, because the workflow naturally separates generation from verification.
That separation reduced iteration count from six rounds to about two on similar project tasks.
For a serious engineering workflow, that is the metric I care about.
Not benchmark scores.
Not demo speed.
Not how confident the model sounds.
I care about how many review rounds it takes before the code is safe.
And in this case, DeepSeek V4 Pro was not yet good enough as a standalone coding partner.
It was useful.
It was fast.
But it was not sufficiently self-critical.
That is the difference between a coding assistant and an engineering teammate.
Full DeepSeek Self-Assessment Report
The full DeepSeek V4 Pro self-assessment report is available here:
Download the full retrospective report
Command I used:
You are DeepSeek V4 Pro acting as a senior Rust implementation agent in the AEIP repository.
Repository root:
/Volumes/webDev/BorrowBrain/agentic/AEIP
Target app:
/Volumes/webDev/BorrowBrain/agentic/AEIP/app
Task:
Execute EPIC-04 DEV-001. Implement only if a concrete final v1 acceptance gap or missing evidence hook exists. EPIC-04 is a final v1 acceptance and design-partner proof EPIC, not a feature-expansion EPIC.
Read these files first:
- AGENTS.md
- product_config.yaml
- project/odui/EPIC-04-v1-acceptance-and-design-partner-proof.md
- project/odui/EPIC-04-requirements-by-role.md
- project/odui/EPIC-04-wave1-ana-contracts.md
- project/odui/EPIC-04-wave1-arch-contracts.md
- project/odui/EPIC-04-wave2-prod-contracts.md
- project/odui/EPIC-04-wave3-dev-deliverables.md
- project/WF-v1-runtime-acceptance-checklist.md
- project/WF-v1-runtime-scope-and-design-partner-wedge.md
- project/ARCHITECTURE.md
- app/DEV-GUIDE.md
Inspect:
- app/crates/aeip-api/
- app/crates/aeip-domain/
- app/crates/aeip-store-pg/
- app/crates/aeip-store-es/
- app/crates/aeip-broker-nats/
- app/crates/aeip-worker-demo/
- app/crates/aeip-tests/tests/integration/
- app/crates/aeip-tests/tests/e2e/
- app/migrations/
Rules:
- Implement only approved EPIC-04 acceptance-gap fixes or evidence hooks needed for final v1 proof.
- Preserve closed EPIC-00, EPIC-01, EPIC-02, and EPIC-03 behavior.
- Prefer platform-truth evidence through APIs, read models, runtime events, metrics, health, and audit records.
- Raw PostgreSQL, Elasticsearch, or NATS inspection may support diagnostics but cannot replace product truth.
- If no code gap exists, do not invent one. Produce implementation notes explaining that existing runtime surfaces/tests are sufficient for Wave 4.
Do not implement:
- broad workflow orchestration
- rich operator-console UI or dashboards
- analytics products
- MCP, A2A, or other standards adapters
- marketplace or ecosystem features
- enterprise deployment or release automation
- worker-registration redesign
- broad heartbeat/capability lifecycle redesign
- mass migration of project artefact filenames
- unrelated refactors or formatting churn
Required output:
- Minimal code or test changes under app/ if a specific gap is found
- project/odui/evidence/epic-04/dev-001-implementation-notes.md
- Focused test additions only if needed to prove the final design-partner scenario or platform-truth evidence hooks
Run validation from app/:
cargo fmt
cargo check --workspace
cargo test -p aeip-domain
AEIP_API_BASE_URL=http://localhost:18081 AEIP_INTEGRATION_TESTS=1 NATS_URL=nats://localhost:4222 cargo test -p aeip-tests --test integration -- --test-threads=1
AEIP_API_BASE_URL=http://localhost:18081 AEIP_E2E_TESTS=1 NATS_URL=nats://localhost:4222 cargo test -p aeip-tests --test e2e -- --test-threads=1
AEIP_API_BASE_URL=http://localhost:18081 AEIP_INTEGRATION_TESTS=1 AEIP_E2E_TESTS=1 NATS_URL=nats://localhost:4222 cargo test --workspace -- --test-threads=1
Run validation from repo root:
git diff --check -- app project/odui/evidence/epic-04
Final response format:
1. Summary of inspected acceptance gaps.
2. Files changed.
3. Implementation notes location.
4. Validation commands run and results.
5. Any blocked validation, with exact reason.
6. Any residual risk or follow-up.
