In this chapter
Project phases.
What they are. A project lifecycle is the sequence of phases a project moves through from concept to retirement. The phase names vary across industries — aerospace calls them by different names than commercial software, defense calls them yet other things — but the underlying shape is similar. Each phase has a purpose, a deliverable, and a gate that determines whether the project is ready to advance.
Why the structure exists. Engineering work has dependencies that don’t respect schedule pressure. You cannot meaningfully design a system whose requirements you haven’t pinned down. You cannot meaningfully test a system you haven’t built. You cannot meaningfully retire a system you haven’t deployed. The phases are a recognition that earlier work has to settle before later work has anything to operate on. Skipping ahead doesn’t accelerate the project; it creates rework later, when the missing earlier work is discovered to have been necessary after all.
The mental model. Building a house. The architect cannot draw plans before knowing the lot, the budget, and what the family needs. The framers cannot frame before the foundation is poured. The electricians cannot wire before the framing is up. Each trade depends on the trade before it having done its work. Compressing the schedule by starting trades in parallel produces a house that doesn’t fit together. The phases of an engineering project are the same kind of dependency, applied to harder-to-see work.
The discipline. Each phase ends with a deliverable that allows the gate decision: is this project ready to enter the next phase? The deliverable is what makes the gate honest. Without a written deliverable, “ready” becomes opinion, and projects advance based on schedule pressure rather than readiness. With one, the gate is a fact-check.
Phase 0: Concept
Goal: Understand the problem, assess feasibility.
- Stakeholder interviews
- Problem definition
- High-level requirements
- Feasibility analysis
- Rough cost/schedule estimate
- Go/no-go decision
Phase 1: Requirements
Goal: Define what success looks like
- Functional requirements
- Non-functional requirements
- Interface definitions
- Acceptance criteria
- Requirements review with customer
Phase 2: Design
Goal: How we'll meet requirements
- System architecture
- Trade studies (COTS vs custom)
- Interface control documents
- Design reviews (PDR)
- Risk mitigation plans
Phase 3: Implementation
Goal: Build it
- Detailed design & coding
- Unit testing as you go
- Integration (incremental)
- Code reviews & peer checks
- Critical Design Review (CDR)
Phase 4: Verification
Goal: Prove it meets requirements
- System-level testing
- Environmental testing
- Requirements traceability
- Test reports & documentation
- Test Readiness Review (TRR)
Phase 5: Delivery & Support
Goal: Deploy and maintain
- Customer acceptance
- Training & documentation
- Deployment support
- Bug fixes & patches
- Lessons learned
Key Deliverables by Phase
| Phase | Critical Deliverables | Review/Gate |
|---|---|---|
| Concept | Concept of Operations (ConOps), Feasibility Study, Cost/Schedule Estimate | Concept Review |
| Requirements | Requirements Specification, Interface Control Documents (ICDs) | Requirements Review (SRR) |
| Design | System Architecture Document, Trade Study Reports, Risk Register | Preliminary Design Review (PDR) |
| Implementation | Detailed Design Docs, Source Code, Unit Test Results | Critical Design Review (CDR) |
| Verification | Test Plans, Test Procedures, Test Reports, RTM | Test Readiness Review (TRR) |
| Delivery | As-Built Documentation, User Manuals, Training Materials | Acceptance Review |
Code freeze and release.
What it is. A code freeze is a deliberate stop on new feature development, after which only bug fixes are allowed into the build. The freeze creates a stable target for the integration, regression, and acceptance work that has to happen before the release ships.
Why it exists. Software has the property that any change can break anything. A new feature added on Monday may produce a regression that doesn’t surface until Friday’s system test. If features keep landing throughout the verification phase, the team is verifying a moving target — and every test result is invalidated by the next change. The freeze creates a window where the build is stable enough that test results mean something.
The trap. “Just one more feature” is the discipline-killer. Every feature that lands after the freeze restarts the verification clock for the parts of the system it touched. The team that allows three “small” post-freeze features has effectively had no freeze. The team that rejects them has a stable build at release time and a high-confidence ship.
Release process.
The exact timeline varies by project, but the shape is recognizable. Each milestone exists because of a dependency: regression testing needs feature-complete code, system testing needs a stable build, the release candidate needs verified test results. Compressing the timeline doesn’t move the milestones — it just removes the verification time between them, and that verification is what made the release a release rather than a hope.
- T-8 weeks: Feature complete. All planned functionality implemented.
- T-6 weeks: Code freeze. Only bug fixes allowed; each one requires approval.
- T-4 weeks: Integration testing complete. System-level tests passing.
- T-2 weeks: Regression testing. Verify nothing broke from bug fixes.
- T-1 week: Release candidate. Final smoke tests, documentation review.
- T-0: Release to customer. Deployment support begins.
Configuration management.
Every claim of the form “we tested this” depends on knowing what “this” refers to. Configuration management is the discipline that makes that knowable. Every artifact has a version. Every test result references the version it was run against. The system can answer — mechanically, without human memory — the question “which version of which component was tested with which version of which other component, on which date, by whom?”
When configuration management is absent, that answer is reconstructed from human memory and shared drives, which means it is approximate at best and wrong at worst. Tests run against builds that were since overwritten. Bug reports reference versions that nobody can pin down anymore. Regressions take days to investigate that should take minutes. The team doesn’t know what they shipped, what they tested, or what changed between them. Every claim becomes an estimate.
What the discipline actually requires.
Source control for everything that can change — code, configuration files, scripts, documentation, FPGA images, firmware. If it has a version, it lives in version control with a commit history. If it’s on someone’s laptop and nowhere else, it doesn’t exist as far as the project is concerned.
Build artifacts produced by automated systems, not by individual engineers’ laptops. The build system runs from a known commit hash and produces an artifact tagged with that hash. Builds are reproducible because they have to be — if you cannot rebuild last quarter’s release from source, you cannot debug last quarter’s bug.
Test records that name the artifact under test, by version. “We ran the regression suite against build 4.7.2 on date X, here are the results” is a record. “It worked when I tested it last week” is not. The test record is what makes the test result usable evidence three months later when something fails in the field.
Release tagging that pins exactly which versions of which components shipped together. The next time a customer reports a bug, you can answer “here is exactly what you have, here is what we tested it with, here is what has changed since.” That answer is the difference between a tractable investigation and a multi-week archaeology project.
A new engineer onboarding to a project asked which version of the firmware was running on the test fixture. The answer came back from three different people: “the latest one I built last week,” “the December version, that’s the stable one,” and “I think Mike has a build with the fix in it.” There was no canonical answer because there was no canonical anything. Builds lived on engineers’ laptops. Tests had been run against builds that had since been overwritten. The fixture had whatever was on it last time someone flashed it.
Two weeks later, a regression was reported by the customer. Reproducing it required figuring out which build the customer had, which build had been on the fixture during the last successful test, and what had changed between them. Nobody knew. The team spent four engineer-days reconstructing what should have been a one-line answer from the commit log.
The lesson. Configuration management isn’t bookkeeping for its own sake. It is the precondition for every claim of the form “we tested this.” Without it, every test result is a story rather than a fact, and every regression is an investigation rather than a diff. The cost of the discipline is small and continuous. The cost of skipping it is large and lumpy — concentrated in the days when something has gone wrong and you can’t answer basic questions about what you shipped.
Hotfix, patch, update.
Why this distinction matters. Once a system is in production, every change is a risk. A change that fixes one bug can introduce three more — especially if the testing time is compressed because of urgency. The right discipline is to match the rigor of the change to the risk of the situation. Critical security flaws warrant rushed deployment with minimal testing because the alternative is worse. Minor UI bugs do not. Mixing up the two means either over-testing trivial changes or under-testing dangerous ones.
The trade. Speed and rigor are in tension. Faster deployment means less time for regression testing, which means more risk of introducing new problems. The discipline is to be honest about which trade you’re making each time, and to reserve the fastest paths for the situations that genuinely require them. Calling everything an emergency drains the meaning out of the word; the next real emergency gets the same response as the last fake one.
Hotfix.
Emergency repair.
- Critical bug in production
- Security vulnerability
- System down or data loss risk
- Deploy ASAP (hours to days)
- Minimal testing — focus on the fix
Patch.
Important bug fix.
- Non-critical bugs affecting users
- Performance issues
- Accumulated small fixes
- Deploy within weeks
- Regression testing required
Update / release.
Planned new version.
- New features
- Minor bug fixes
- Performance improvements
- Scheduled deployment
- Full test cycle
Anticipating Customer Needs
Customer says: "We need to log sensor data to a file."
Stated requirement: Data logging functionality.
Unstated but obvious needs:
- What happens when disk fills up? (rotate logs, alert operator)
- How do they retrieve logs? (USB export, network transfer)
- What if power fails during logging? (flush buffers, resume on reboot)
- Can they search/filter logs? (timestamps, event types)
- How long do they keep data? (retention policy, compression)
Result: Anticipate these needs in design. Customer will appreciate you thought ahead.
MVP scope drift.
A specific failure mode is worth naming because it’s recognizable to anyone who’s been on a slipping project. The customer originally asked for a prototype — a demonstration of the concept, with the expectation that real engineering would follow. The project ran long. The customer waited. By the time something is ready to show, the customer’s patience for a “just a demo” deliverable is gone, and what they want now is a working system. They’re still calling it the MVP, because that’s the word the contract uses, but their definition of MVP has quietly migrated from “proves the concept” to “does the job.”
This isn’t the customer being unreasonable. It’s the customer naming the new reality: they’ve been patient long enough that they’re past wanting a demo. The team, meanwhile, is still building toward the original definition. Both parties think they’re aligned because they’re using the same word. Neither has noticed the word now means different things.
The dishonest version is what most teams do under pressure: hide the cost of the slip inside the price of the next change order. A simple change request comes in. The team prices it as if it includes some of the overrun work that wasn’t their fault. The customer either accepts the inflated change-order price (in which case the project recovers some cost, but the team has now lied about what the work cost) or rejects it (in which case the team is back to delivering a prototype against production expectations).
The honest move costs less in the long run because it preserves the relationship. The dishonest move buys budget at the cost of trust, and trust compounds. The teaching point is general: the test of whether you are managing a customer relationship honestly is whether you can describe the project’s true state without losing the contract. If you can’t, you’ve already lost something more important than the contract.
Engineering ROI: every hour has a cost.
Engineering decisions are business decisions. Every choice has cost implications. Custom hardware versus COTS, development time versus recurring cost, time spent on craft versus time spent on shipping — these are not just technical decisions. They affect ROI, time-to-market, profitability, and whether the project gets funded for a second round.
The deeper version of this principle is worth stating explicitly because most engineers resist it: every engineering hour has a cost, and a senior engineer’s job is partly recognizing which activities deliver value to the scope and which are ceremony. NRE — non-recurring engineering — is what you spend on the design, prototyping, testing, and documentation work that produces the product. It is not free, it is not unlimited, and it is one of the largest line items on most projects.
Software engineers in particular resist this framing because they’re trained to optimize for craft, but craft without budget awareness is just well-rendered overrun. An hour spent bikeshedding over a C++ idiom that the latest YouTube argument settled differently than yesterday is an hour that didn’t ship the thing. That doesn’t mean craft is wrong — craft compounds, and a team that never invests in it produces fragile work. It means craft has to be balanced against the budget that pays for it, with explicit awareness of what the trade actually buys.
Build vs Buy Analysis
COTS Option:
- Cost: $200/unit
- Available now
- Proven reliability
- Vendor support
- 100 units = $20,000 recurring
Custom Option:
- NRE: $50,000 (design + prototyping)
- Unit cost: $80 (at 100 qty)
- 6 months development time
- Maintenance burden on us
- 100 units = $50k NRE + $8k = $58,000
Break-even: At ~416 units, custom becomes cheaper. But consider:
- Will we sell 416+ units? (market analysis)
- Can we afford 6-month delay? (time-to-market)
- Do we have resources for support? (long-term cost)
The peer review trap.
One specific place this matters: code reviews. A peer review by someone who understands the problem domain is worth several thousand dollars per review — they can ask the questions a linter cannot, surface the bug a static analyzer would never catch, and shape the design before it becomes infrastructure. A peer review by someone who doesn’t understand the problem domain costs the same per hour and is worth nothing. Without domain context the review collapses to syntax checking and style preferences, which the linter already does for free.
The remedy isn’t more reviewers. It’s reviewers who can see the actual risk. That requires investment — pairing junior engineers with seniors who know the domain, rotating reviewers to spread that knowledge, refusing to merge work that hasn’t been reviewed by someone who can evaluate it on the dimensions that matter. The team that runs reviews on every PR but only catches style issues isn’t doing reviews. They’re running expensive linters.
A senior engineer review-merged a pull request in three minutes. The comments were all about formatting: a tab versus four spaces, a function name that should have been camelCase, a brace style preference. Solid review hygiene by the C++ standard. The code in question was a sensor driver that handled the nominal case and crashed on NaN inputs at the operating temperature extreme. Six weeks later, the field unit returned NaN, and the team spent two days reproducing the failure before someone noticed the function had no defensive check for non-finite values.
The reviewer didn’t lack skill. The reviewer lacked the domain context to know that this particular sensor returns NaN at -40C, and that NaN handling was the actual risk in this code. Without that context, the review collapsed to the only thing the reviewer could evaluate: style.
The lesson. A peer review by someone who doesn’t understand the problem domain costs the same as a peer review by someone who does — and one of them is worth nothing. If the only signal a review surfaces is style and formatting, that’s a process to fix, not a victory to celebrate.
Engineering and management as one team.
Every principle in this guide assumes one thing that is rarely true: that the engineer applying it has support from the layer above them. Engineering rigor without management air cover gets ground down to the same patterns the team had before, just with more frustration on the engineer’s side. Rigor with management protection actually changes things. Both are required.
The principle is a specific named role. Management’s job in a healthy engineering culture isn’t directing the work. It is running interference for the work. The manager who protects time for a trade study, defends the dev-hardware budget against “we don’t need it anymore,” and lets the team push back on a senior engineer’s inherited-from-the-last-project pattern — that manager is doing the high-leverage work of the role. The manager who treats their job as schedule policing and ceremony enforcement is doing none of it.
The pattern underneath the patterns.
Most of the failures named in this guide look engineering-shaped on the surface but are management-shaped underneath. The team that drops dev-hardware support once final hardware arrives isn’t making an engineering decision — that is a management cost decision driven by a wrong mental model of where bugs get found. The team without configuration management isn’t making an engineering oversight — that is a resourcing failure dressed up as one. The team running peer reviews that only catch syntax issues isn’t failing technically — they are missing the management investment in domain-deep reviewers.
The pattern repeats. A discipline costs engineer time. The cost shows up on someone’s budget. Schedule pressure mounts. The discipline gets eroded. The resulting failures look like engineering bugs. The engineer is blamed for the symptom. The actual decision was upstream.
Parkinson’s law in the ticket system.
A specific version of this misalignment is worth naming because it is nearly universal. Work expands to fill the process built around it. A team with a heavy ticket-tracking system — sprint planning, story-pointing, retro ceremonies, backlog grooming meetings — will produce work at the rate that fills those structures, not at the rate the structures were designed to enable. The structures were sold as productivity tools. They become the work.
The fix is not “more agile” or “less agile.” It is recognizing that process has a cost, the cost shows up in engineer-hours, and a team that spends 30% of its time on process ceremony is producing 70% of the work it could be producing. Whether that’s a good trade depends on what the process actually buys. Sometimes it buys real coordination value. Sometimes it buys nothing but the appearance of management. The discipline is to measure, not assume.
What it takes from the engineer.
If you are an engineer reading this without management air cover, the principles still apply — but applying them requires a different skill: making the cost case explicitly, with numbers, in language management can act on. “We need to keep the dev hardware support” is a request management can decline under budget pressure. “Dropping the dev hardware will cost us approximately X engineer-days per regression we hunt on production silicon, based on Y regressions per quarter from previous projects” is a request management can’t decline without owning the math.
The engineers who succeed in misaligned environments do this work explicitly. They translate engineering discipline into business cost. They make the invisible cost of skipping a discipline visible enough that the math is on the record before the decision is made. It is not enough to be right. You have to be right in a form management can use.
Engineering rigor is necessary. Management air cover is necessary. Either one alone is insufficient. The healthiest engineering cultures have both, and they’re structured so that neither has to fight the other to do their job. If you find yourself fighting your management to be allowed to apply basic discipline, that’s a flag — not about you, and not about your management individually, but about an alignment problem upstream of both of you that needs to be named and addressed.
Lessons Learned Process
Questions to Ask
What Went Well?
- Which processes worked smoothly?
- What decisions paid off?
- What tools/methods were effective?
- How do we repeat this success?
What Went Wrong?
- What caused delays or rework?
- Which assumptions were wrong?
- What would we do differently?
- How do we prevent this next time?
Process Improvements
- Update templates/checklists
- Add to risk register for similar projects
- Share with team (brown bag lunch)
- Review lessons before next project kickoff