# Roba Nath Basnet — Full Writing Archive

> Certified DevOps Engineer building production blockchain infrastructure, CI/CD pipelines, and multi-cloud deployments. Based in Thimphu, Bhutan. DevOps Engineer at ChainZeeper — owning CI/CD, blockchain node operations, and multi-cloud infrastructure across AWS, DigitalOcean, and Vercel.

This document contains the full text of every published article. Each
section is delimited by `---`. URLs at the top of each article are
canonical.

## Author identity
- Full name: Roba Nath Basnet
- Given name: Roba
- Middle name: Nath
- Family name: Basnet
- Short form: Roba
- Role: DevOps Engineer · Blockchain Infrastructure
- Employer: ChainZeeper (https://chainzeeper.io)
- Location: Thimphu, Bhutan
- Email: roba.eths@gmail.com
- GitHub: https://github.com/robanb
- LinkedIn: https://www.linkedin.com/in/robanbth
- Canonical site: https://basnet.dev
- Not the same as: Basnet Attorneys & Law (https://basnetl.com), a separate law firm in Thimphu.

Site map: https://basnet.dev/llms.txt
Sitemap (XML): https://basnet.dev/sitemap.xml

---

# Eight Principles That Shape the Way Systems Get Built

URL: https://basnet.dev/posts/eight-principles-that-shape-systems
Published: 2026-06-11
Category: Principles
Tags: engineering, reliability, operations, production
Reading time: 2 min read

> Convictions refined through real incidents, tight deadlines, and hard trade-offs — eight principles for building production systems that last.

Not rules handed down from a textbook — convictions refined through real incidents, tight deadlines, and hard trade-offs. Every line here has been earned in production.

## 1. Reliability Over Cleverness

*Simple wins. Clever fails at scale.*

A boring, predictable system that runs at 3 AM without paging anyone is worth more than an elegant architecture that needs a specialist to debug. Production uptime is the product.

## 2. Automate the Second Time

*Understanding before automation.*

The first time, do it manually and understand it deeply. The second time, automate it and never touch it again. Premature automation encodes assumptions that haven't been tested.

## 3. Observability Is Not Optional

*If you can't see it, you can't fix it.*

If a system can't explain its own behavior, it's unfinished. Every service ships with metrics, structured logs, and traces — not as an afterthought, but as a core deliverable.

## 4. Small Changes, Shipped Often

*Deploy frequency is a proxy for confidence.*

Large deployments are large risks. Every change should be small enough to understand in a code review, safe enough to deploy on Friday, and reversible within minutes.

## 5. Trade-offs, Not Best Practices

*Context over convention.*

There is no universal best practice — only trade-offs in context. The right answer depends on team size, budget, timeline, and risk tolerance. Engineering judgment is knowing which constraints matter most.

## 6. Incidents Are Investments

*Failure is feedback, not fault.*

Every outage reveals a gap the team didn't know existed. Blameless post-mortems, actionable follow-ups, and shared learning turn incidents from losses into the highest-value engineering work.

## 7. Multiply, Don't Accumulate

*The best engineers make others better.*

A senior engineer who hoards context is a liability. Documented runbooks, shared on-call rotations, and code that anyone on the team can deploy — the goal is making yourself unnecessary.

## 8. Ship With Conviction

*Decide, ship, measure, adapt.*

Analysis paralysis kills more projects than bad decisions. Make the best call with available information, commit to it, measure the outcome, and adjust. Velocity with feedback loops beats perfection.

---

These principles evolve. Every incident, retrospective, and architecture review sharpens the thinking. What matters is not getting it right once — it's getting it right repeatedly, under pressure, with a team counting on you.


---

# When a .map File Leaks Your Entire Codebase

URL: https://basnet.dev/posts/source-maps-leak-cicd-security
Published: 2026-04-04
Category: Security
Tags: security, ci-cd, npm, devops, build-tools
Reading time: 6 min read

> Lessons from the Claude Code source map incident — how default build settings can silently ship debug artifacts to production.

In March 2026, Anthropic accidentally published the entire source code for Claude Code through a `.map` file included in their npm package. Every file, every comment, every internal constant — all sitting in a JSON file anyone could download. It was the second time it happened in two months.

This wasn't a sophisticated attack. It was a build configuration oversight. And it's a mistake that's far more common than people realize.

## What Are Source Maps?

Source maps are files generated by JavaScript bundlers (Webpack, esbuild, Bun, Rollup) that map minified production code back to the original source. They're essential for debugging — when you see an error at `bundle.js:1:45892`, the source map tells your browser it's actually `src/auth/login.ts:47:12`.

```json title="example.js.map"
{
  "version": 3,
  "sources": ["src/auth/login.ts", "src/api/client.ts"],
  "names": ["authenticate", "refreshToken"],
  "mappings": "AAAA,SAAS,IAAI..."
}
```

The problem: **source maps contain enough information to reconstruct your entire original codebase.** File paths, function names, variable names, comments, string literals — everything the bundler stripped out to create the minified version.

## How This Happens

Most bundlers generate source maps **by default**. If you don't explicitly disable them for production builds, they ship with your package.

```javascript title="bun-default-behavior.js"
// Bun generates source maps by default
// If you don't add this to your build config:
Bun.build({
  entrypoints: ["./src/index.ts"],
  outdir: "./dist",
  sourcemap: "none",    // THIS LINE prevents the leak
});
```

The same applies to other bundlers:

```javascript title="webpack.config.js"
module.exports = {
  mode: "production",
  // devtool controls source map generation
  // "source-map" generates .map files — DON'T use this in production builds
  // "hidden-source-map" generates maps but doesn't reference them in the bundle
  // false disables source maps entirely
  devtool: false,
};
```

```javascript title="esbuild-config.js"
require("esbuild").buildSync({
  entryPoints: ["src/index.ts"],
  bundle: true,
  minify: true,
  sourcemap: false,     // Explicitly disable for production
  outfile: "dist/bundle.js",
});
```

## The npm-Specific Risk

When publishing to npm, the problem compounds. `npm publish` includes everything in your project directory unless you explicitly exclude files.

There are two ways to control what gets published:

### Allowlist approach (preferred)

```json title="package.json"
{
  "name": "my-package",
  "files": [
    "dist/**/*.js",
    "dist/**/*.d.ts",
    "!dist/**/*.map"
  ]
}
```

The `files` field is an allowlist — only the listed patterns are included. This is safer because new file types are excluded by default.

### Blocklist approach

```text title=".npmignore"
*.map
src/
tests/
.env*
*.config.ts
```

`.npmignore` is a blocklist — everything is included unless explicitly excluded. This is riskier because new file types (like `.map` files from a bundler change) are included by default.

<Callout type="tip" title="Always use the allowlist approach">
  The `files` field in `package.json` is safer than `.npmignore` because it's deny-by-default. Only the files you explicitly list get published. A new bundler that generates `.map` files won't accidentally ship them.
</Callout>

## What Gets Exposed

When a source map leaks, an attacker gets:

- **Full source code** — every file, reconstructed from the mapping
- **Internal comments** — `// TODO: fix this security check` becomes a roadmap
- **Hardcoded strings** — API endpoints, internal URLs, feature flag names
- **Architecture details** — file structure reveals how the system is organized
- **Potential secrets** — any API keys, tokens, or credentials in the source

For AI products specifically, system prompts and orchestration logic get exposed — revealing how the AI is instructed to behave and what guardrails exist.

## Preventing This in CI/CD

The fix isn't just "remember to disable source maps." It's building automated checks into your pipeline so a human mistake can't ship debug artifacts.

### Step 1: Build Configuration

```javascript title="build.config.ts"
const isProduction = process.env.NODE_ENV === "production";

export default {
  sourcemap: isProduction ? false : "inline",
  minify: isProduction,
};
```

### Step 2: CI Validation

```yaml title=".github/workflows/publish.yml"
jobs:
  publish:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - run: npm run build

      - name: Verify no source maps in dist
        run: |
          if find dist -name "*.map" | grep -q .; then
            echo "ERROR: Source map files found in dist/"
            find dist -name "*.map"
            exit 1
          fi

      - name: Verify no source maps in package
        run: |
          npm pack --dry-run 2>&1 | grep -i "\.map" && {
            echo "ERROR: .map files would be included in package"
            exit 1
          } || true

      - run: npm publish
```

### Step 3: Pre-publish Hook

```json title="package.json"
{
  "scripts": {
    "prepublishOnly": "node scripts/verify-no-sourcemaps.js"
  }
}
```

```javascript title="scripts/verify-no-sourcemaps.js"
import { globSync } from "fs";

const maps = globSync("dist/**/*.map");
if (maps.length > 0) {
  console.error("Source map files found — aborting publish:");
  maps.forEach((f) => console.error(`  ${f}`));
  process.exit(1);
}
```

## The Broader Lesson

This incident isn't really about source maps. It's about a fundamental principle: **build tool defaults are not production-safe defaults.** Every tool in your build pipeline has settings optimized for developer experience, not for production security.

Audit your build output:

```bash title="audit-build.sh"
#!/bin/bash
echo "=== Files in build output ==="
find dist -type f | sort

echo ""
echo "=== Checking for debug artifacts ==="
find dist -name "*.map" -o -name "*.map.js" -o -name "*.d.ts.map"

echo ""
echo "=== Checking for source references ==="
grep -r "sourceMappingURL" dist/ || echo "No source map references found"

echo ""
echo "=== What npm will publish ==="
npm pack --dry-run 2>&1
```

## Key Takeaways

1. **Source maps reconstruct your entire codebase** — they're not just "debug info"
2. **Most bundlers generate them by default** — you must explicitly disable them for production
3. **Use `files` in package.json, not `.npmignore`** — allowlist beats blocklist
4. **Add CI checks for debug artifacts** — don't rely on humans remembering build flags
5. **Audit your build output regularly** — run `npm pack --dry-run` and inspect what you're shipping

If a major AI company can ship source maps to npm twice in two months, it can happen to anyone. The difference is whether you have automated guardrails or just good intentions.

<Sources items={[
  { title: "Anthropic Employee Error Exposes Claude Code Source", url: "https://www.infoworld.com/article/4152856/anthropic-employee-error-exposes-claude-code-source.html", publisher: "InfoWorld" },
  { title: "Introduction to JavaScript Source Maps", url: "https://developer.chrome.com/docs/devtools/javascript/source-maps/", publisher: "Chrome DevTools" },
  { title: "npm package.json files field", url: "https://docs.npmjs.com/cli/v10/configuring-npm/package-json#files", publisher: "npm Docs" },
  { title: "Webpack Devtool Configuration", url: "https://webpack.js.org/configuration/devtool/", publisher: "Webpack" },
]} />


---

# Troubleshooting Terraform: Patterns Worth Knowing

URL: https://basnet.dev/posts/troubleshooting-terraform-in-production
Published: 2026-04-02
Category: Infrastructure
Tags: terraform, iac, devops, debugging
Reading time: 5 min read

> Apply failures, cycle errors, and state drift — the three categories of Terraform problems that surface in production, and how to fix them.

Terraform is the backbone of modern infrastructure stacks. It's also the tool that produces some of the most cryptic error messages in the DevOps ecosystem. Across multiple cloud providers and blockchain infrastructure deployments, a pattern emerges: most problems fall into three buckets.

## Apply Failures

The most common category. You run `terraform apply`, and it fails — sometimes with a helpful message, sometimes not.

### Provider Authentication Errors

The first thing to check when an apply fails unexpectedly:

```bash title="debug-auth.sh"
# Enable verbose logging to trace authentication flow
export TF_LOG=DEBUG
terraform plan 2>&1 | grep -i "auth\|credential\|token"
```

Nine times out of ten, it's an expired token or a misconfigured environment variable. A basic checklist covers most cases:

```bash
# Verify credentials are actually set
echo $AWS_ACCESS_KEY_ID | head -c 8    # Should show first 8 chars
echo $AWS_REGION                        # Should not be empty
aws sts get-caller-identity             # The definitive test
```

### Resource Validation Errors

Terraform validates resource properties against the provider schema, but some validations only happen at apply time — the provider sends the request to the API, and the API rejects it.

```hcl title="main.tf"
resource "aws_instance" "node" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t3.medium"

  # This will fail at apply time if the subnet
  # doesn't exist or belongs to a different VPC
  subnet_id = var.subnet_id
}
```

<Callout type="tip" title="Validate before you apply">
  Run `terraform validate` as the first step in every CI pipeline. It catches syntax and basic configuration errors without touching any APIs. It won't catch runtime issues, but it eliminates an entire class of failures for free.
</Callout>

### Permission Errors

The most frustrating apply failures are permission errors that only surface on specific resource types. Your IAM role might have EC2 permissions but lack the `iam:PassRole` permission needed to attach an instance profile.

```bash
# When you get "AccessDenied", trace exactly which API call failed
export TF_LOG=TRACE
terraform apply 2>&1 | grep "HTTP/1.1\|Action\|AccessDenied"
```

## Cycle Errors

Cycle errors happen when Terraform detects a circular dependency in your resource graph. Resource A depends on Resource B, which depends on Resource A.

```
Error: Cycle: aws_security_group.app, aws_security_group_rule.app_to_db,
       aws_security_group.db, aws_security_group_rule.db_to_app
```

The fix is almost always to break the cycle by using standalone resource rules instead of inline blocks:

```hcl title="networking.tf"
# Instead of inline ingress/egress rules inside the security groups,
# use separate aws_security_group_rule resources

resource "aws_security_group" "app" {
  name   = "app-sg"
  vpc_id = var.vpc_id
}

resource "aws_security_group" "db" {
  name   = "db-sg"
  vpc_id = var.vpc_id
}

# These don't create cycles because they reference
# the security groups, not the other way around
resource "aws_security_group_rule" "app_to_db" {
  type                     = "egress"
  from_port                = 5432
  to_port                  = 5432
  protocol                 = "tcp"
  security_group_id        = aws_security_group.app.id
  source_security_group_id = aws_security_group.db.id
}
```

## State Issues

State problems are the scariest because they can cause Terraform to destroy and recreate resources you didn't intend to touch.

### State Drift

When someone manually changes infrastructure that Terraform manages:

```bash
# Detect drift without making changes
terraform plan -refresh-only

# If drift is intentional, import the current state
terraform import aws_instance.node i-0abc123def456
```

### State Lock Conflicts

When a previous `apply` crashed and left the state locked:

```bash
# List active locks
terraform force-unlock LOCK_ID

# Before force-unlocking, always verify no other
# apply is actually running
```

<Callout type="warning" title="Never force-unlock without checking">
  Force-unlocking a state file while another apply is running can corrupt your state. Always verify that no other process is holding the lock before using `force-unlock`.
</Callout>

### State File Corruption

The nuclear option. If your state file is corrupted beyond repair:

```bash
# Back up the corrupted state
cp terraform.tfstate terraform.tfstate.corrupt

# Pull resources back into a fresh state
terraform import aws_vpc.main vpc-0abc123
terraform import aws_subnet.private subnet-0abc123
# ... repeat for every resource
```

This is painful. It's also why remote state with versioning enabled is a non-negotiable baseline:

```hcl title="backend.tf"
terraform {
  backend "s3" {
    bucket         = "myproject-terraform-state"
    key            = "prod/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}
```

## A Reliable Debugging Workflow

When something breaks, this sequence cuts through the noise:

1. **Read the full error** — not just the last line, the full output
2. **Check `TF_LOG=DEBUG`** — the verbose output usually reveals the root cause
3. **Run `terraform plan`** — see what Terraform thinks the current state is
4. **Check the provider changelog** — provider updates frequently introduce breaking changes
5. **Search the provider's GitHub issues** — someone else has usually hit the same problem

The fastest path to fixing a Terraform issue is understanding what Terraform thinks reality looks like versus what it actually looks like.

<Sources items={[
  { title: "Terraform CLI Documentation", url: "https://developer.hashicorp.com/terraform/cli", publisher: "HashiCorp" },
  { title: "Terraform State Management", url: "https://developer.hashicorp.com/terraform/language/state", publisher: "HashiCorp" },
  { title: "How To Troubleshoot Terraform", url: "https://www.digitalocean.com/community/tutorials/how-to-troubleshoot-terraform", publisher: "DigitalOcean" },
  { title: "Debugging Terraform", url: "https://developer.hashicorp.com/terraform/internals/debugging", publisher: "HashiCorp" },
]} />


---

# Using LLMs for Incident Response — What Works and What Doesn't

URL: https://basnet.dev/posts/llms-in-incident-response
Published: 2026-04-01
Category: AI
Tags: ai, llm, incident-response, devops, observability
Reading time: 5 min read

> After integrating AI into an on-call workflow, here's what actually reduced MTTR and what turned out to be expensive noise.

Everyone's talking about AI transforming DevOps. After six months of integrating LLMs into an incident response workflow, the picture is more nuanced: AI is genuinely useful for some parts of incident response and actively harmful for others.

## Where LLMs Actually Help

### Log Summarization

When a production incident generates thousands of log lines across multiple services, an LLM can summarize the pattern faster than any human:

```python title="log-summarizer.py"
import openai

def summarize_incident_logs(logs: list[str], context: str) -> str:
    prompt = f"""You are an SRE analyzing a production incident.
Context: {context}

Here are the relevant log entries (newest first):
{chr(10).join(logs[:200])}

Summarize:
1. What service(s) are affected
2. The sequence of events leading to the failure
3. Any error patterns or recurring messages
4. Suggested areas to investigate"""

    response = openai.chat.completions.create(
        model="claude-sonnet-4-6-20250414",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=1000,
    )
    return response.choices[0].message.content
```

In practice, this cuts the initial triage phase from 10-15 minutes down to 2-3 minutes. The LLM doesn't need to be right about the root cause — it just needs to point the on-call engineer in the right direction.

### Runbook Retrieval

We indexed our runbooks and post-mortems into a vector database. When an alert fires, the system retrieves relevant past incidents:

```python title="runbook-search.py"
from chromadb import Client

def find_relevant_runbooks(alert_summary: str, n_results: int = 3):
    db = Client()
    collection = db.get_collection("runbooks")

    results = collection.query(
        query_texts=[alert_summary],
        n_results=n_results,
    )

    return [
        {
            "title": meta["title"],
            "resolution": meta["resolution"],
            "similarity": score,
        }
        for meta, score in zip(results["metadatas"][0], results["distances"][0])
    ]
```

This is where AI shines — pattern matching across hundreds of past incidents that no human could recall on demand.

### Change Correlation

When an incident occurs, the LLM cross-references recent deployments, config changes, and infrastructure modifications:

```bash title="correlate-changes.sh"
#!/bin/bash
# Gather recent changes for AI analysis
echo "=== Deployments (last 4 hours) ==="
kubectl get events --field-selector reason=Pulling -A --sort-by='.lastTimestamp' | tail -20

echo "=== Config Changes ==="
git log --oneline --since="4 hours ago" -- "k8s/" "terraform/"

echo "=== Infrastructure Events ==="
aws cloudtrail lookup-events \
  --start-time "$(date -d '4 hours ago' -u +%FT%TZ)" \
  --lookup-attributes AttributeKey=EventName,AttributeValue=RunInstances
```

<Callout type="tip" title="Change correlation is the highest-ROI AI use case">
  In practice, 70%+ of production incidents are caused by a recent change. Automating the "what changed recently?" question saves the most time during the critical first minutes of an incident.
</Callout>

## Where LLMs Fail

### Automated Remediation

Letting an LLM execute fixes in production is a terrible idea. When tested, the model suggested scaling up a database replica to handle increased load — reasonable in theory, but it didn't account for the storage class limitations that would have caused the new replica to start without persistent storage.

```yaml title="what-the-llm-suggested.yml"
# The LLM generated this "fix"
# It would have created a replica with no persistent storage
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres-replica
spec:
  replicas: 3  # was 2 — LLM scaled it up
  # Missing: volumeClaimTemplates
  # Missing: storage class specification
  # Missing: replication configuration
```

**Rule: AI suggests, humans execute.** Always.

### Root Cause Analysis

LLMs are confident but often wrong about root causes. They'll correlate events that happen to co-occur and present them as causal relationships. A memory spike and a deployment happening at the same time doesn't mean the deployment caused the memory spike.

### Real-Time Decision Making

During an active incident, you need speed and accuracy. LLM latency (even a few seconds) and hallucination risk make them unsuitable for real-time decisions. Use them for preparation and post-incident analysis, not during the heat of the moment.

<Callout type="warning" title="Never let AI make production decisions autonomously">
  An LLM that's wrong 5% of the time sounds impressive — until you realize that's one wrong production decision every 20 incidents. In infrastructure, a single wrong call can cascade into a much larger outage.
</Callout>

## The Architecture That Works

After iterating, here's the setup that works in practice:

1. **Alert fires** → PagerDuty notifies on-call
2. **Automated context gathering** → script collects logs, metrics, recent changes
3. **AI summarization** → LLM summarizes the context and suggests investigation areas
4. **Runbook retrieval** → vector search finds relevant past incidents
5. **Human decision** → on-call engineer reads the AI summary and decides what to do
6. **Post-incident** → LLM drafts the post-mortem from the incident timeline

The AI never touches production. It reads, summarizes, and suggests. The human investigates and acts.

## Measuring the Impact

After six months:

- **MTTR reduced by 34%** — mostly from faster initial triage
- **On-call cognitive load decreased** — engineers report less "where do I start?" anxiety
- **Post-mortem quality improved** — AI-drafted timelines are more thorough than human-recalled ones
- **False positive rate unchanged** — AI doesn't help with alert tuning (yet)

## Key Takeaways

1. **AI is best at summarization and retrieval** — not decision-making
2. **Never let AI execute in production** — suggest only, humans approve
3. **Index your post-mortems** — they're your most valuable training data
4. **Measure before and after** — "AI-powered" means nothing without metrics
5. **Start with log summarization** — it's the easiest win with the lowest risk

<Sources items={[
  { title: "AI for IT Operations (AIOps)", url: "https://www.gartner.com/en/information-technology/glossary/aiops-artificial-intelligence-operations", publisher: "Gartner" },
  { title: "Reducing MTTR with Machine Learning", url: "https://sre.google/workbook/postmortem-culture/", publisher: "Google SRE" },
  { title: "Vector Databases for Retrieval-Augmented Generation", url: "https://docs.trychroma.com/", publisher: "Chroma" },
  { title: "Anthropic Claude API Documentation", url: "https://docs.anthropic.com/en/docs/intro-to-claude", publisher: "Anthropic" },
]} />


---

# Automating Code Review with AI — Architecture and Honest Results

URL: https://basnet.dev/posts/ai-code-review-automation
Published: 2026-03-30
Category: AI
Tags: ai, code-review, automation, ci-cd, devops
Reading time: 6 min read

> AI-powered code review integrated into a PR workflow. Here's the architecture, the prompt engineering, and the metrics after 3 months.

Manual code review is a bottleneck. Senior engineers spend hours daily reviewing PRs, context-switching between their own work and review queues. Integrating AI into a code review workflow — not to replace human reviewers, but to handle the repetitive checks so humans can focus on architecture and logic — is a practical way to reclaim that time.

## What AI Reviews Well

After three months, the pattern is clear. AI catches mechanical issues with near-perfect accuracy:

- **Security vulnerabilities** — SQL injection, XSS, hardcoded credentials, insecure deserialization
- **Bug patterns** — null pointer risks, off-by-one errors, race conditions in obvious cases
- **Style consistency** — naming conventions, import ordering, dead code
- **Documentation gaps** — public APIs without JSDoc, missing error descriptions
- **Dependency risks** — known CVEs in added packages, license incompatibilities

What it doesn't do well: architectural decisions, business logic validation, performance implications in context.

## The Architecture

```yaml title=".github/workflows/ai-review.yml"
name: AI Code Review
on:
  pull_request:
    types: [opened, synchronize]

jobs:
  ai-review:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
      contents: read
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Get PR diff
        id: diff
        run: |
          git diff origin/${{ github.base_ref }}...HEAD > pr.diff
          echo "diff_size=$(wc -l < pr.diff)" >> $GITHUB_OUTPUT

      - name: Run AI review
        if: steps.diff.outputs.diff_size < 2000
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          PR_NUMBER: ${{ github.event.pull_request.number }}
        run: node scripts/ai-review.js
```

AI review is skipped on large diffs (2000+ lines). LLMs lose accuracy on massive context windows, and large PRs should be split anyway.

### The Review Script

```typescript title="scripts/ai-review.ts"
import Anthropic from "@anthropic-ai/sdk";
import { readFileSync } from "fs";
import { execSync } from "child_process";

const client = new Anthropic();

const diff = readFileSync("pr.diff", "utf-8");
const changedFiles = execSync("git diff --name-only origin/main...HEAD")
  .toString()
  .trim()
  .split("\n");

const prompt = `You are a senior software engineer reviewing a pull request.

## Changed files
${changedFiles.join("\n")}

## Diff
${diff}

Review this PR for:
1. **Security issues** — injection, XSS, hardcoded secrets, insecure patterns
2. **Bugs** — null/undefined risks, incorrect logic, edge cases
3. **Performance** — obvious N+1 queries, unnecessary re-renders, memory leaks
4. **Best practices** — error handling, naming, code organization

Rules:
- Only comment on issues that are clearly wrong or risky
- Do NOT suggest stylistic preferences or minor refactors
- Do NOT comment on unchanged code
- Be specific: reference the file and line number
- If the PR looks good, say so briefly

Format each issue as:
**[SEVERITY]** \`file:line\` — description`;

const response = await client.messages.create({
  model: "claude-sonnet-4-6-20250414",
  max_tokens: 2000,
  messages: [{ role: "user", content: prompt }],
});

const review = response.content[0].type === "text"
  ? response.content[0].text
  : "";

// Post as PR comment via GitHub API
execSync(`gh pr comment ${process.env.PR_NUMBER} --body "${
  review.replace(/"/g, '\\"')
}"`);
```

<Callout type="warning" title="Never auto-approve or auto-merge based on AI review">
  AI review is an additional signal, not a replacement for human judgment. The AI might miss a subtle business logic error that only a domain expert would catch. Always require at least one human approval.
</Callout>

## Prompt Engineering Matters

The prompt went through 15+ iterations before reaching its current state. Key lessons:

### Be explicit about what NOT to flag

Without "Do NOT suggest stylistic preferences," the AI generates dozens of comments about variable naming and formatting that clutter the review.

### Severity levels reduce noise

```
**[CRITICAL]** — Must fix before merge (security, data loss)
**[WARNING]**  — Should fix, potential bug or risk
**[INFO]**     — Suggestion, take it or leave it
```

Developers learned to ignore `[INFO]` and always address `[CRITICAL]`. Without severity, every comment felt equally urgent.

### Context window management

For large PRs, the script sends only the diff plus the full content of modified files — not the entire repo. This keeps the context focused and the review relevant:

```typescript title="context-management.ts"
function buildContext(changedFiles: string[], maxTokens: number): string {
  const fileContents: string[] = [];
  let estimatedTokens = 0;

  for (const file of changedFiles) {
    const content = readFileSync(file, "utf-8");
    const tokens = Math.ceil(content.length / 4);

    if (estimatedTokens + tokens > maxTokens) break;

    fileContents.push(`--- ${file} ---\n${content}`);
    estimatedTokens += tokens;
  }

  return fileContents.join("\n\n");
}
```

## Results After 3 Months

| Metric | Before | After |
|--------|--------|-------|
| Avg time to first review | 4.2 hours | 12 minutes (AI) + 2.1 hours (human) |
| Security issues caught in review | ~60% | ~92% |
| Review comments per PR (human) | 4.7 | 2.1 |
| Developer satisfaction with review process | 3.2/5 | 4.1/5 |

The biggest win isn't speed — it's that human reviewers now focus on higher-level feedback because the AI already handled the mechanical checks.

## Common Pitfalls

### False positives erode trust

If the AI flags non-issues, developers stop reading its comments. Tracking the false positive rate and tuning the prompt when it exceeds 15% is essential to maintaining trust.

### Cost management

At ~$0.03 per review for average PRs (Sonnet), cost is negligible. But large PRs with full file context can hit $0.50+. The 2000-line diff limit keeps costs predictable:

```bash
# Monthly cost tracking
echo "Reviews this month: $(gh api /repos/myorg/myapp/actions/runs \
  --jq '[.workflow_runs[] | select(.name=="AI Code Review")] | length')"
```

### Don't review generated code

Auto-generated files (GraphQL types, Prisma client, lock files) produce noise. Exclude them:

```yaml title="ai-review-config.yml"
exclude_patterns:
  - "*.generated.ts"
  - "*.lock"
  - "prisma/migrations/**"
  - "__generated__/**"
  - "*.min.js"
```

## Key Takeaways

1. **AI handles mechanical checks — humans handle judgment** — this split is where value lives
2. **Prompt engineering is 80% of the work** — the same model with a bad prompt is useless
3. **False positive rate is the critical metric** — above 15%, developers ignore the AI entirely
4. **Skip large diffs** — AI accuracy degrades on massive PRs
5. **Track cost and adjust** — set context limits to keep costs predictable

<Sources items={[
  { title: "Anthropic Claude API", url: "https://docs.anthropic.com/en/docs/intro-to-claude", publisher: "Anthropic" },
  { title: "GitHub Actions — Pull Request Events", url: "https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#pull_request", publisher: "GitHub" },
  { title: "Effective Code Review", url: "https://google.github.io/eng-practices/review/", publisher: "Google Engineering" },
  { title: "AI-Assisted Software Development", url: "https://www.infoq.com/articles/ai-assisted-development/", publisher: "InfoQ" },
]} />


---

# Kubernetes Debugging Patterns for Production

URL: https://basnet.dev/posts/kubernetes-debugging-production
Published: 2026-03-25
Category: Kubernetes
Tags: kubernetes, debugging, devops, containers
Reading time: 6 min read

> CrashLoopBackOff, OOMKilled, stuck deployments, and networking mysteries — a field guide to debugging K8s when things go wrong.

Kubernetes debugging is its own skill. The error messages are often vague, the failure modes are distributed, and the logs needed are scattered across multiple layers. These are the patterns that come up most often in production.

## CrashLoopBackOff

The most common Kubernetes problem. Your pod starts, crashes, restarts, crashes again, and the backoff delay grows exponentially.

```bash title="diagnose-crashloop.sh"
# Step 1: Check what the container is actually doing
kubectl logs pod/myapp-7b4f6d8c5-x2k9m --previous

# Step 2: If logs are empty, the process crashed before logging
kubectl describe pod myapp-7b4f6d8c5-x2k9m | grep -A5 "Last State"

# Step 3: Check the exit code
kubectl get pod myapp-7b4f6d8c5-x2k9m -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
```

Common exit codes and what they mean:

| Exit Code | Meaning | Common Cause |
|-----------|---------|--------------|
| 1 | Application error | Unhandled exception, missing config |
| 137 | SIGKILL (OOMKilled) | Memory limit exceeded |
| 139 | SIGSEGV | Segmentation fault |
| 143 | SIGTERM | Graceful shutdown (normal) |

### OOMKilled Specifically

When the exit code is 137, the container exceeded its memory limit:

```bash
# Confirm OOMKilled
kubectl describe pod myapp-7b4f6d8c5-x2k9m | grep -i oom

# Check current memory usage vs limits
kubectl top pod myapp-7b4f6d8c5-x2k9m
kubectl get pod myapp-7b4f6d8c5-x2k9m -o jsonpath='{.spec.containers[0].resources}'
```

<Callout type="warning" title="Don't just increase the memory limit">
  OOMKilled usually indicates a memory leak, not an undersized limit. Increasing the limit just delays the crash. Profile the application first — check for unclosed connections, unbounded caches, or event listener leaks.
</Callout>

## Stuck Deployments

A deployment that never completes — pods are in `Pending`, `ContainerCreating`, or `Init` state indefinitely.

### Pending Pods

```bash
# Why is the pod pending?
kubectl describe pod myapp-pending-pod | grep -A10 "Events"

# Common causes:
# - Insufficient CPU/memory on nodes
# - No nodes matching nodeSelector/affinity rules
# - PersistentVolumeClaim not bound
```

Check cluster capacity:

```bash
# Node resource availability
kubectl describe nodes | grep -A5 "Allocated resources"

# Unschedulable nodes
kubectl get nodes -o wide | grep -v Ready
```

### ContainerCreating Stuck

Usually an image pull problem or volume mount issue:

```bash
# Check events for the specific pod
kubectl describe pod myapp-stuck | grep -A20 "Events"

# Common causes:
# - ImagePullBackOff: wrong image name, private registry auth missing
# - Volume mount failure: PVC not bound, NFS server unreachable
```

```yaml title="fix-image-pull-secret.yml"
apiVersion: v1
kind: Pod
spec:
  containers:
    - name: myapp
      image: registry.example.com/myapp:latest
  # If using a private registry, you need this
  imagePullSecrets:
    - name: registry-credentials
```

## Networking Issues

Kubernetes networking problems are the hardest to debug because the symptoms are indirect — timeouts, connection refused, intermittent failures.

### Service Not Reachable

```bash
# Step 1: Verify the service exists and has endpoints
kubectl get svc myapp-service
kubectl get endpoints myapp-service

# If endpoints are empty, the selector doesn't match any pods
kubectl get pods -l app=myapp --show-labels

# Step 2: Test from inside the cluster
kubectl run debug --rm -it --image=busybox -- sh
# Inside the pod:
wget -qO- http://myapp-service:8080/health
nslookup myapp-service
```

### DNS Resolution Failures

```bash
# Check if CoreDNS is healthy
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Test DNS resolution from a pod
kubectl run dns-test --rm -it --image=busybox -- nslookup kubernetes.default

# Check CoreDNS logs for errors
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50
```

<Callout type="tip" title="Always test from inside the cluster">
  Don't debug Kubernetes networking from your local machine. Network policies, service meshes, and DNS all behave differently inside vs outside the cluster. Always `kubectl exec` or `kubectl run` a debug pod.
</Callout>

### Network Policies Blocking Traffic

```bash
# List all network policies in the namespace
kubectl get networkpolicies -n production

# Describe a specific policy to see what it allows/denies
kubectl describe networkpolicy default-deny -n production

# Quick test: temporarily delete the policy and see if traffic flows
# (only in non-production environments)
```

## Resource Debugging Tools

### Ephemeral Debug Containers

Kubernetes 1.25+ supports ephemeral containers — attach a debug container to a running pod without restarting it:

```bash
# Attach a debug container with networking tools
kubectl debug pod/myapp-7b4f6d8c5-x2k9m -it \
  --image=nicolaka/netshoot \
  --target=myapp

# Inside the debug container, you have full networking tools:
# tcpdump, dig, curl, netstat, ss, iperf, etc.
```

### Resource Inspection One-Liners

```bash
# Pods sorted by CPU usage
kubectl top pods --sort-by=cpu -A

# Pods sorted by memory
kubectl top pods --sort-by=memory -A

# Events sorted by time (last 30 minutes)
kubectl get events --sort-by='.lastTimestamp' -A | tail -30

# All pods not in Running state
kubectl get pods -A --field-selector status.phase!=Running

# Pods with restarts > 0
kubectl get pods -A -o json | jq -r '
  .items[] |
  select(.status.containerStatuses[]?.restartCount > 0) |
  [.metadata.namespace, .metadata.name,
   (.status.containerStatuses[0].restartCount | tostring)] |
  join("\t")'
```

## Debugging Checklist

When something goes wrong in Kubernetes, a reliable sequence to follow is:

1. **`kubectl get pods -A`** — what's the cluster-wide state?
2. **`kubectl describe pod <name>`** — what do the events say?
3. **`kubectl logs <pod> --previous`** — what happened before the crash?
4. **`kubectl top pods`** — is it a resource problem?
5. **`kubectl get events --sort-by=time`** — what happened recently?
6. **Debug from inside the cluster** — `kubectl run` a debug pod

Most problems are answered by steps 1-3. Steps 4-6 are for the harder cases.

## Key Takeaways

1. **Always check events first** — `kubectl describe` tells you more than `kubectl get`
2. **Read the exit code** — it tells you the category of failure immediately
3. **OOMKilled means profile, not resize** — increasing limits masks the real problem
4. **Debug networking from inside** — external tools give misleading results
5. **Keep a debug pod image handy** — `nicolaka/netshoot` has every tool you need

<Sources items={[
  { title: "Kubernetes Debugging Guide", url: "https://kubernetes.io/docs/tasks/debug/", publisher: "Kubernetes" },
  { title: "Debug Running Pods with Ephemeral Containers", url: "https://kubernetes.io/docs/tasks/debug/debug-application/debug-running-pod/", publisher: "Kubernetes" },
  { title: "Kubernetes Troubleshooting Flowchart", url: "https://learnk8s.io/troubleshooting-deployments", publisher: "Learnk8s" },
  { title: "netshoot — Container Networking Troubleshooting", url: "https://github.com/nicolaka/netshoot", publisher: "GitHub" },
]} />


---

# Feature Flags and the Case for Progressive Delivery

URL: https://basnet.dev/posts/feature-flags-progressive-delivery
Published: 2026-03-18
Category: CI/CD
Tags: feature-flags, ci-cd, progressive-delivery, devops
Reading time: 5 min read

> How feature flags decouple deployment from release, reduce blast radius, and why every team shipping to production should use them.

The riskiest moment in any deployment isn't writing the code — it's the moment you flip the switch and expose it to real users. Feature flags change that equation entirely by separating the act of deploying code from the act of releasing a feature.

## The Problem with Big-Bang Releases

Traditional deployment is binary: code is either live or it isn't. You merge to main, the pipeline runs, and every user gets the new feature at once. If something breaks, your options are rolling back the entire deployment or pushing a hotfix under pressure.

There is a better way — and it starts with separating deployment from release.

## What Feature Flags Actually Are

A feature flag is a conditional wrapper around new functionality. At its simplest:

```typescript title="feature-check.ts"
function getPaymentProcessor(userId: string): PaymentProcessor {
  if (isFeatureEnabled("new-payment-flow", userId)) {
    return new StripeV2Processor();
  }
  return new StripeV1Processor();
}
```

The flag's state is evaluated at runtime, not at build time. This means you can:

- **Deploy code without releasing it** — the feature sits behind a flag, invisible to users
- **Roll out gradually** — enable for 1% of users, then 5%, then 25%, watching metrics at each step
- **Kill a feature instantly** — flip the flag off, no redeployment needed
- **Target specific users** — enable for internal teams, beta testers, or specific regions

## The Four Types of Flags

Not all feature flags serve the same purpose, and understanding the distinction matters for maintenance:

### Release Flags

Temporary flags for shipping incomplete work. These are the most common and should have the shortest lifespan.

```typescript title="release-flag.ts"
// Ship the incomplete checkout redesign behind a flag
// Remove this flag once the feature is stable
if (flags.isEnabled("checkout-redesign")) {
  return <NewCheckoutFlow />;
}
return <LegacyCheckout />;
```

### Experiment Flags

For A/B testing. They're short-lived but need to track which variant each user sees:

```typescript title="experiment-flag.ts"
const variant = flags.getVariant("pricing-page-test", userId);

switch (variant) {
  case "control":
    return <CurrentPricingPage />;
  case "variant-a":
    return <SimplifiedPricingPage />;
  case "variant-b":
    return <ComparisonPricingPage />;
}
```

### Ops Flags

Kill switches for features with uncertain performance characteristics. These are the ones that save you at 3 AM:

```typescript title="ops-flag.ts"
// If the new recommendation engine is overloading the database,
// flip this flag and we fall back to the static list
const recommendations = flags.isEnabled("ml-recommendations")
  ? await getMLRecommendations(userId)
  : await getStaticRecommendations();
```

### Permission Flags

Long-lived flags that gate premium or beta features:

```typescript title="permission-flag.ts"
if (flags.isEnabled("advanced-analytics", { plan: user.plan })) {
  return <AdvancedAnalyticsDashboard />;
}
return <BasicAnalyticsDashboard />;
```

## The Hidden Cost: Flag Debt

Every feature flag is a branch in your code that doubles the testing surface. Two flags mean four possible states. Ten flags mean over a thousand combinations.

<Callout type="warning" title="Clean up your flags">
  Stale feature flags are technical debt with compound interest. Codebases with 50+ flags where nobody knows which ones are still active are more common than they should be. Set a rule: every release flag gets a removal ticket created at the same time as the flag itself.
</Callout>

A simple convention worth enforcing:

```typescript title="flag-registry.ts"
const FLAGS = {
  "checkout-redesign": {
    type: "release",
    owner: "payments-team",
    createdAt: "2026-03-01",
    removeBy: "2026-04-15",     // Flag must be removed by this date
    description: "New checkout flow with Stripe V2",
  },
} as const;
```

## Implementation in CI/CD

Feature flags integrate naturally into a CI/CD pipeline:

```yaml title=".gitlab-ci.yml"
deploy:
  stage: deploy
  script:
    - deploy-to-production
    - update-feature-flags --env production
  environment:
    name: production
    action: start

rollback:
  stage: deploy
  when: manual
  script:
    - disable-feature-flag checkout-redesign
    # No redeployment needed — the flag controls visibility
```

The pipeline deploys the code, but the feature flag controls whether users see it. Rollback becomes a flag toggle instead of a full redeployment.

## Progressive Delivery in Practice

Feature flags are one piece of progressive delivery. Combined with other techniques, they create a deployment model where risk is minimized at every step:

1. **Deploy to production** with the feature flagged off
2. **Enable for internal users** — your team dogfoods it first
3. **Canary release** — enable for 1-5% of traffic, monitor error rates
4. **Gradual rollout** — increase to 25%, 50%, 100% over days
5. **Remove the flag** — the feature is now the default

Each step includes an automatic rollback trigger: if error rates spike above a threshold, the flag is automatically disabled.

## Key Takeaways

1. **Decouple deployment from release** — deploy whenever you want, release when you're ready
2. **Start with ops flags** — even if you don't use release flags yet, having kill switches for risky features is invaluable
3. **Set expiration dates on flags** — every flag should have an owner and a removal deadline
4. **Monitor at each rollout stage** — the whole point of gradual rollout is catching problems before they reach everyone

<Sources items={[
  { title: "Feature Flags and Progressive Delivery", url: "https://about.gitlab.com/blog/feature-flags-continuous-delivery/", publisher: "GitLab" },
  { title: "Feature Toggles (Feature Flags)", url: "https://martinfowler.com/articles/feature-toggles.html", publisher: "Martin Fowler" },
  { title: "Progressive Delivery", url: "https://launchdarkly.com/blog/what-is-progressive-delivery/", publisher: "LaunchDarkly" },
  { title: "Testing in Production with Feature Flags", url: "https://docs.gitlab.com/ee/operations/feature_flags.html", publisher: "GitLab Docs" },
]} />


---

# GitOps with ArgoCD: What Teams Wish They Knew Before Starting

URL: https://basnet.dev/posts/gitops-argocd-production
Published: 2026-03-12
Category: GitOps
Tags: gitops, argocd, kubernetes, ci-cd, devops
Reading time: 5 min read

> Lessons from adopting GitOps in production — the wins, the gotchas, and the patterns that actually survive real-world complexity.

GitOps sounds simple: Git is the single source of truth for your infrastructure. Push a change, and the system converges to match. In practice, adopting GitOps with ArgoCD reveals that the concept is simple but the execution has sharp edges.

## What GitOps Actually Means

Traditional deployment: CI pipeline builds the image, then pushes it to the cluster. The pipeline has cluster credentials and executes `kubectl apply`.

GitOps deployment: CI pipeline builds the image, then updates a Git repo with the new image tag. ArgoCD watches that repo and applies the changes to the cluster.

```
Traditional:  Code Repo → CI Build → CI Deploys → Cluster
GitOps:       Code Repo → CI Build → Updates Config Repo → ArgoCD → Cluster
```

The critical difference: **nothing outside the cluster pushes changes to it.** ArgoCD pulls from Git. No CI pipeline needs cluster credentials.

## The Repository Structure

After trying several approaches, this structure works best:

```
infrastructure/
├── apps/
│   ├── api-service/
│   │   ├── base/
│   │   │   ├── deployment.yml
│   │   │   ├── service.yml
│   │   │   └── kustomization.yml
│   │   └── overlays/
│   │       ├── staging/
│   │       │   ├── kustomization.yml
│   │       │   └── replicas-patch.yml
│   │       └── production/
│   │           ├── kustomization.yml
│   │           └── replicas-patch.yml
│   └── worker-service/
│       └── ...
├── argocd/
│   ├── api-service.yml
│   └── worker-service.yml
└── README.md
```

<Callout type="tip" title="Separate your config repo from your code repo">
  Putting Kubernetes manifests in the same repo as application code means every code commit shows up as a "change" in ArgoCD, even if nothing infrastructure-related changed. A separate config repo keeps signal clean.
</Callout>

## ArgoCD Application Definition

```yaml title="argocd/api-service.yml"
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: api-service
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/myorg/infrastructure.git
    targetRevision: main
    path: apps/api-service/overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
    retry:
      limit: 3
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 1m
```

Key settings:

- **`selfHeal: true`** — if someone `kubectl edit`s a resource manually, ArgoCD reverts it to match Git. This is the whole point of GitOps.
- **`prune: true`** — if you remove a manifest from Git, ArgoCD deletes the resource from the cluster. Without this, orphaned resources accumulate.
- **Retry with backoff** — transient failures (API server hiccups, webhook timeouts) don't leave the app in a failed state.

## The Image Update Problem

The biggest gotcha in GitOps: how do you update the image tag after a CI build?

### Option 1: CI Updates the Config Repo

```yaml title=".github/workflows/build.yml"
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Build and push image
        run: |
          docker build -t registry.example.com/api:${{ github.sha }} .
          docker push registry.example.com/api:${{ github.sha }}

      - name: Update config repo
        run: |
          git clone https://github.com/myorg/infrastructure.git
          cd infrastructure
          kustomize edit set image \
            api=registry.example.com/api:${{ github.sha }} \
            -C apps/api-service/overlays/production
          git add .
          git commit -m "deploy: api-service ${{ github.sha }}"
          git push
```

This works but creates a tight coupling between CI and the config repo.

### Option 2: ArgoCD Image Updater (Preferred)

ArgoCD Image Updater watches your container registry and automatically updates image tags in Git:

```yaml title="api-service-with-image-updater.yml"
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: api-service
  annotations:
    argocd-image-updater.argoproj.io/image-list: api=registry.example.com/api
    argocd-image-updater.argoproj.io/api.update-strategy: semver
    argocd-image-updater.argoproj.io/write-back-method: git
```

This decouples CI from GitOps entirely. CI pushes the image, Image Updater detects it, updates Git, ArgoCD syncs.

## Handling Secrets

Secrets are the hardest part of GitOps. You can't commit plaintext secrets to Git, but Git is supposed to be the single source of truth.

Sealed Secrets is a solid solution here — encrypt secrets locally, commit the encrypted version to Git, and the Sealed Secrets controller decrypts them in the cluster:

```bash title="seal-secret.sh"
# Create a regular secret
kubectl create secret generic db-credentials \
  --from-literal=password=supersecret \
  --dry-run=client -o yaml > secret.yml

# Seal it (encrypt for the cluster's public key)
kubeseal --format=yaml < secret.yml > sealed-secret.yml

# Commit the sealed version — safe to store in Git
rm secret.yml  # Never commit the plaintext version
git add sealed-secret.yml
git commit -m "chore: update db credentials"
```

<Callout type="warning" title="Rotate your sealing key">
  The Sealed Secrets controller generates an encryption key pair. If that key is compromised, all sealed secrets can be decrypted. Rotate the key periodically and re-encrypt all secrets when you do.
</Callout>

## What Can Go Wrong

### Sync Waves and Dependencies

If Service B depends on Service A, you need sync waves to ensure A is deployed first:

```yaml title="service-a.yml"
metadata:
  annotations:
    argocd.argoproj.io/sync-wave: "1"
```

```yaml title="service-b.yml"
metadata:
  annotations:
    argocd.argoproj.io/sync-wave: "2"
```

Without sync waves, ArgoCD applies everything in parallel, and Service B may crash because Service A isn't ready yet.

### Drift Detection False Positives

Some Kubernetes controllers modify resources after creation (adding default fields, mutating webhooks). ArgoCD sees these as "drift" and tries to revert them, creating an endless sync loop.

Fix with ignore differences:

```yaml title="ignore-mutations.yml"
spec:
  ignoreDifferences:
    - group: apps
      kind: Deployment
      jsonPointers:
        - /spec/replicas  # If using HPA, ignore replica count
    - group: ""
      kind: Service
      jsonPointers:
        - /spec/clusterIP  # Assigned by K8s, changes on recreation
```

## Key Takeaways

1. **Separate config repos from code repos** — keeps ArgoCD signal clean
2. **Enable selfHeal and prune** — without these, GitOps is just Git-triggered deploys
3. **Use ArgoCD Image Updater** — avoid coupling CI to your config repo
4. **Solve secrets before adopting GitOps** — Sealed Secrets or External Secrets Operator
5. **Plan for drift detection noise** — ignoreDifferences is not optional in production

<Sources items={[
  { title: "ArgoCD Documentation", url: "https://argo-cd.readthedocs.io/en/stable/", publisher: "Argo Project" },
  { title: "GitOps Principles", url: "https://opengitops.dev/", publisher: "OpenGitOps" },
  { title: "ArgoCD Image Updater", url: "https://argocd-image-updater.readthedocs.io/", publisher: "Argo Project" },
  { title: "Sealed Secrets", url: "https://sealed-secrets.netlify.app/", publisher: "Bitnami" },
]} />


---

# Setting Up a Private Docker Registry You Can Actually Trust

URL: https://basnet.dev/posts/private-docker-registry-setup
Published: 2026-03-05
Category: Docker
Tags: docker, registry, nginx, linux, security
Reading time: 5 min read

> Running your own registry with Nginx, TLS, and authentication — why relying solely on Docker Hub for production images falls short.

Docker Hub is fine for open-source images. But the moment you're building proprietary services — especially for blockchain infrastructure where image integrity is critical — you need a private registry you control.

This is the production setup covered here: Docker Registry behind Nginx with TLS termination and HTTP basic auth.

## Why Self-Host a Registry

Three reasons to stop relying exclusively on Docker Hub:

- **Rate limits** — Docker Hub's pull rate limits have caused CI/CD pipeline failures during peak build times
- **Image integrity** — teams need to know exactly where images are stored and who has access
- **Network latency** — pulling from a local or same-region registry is significantly faster than pulling from Docker Hub on every deploy

## The Compose Stack

```yaml title="docker-compose.yml"
services:
  registry:
    image: registry:2
    restart: unless-stopped
    volumes:
      - registry-data:/var/lib/registry
      - ./auth:/auth:ro
    environment:
      REGISTRY_AUTH: htpasswd
      REGISTRY_AUTH_HTPASSWD_REALM: "Private Registry"
      REGISTRY_AUTH_HTPASSWD_PATH: /auth/htpasswd
      REGISTRY_STORAGE_DELETE_ENABLED: "true"
    networks:
      - internal

  nginx:
    image: nginx:alpine
    restart: unless-stopped
    ports:
      - "443:443"
    volumes:
      - ./nginx/registry.conf:/etc/nginx/conf.d/default.conf:ro
      - /etc/letsencrypt:/etc/letsencrypt:ro
    depends_on:
      - registry
    networks:
      - internal

volumes:
  registry-data:

networks:
  internal:
    driver: bridge
```

Key decisions:

- **Registry data on a named volume** — survives container recreation, easy to back up
- **Auth directory mounted read-only** — the registry can read credentials but can't modify them
- **Delete enabled** — without this, you can never clean up old images
- **Registry not exposed to host** — only Nginx is, on port 443

## Authentication

Generate credentials with `htpasswd`:

```bash title="setup-auth.sh"
#!/bin/bash
mkdir -p auth

# Create the first user
htpasswd -Bc auth/htpasswd deployer
# -B uses bcrypt hashing (stronger than default)
# -c creates the file (only use -c for the first user)

# Add additional users without -c
htpasswd -B auth/htpasswd ci-bot
```

<Callout type="warning" title="Use bcrypt, not MD5">
  The default `htpasswd` hashing (MD5) is weak. Always pass the `-B` flag for bcrypt. Docker Registry supports bcrypt natively.
</Callout>

## Nginx Configuration

```nginx title="nginx/registry.conf"
upstream registry {
    server registry:5000;
}

server {
    listen 443 ssl;
    server_name registry.example.com;

    ssl_certificate     /etc/letsencrypt/live/registry.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/registry.example.com/privkey.pem;

    # Required for large image layer uploads
    client_max_body_size 0;
    chunked_transfer_encoding on;

    location / {
        # Required headers for Docker Registry V2 API
        proxy_pass http://registry;
        proxy_set_header Host $http_host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        proxy_read_timeout 900;
        proxy_send_timeout 900;
    }
}
```

Two settings that trip people up:

- **`client_max_body_size 0`** — disables the upload size limit. Docker image layers can be hundreds of megabytes. Without this, you'll get `413 Request Entity Too Large` errors on push.
- **`proxy_read_timeout 900`** — large image pushes take time. The default 60-second timeout will cause failures on slow connections.

## TLS with Let's Encrypt

Docker requires HTTPS for any registry that isn't `localhost`. No exceptions.

```bash title="setup-tls.sh"
#!/bin/bash
# Install certbot
apt install -y certbot

# Get the certificate (stop Nginx first to free port 443)
docker compose stop nginx
certbot certonly --standalone -d registry.example.com
docker compose start nginx
```

Set up automatic renewal:

```bash title="crontab"
0 3 * * 1 certbot renew --quiet --pre-hook "docker compose -f /opt/registry/docker-compose.yml stop nginx" --post-hook "docker compose -f /opt/registry/docker-compose.yml start nginx"
```

## Testing the Registry

```bash title="test-registry.sh"
# Login
docker login registry.example.com
# Enter username and password when prompted

# Tag a local image for the private registry
docker tag myapp:latest registry.example.com/myapp:latest

# Push
docker push registry.example.com/myapp:latest

# Pull from another machine
docker pull registry.example.com/myapp:latest
```

If login fails with a `502 Bad Gateway`, the issue is almost always Nginx not being able to reach the registry container. Check that both services are on the same Docker network.

## Garbage Collection

Docker Registry doesn't automatically clean up deleted image layers. Without periodic garbage collection, disk usage grows indefinitely:

```bash title="gc.sh"
#!/bin/bash
# Run garbage collection on the registry
docker compose exec registry bin/registry \
  garbage-collect /etc/docker/registry/config.yml \
  --delete-untagged

echo "Registry garbage collection complete"
```

<Callout type="tip" title="Schedule GC during low-traffic windows">
  Garbage collection locks the registry during execution. Run it during off-hours to avoid blocking CI/CD image pushes.
</Callout>

## CI/CD Integration

In your CI pipeline, authenticate and push automatically:

```yaml title=".github/workflows/build.yml"
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Login to private registry
        run: echo "$REGISTRY_PASSWORD" | docker login registry.example.com -u ci-bot --password-stdin

      - name: Build and push
        run: |
          docker build -t registry.example.com/myapp:${{ github.sha }} .
          docker push registry.example.com/myapp:${{ github.sha }}
```

## Key Takeaways

1. **Never expose the registry directly** — always put Nginx (or another reverse proxy) in front for TLS and auth
2. **Set `client_max_body_size 0`** — the single most common cause of registry push failures
3. **Use bcrypt for passwords** — default MD5 is unacceptable for production
4. **Schedule garbage collection** — disk usage will grow unbounded without it
5. **Back up your volume** — losing your registry data means rebuilding every image from source

<Sources items={[
  { title: "How To Set Up a Private Docker Registry on Ubuntu 22.04", url: "https://www.digitalocean.com/community/tutorials/how-to-set-up-a-private-docker-registry-on-ubuntu-22-04", publisher: "DigitalOcean" },
  { title: "Docker Registry Documentation", url: "https://docs.docker.com/registry/", publisher: "Docker" },
  { title: "Nginx Reverse Proxy for Docker Registry", url: "https://docs.docker.com/registry/recipes/nginx/", publisher: "Docker" },
  { title: "Docker Registry Garbage Collection", url: "https://docs.docker.com/registry/garbage-collection/", publisher: "Docker" },
]} />


---

# GitHub Actions: Reusable Workflows That Actually Scale

URL: https://basnet.dev/posts/github-actions-reusable-workflows
Published: 2026-02-28
Category: CI/CD
Tags: github-actions, ci-cd, automation, devops
Reading time: 5 min read

> How duplicated CI/CD configs across 30+ repos were eliminated with reusable workflows, composite actions, and a central workflow registry.

When you manage 30+ repositories, copy-pasting CI/CD workflows between repos becomes unsustainable. One security patch to the build process means updating 30 workflow files. Reusable workflows solve this — define once, reference everywhere.

## The Problem

Every repo had its own `.github/workflows/ci.yml`. They were mostly identical but had drifted over time. Some had Docker layer caching, some didn't. Some ran security scans, some forgot to. Updating the Node.js version meant 30 PRs.

## Reusable Workflows

A reusable workflow lives in a central repository and is called by other repos:

```yaml title=".github/workflows/node-ci.yml"
# Central repo: myorg/workflows
name: Node.js CI

on:
  workflow_call:
    inputs:
      node-version:
        description: "Node.js version"
        required: false
        default: "20"
        type: string
      run-e2e:
        description: "Run E2E tests"
        required: false
        default: false
        type: boolean
    secrets:
      NPM_TOKEN:
        required: false
      SONAR_TOKEN:
        required: false

jobs:
  lint-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: ${{ inputs.node-version }}
          cache: "npm"

      - run: npm ci

      - name: Lint
        run: npm run lint

      - name: Type check
        run: npm run typecheck

      - name: Unit tests
        run: npm test -- --coverage

      - name: Upload coverage
        uses: actions/upload-artifact@v4
        with:
          name: coverage
          path: coverage/

  security-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run Snyk
        uses: snyk/actions/node@master
        env:
          SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}
        with:
          args: --severity-threshold=high

      - name: Audit dependencies
        run: npm audit --audit-level=high

  e2e:
    if: inputs.run-e2e
    runs-on: ubuntu-latest
    needs: lint-and-test
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ inputs.node-version }}
          cache: "npm"
      - run: npm ci
      - name: E2E tests
        run: npm run test:e2e
```

Consuming repos call it with a single line:

```yaml title=".github/workflows/ci.yml"
# Any repo in the org
name: CI
on:
  push:
    branches: [main]
  pull_request:

jobs:
  ci:
    uses: myorg/workflows/.github/workflows/node-ci.yml@main
    with:
      node-version: "20"
      run-e2e: true
    secrets: inherit
```

One file per consuming repo. All the logic lives centrally.

<Callout type="tip" title="Pin to a SHA, not a branch">
  Using `@main` means every repo automatically gets workflow updates — convenient but risky. For production, pin to a specific commit SHA or tag: `@v2.1.0`. This gives you controlled rollouts of workflow changes.
</Callout>

## Composite Actions for Shared Steps

When you need to share individual steps rather than entire workflows, composite actions are more flexible:

```yaml title="actions/docker-build/action.yml"
# Central repo: myorg/workflows/actions/docker-build
name: Docker Build and Push
description: Build and push a Docker image with layer caching

inputs:
  registry:
    description: "Container registry URL"
    required: true
  image-name:
    description: "Image name"
    required: true
  tag:
    description: "Image tag"
    required: false
    default: ${{ github.sha }}
  dockerfile:
    description: "Path to Dockerfile"
    required: false
    default: "Dockerfile"

runs:
  using: composite
  steps:
    - name: Set up Docker Buildx
      uses: docker/setup-buildx-action@v3

    - name: Login to registry
      uses: docker/login-action@v3
      with:
        registry: ${{ inputs.registry }}
        username: ${{ env.REGISTRY_USERNAME }}
        password: ${{ env.REGISTRY_PASSWORD }}

    - name: Build and push
      uses: docker/build-push-action@v5
      with:
        context: .
        file: ${{ inputs.dockerfile }}
        push: true
        tags: |
          ${{ inputs.registry }}/${{ inputs.image-name }}:${{ inputs.tag }}
          ${{ inputs.registry }}/${{ inputs.image-name }}:latest
        cache-from: type=gha
        cache-to: type=gha,mode=max
```

Usage in any workflow:

```yaml title="usage.yml"
steps:
  - uses: actions/checkout@v4
  - uses: myorg/workflows/actions/docker-build@v2
    with:
      registry: ghcr.io
      image-name: myorg/api-service
    env:
      REGISTRY_USERNAME: ${{ github.actor }}
      REGISTRY_PASSWORD: ${{ secrets.GITHUB_TOKEN }}
```

## Versioning Strategy

The central workflow repo is tagged with semantic versions:

```bash title="release-workflow.sh"
#!/bin/bash
# Tag a new version of the workflows
VERSION=$1

git tag -a "v${VERSION}" -m "Release v${VERSION}"
git push origin "v${VERSION}"

# Update the major version tag (v2 -> latest v2.x.x)
MAJOR=$(echo "$VERSION" | cut -d. -f1)
git tag -fa "v${MAJOR}" -m "Update v${MAJOR} to v${VERSION}"
git push origin "v${MAJOR}" --force
```

Repos can pin to:
- `@v2` — get all minor/patch updates automatically
- `@v2.1.0` — exact version, no surprises
- `@main` — bleeding edge (only for testing)

## Workflow Dispatch for Manual Operations

Some workflows need to be triggered manually with parameters:

```yaml title=".github/workflows/deploy.yml"
name: Deploy
on:
  workflow_dispatch:
    inputs:
      environment:
        description: "Deploy target"
        required: true
        type: choice
        options:
          - staging
          - production
      version:
        description: "Image tag to deploy"
        required: true
        type: string
      dry-run:
        description: "Dry run (plan only)"
        required: false
        default: true
        type: boolean

jobs:
  deploy:
    uses: myorg/workflows/.github/workflows/k8s-deploy.yml@v2
    with:
      environment: ${{ inputs.environment }}
      version: ${{ inputs.version }}
      dry-run: ${{ inputs.dry-run }}
    secrets: inherit
```

## Monitoring Workflow Health

Track workflow reliability across the org:

```bash title="workflow-health.sh"
#!/bin/bash
# Check workflow success rates across repos
for repo in $(gh repo list myorg --json name -q '.[].name'); do
  TOTAL=$(gh run list -R "myorg/$repo" -L 20 --json conclusion -q 'length')
  FAILED=$(gh run list -R "myorg/$repo" -L 20 --json conclusion -q '[.[] | select(.conclusion=="failure")] | length')

  if [ "$TOTAL" -gt 0 ]; then
    RATE=$(echo "scale=0; ($TOTAL - $FAILED) * 100 / $TOTAL" | bc)
    echo "$repo: ${RATE}% success ($FAILED/$TOTAL failed)"
  fi
done
```

## Key Takeaways

1. **Centralize workflows in a dedicated repo** — one source of truth for all CI/CD logic
2. **Use reusable workflows for full pipelines** — lint, test, build, deploy in one call
3. **Use composite actions for shared steps** — Docker builds, deployment scripts, notification steps
4. **Version your workflows** — pin consuming repos to tags, not branches
5. **`secrets: inherit` simplifies secret management** — no need to pass each secret individually

<Sources items={[
  { title: "GitHub Actions — Reusable Workflows", url: "https://docs.github.com/en/actions/sharing-automations/reusing-workflows", publisher: "GitHub" },
  { title: "Creating Composite Actions", url: "https://docs.github.com/en/actions/sharing-automations/creating-actions/creating-a-composite-action", publisher: "GitHub" },
  { title: "GitHub Actions Security Best Practices", url: "https://docs.github.com/en/actions/security-for-github-actions/security-guides/security-hardening-for-github-actions", publisher: "GitHub" },
  { title: "Docker Build Push Action", url: "https://github.com/docker/build-push-action", publisher: "Docker" },
]} />


---

# The Cloud Shared Responsibility Model Is Not Optional

URL: https://basnet.dev/posts/cloud-shared-responsibility-model
Published: 2026-02-20
Category: Security
Tags: cloud, security, aws, devops, compliance
Reading time: 6 min read

> What you own, what your cloud provider owns, and the gray areas in between — with real breach examples that prove why this matters.

Every major cloud breach in the last five years has one thing in common: the organization that got breached thought their cloud provider was handling something that was actually their responsibility. The shared responsibility model isn't a suggestion — it's a contract, and misunderstanding it has cost companies hundreds of millions of dollars.

## The Core Principle

Cloud providers secure the **infrastructure** — the physical data centers, hypervisors, managed service internals, and the global network. You secure **everything you put on that infrastructure** — your data, your configurations, your access controls, and your applications.

AWS puts it simply: they handle "security **of** the cloud." You handle "security **in** the cloud."

## What Changes by Service Model

Your responsibility shifts depending on how much abstraction you're using:

### IaaS — You Own Almost Everything

With services like EC2, Azure VMs, or GCE instances, you're responsible for:

```bash title="your-responsibilities-iaas.sh"
# OS patching — this is on you
sudo apt update && sudo apt upgrade -y

# Firewall rules — default-deny, open only what's needed
ufw default deny incoming
ufw allow 443/tcp
ufw enable

# Disk encryption — not always on by default
# You must verify and enable it
```

The provider gives you a virtual machine. Everything from the OS up is yours to secure.

### PaaS — Shared But Not Gone

With managed services like RDS, Cloud Functions, or App Engine, the provider handles OS patching and runtime updates. But you still own:

- **Access controls** — who can connect to your RDS instance
- **Encryption configuration** — enabling encryption at rest isn't always the default
- **Network exposure** — a publicly accessible database is still your mistake
- **Backup strategy** — managed doesn't mean backed up the way you need

```hcl title="rds-security.tf"
resource "aws_db_instance" "main" {
  engine               = "postgres"
  instance_class       = "db.t3.medium"

  # YOUR responsibility: encryption at rest
  storage_encrypted    = true
  kms_key_id           = aws_kms_key.db.arn

  # YOUR responsibility: not making it public
  publicly_accessible  = false

  # YOUR responsibility: backup retention
  backup_retention_period = 14
}
```

### SaaS — Less Surface, Same Core Duties

Even with SaaS products, you're responsible for:

- **User access management** — who has admin access to your SaaS tools
- **MFA enforcement** — your provider offers it, you must enable it
- **Data classification** — knowing what sensitive data lives in the service
- **Integration security** — API keys and OAuth tokens connecting your systems

## Real Breaches That Prove the Point

### Capital One (2019)

A misconfigured WAF on AWS allowed an attacker to exploit an SSRF vulnerability and access S3 buckets containing 106 million customer records. The root cause was overly permissive IAM roles — the compromised service had access to every S3 bucket in the account.

AWS infrastructure was not compromised. Capital One's IAM configuration was.

**Cost: $80 million fine + $190 million settlement.**

### Twitch (2021)

125GB of internal data — including source code, internal tools, and creator payout information — was leaked due to a misconfigured server. The data was stored on infrastructure Twitch controlled and was responsible for securing.

**Lesson: your provider secures the storage service. You secure what you put in it.**

<Callout type="danger" title="The pattern is always the same">
  In every major cloud breach, the cloud provider's infrastructure was secure. The customer's configuration was not. Misconfigured IAM, open security groups, unencrypted data, and overly broad permissions are the four horsemen of cloud security failures.
</Callout>

## The Four Things You Must Get Right

### 1. IAM — Least Privilege, Always

```json title="iam-policy.json"
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::my-app-uploads/*"
    }
  ]
}
```

Rules to enforce on every project:

- **No wildcard permissions** — `"Action": "*"` is never acceptable in production
- **MFA on every human account** — no exceptions
- **IAM roles over static credentials** — EC2 instance profiles, not access keys
- **Regular access reviews** — permissions accumulate; prune them quarterly

### 2. Encryption — At Rest and In Transit

```bash
# Verify S3 bucket encryption
aws s3api get-bucket-encryption --bucket my-bucket

# Check if RDS encryption is enabled
aws rds describe-db-instances \
  --query "DBInstances[*].[DBInstanceIdentifier,StorageEncrypted]" \
  --output table
```

### 3. Network Configuration

```bash
# Find security groups with 0.0.0.0/0 on sensitive ports
aws ec2 describe-security-groups \
  --filters "Name=ip-permission.cidr,Values=0.0.0.0/0" \
  --query "SecurityGroups[*].[GroupId,GroupName]" \
  --output table
```

A security group open to `0.0.0.0/0` on port 22 or 3389 is a breach waiting to happen. These should be audited weekly.

### 4. Patching — The IaaS Tax

If you run IaaS, OS patching is your problem:

```bash title="automated-patching.sh"
#!/bin/bash
# Automated security patching for Ubuntu servers
apt update
apt upgrade -y --only-upgrade
# Log what was updated
apt list --upgradeable 2>/dev/null | tee /var/log/patch-$(date +%F).log
```

For managed services, the provider handles this — which is one of the strongest arguments for using PaaS/SaaS when you can.

## Operationalizing the Model

Understanding the model isn't enough. You need to enforce it continuously:

- **Infrastructure as Code** — define security controls in Terraform/Pulumi so they're versioned and reviewable
- **Policy as Code** — use AWS Config Rules, Azure Policy, or OPA to automatically detect misconfigurations
- **CSPM tools** — AWS Security Hub, Azure Defender, or third-party tools like Wiz scan continuously for compliance drift
- **Audit logging** — CloudTrail, Azure Monitor, and GCP Audit Logs should be enabled on every account, with alerts on high-signal events

## Key Takeaways

1. **Your provider secures their infrastructure — you secure your configurations** — this is non-negotiable
2. **The less abstraction you use, the more you own** — IaaS means you own almost everything
3. **IAM misconfigurations cause the majority of breaches** — invest time here first
4. **Automate compliance checking** — manual audits don't scale and drift happens between reviews
5. **When in doubt, assume it's your responsibility** — this mindset prevents the gaps that cause breaches

<Sources items={[
  { title: "Cloud Shared Responsibility Model Explained", url: "https://kodekloud.com/blog/cloud-shared-responsibility-model-explained/", publisher: "KodeKloud" },
  { title: "AWS Shared Responsibility Model", url: "https://aws.amazon.com/compliance/shared-responsibility-model/", publisher: "AWS" },
  { title: "Capital One Data Breach Analysis", url: "https://krebsonsecurity.com/2019/07/capital-one-data-theft-impacts-106m-people/", publisher: "Krebs on Security" },
  { title: "NIST Cloud Computing Security", url: "https://csrc.nist.gov/publications/detail/sp/800-144/final", publisher: "NIST" },
]} />


---

# Platform Engineering: Building an Internal Developer Portal That Gets Used

URL: https://basnet.dev/posts/platform-engineering-internal-developer-portal
Published: 2026-02-10
Category: Platform Engineering
Tags: platform-engineering, developer-experience, devops, backstage
Reading time: 5 min read

> Most internal platforms fail because they solve infrastructure problems, not developer problems. Here's how to build one that developers actually adopt.

Platform engineering is the hottest trend in DevOps, and half the implementations out there are shelfware. The team builds an elaborate internal platform, presents it at an all-hands, and six months later developers are still SSHing into servers and running manual scripts.

The difference between platforms that get adopted and platforms that collect dust is simple: successful platforms start with developer pain, not infrastructure elegance.

## What Developers Actually Want

Surveying the engineering team before building anything typically reveals the same pattern. The top pain points aren't about Kubernetes or CI/CD — they're about cognitive load:

1. **"Nobody knows which service owns this API endpoint"** — service discovery
2. **"Setting up a new service takes two weeks of tickets"** — scaffolding
3. **"The runbook for this alert is nowhere to be found"** — documentation discovery
4. **"There's no easy way to tell if a service is healthy in production"** — observability access

These are the problems worth solving first.

## The Backstage Foundation

Backstage works well as the foundation — not because it's the best tool, but because it's the most extensible. The software catalog is the core:

```yaml title="catalog-info.yml"
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: payment-service
  description: Handles payment processing via Stripe
  annotations:
    github.com/project-slug: myorg/payment-service
    backstage.io/techdocs-ref: dir:.
    pagerduty.com/service-id: P1234AB
    grafana/dashboard-selector: "payment-service"
  tags:
    - python
    - payments
    - critical
  links:
    - url: https://grafana.internal/d/payments
      title: Dashboard
      icon: dashboard
spec:
  type: service
  lifecycle: production
  owner: team-payments
  system: checkout
  providesApis:
    - payment-api
  consumesApis:
    - user-api
    - notification-api
  dependsOn:
    - resource:postgres-payments
    - resource:redis-cache
```

Every service gets a `catalog-info.yml` in its repo. Backstage automatically discovers and indexes them. The result: a single place where any developer can find who owns what, what depends on what, and where to look when something breaks.

## Golden Paths: Self-Service Scaffolding

The highest-impact feature is software templates — golden paths that let developers create new services without filing tickets:

```yaml title="templates/new-service.yml"
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: new-microservice
  title: Create a New Microservice
  description: Scaffolds a production-ready service with CI/CD, monitoring, and K8s manifests
spec:
  owner: team-platform
  type: service
  parameters:
    - title: Service Details
      required:
        - name
        - owner
        - language
      properties:
        name:
          title: Service Name
          type: string
          pattern: "^[a-z][a-z0-9-]*$"
        owner:
          title: Owning Team
          type: string
          ui:field: OwnerPicker
        language:
          title: Language
          type: string
          enum:
            - typescript
            - python
            - go
  steps:
    - id: fetch
      name: Fetch Template
      action: fetch:template
      input:
        url: ./skeleton
        values:
          name: ${{ parameters.name }}
          owner: ${{ parameters.owner }}

    - id: publish
      name: Create Repository
      action: publish:github
      input:
        repoUrl: github.com?owner=myorg&repo=${{ parameters.name }}
        defaultBranch: main

    - id: register
      name: Register in Catalog
      action: catalog:register
      input:
        repoContentsUrl: ${{ steps.publish.output.repoContentsUrl }}
        catalogInfoPath: /catalog-info.yml
```

What the developer gets in 30 seconds:

- A new GitHub repo with the codebase scaffolded
- CI/CD pipeline configured and running
- Kubernetes manifests with proper resource limits
- Grafana dashboard provisioned
- PagerDuty service created
- Service registered in the catalog

What it used to take: 2 weeks and 5 Jira tickets across 3 teams.

<Callout type="tip" title="Golden paths, not golden cages">
  The template should be the easy path, not the only path. Developers who need to deviate should be able to — the platform just makes the standard path so easy that most people choose it voluntarily.
</Callout>

## Observability Integration

Every service in the catalog links directly to its dashboards, logs, and alerts. No more "which Grafana dashboard is for this service?":

```typescript title="plugins/observability-card.tsx"
// Backstage plugin that embeds service health in the catalog page
export const ServiceHealthCard = () => {
  const { entity } = useEntity();
  const dashboardUrl = entity.metadata.annotations?.["grafana/dashboard-selector"];

  return (
    <InfoCard title="Service Health">
      <iframe
        src={`${GRAFANA_URL}/d/${dashboardUrl}?kiosk`}
        width="100%"
        height="300"
        style={{ border: "none" }}
      />
    </InfoCard>
  );
};
```

## Measuring Adoption

The platform only succeeds if developers use it. Useful adoption metrics to track:

- **Catalog coverage** — percentage of services registered (target: 100%)
- **Template usage** — new services created through templates vs manually
- **Time to first deploy** — how long from "we need a new service" to "it's running in staging"
- **Portal active users** — weekly active users in Backstage

```bash
# Quick catalog coverage check
TOTAL_REPOS=$(gh repo list myorg --json name -q '.[].name' | wc -l)
REGISTERED=$(curl -s "$BACKSTAGE_URL/api/catalog/entities?filter=kind=component" | jq '.length')
echo "Catalog coverage: $REGISTERED / $TOTAL_REPOS"
```

## What to Do Differently

1. **Start smaller** — building too many features before validating is a common trap. Start with the catalog and one template.
2. **Don't migrate everything at once** — let teams adopt at their own pace
3. **Invest in documentation** — the platform needs better docs than the tools it replaces
4. **Assign a dedicated team** — a platform that's "maintained by everyone" is maintained by no one

## Key Takeaways

1. **Start with developer pain, not infrastructure** — survey your users before building
2. **The software catalog is the foundation** — everything else builds on knowing what exists and who owns it
3. **Golden paths reduce time-to-production by 10x** — self-service scaffolding is the highest-impact feature
4. **Measure adoption, not features** — a platform nobody uses is worse than no platform
5. **Treat the platform as a product** — it needs a roadmap, user research, and a dedicated team

<Sources items={[
  { title: "Backstage by Spotify", url: "https://backstage.io/docs/overview/what-is-backstage", publisher: "Spotify" },
  { title: "Platform Engineering on Kubernetes", url: "https://platformengineering.org/", publisher: "Platform Engineering" },
  { title: "Team Topologies", url: "https://teamtopologies.com/", publisher: "Matthew Skelton & Manuel Pais" },
  { title: "What is Platform Engineering?", url: "https://www.gartner.com/en/articles/what-is-platform-engineering", publisher: "Gartner" },
]} />


---

# Structured Logging That Actually Scales

URL: https://basnet.dev/posts/structured-logging-that-scales
Published: 2026-01-28
Category: Observability
Tags: logging, observability, devops, monitoring
Reading time: 5 min read

> Why replacing text logs with structured JSON, shipping them to a central stack, and adopting consistent query patterns cuts incident response time in half.

The first thing to check when inheriting a production system is the logs. Unstructured text like `ERROR: something went wrong in payment service` is a reliable signal that incident response is going to be painful. Structured logging is one of those practices that costs almost nothing to implement but transforms how fast you can diagnose problems.

## The Problem with Text Logs

Traditional log lines look like this:

```
2026-01-28 14:23:01 ERROR PaymentService - Failed to process payment for user 12345, order 67890, amount $150.00, error: timeout
```

Parsing this requires regex. Every service formats logs differently. Searching across services means writing different queries for each one. Correlating a single request across multiple services is nearly impossible.

## Structured Logging

The same event as structured JSON:

```json title="structured-log-entry.json"
{
  "timestamp": "2026-01-28T14:23:01.456Z",
  "level": "error",
  "service": "payment-service",
  "message": "Payment processing failed",
  "userId": "12345",
  "orderId": "67890",
  "amount": 150.00,
  "currency": "USD",
  "error": "upstream_timeout",
  "duration_ms": 30000,
  "traceId": "abc-123-def-456",
  "spanId": "span-789"
}
```

Every field is queryable. Every service uses the same format. Correlating a request across services is a single query on `traceId`.

## Implementation

### Node.js with Pino

Pino is the fastest JSON logger for Node.js — it writes logs asynchronously and adds negligible overhead:

```typescript title="logger.ts"
import pino from "pino";

export const logger = pino({
  level: process.env.LOG_LEVEL ?? "info",
  formatters: {
    level(label) {
      return { level: label };
    },
  },
  base: {
    service: process.env.SERVICE_NAME,
    environment: process.env.NODE_ENV,
    version: process.env.APP_VERSION,
  },
});
```

Usage in application code:

```typescript title="payment-handler.ts"
import { logger } from "./logger";

async function processPayment(userId: string, orderId: string, amount: number) {
  const log = logger.child({ userId, orderId, amount });

  log.info("Processing payment");

  try {
    const result = await paymentGateway.charge(amount);
    log.info({ transactionId: result.id, duration_ms: result.duration }, "Payment succeeded");
    return result;
  } catch (error) {
    log.error({ error: error.message, code: error.code }, "Payment failed");
    throw error;
  }
}
```

The `child()` method creates a logger with context fields that are automatically included in every log entry. No more manually including `userId` in every log call.

### Go with zerolog

```go title="logger.go"
package main

import (
    "os"
    "github.com/rs/zerolog"
    "github.com/rs/zerolog/log"
)

func init() {
    zerolog.TimeFieldFormat = zerolog.TimeFormatUnix
    log.Logger = zerolog.New(os.Stdout).With().
        Str("service", "payment-service").
        Str("version", os.Getenv("APP_VERSION")).
        Timestamp().
        Logger()
}

func processPayment(userID string, amount float64) error {
    log.Info().
        Str("userId", userID).
        Float64("amount", amount).
        Msg("Processing payment")
    return nil
}
```

## Shipping Logs

Structured logs are only useful if they're aggregated in a central, searchable system. A minimal self-hosted stack looks like this:

```yaml title="docker-compose.logging.yml"
services:
  vector:
    image: timberio/vector:latest-alpine
    volumes:
      - ./vector.toml:/etc/vector/vector.toml:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
    depends_on:
      - loki

  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"
    volumes:
      - loki-data:/loki

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana

volumes:
  loki-data:
  grafana-data:
```

Vector collects logs from Docker containers, parses the JSON, and ships them to Loki. Grafana queries Loki for visualization and alerting.

```toml title="vector.toml"
[sources.docker]
type = "docker_logs"

[transforms.parse]
type = "remap"
inputs = ["docker"]
source = '''
. = parse_json!(.message)
'''

[sinks.loki]
type = "loki"
inputs = ["parse"]
endpoint = "http://loki:3100"
encoding.codec = "json"
labels.service = "{{ service }}"
labels.level = "{{ level }}"
```

<Callout type="tip" title="Why Loki over Elasticsearch?">
  Loki indexes only labels (service, level, environment), not the full log content. This makes it orders of magnitude cheaper to run. For most teams, the tradeoff — slightly slower full-text search in exchange for 10x lower infrastructure cost — is worth it.
</Callout>

## Query Patterns That Save Time

### Find all errors for a specific user in the last hour

```
{service="payment-service", level="error"} | json | userId = "12345"
```

### Trace a request across services

```
{level=~"info|error"} | json | traceId = "abc-123-def-456"
```

### Find slow requests

```
{service="api-gateway"} | json | duration_ms > 5000
```

### Error rate by service (last 15 minutes)

```
sum by (service) (rate({level="error"}[15m]))
```

## Alerting on Logs

Logs aren't just for post-incident investigation. With structured data, you can alert proactively:

```yaml title="loki-alert-rules.yml"
groups:
  - name: log-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate({level="error"}[5m])) by (service) > 0.5
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High error rate in {{ $labels.service }}"

      - alert: PaymentFailureSpike
        expr: |
          sum(rate({service="payment-service", level="error"} |= "Payment failed" [5m])) > 0.1
        for: 1m
        labels:
          severity: critical
```

## Key Takeaways

1. **Structured from day one** — retrofitting structured logging is painful; start with JSON from the beginning
2. **Use child loggers for context** — attach request-scoped fields once, not in every log call
3. **Include a trace ID in every log** — this is the single most valuable field for debugging distributed systems
4. **Centralize immediately** — logs on individual servers are useless during incidents when you need cross-service visibility
5. **Alert on log patterns** — don't wait for users to report problems that your logs already show

<Sources items={[
  { title: "Pino — Super Fast Node.js Logger", url: "https://getpino.io/", publisher: "Pino" },
  { title: "Grafana Loki Documentation", url: "https://grafana.com/docs/loki/latest/", publisher: "Grafana Labs" },
  { title: "Vector — A Lightweight Log Pipeline", url: "https://vector.dev/docs/", publisher: "Datadog" },
  { title: "Google SRE Book — Monitoring Distributed Systems", url: "https://sre.google/sre-book/monitoring-distributed-systems/", publisher: "Google" },
]} />


---

# Zero Trust Networking: A Practical Implementation Guide

URL: https://basnet.dev/posts/zero-trust-networking-practical-guide
Published: 2026-01-12
Category: Security
Tags: zero-trust, security, networking, devops, kubernetes
Reading time: 5 min read

> Moving beyond perimeter security — a practical approach to implementing zero trust across services, users, and infrastructure without boiling the ocean.

"Never trust, always verify" sounds great in a conference talk. Implementing it in a production environment with legacy services, tight deadlines, and engineers who just want to ship features is a different story. This guide covers how to roll out zero trust incrementally without breaking everything.

## Why Perimeter Security Fails

The traditional network model — hard outer shell, soft interior — assumes that anything inside the network is trusted. This fails because:

- **Lateral movement** — an attacker who compromises one service can reach everything on the internal network
- **Remote work** — the "inside" and "outside" distinction no longer maps to physical locations
- **Cloud services** — your perimeter now extends to AWS, GCP, SaaS tools, and third-party APIs
- **Supply chain attacks** — a compromised dependency runs with full network access inside your perimeter

## Zero Trust Principles

Every request — whether from a user, service, or device — must be:

1. **Authenticated** — prove who you are
2. **Authorized** — prove you're allowed to do this specific thing
3. **Encrypted** — all traffic encrypted, even internal
4. **Continuously verified** — authentication isn't a one-time event

## Starting Point: Service-to-Service Authentication

The highest-impact first step is ensuring services authenticate to each other. No more "if it's on the internal network, it's trusted."

### Mutual TLS (mTLS)

Every service gets a certificate. Every connection requires both sides to present valid certificates:

```yaml title="istio-peer-authentication.yml"
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: production
spec:
  mtls:
    mode: STRICT
```

With Istio's strict mTLS, any service that tries to communicate without a valid certificate is rejected. No exceptions.

### Service Identity with SPIFFE

SPIFFE provides a standard for service identity that works across platforms:

```
spiffe://myorg.com/ns/production/sa/payment-service
```

Every service gets a SPIFFE ID. Authorization policies reference these IDs instead of IP addresses or hostnames, which change constantly in dynamic environments.

## Network Policies: Default Deny

The foundation of zero trust networking in Kubernetes — deny all traffic by default, then explicitly allow only what's needed:

```yaml title="default-deny.yml"
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
```

Then allow specific communication paths:

```yaml title="allow-api-to-db.yml"
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-to-database
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: postgres
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: api-service
      ports:
        - port: 5432
          protocol: TCP
```

<Callout type="warning" title="Test network policies in staging first">
  A misconfigured default-deny policy will take down your entire application instantly. Deploy to staging, verify every service can still communicate, then promote to production. Have a rollback plan ready.
</Callout>

## User Access: Beyond VPN

VPNs give users full network access — the opposite of zero trust. Replace VPN-based access with identity-aware proxies:

```nginx title="identity-aware-proxy.conf"
server {
    listen 443 ssl;
    server_name internal-tool.example.com;

    # Verify OAuth2 token on every request
    auth_request /oauth2/auth;
    error_page 401 = /oauth2/sign_in;

    auth_request_set $user $upstream_http_x_auth_request_user;
    auth_request_set $email $upstream_http_x_auth_request_email;
    auth_request_set $groups $upstream_http_x_auth_request_groups;

    location / {
        proxy_pass http://internal-tool:8080;
        proxy_set_header X-Authenticated-User $user;
        proxy_set_header X-Authenticated-Email $email;
        proxy_set_header X-Authenticated-Groups $groups;
    }
}
```

Each request is authenticated and authorized individually. No VPN. No "you're on the network, so you're trusted."

## Short-Lived Credentials

Long-lived API keys and service account tokens are the antithesis of zero trust. Every credential should expire:

```bash title="short-lived-aws-creds.sh"
# Instead of static AWS access keys, use STS for temporary credentials
aws sts assume-role \
  --role-arn arn:aws:iam::123456789:role/deploy-role \
  --role-session-name ci-deploy \
  --duration-seconds 900  # 15 minutes — enough for one deployment

# In Kubernetes, use projected service account tokens
# that expire and auto-rotate
```

```yaml title="projected-token.yml"
apiVersion: v1
kind: Pod
spec:
  containers:
    - name: app
      volumeMounts:
        - name: token
          mountPath: /var/run/secrets/tokens
  volumes:
    - name: token
      projected:
        sources:
          - serviceAccountToken:
              path: token
              expirationSeconds: 3600  # 1 hour
              audience: api.example.com
```

## Monitoring Zero Trust

Zero trust generates a lot of authentication and authorization events. Monitor them:

```yaml title="alert-rules.yml"
groups:
  - name: zero-trust-alerts
    rules:
      - alert: UnauthorizedServiceCommunication
        expr: |
          sum(rate(istio_requests_total{
            response_code="403",
            reporter="destination"
          }[5m])) by (source_workload, destination_workload) > 0
        for: 1m
        annotations:
          summary: "{{ $labels.source_workload }} denied access to {{ $labels.destination_workload }}"

      - alert: MtlsHandshakeFailures
        expr: |
          sum(rate(envoy_ssl_connection_error[5m])) by (pod) > 0.1
        for: 2m
        annotations:
          summary: "mTLS handshake failures on {{ $labels.pod }}"
```

<Callout type="tip" title="403s are a feature, not a bug">
  In a zero trust environment, 403 (Forbidden) responses are expected and healthy. They mean the policy is working. Alert on unexpected 403s between services that should be communicating, not on 403s in general.
</Callout>

## The Incremental Rollout

Don't try to implement everything at once. A proven rollout order:

1. **Week 1-2:** Enable mTLS in permissive mode (log but don't block)
2. **Week 3-4:** Deploy default-deny network policies in staging
3. **Week 5-6:** Switch mTLS to strict mode in production
4. **Week 7-8:** Deploy network policies to production
5. **Month 3:** Replace VPN access with identity-aware proxy
6. **Month 4:** Migrate to short-lived credentials

Each step has a rollback plan. Each step is validated before moving to the next.

## Key Takeaways

1. **Start with service-to-service mTLS** — it's the highest-impact, lowest-risk first step
2. **Default-deny network policies are non-negotiable** — without them, compromised services have unlimited lateral movement
3. **Replace VPNs with identity-aware proxies** — VPNs are the opposite of zero trust
4. **Short-lived credentials reduce blast radius** — a leaked token that expires in 15 minutes is dramatically less dangerous
5. **Roll out incrementally** — zero trust is a journey, not a migration weekend

<Sources items={[
  { title: "NIST SP 800-207 — Zero Trust Architecture", url: "https://csrc.nist.gov/publications/detail/sp/800-207/final", publisher: "NIST" },
  { title: "Istio Security — mTLS", url: "https://istio.io/latest/docs/concepts/security/", publisher: "Istio" },
  { title: "SPIFFE — Secure Production Identity Framework", url: "https://spiffe.io/docs/latest/spiffe-about/overview/", publisher: "SPIFFE" },
  { title: "Kubernetes Network Policies", url: "https://kubernetes.io/docs/concepts/services-networking/network-policies/", publisher: "Kubernetes" },
]} />


---

# Container Security Scanning in CI/CD — Beyond the Basics

URL: https://basnet.dev/posts/container-security-scanning-pipeline
Published: 2026-01-05
Category: Cloud Native
Tags: containers, security, docker, ci-cd, devops
Reading time: 5 min read

> Image scanning alone isn't enough. This post walks through a multi-layer container security pipeline that catches vulnerabilities before they reach production.

Most teams add a Trivy scan to their CI pipeline, see a wall of CVEs, ignore most of them, and call it "container security." That's not security — it's checkbox compliance. Real container security is a multi-layer pipeline that filters noise, enforces policies, and blocks deployments that don't meet your standards.

## The Layers of Container Security

Container security isn't one thing. It's at least five:

1. **Base image selection** — which OS and runtime you start from
2. **Dependency scanning** — CVEs in your application dependencies
3. **Image scanning** — CVEs in the final built image
4. **Configuration analysis** — Dockerfile best practices and misconfigurations
5. **Runtime policies** — what the container is allowed to do when it runs

## Layer 1: Base Image Selection

Your base image choice determines 80% of your vulnerability surface. Alpine has fewer packages (and fewer CVEs) than Ubuntu. Distroless has even fewer.

```dockerfile title="Dockerfile"
# Bad: full Ubuntu image — 400+ packages, many unnecessary
FROM ubuntu:22.04

# Better: Alpine — minimal package set
FROM node:20-alpine

# Best: Distroless — only your app and its runtime
FROM gcr.io/distroless/nodejs20-debian12
```

Tracking base image CVE counts as a metric is a good baseline:

```bash title="base-image-audit.sh"
#!/bin/bash
for image in "ubuntu:22.04" "node:20-alpine" "gcr.io/distroless/nodejs20-debian12"; do
  COUNT=$(trivy image --quiet --severity HIGH,CRITICAL "$image" 2>/dev/null | grep "Total:" | awk '{print $2}')
  echo "$image: $COUNT high/critical CVEs"
done
```

<Callout type="tip" title="Pin your base image digest">
  `FROM node:20-alpine` can change without warning when a new patch is published. Pin to a specific digest: `FROM node:20-alpine@sha256:abc123...`. This ensures reproducible builds and prevents surprise CVEs from upstream updates.
</Callout>

## Layer 2: Dependency Scanning

Before you even build the image, scan your application dependencies:

```yaml title=".github/workflows/security.yml"
jobs:
  dependency-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Audit npm dependencies
        run: npm audit --audit-level=high

      - name: Check for known vulnerabilities
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: fs
          scan-ref: .
          severity: HIGH,CRITICAL
          exit-code: 1
```

## Layer 3: Image Scanning

After building the image, scan it for OS-level and application-level vulnerabilities:

```yaml title="image-scan.yml"
  image-scan:
    runs-on: ubuntu-latest
    needs: build
    steps:
      - name: Scan image
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: ${{ env.IMAGE }}:${{ github.sha }}
          severity: HIGH,CRITICAL
          exit-code: 1
          format: sarif
          output: trivy-results.sarif

      - name: Upload to GitHub Security
        uses: github/codeql-action/upload-sarif@v3
        with:
          sarif_file: trivy-results.sarif
```

### Handling the CVE Noise

The raw scan output is overwhelming. Most CVEs have no fix available, or they're in packages the app doesn't use. A `.trivyignore` file handles acknowledged risks:

```text title=".trivyignore"
# No fix available, not exploitable in our context
CVE-2023-44487

# Fixed in next base image update, scheduled for March
CVE-2024-21626

# False positive — we don't use the affected function
CVE-2024-34156
```

Every ignored CVE must have a comment explaining why. This file is reviewed in every security audit.

## Layer 4: Dockerfile Analysis

Static analysis catches misconfigurations before the image is even built:

```bash
# Hadolint — Dockerfile linter
hadolint Dockerfile
```

Common issues Hadolint catches:

```dockerfile title="bad-practices.dockerfile"
# DL3007: Using latest is prone to errors
FROM node:latest

# DL3003: Use WORKDIR instead of cd
RUN cd /app && npm install

# DL3009: Delete apt-get lists after installing
RUN apt-get update && apt-get install -y curl

# DL3018: Pin versions in apk add
RUN apk add curl
```

The fixed version:

```dockerfile title="good-practices.dockerfile"
FROM node:20-alpine@sha256:abc123

WORKDIR /app

RUN apk add --no-cache curl=8.5.0-r0

COPY package*.json ./
RUN npm ci --production

COPY . .

USER node
EXPOSE 3000
CMD ["node", "dist/index.js"]
```

## Layer 5: Runtime Policies

Scanning the image isn't enough — you also need to restrict what it can do at runtime:

```yaml title="security-context.yml"
apiVersion: v1
kind: Pod
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 1000
    seccompProfile:
      type: RuntimeDefault
  containers:
    - name: app
      securityContext:
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
        capabilities:
          drop:
            - ALL
      resources:
        limits:
          memory: 512Mi
          cpu: 500m
```

<Callout type="warning" title="readOnlyRootFilesystem breaks some apps">
  Applications that write to `/tmp`, create lock files, or generate runtime configs will fail with a read-only filesystem. Mount specific writable paths as `emptyDir` volumes instead of disabling the restriction entirely.
</Callout>

## The Complete Pipeline

Putting it all together:

```
PR Created
  ├── Dependency scan (npm audit + Trivy filesystem)
  ├── Dockerfile lint (Hadolint)
  └── Unit tests

PR Merged
  ├── Build image
  ├── Image scan (Trivy)
  │   ├── CRITICAL CVEs → Block deployment
  │   ├── HIGH CVEs → Warn, require approval
  │   └── MEDIUM/LOW → Log, proceed
  ├── Sign image (Cosign)
  └── Push to registry

Deploy
  ├── Verify image signature
  ├── Check admission policies (OPA/Kyverno)
  └── Apply with security contexts
```

## Key Takeaways

1. **Start with the base image** — Alpine or Distroless eliminates most CVEs before you write any code
2. **Pin base image digests** — reproducible builds prevent surprise vulnerabilities
3. **Filter scan noise with `.trivyignore`** — but require justification for every entry
4. **Lint Dockerfiles** — misconfigurations are as dangerous as CVEs
5. **Enforce runtime security contexts** — scanning without runtime restrictions is half the picture

<Sources items={[
  { title: "Trivy — Container Security Scanner", url: "https://trivy.dev/latest/", publisher: "Aqua Security" },
  { title: "Dockerfile Best Practices", url: "https://docs.docker.com/build/building/best-practices/", publisher: "Docker" },
  { title: "Kubernetes Security Context", url: "https://kubernetes.io/docs/tasks/configure-pod-container/security-context/", publisher: "Kubernetes" },
  { title: "Hadolint — Dockerfile Linter", url: "https://github.com/hadolint/hadolint", publisher: "GitHub" },
]} />