Wednesday, October 29, 2025

The Mini-Box Pattern: A Pragmatic Path to Resilient Event-Driven Architecture

In modern microservices, the ability to reliably react to changes is paramount. Event-driven architectures, often powered by message brokers like Google Pub/Sub, promise loose coupling and scalability. But this power comes with a significant challenge: ensuring that processing an event and notifying others about it is a reliable operation.

In this post, we'll explore a common pitfall in event-driven systems and introduce a simple, evolutionary pattern we call the "Mini-Box Pattern," a pragmatic stepping stone to the well-known Outbox Pattern.

The Dream: Seamless Event-Driven Communication

In our ideal system, events flow seamlessly, driving business processes forward. We primarily see two scenarios:

  • Scenario 1: Event Handler Chains. A microservice receives a change event from Pub/Sub, updates its own database, and produces a new notification event for other interested services to consume.
  • Scenario 2: API-Driven Events. A REST API updates the database and, as a side effect, must produce a notification event (e.g., for an audit service or to update a read model).

In both cases, the service must reliably do two things: update its database and send a new event.

The Nightmare: The Non-Atomic Reality

The core reliability problem stems from a simple fact: database transactions and network calls are not atomic.

A service must only acknowledge (ACK) the initial event or API request once both the database write and the new event publication are successful. If either fails, it should negatively acknowledge (NACK) to force a retry.

But consider this failure scenario:

  1. The service successfully commits its database transaction.
  2. It then tries to publish the resulting event to Pub/Sub, but a network partition occurs.
  3. The publication fails. The service must NACK the original event.
  4. The original event is retried, leading to a duplicate database update.

This is a classic "at-least-once" delivery problem, but it's compounded by the fact that the two critical operations can't be grouped. Even with robust retry logic and exponential backoff, the retries themselves can cause timeouts, leading to unnecessary NACKs and system instability.

We needed a way to break the tight, unreliable coupling between the database transaction and the event publication.

The Goal: The Transactional Outbox Pattern

The definitive solution to this problem is the Transactional Outbox Pattern. In this pattern, the outgoing event is stored as part of the same database transaction that updates the business data. A separate process then relays these stored events to the message broker.

This ensures atomicity—the event is guaranteed to be persisted if the transaction commits. However, implementing the full outbox pattern, with a reliable relay service, can be a significant undertaking.

The Bridge: Introducing the Mini-Box Pattern

Faced with time constraints but an urgent need for improved resilience, we designed an intermediate solution: the Mini-Box Pattern.

This pattern gives us the core durability benefit of the outbox without immediately building the asynchronous relay component. It's a pragmatic compromise that buys us critical time and creates a foundation we can evolve.

How the Mini-Box Pattern Works

The key is to treat the outgoing event as data first, and a message second.

  1. Transactional Write: Inside the same database transaction that handles the business logic:
    • The business data is updated.
    • The outgoing event payload is serialized (e.g., to JSON) and inserted into a dedicated outbox_messages table.
  2. Best-Effort Publish: After the transaction successfully commits, the service attempts to publish the event to Pub/Sub as before.
  3. The Safety Net: This is where the magic happens.
    • On Success: The event is sent, and the source is ACK'd. We have a record of the event in our database for potential debugging or replay.
    • On Failure: If the Pub/Sub call fails, the event is already safely stored. We can NACK the original request without fear of losing the event. Our system can now alert us that there are "stranded" messages in the outbox_messages table, which can be replayed manually via a console or script.

This approach decouples the fate of our event from the transient failures of the network. The synchronous part of our operation (the database transaction) now captures the full intent of the operation, including the need to notify others.

Our Implementation Plan

Adopting the Mini-Box pattern involved a clear, staged plan:

  1. Design the Foundation: Create the outbox_messages table via a Liquibase script.
  2. Refactor the Core: Update our shared publishing library to perform the dual write: first to the database table, then to Pub/Sub.
  3. Integrate Across Services: Roll out the updated library to all our REST APIs and event handlers. This work was parallelized across the team.
  4. Test Rigorously: Conduct end-to-end performance and integration tests to ensure the new flow was stable.
  5. Implement Alerting: Set up monitoring and alerting to notify us when messages fail to publish and land in the table.
  6. Evolve: This table and process are perfectly positioned to be evolved into a full Outbox Pattern by building a relay service that polls this table and publishes the events.

A Glimpse at the Code

The core logic is surprisingly straightforward. Here's a simplified conceptual example:

// Inside the service method, within a transaction
public void processOrder(Order order) {

    // 1. Update business data
    orderRepository.save(order);

    // 2. Serialize and persist the event within the SAME transaction
    OrderCreatedEvent event = new OrderCreatedEvent(order);
    String serializedEvent = objectMapper.writeValueAsString(event);
    outboxRepository.save(new OutboxMessage(serializedEvent, "order-topic"));
    
    // Transaction commits here. If it fails, everything rolls back.
}

// After the transaction, attempt to publish
try {
    pubSubPublisher.publishAsync(event);
} catch (PublishException e) {
    // The event is safe in the database! We can alert and replay later.
    logger.warn("Publish failed, but event is persisted to outbox. Manual replay required.", e);
}   //opt

Conclusion

The Mini-Box Pattern is a testament to pragmatic engineering. It acknowledges that while perfect, definitive solutions are excellent goals, sometimes the best path is an evolutionary one.

By making a small architectural change—treating events as data first—we dramatically increased the resilience of our system without a massive upfront investment. We've bought ourselves time, reduced operational anxiety, and built a solid foundation for the future. If you're struggling with event reliability but aren't ready for a full outbox implementation, the Mini-Box might be the perfect bridge for you.

Have you faced similar challenges? What interim patterns have you used? Share your thoughts in the comments below!

Sunday, September 7, 2025

The Design Doc Dilemma: Finding the "Just Enough" in Your Two-Week Sprint

How a little upfront design can prevent your "go fast" agile team from actually 
going slow.

If you’ve worked on an agile team, you know the rhythm: backlog grooming, sprint planning, a two-week burst of coding, and then a review. The mantra is often "working software over comprehensive documentation." But if you’ve been in the trenches, you’ve also likely seen this scenario: 

A complex story gets pulled into a sprint. The team huddles for a quick 15-minute discussion, then everyone jumps straight to code. Two days later, PR comments reveal a fundamental misunderstanding. A week in, two engineers realize their implementations are incompatible. By the end of the sprint, the feature is "done," but it’s a Rube Goldberg machine of code—brittle, overly complex, and difficult to test.
The subsequent sprints are then plagued with bug fixes, refactoring, and rework directly caused by that initial rushed implementation.

Going fast ended up making us go agonizingly slow.

This isn't an indictment of agile; it's a misapplication of it. The key to avoiding this trap isn't to revert to weeks of Big Design Upfront (BDUF), but to intelligently apply "Just Enough" Design Upfront (JEDUF). And the most effective tool I've found for this is the humble design document.

Why Jumping Straight to Code Fails for Complex Problems

For simple CRUD tasks or well-trodden paths, a story description and a quick conversation are perfectly sufficient. The problem space is understood, and the solution is obvious. But for complex, novel, or architecturally significant work, code is a terrible medium for exploring ideas.

Why?

  • Code is Final: Writing code is an act of commitment. Changing a core architecture decision after hundreds  of lines have been written is expensive.
  • It Lacks Context: Code shows how something is done, but rarely explains the why-the considered alternatives, the trade-offs, and the rejected ideas.
  • It's Isolating: Without a shared artifact, engineers can head down divergent paths, only discovering 
    their misalignment during a painful merge conflict.

The Antidote: The Targeted Design Doc

The solution isn't to document everything, but to recognize the stories that carry risk and complexity and treat them differently. For these, I advocate for a simple process:
  1. Identify the Candidate: During sprint planning or grooming, flag a story as "complex." This is usually obvious—it involves new system integrations, significant performance requirements, novel algorithms, or has a high degree of ambiguity.
  2. Time-Box the Design: The assigned engineer spends a few hours (not days!) drafting a concise design doc. This isn't a 50-page specification. It's a brief document that outlines:
    • The Problem: What are we actually solving?
    • The Proposed Solution: A high-level overview of the approach.
    • Considered Alternatives: What other paths did you consider? Why were they rejected?
    • Key Trade-offs: (e.g., "We chose faster performance over code simplicity here because of 
      requirement X.")
    • Open Questions: What are you still unsure about?
  3. Review & Socialize: Share the doc with other senior engineers—often async, but sometimes in a quick 30-minute meeting. The goal isn't to achieve consensus, but to 
    stress-test the idea. Does this make sense? Are there hidden pitfalls? Is there a simpler, more elegant solution we're all missing?
  4. Iterate or Implement: Based on the feedback, the design is improved, simplified, or sometimes rejected altogether in favor of a better approach. Now the team codes, with a clear, 
    vetted blueprint.

The Science: Knowing When to Use a Design Doc

The science,  is in the discernment. You don't do this for every story. That would be bureaucratic and slow. You do it for the ones where the cost of being wrong is high.

Use a design doc when the story involves:

  • Cross-team or cross-service dependencies.
  • New technology or patterns the team isn't familiar with.
  • Significant performance or scaling concerns.
  • High-risk areas of the codebase.
  • Fundamental changes to the application's architecture.

For the vast majority of stories, the "design" is a whiteboard sketch or a conversation. But for the 10-20% that are truly complex, the design doc process is an accelerator, not a hindrance.

The Result: Speed, Quality, and Alignment

This approach transforms your process:

  • Fewer Revisions: Catching design flaws in a doc is orders of magnitude cheaper than catching them in a PR.
  • Collective Ownership: The entire team understands the why behind the solution, leading to better maintenance and fewer regressions.
  • Knowledge Sharing: The document becomes a lasting artifact for future engineers wondering,
     "Why did we build it this way?"
  • True Agility: You're not just moving fast; you're moving fast in the right direction
    You build quality in from the start, instead of trying to test or refactor it in later.

So, the next time your team faces a gnarly story, resist the urge to dive headfirst into the IDE. 
Take a breath, write a page, and get a second opinion. You’ll find that a small investment in thinking saves a huge amount of time in coding.

How does your team handle complex design? Do you have a process for "just enough" 
documentation? Share your thoughts in the comments below!


Thursday, August 14, 2025

Bug Driven Development (BDD): When "Done" Really Means "Debugging Hell"

 

Introduction

In the world of Agile and Scrum, the term "Done" is sacred. A story is supposed to be complete-tested, reviewed, and ready for production. But in some teams, "Done" is just the beginning of a never-ending cycle of bug fixes. This anti-pattern has a name: Bug Driven Development (BDD)!

What is Bug Driven Development (BDD)?

BDD is a dysfunctional workflow where:

  1. A developer claims a story is "Done" in Sprint Review.
  2. QA (or worse, users) finds a flood of bugs that should never have existed.
  3. The next sprint is spent fixing what was supposedly "finished."
  4. The cycle repeats, creating technical debt, frustration, and burnout.

Unlike Behavior-Driven Development (the good BDD), where tests define requirements, Bug-Driven Development means bugs define the real scope of work.


The Face of BDD

Bug Driven Development (The Unintentional Anti-Pattern)

  • Symptoms:
    • "Works on my machine" mentality.
    • Zero (or flaky) unit/integration tests.
    • QA backlog grows faster than the dev sprint velocity.
  • Root Causes:
    • Poor Estimating and Planning leading to
    • Rushing to meet sprint deadlines without proper validation.
    • No code reviews (or rubber-stamp approvals).
    • Learned helplessness—engineers think bugs are inevitable
    • Cultural - favor speed over quality and expecting bugs is natural.
    • Lack of accountability—no consequences for shipping broken code.
    • Low-skilled engineers who don’t understand defensive programming.

Why BDD is a Silent Killer

  • Wastes Time: Fixing preventable bugs drains 50%+ of dev capacity (Microsoft Research found devs spend 50-75% of time debugging).
  • Kills Morale: Engineers hate working in bug factories.
  • Destroys Trust: Stakeholders stop believing in "Done."
  • Increases Costs: Late-stage bug fixes are 100x costlier (IBM Systems Sciences Institute).

How to Escape BDD (Before It Kills Your Team)

1. Enforce Real "Definition of Done" (DoD)

  • No story is "Done" without:
    • ✅ Unit/Integration tests.
    • ✅ Peer-reviewed code.
    • ✅ Passing QA (not "mostly working").

2. Shift Left on Quality

  • Test-first mindset: Write tests before code (TDD).
  • Automate validation: CI/CD pipelines should block buggy code.

3. Stop Rewarding Speed Over Quality

  • Measure & penalize escape defects (bugs found post-"Done").
  • Celebrate clean code, not just closed tickets.

4. Fire Bad Engineers (If Necessary)

  • Low-skilled engineers can learn, but Bad Engineers won't.
  • If someone refuses to improve, they’re a culture toxin.

Conclusion: From BDD to Brilliance

Bug-Driven Development isn’t Agile—it’s technical debt in motion The fix? Stop accepting "Done" until it’s really done. Otherwise, prepare for a future where your sprints are just bug-fixing marathons.

Question for You:
Does your team Implements BDD? Share your horror stories below!

Saturday, July 12, 2025

Why Feature Teams Beat Siloed Development: Lessons from a Cloud Migration

When I joined CompanyX (to protect it's identity), they were in the midst of a massive modernization effort—replacing legacy monolithic systems with sleek, cloud-native micro services running on Google Cloud. On paper, it was a forward-thinking move. But there was a catch: the engineering teams were strictly divided into backend and frontend squads, each working in isolation.

The backend team built REST APIs. The frontend team consumed them. They coordinated via GoogleChat, ADO tickets, and API contracts—yet, when it came time for User Acceptance Testing (UAT), chaos ensued. Bugs surfaced. Assumptions clashed. Finger-pointing began.

What went wrong?

The Problem with Siloed Teams

The traditional "Backend vs. Frontend" split seems logical at first:

  • Backend engineers focus on APIs, databases, and business logic.
  • Frontend developers build UIs, handling state management and user interactions.

But in practice, this separation creates three major headaches:

  1. Late Integration Surprises

    • Teams work on different timelines, delaying end-to-end testing until late in the cycle.
    • By the time APIs and UIs meet, mismatches in data structures, error handling, or performance become costly to fix.
  2. Communication Overhead

    • Instead of real-time collaboration, teams rely on documentation and meetings—which often lag behind actual development.
    • A backend engineer might design an API that "makes sense" to them but is awkward for frontend consumption.
  3. Lack of Ownership

    • When something breaks, it’s easy to say: "That’s a frontend issue" or "The backend payload is wrong."
    • No single team feels responsible for the entire user experience.

A Better Way: Feature Teams

What if, instead of splitting teams by technical layer, we organized them by features?

A feature team is a small, cross-functional pod that includes:
✔ Backend developers
✔ Frontend developers
✔ (Optional) QA, DevOps, Data Engineers 

Their mission? Deliver a complete, working slice of functionality—not just a backend API or a UI mockup.

Why This Works Better

  1. Early and Continuous Integration

    • Since the team builds vertically, they test integrations daily—not just in UAT.
    • Bugs are caught early, reducing last-minute fire drills.
  2. Tighter Collaboration

    • Backend and frontend devs sit together (or pair remotely), discussing API design in real-time.
    • No more "This isn’t what we agreed on in the spec!" surprises.
  3. End-to-End Ownership

    • The team owns the entire feature, from database to UI.
    • No more blame games—just collective problem-solving.
  4. Faster Delivery

    • Features move smoothly from development to testing to production.
    • Less waiting on external dependencies.

What I Wish We Had Done Differently

Looking back, Company X’s cloud migration could have been smoother and faster with feature teams. Instead of:
"Backend will deliver the API in Sprint 3, frontend will integrate in Sprint 4,"

We could have had:
"Team A ships the 'Checkout Flow' by Sprint 3—fully working, tested, and deployed."

Key Takeaways

  • Silos slow you down. Separation of frontend and backend creates friction.
  • Feature teams align with Agile & DevOps principles—focusing on working software, not just technical outputs.
  • Own the whole feature, not just a layer. This reduces risk and improves quality.

If you're leading a modernization effort (especially in microservices or cloud migrations), break the silos early. Build feature teams, not fragmented departments. Your UAT phase will thank you.


What’s your experience? Have you seen siloed teams cause integration nightmares? Or have you successfully shifted to feature-driven development? Share your thoughts below! 🚀


Tuesday, July 30, 2024

Interview Question: "Describe a challending Project you worked on"

 

Describe a Challenging Project You Worked On


One common interview question is to "Describe a challenging project you worked on." In 2011, AWS had only a few RDS choices, unlike the many options available today.

Throughout my career, I have worked on many interesting projects, but I will focus on my time at Precor, a fitness equipment manufacturer. In 2011, Precor was building on the concept of "Connected Fitness" to allow fitness machines to connect to the internet. This enabled users to download workouts, save workouts, watch instructional videos, read e-books while running on a treadmill, and enjoy many other features.

Precor needed to build a team from the ground up for this project, as they previously only made fitness machines with basic embedded software. I was the second hire after the engineering manager and became Principal Engineer. My mission was to design and build on the vision of "connected fitness." Precor had two main teams: one working on the console (P80 equipment, a dedicated terminal attached to a fitness machine) and my team, working on the backend systems powering all machines in gyms, clubs, and hotels.

Team Composition:


My team consisted of an Engineering Manager, a Principal Engineer (me) as team lead/architect, five engineers, and a Product Owner. We followed a Scrum approach.
 

My Mission:


I was tasked in building two categories of APIs (myself and the team, I played a key role, leading the team, in both defining much of the architecture and code modules):


1. APIs to help club operators understand machine utilization.  
2. APIs to empower exercisers.

Focus on the Exerciser's API:


Every console connected to the internet had a dedicated UI running on an embedded Linux machine, powered by an ARM Cortex CPU with decent video capability, such as rendering YouTube videos. The Fitness Equipment (FE) served as an API client. Another client type was the Mobile App, built by a third party.

The backend systems were built as microservices running on the AWS Cloud. The Exerciser API was a REST API leveraging OAuth2 for user authorization. The use case for exercisers was to create and track their fitness goals using both a mobile app and the fitness machines, regardless of location, as long as they were using Precor connected fitness equipment.

For club owners, the use case was to better serve their customers with modern machines and understand machine utilization, idle time, and receive custom alerts for machine malfunctions. They could also generate custom reports on user exercise frequency to predict membership cancellations.

Exerciser API Features:


The Exerciser API allowed users to log in via RFID, enabling the machine to adjust settings such as angle, inclination, and speed on a treadmill, and start recording exercise data, including calories and duration. Users could check their daily and weekly exercise progress towards their goals on the mobile app, which could be customized for goals like getting fit or losing weight. Users were awarded badges on the mobile app for achieving milestones, such as 1,000 steps, accompanied by a congratulatory message and a cool image.


The Exerciser API:

 

 


On the backend, we built the stack with Java and the Spring framework, using Apache as the HTTP server. The database was RDS (MySQL), and we used DynamoDB, a columnar datastore, for high volume and write throughput. Redis was used to track denial of service attacks.

DynamoDB was utilized for the Fitness Machine APIs to store frequent heartbeat data and log messages. To buffer between the server and storage, we implemented a message queue in front of DynamoDB.



The Goals of this Project, known as Preva was to Increase user retention, attract new members, 

drive secondary revenue generation and help gym members achieve their goals.

Precor conducted several studies to measure each of these goals, and I will link these studies to Precor website:

Increase retention

Attract new Members

Drive Secondary Revenue Generation

Help Gym Members Achieve their Goals


Conclusion:


This project was both fun and challenging due to several factors: tight deadlines, the innovation involved, learning cloud computing, leading a team, and ultimately helping people improve their lives by promoting a healthy lifestyle.



Wednesday, January 7, 2015

Spring Boot Camper and REST Assured Testing Library

I almost never blog, but I will attempt once more. Hopefully I will make it a habit.

I have recently created a sample project to demonstrate how simple and cool it is to write REST APIs using Spring Boot. Also, how to test it with integration tests that runs as fast as unit tests using REST Assured.

This is going to be a series of samples, each focused in showcasing one aspect of the framework or technique.
The first one is  camp_rest_assured,  which demonstrate how to integrate Spring Boot with REST Assured library, with an extra bonus of showing a technique that I created using a custom converter service to do request validation using custom messages collected from representation beans annotated with JSR-303 validations.

Checkout the sampler project here:
https://github.com/phavelar/boot-camper

I will be adding detailed explanations soon. Stay Tuned!

Friday, November 2, 2012

Grails URL Encoding

Bug: Form data being encoded twice in Grails 2.1.1


I recently run into a bug related to character encoding in Grails 2.1.1 application that manifested a priory only when the web app was deployed into production.  
All tests where passing during CI builds as well as local development environment.
It was a tricky issue to figure it out, so I want to share it here, hopping to save you some time if you encounter this problem and are fortunate enough to find this post :)  

Say you're dealing with i18n and have UTF-8 form encoded text data (application/x-www-form-urlencoded), somehow after posting a text in Cyrillic, the contents were being doubly encoded, messing up the original text data..

The issue does not manifest in development mode,  or rather, I discovered that if you use IntelliJ IDE to launch the web application ("exploded mode" )  all is normal, that is the Cyrillic text is properly encoded.  Instead,  if you build the war using "grails war" command and manually deploy it to Tomcat, then this bug happens.

Digging deeper we had the following sloppy code to encode the form data:

URLEncoder.encode(formData)

As you can see from the Java Doc API,  this is a deprecated method, where the resulting string may vary depending on the platform's default.

I'm not sure why the platform default changes when packaging the war via "grails war" command versus running the war from within the IDE, but the fact is that this had cost us a few hours spent on debugging.

Method Summary
static Stringencode(String s)
          Deprecated. The resulting string may vary depending on the platform's default encoding. Instead, use the encode(String,String) method to specify the encoding.
static Stringencode(String s, String enc)
          Translates a string into application/x-www-form-urlencoded format using a specific encoding scheme.

To fix, simply change the code to URLEncoder.encode(formData, "UTF-8")

So, discovering the behavior about packaged versus exploded war phenomena  was half the battle to be able to reproduce this bug.   However, this could have been avoided altogether if the developer had paid attention to compiler deprecated warnings. Or  not done this:

-Dgrails.log.deprecated=false  //to turn off for development mode  

Hope this post can help some fellow developers !

Happy Coding !