Diego Pacheco’s Software Architecture Library

This repository contains a curated collection of concepts and guidance on Software Architecture by Diego Pacheco.

I wrote this book because I have been doing software architecture for more than 20 years and I want to share my philosophy, practices, and important concepts and patterns with other architects and aspiring architects. This is also an experiment - a living piece of advice that I want to improve over time.

Disclaimer

This book does not represent the views of any of my employers or clients past or future. The opinions expressed here are my own and do not reflect the views of any organization I am affiliated with past or future. All of this book is provided with my own personal time, effort, and devices. Several pages have links to my personal blog and POCs made on my personal time.

What to Expect

Honest and direct advice
Highly opinionated content
Practical and actionable guidance

What this book is NOT

Not a tutorial or step-by-step guide
It’s not a panacea for all architecture problems and solutions

About the Author

👨‍💻 Diego Pacheco Bio Diego Pacheco is a seasoned, experienced 🇧🇷 Brazilian software architect, author, speaker, technology mentor, and DevOps practitioner with more than 20 years of solid experience. I have been building teams and mentoring people for more than a decade, teaching soft skills and technology daily. Selling projects, hiring, building solutions, running coding dojos, long retrospectives, weekly 1:1s, design sessions, code reviews, and my favorite debate club: architects community of practices and development groups for more than a decade. Living, breathing, and practicing real Agile since 2005, coaching teams, and helping many companies discover better ways to work using Lean and Kanban, Agile principles, and methods like XP and DTA/TTA. I have led complex architecture teams and engineering teams at scale guided by SOA principles, using a variety of open-source languages like Java, Scala, Rust, Go, Python, Groovy, JavaScript, and TypeScript, cloud providers like AWS Cloud and Google GCP, amazing solutions like Akka, ActiveMQ, Netty, Tomcat, and Gatling, NoSQL databases like Cassandra, Redis, Elasticache Redis, Elasticsearch, OpenSearch, RabbitMQ, libraries like Spring, Hibernate, and Spring Boot, and also the NetflixOSS Stack: Simian Army, RxJava, Karyon, Dynomite, Eureka, and Ribbon. I have implemented complex security solutions at scale using AWS KMS, S3, Containers (ECS and EKS), Terraform, and Jenkins. Over a decade of experience as a consultant, coding, designing, and training people at big customers in Brazil, London, Barcelona, India, and the USA (Silicon Valley and Midwest). I have a passion for functional programming and distributed systems, NoSQL Databases, a mindset for Observability, and am always learning new programming languages.

🌱Currently: Working as a principal Software Architect with AWS public cloud, Kubernetes/EKS, performing complex cloud migrations, library migrations, server and persistence migrations, and security at scale with multi-level envelope encryption solutions using KMS and S3. While still hiring, teaching, mentoring, and growing engineers and architects. During my free time, I love playing with my daughter, playing guitar, gaming, coding POCs, and blogging. Active blogger at http://diego-pacheco.blogspot.com.br/

💻 Core skills and expertise: Architecture Design and architecture coding for highly scalable systems Delivering distributed systems using SOA and Microservices principles, tools, and techniques Driving and executing complex cloud migrations, library and server migrations at scale Performance tuning, troubleshooting & DevOps engineering Functional Programming and Scala Technology Mentor, agile coach & leader for architecture and engineering teams Consultant on development practices with XP / Kanban Hire, develop, retain, and truly grow talent at scale

🌐 Resources

📝 Tiny Essays:

🥇 Tiny Side Projects

🧝🏾‍♂️ Tupi lang: programming language written in Java 23
🥫 Jello: vanilla JS, web-apis, trello-like
📑 Zim: vim-like written in Zig 0.13
💻 Gorminator: simple and dumb Linux terminal written in Go
😸 kit: Git-like written in Kotlin
🦀 Shrust: Compress/Decompress tool written in Rust
🕵🏽 Smith: It’s a security Agent Written with Scala 3.x
📟 ZOS: A very tiny OS written in Zig
🎮 Tiny Games: Collection of JS games

Chapter 0 - Zero

Why - Doing what’s right
Rationale - The reasoning behind this book

Chapter 1 - Philosophy

Why
Crystal Ball - Think about future changes
Defensive Programming - Anticipating and handling errors
Doing Hard Things - Tackling complex problems head-on
Frontend vs Backend - Design philosophy differences
Open Source First - Favor open source solutions always
Service Orientation - Services as first-class citizens
Protect Your Time - Strategies for architects to safeguard time for deep work

Chapter 2 - Anti-Patterns

Why
Tech Debt Plague - Fighting technical debt constantly
Ignore Culture - Addressing ignored problems proactively
Stagnation - Constant learning to avoid stagnation
Requirements - Challenging and validating decisions

Chapter 3 - Dilemmas

Why
Discovery vs Delivery - Balancing exploration and execution
Move Fast vs Do It Right - Speed vs Quality
Build vs Buy - When to build in-house vs buy and integrate
Decide or Wait - Making timely decisions vs delaying for more info

Chapter 4 - Properties

Why
Anti-Fragility - Systems Thriving on Failure
State of the Art - Choosing the best and latest solutions
Scalability - Designing for growth and load
Observability - Monitoring and understanding system behavior
Stability - Ensuring stable daily practices.
Secure - Embedding security in architecture

Chapter 5 - Practices

Why
Attention to Detail - Architect with precision and care
Architecture Review - Documenting architecture for communication and improvements
Design First - Think first, act later
Ownership - Extreme ownership and proactive behavior for architects
Reading Code - Importance of reading code as an architect
Monthly Review - See the whole picture.
Working in the Trenches - Hands-on, front-line work with the team.

Chapter 6 - Concepts

Why
ACID - Relational database transaction properties
Authentication & Authorization - Identity verification and access control
JWT - JSON Web Tokens for secure data exchange
BASE - NoSQL consistency model
Idempotency - Safe operation repetition
Optimistic vs Pessimistic Locking - Concurrency control strategies
Partition - Data distribution strategies
Schema Evolution - Forward/backward compatibility for APIs
Source of Truth - Authoritative data source
Stateless vs Stateful Services - Service state management

Chapter 7 - Patterns

Why
API Gateway - Single entry point for microservices
BFF Pattern - Backend for Frontend pattern
Cache - Data storage for faster access
Connection Pool - Database connection reuse
Feature Flags - Runtime configuration and gradual rollouts
Load Balancer - Traffic distribution across servers
Message Patterns - Publish/Subscribe, Point-to-Point messaging
Message ID - Unique identifier for tracing requests
Pagination - Breaking large datasets into pages
Queue - Asynchronous message processing
Retry - Handling transient failures with retry strategies
Web Hook - Event-driven HTTP callbacks

Chapter 8 - Tools

Why
Diagramming Tools - Visualizing architecture with diagrams
Writing Tools - Documenting architecture effectively
Thinking Tools - Tools for structured thinking and exploration.

Chapter 9 - Epilogue

Epilogue - Final thoughts and next steps
Resources - Recommended books and learning materials
How I Wrote The Book - Behind the scenes of the book’s creation
Changelog - Updates and revisions to the book
References - External blog posts, articles, and technical documentation
Glossary - Definitions of key terms used in the book
Index - Complete index of topics covered in the book

Why Zero?

If you are an engineer, you know we are supposed to start counting at ZERO. For some reason, books always start at one, so I had to make this one start at 0.

Rationale

I wrote other books like Building Scala Applications, Principles of Software Architecture Modernisation, and Continuous Modernization. This book is on similar themes, however this is not a traditional book. It is an open source Git book. It is online and you do not need to pay.

The tool I am using here is very nice. It allows you to search the whole book. In any page, just type / on your keyboard and start typing what you want to search for. Also, because this book is written based on markdown, it is very easy to link any page because you have a link per page.

Because I am using Git, I have a history of all changes. Because this is made with software, I can update the book, so the book is versioned. Look at the footer of every page and you will see the version of the book.

This is a way for me to give it back. For free. I hope you enjoy it.

What is a Software Architect?

A software architect is a person that makes decisions or influences decisions about the software structure, the technologies to be used, the patterns to be followed, and the practices to be adopted. A software architect is responsible for the overall design and quality of the software system. A software architect codes, does code reviews, mentors developers, leads technical discussions, defines standards and guidelines, evaluates technologies, and ensures that the software meets the requirements and goals of the business.

I like to think about a software architect as a Monk and Rockstar. An architect also goes DEEP in the technology, understands the details, the trade-offs, the pros and cons of different approaches. An architect also goes WIDE in the technology landscape, understands the trends, the best practices, the patterns, the anti-patterns, the tools and frameworks available.

What is Architecture?

Software architecture is the high-level structure and organization of a software system. It defines the components, modules, and their interactions, as well as the principles and guidelines that govern the design and development of the software. Software architecture encompasses both the technical and non-technical aspects of the system, including performance, scalability, security, maintainability, and usability.

In other words, software architecture is code. But not only code, but the decision on what to do and what not to do. Software architecture is about making choices that will impact the software system in the long term.

Why Philosophy?

Philosophy means “love of wisdom,” and this project is dedicated to exploring and sharing philosophical ideas in an accessible way. The goal is to foster critical thinking, encourage open dialogue, and promote a deeper understanding of the world around us.

How we do things matters. There are many ways to do something wrong, but not as many ways to do it right. Philosophy is “ways of thinking,” or “how to approach” problems, day-to-day life, difficult situations, decisions, and trade-offs.

Without a philosophy, we are likely to fall into bad habits, poor decisions, and unexamined beliefs. A well-defined philosophy helps us navigate life’s complexities with clarity and purpose.

Crystal Ball

Architects must have a good crystal ball. Requirements change, and a requirement change can break the architecture completely if it was not designed to accommodate change. Thinking ahead is a great practice because it allows you to predict the next moves, meaning you think big and ahead but execute small.

PS: Image created with Gemini 3 Banana Pro model

Code can and should be refactored. Refactoring should happen all the time and continuously. However, there are people who are allergic to refactoring, and that is a recipe for great disasters. Thinking ahead allows you to be prepared when you need it. Think about fire drills - you practice how to handle fire, so when you need it you are ready.

Think big execute small.

How to Train your Crystal Ball skills?

One simple way is paying attention to the industry and trends. You can learn from the errors of others. Sometimes it is not that the technology or concept is bad, but people apply it wrong, and if they apply it wrong depending on the case, it can be a huge disaster like Microservices when done wrong.

Another form of crystal balling is to predict the next step. This one might sound harder but it is not. Let me give you 3 examples:

Your application has a product catalog: You can start with simple exact match on the database. But if you grow and have hundreds of products, you will need Full Text Search, something like OpenSearch or ElasticSearch. You can start thinking about how to integrate that in the future. You do not need it now, but you can get familiar with it, you can search the solution space, you can even get ready. When you need it, you are much better equipped to deal with it.
You have updates for the user: Which happen based on backend events. It is okay to use an old school polling mechanism, but if you think ahead you can start thinking about WebSockets or Server Sent Events. You can start learning how to implement that, so when you need it you are ready.
You need to store files: You can start with a relational database and store in a table (it is an anti-pattern), but if you think ahead you can start learning how to use S3 or MinIO. You can start learning how to integrate that, so when you need it you are ready.

Defensive Programming

Defensive programming is a design approach that emphasizes anticipating and handling potential errors or unexpected inputs in software development. For instance, you always:

Validate inputs to ensure they meet expected formats and constraints.
Handle exceptions gracefully to prevent crashes and provide meaningful error messages.
Design functions and methods to be robust against invalid or unexpected inputs.
Avoid nulls at all cost.

Your code should be resilient. It should not break that easily.

Consider the following pseudo-code with Scala 3x:

class ProductService(
    @Autowired productRepository: ProductRepository,
    @Autowired logger: Logger
){
    def saveProduct(val product: Product, ctx: Context): Either[String, Product] = {
        val messageId = ctx.getMessageId()
        if(product.price < 0){
            val errorMsg = "Product price cannot be negative"
            logger.warn(s"[MessageId: $messageId] $errorMsg. Product: ${product.id}")
            return Left(s"$errorMsg [MessageId: $messageId]")
        }
        if (isBannedProduct(product)){
            val errorMsg = "This product is banned"
            logger.warn(s"[MessageId: $messageId] $errorMsg. Product: ${product.id}")
            return Left(s"$errorMsg [MessageId: $messageId]")
        }
        if (!isLegalProduct(product)){
            val errorMsg = "This product is not legal in your country"
            logger.warn(s"[MessageId: $messageId] $errorMsg. Product: ${product.id}")
            return Left(s"$errorMsg [MessageId: $messageId]")
        }
        Right(productRepository.save(product))
    }
}

You see that there are validations. We do not blindly trust the input. We check if the price is negative. If it is, we throw an exception. This is defensive programming in action.

Doing Hard Things

Proper software is hard. It is a never-ending war. Doing the right things means teaching, socializing, influencing, enforcing, convincing, pushing back, standing your ground, and giving all sweat and blood to make the right things happen.

The best things in software are HARD:

Doing the right design
Keeping discipline to write tests and have high diversity and coverage
Calling out and teaching team members about poor practices and wrong beliefs.
Pushing back poorly written tickets and requirements
Saying NO to “false shortcuts” (that lead to anti-patterns and tech debt)
Doing the right things every day (invisible ant work)

Perhaps the most difficult thing ever is dealing with monoliths and distributed monoliths. Modernization requires a very specific and disciplined approach that is hard to do right. It requires patience, persistence, and a lot of hard work.

I’m convinced Distributed Monoliths are the #1 enemy of modern software architecture. Spawned across:

Architects must do HARD THINGS all the time.

Frontend VS Backend

Frontend by nature is specific. One React component, one page, one feature. Backend by nature is generic. One API endpoint serves many features, many pages.

Frontend Thinking

Tends to think 1:1. One component, one feature, one page. This might be fine for frontend but not for backend. It is common for frontend engineers to think that every single fetch they need to do should be a separate API endpoint (which leads to lack of conceptual integrity).

Backend Thinking

Tends to think 1:many. One API endpoint serves many features, many pages. This might be fine for backend but not for frontend. By default, backend services should be generic and reusable.

It is okay to have many pages or many components, but it is not okay to have that many services. They should be generic, more centralized, and reusable by design.

A tale of a Frontend Car

This is a car built by frontend engineers.

PS: Image created with Gemini 3 Banana Pro model

A backend car is generic and endpoints can do more and have reuse by design not by acident. A frontend car has a wheel to turn left, a wheel to turn right, a wheel to go forward, and a wheel to go backward, every single thing ia a new whell. Now replace the word “wheel” with “API endpoint” and you get the idea.

Open Source First

Open Source must be the default. Avoid proprietary software. Do not build wrappers for open source libraries. Use Open Source software directly. Building internal shared libraries is a liability and will create many issues like security vulnerabilities, maintenance burden, and lack of community support / documentation. Therefore creating internal shared libraries must be a very conscious choice and must be well justified.

Why It’s important?

Because Open Source gives us many advantages like:

There is a whole community behind: Maintaining, fixing bugs, improving the performance, making it better.
Portability: Open Source means you are not stuck with a single vendor and more likely you can run yourself or switch the vendors (this is also known as open standards like REST).
Documentation & Training: Developers often don’t like to write documentation. By using open source you can get documentation for free, you get stackoverflow threads, blog posts, books, tutorials, videos and much more. Open source also means people can be trained by the market and not by you. If you have a proprietary software that only exists in your company, you cannot use the market to train people.

Open source is also giving it back. Open source is giving visibility to engineers who do good work, open source is life and collaboration.

Service Orientation

Service Oriented Architecture (SOA) is a big deal. SOA should be the default way of thinking and operating services in modern software architecture. Service Orientation should be the default choice for solutions. Services are more important and better than internal libraries.

Services allow Isolation, Independence and Flexibility. When services are done right, with proper contracts, refactor can happen under the hood. Services allow different stacks and technology.

Services should be Capability oriented. Services should be as generic as possible and it’s okay to have more granularity and more code in a service. Services do not need to be micro(Micro Services).

The most important part of services is the contract. The contract is the API. The API should be as stable as possible. The API should be backward compatible as much as possible. The API should be well documented and versioned.

Contracts must be well designed and reviewed carefully. It is not difficult to make poor contracts because engineers, and even more so frontend engineers, are just thinking about what needs to change, and are not thinking if it is the right place to make the change or if it should be that way at all. Contracts must be explicit and not hidden. The cost of refactoring a contract is usually high, while refactoring the internal implementation is low, as long as you do not break the contract.

Services require a lot of thinking, they are not a panacea.

SOA is a big Deal

Service Orientation should be the main operating model for software architecture. Services should be the default way of thinking about solutions. Services should be the default way of building solutions. Services should be the default way of operating solutions.

Because services allow us to have business capabilities that can be shared across different services, applications, products and use-cases. Services also allow us to isolate and have decoupling from contract and implementation (if done right), allowing the engineering team to improve things under the hood without breaking or slowing down the business.

Protect Your Time

PS: Image created with Gemini 3 - Banana Pro model

It is very easy to be consumed with meetings 100% of the time. However, the architect must protect their time in order to do deep work: research, monthly reviews, crystal balling, thinking about trade-offs, evaluating solutions, and reading code.

At minimum, you should book at least 3 times per week for 3-6 hours to do deep work. It does not matter if it is in the morning or afternoon, whatever is easier. Hold your ground, push back, and protect that time. Architects must think.

Why Anti-Patterns?

Everybody talks about patterns. Everything on the internet is “Best Practices.” But people do not understand principles correctly. People think they understand when they actually do not.

I saw that lack of “proper application” happen over and over the years, with Agile, with Microservices, with DevOps, with Docker. If you do not understand the principles behind those “practices,” you will end up doing anti-patterns.

Some anti-patterns are obvious and related to technology, but many anti-patterns are related to people, process, culture, and how you work as a team. Also, there are anti-patterns related to how architects behave or how they do not behave (and should behave).

It is important to know what is wrong, so you can avoid doing it.

Tech Debt Plague

Tech debt is a plague. It is something that all companies suffer from. It is normal to have tech debt. What is not correct is architects not fighting tech debt constantly. Tech debt must be fought all the time, either by simplifying requirements to avoid feature bloat, or by making better decisions that lead to better systems.

PS: Image created with Gemini 3 - Banana Pro model

When I was thinking about this section, I was thinking in terms of “Tech Debt First”. Because often I see tech debt being the “default choice,” which is an anti-pattern. Some technical principles should be non-negotiable.

Imagine you are building a house. Pretty sure the buyer does not expect the house to collapse 6 months later after buying it. To cut corners, you would not decide to use paper for walls, styrofoam for the roof, and direct sunlight for electricity. However, when we build software, bad management, bad architects, and bad developers decide to cut corners to the point of making the software collapse after a few months of being in production.

One classical example is skipping tests. Tests are not nice to have; they are a must-have. If you do not have tests, you are building a house without a foundation. You can build it fast, but sooner or later it will collapse. How will you do refactoring without tests? Management often pushes for such bad practices, but even architects (the bad ones) push for that. Tests are non-negotiable.

You must fight the war, manage complexity by doing improvements with the engineering team every week rather than “creating a Jira ticket” that will never ever be addressed.

Why you need to know this?

Tech debt is everywhere. You will see it and even produce it. You must be aware of it and fight it constantly. Some ideas to fight tech debt:

Always make the code better
Take quick wins like: add 1-3 tests per day, refactor 1-2 methods per day, improve documentation a bit every day
Explain, evangelize, and educate the team about tech debt and how to improve
Push back, say NO to bad practices, bad decisions, bad shortcuts
Explain the cost of tech debt to management
Use your time wisely, make your part to reduce tech debt

Ignore Culture

If something is wrong. Say something, do something. When you ignore problems consistently over a long period of time, you end up developing an anti-pattern, which I call “Ignore Culture”. When you ignore warnings in the build, when you ignore old versions of libraries, when you ignore anti-patterns on the code. When you ignore production monitoring. When you ignore requirements that make no sense.

Does not matter if you were the one who did the code or not. You must take ownership and under your watch you should ignore nothing.

Ignoring can become a culture because managers can easily do this. Engineers, often pressured by managers also can ignore several wrong things. What’s wrong should not be ignored, it should be fought and pushed back.

Lack of fighting for what is right is immaturity. As an architect you must not ignore what’s wrong. Do not let this become a culture. Ignore culture can happen so easily when people say “It’s not my job”, or “It’s not my code” or “It’s not my project” or “It’s not my responsibility”, the extreme lack of ownership can lead to ignore culture. When you are ignoring problems you are making fertile ground for anti-patterns to grow. Tech debt plague requires ignore culture to thrive.

Why you need to know this?

Because you will very likely see this happen in a small or large scale. The question is how you behave. Yes, an architect must care about culture. Caring is the first step to change. You can’t change just with software and technology you must address culture by being part of the change you want see.

An architect is a leader, it’s a teacher, therefore it must set an example of caring, passion, and action-oriented behavior. Even if little, every PR counts, every little improvement counts.

Stagnation

Architects must be like the Rolling Stones, they don’t gather moss. Architects must be active in a constant state of learning, always reading books, always reading papers, always reading code, always ahead of the curve.

If an architect stops learning, they stagnate, and stagnation is the death of architecture. Think about a coin: a coin that has no circulation is worthless. An architect that has no new knowledge cannot function and cannot be a good architect.

As an architect you cannot become old school. You must always be looking for new ideas, new technologies, new patterns, new anti-patterns, new ways of doing the old things. Being stuck in the past has terrible consequences. The architect is the most influential person or it should be. So what are you influencing?

Why you need to know this?

Because such temptation it’s constant. It’s very easy to fall into the “auto-pilot” mode and don’t challenge how you are working. Retrospectives are important thinking tools. Even if you do not have retrospectives you can take 30 min or 1h to think about what’s going on and if is right or wrong. Stagnation must be fight, IMHO the best thing to do is:

Have passion and curiosity to drive you forward
Read books, papers, articles, blogs, constantly
Go to conferences, meetups, webinars
Talk to other architects, share knowledge
Experiment with new technologies, patterns, ideas
Do POCs with new stuff or new ways of approaching the same stuff

Do not let yourself stagnate. Keep moving forward. Keep rolling, keep rocking.

Requirements

Mind blown by me saying “Requirements” is an anti-pattern? Lean believes that requirements are just a decision that somebody made. Now what you need to ask yourself is, how good is this decision? Architecture is the art of making good decisions. Bad decisions COST a lot, not only money but also time, effort and frustration.

Architects MUST make good decisions. Good decisions are made when you have the right information, the right context, and the right experience. You must be prepared to make right decisions. In the beginning you will make mistakes, which is okay if you learn from them and that’s why you need Architecture Review.

Not all requirements are anti-patterns, but seeing them as “requirements” is an anti-pattern per se. The word requirements, and often how people take it, implies that something is fixed, immutable, and unchangeable. In reality, requirements are always changing, evolving, and adapting to new circumstances. Good architects embrace change and adapt their decisions accordingly.

When a project starts, or when you start a task, it’s very common that product did not think a lot about the consequences, trade-offs, corner cases, implications, or even what the engineering team needs to get it done. The most common anti-pattern is empty jira tickets and tons of meetings to push the “discovery work” to engineering.

There is nothing wrong with engineering collaborating with product, as long as we mark that as “Discovery work”. The issue is, product pushes “not ready” tickets to development and asks: “Why is this not done yet?” then corners are cut, quality is compromised and technical debt is added. Remember the Tech Debt Plague and Ignoring culture, they are all connected.

You need to see requirements as “Temporary decisions”, not as martial law. You need to deliver requirements, but you must challenge them, you must question them, and you must validate them. If you don’t, you are just a delivery team, not an engineering team.

Why you need to know this?

Everything you will do will be a requirement. Does not matter if people don’t use this word, if people call it user stories, issues, tickets, tasks, jobs, experiences, it’s all the same in the sense that you will have to deliver something that somebody else decided. So there are a couple of things you can do here to better handle requirements:

Collaborate with product to help them think through the implications of their decisions.
Collaborate with UX to understand user behavior and needs.
Mark “discovery work” as such, and don’t push it to engineering as “ready”.
Do POCs, which we call spike, XP technique. To learn and then figure out requirements.
Get code into production in order to validate requirements as soon as possible.
Run experiments to validate requirements(assumptions).
Research what the industry is doing to solve similar problems.

Why Dilemmas?

Dilemma is a situation where a difficult choice has to be made between two or more alternatives, especially ones that are equally undesirable. Dilemmas often arise in various aspects of architecture like decision-making, and problem-solving.

Every single architect will face dilemmas. Some dilemmas are very common and part of any system or software development. It is important to talk about dilemmas because no one talks about them.

Having good perspectives allows you to make informed and balanced decisions. Consider this as a compass to navigate dilemmas.

Discovery Vs Delivery

Discovery is the moment where you are trying to figure out what to build in order to solve a problem for your users. Discovery is heavily around product and user research, prototyping, testing ideas and validating assumptions.

However, it’s also necessary to have someone from technology doing discovery; otherwise, you transfer slowness, confusion, and risk even to be attempting to build something that is not feasible or too complex to deliver.

Delivery should be focusing on engineering, and once we know what to build (often captured as requirements and figma prototypes for the frontend) we can start building it. The question is how do I build it right. Sounds simple right? But here we have one of the biggest dilemmas in software development.

If you just go from discovery to delivery, you are very likely working in a waterfall fashion. So this is not linear, but (which I consider an anti-pattern) happens all the time: the moment you touch discovery, people assume discovery is done or correct, and then you only care about shipping it fast.

You need speed, but it’s not just speed of delivery but actually speed of learning. The goal is to learn fast. Sometimes you can learn before production but often times you can only go to production to learn.

If you need to go to production fast to learn, should we embrace the Tech Debt plague and be Tech Debt First? No. Because you might learn what the users want or don’t want, you might find what can bring revenue to the company, but the code is there (likely forever) and we need to balance this. So you can’t just pick a side you need to balance both sides.

Why you need to know this?

Because such tension it’s constant. It’s a regular day of product and engineering teams. Being good on this “game” it’s game changer for the company like Mark Zuckerberg said “Product strategy is: IF we can learn faster than any other company, we’re going to win”.

Move Fast or Do It Right?

Facebook and Silicon Valley culture often emphasize speed and rapid iteration in product development. The mantra “Move Fast and Break Things” encapsulates this ethos, encouraging teams to prioritize quick delivery and innovation over meticulous planning and perfection.

However, even Facebook admitted that they needed to shift to a more balanced approach where they move fast but with stable infrastructure. It’s important to do it right in engineering; however, if you build the wrong product, people don’t care, and then you are wasting your time and money.

However, it’s not as simple as building everything fast and piling up tech debt like the Tech Debt Plague. You will get the right product but a poor experience because it will be slow, full of bugs, and unreliable.

Again this refers to the dilemma of Discovery vs Delivery. You need to find the right balance between moving fast to discover what works and doing it right to ensure quality and reliability.

In theory you could “just move fast” but the issue is, lots of companies are allergic to refactoring and they never pay the price to improve the code, mostly because tech leadership is bad and or immature. But either way, you need to pay the price at some point. If you never pay the price, you end up with a big ball of mud that is hard to maintain and evolve.

Why you need to know this?

Because this tension is constant. It’s a regular day of product and engineering teams. Being good at this “game” is a game changer for the company like Mark Zuckerberg said: “Move Fast with Stable Infra”.

Mark Zuckerberg on Fast Learning Cycles

Build vs Buy

Which one is best, build or buy? Well, it depends. Build is important when it is your core domain and is how your company makes money and differentiates from the competition. If all companies buy the same core business, how are they different? They are not.

However, when it is not your core business, buy makes a lot of sense because it is freeing time for you to focus on what matters: your core business. However, there is a big mistake here. There is no buy without build. Because even if you buy you will need to integrate with your existing systems. Just because you buy does not mean it is perfect, does not mean it fixes all your problems. It’s also common when you buy a solution that same solution introduces problems that you did not have before. Buy has hidden costs.

When buying, it’s important to observe:

Make sure there are APIs
Evaluate the APIs before buying
Buy the code (when possible)
Consider integration as part of the cost
Consider troubleshooting, debugging, observability as part of the cost

If I use an AWS service vs building in-house, is it cheaper? Well, it depends where you want to put the money. In the beginning buying might sound cheaper, but you will pay AWS forever. When you build you might have a better solution (if executed right) but you now need to pay people to maintain it forever. So where do you want the money to go, your people or vendors? The question here is how good you are at execution inside the house. If your execution is terrible, go buy.

Why you need to know this?

Because such dilemma happens all the time. By running trade-offs you can make better and more informed decisions. Always socialize those decisions with your team and stakeholders. Make sure everyone understands the implications of each choice.

Decide or Wait

Lean believes that late decisions are often better than rushed ones. However, waiting too long can lead to missed opportunities. As time passes, you will be more equipped about what is best, and it is always easier to refactor later (if the cost is not too high).

Decisions are a process, they should not be written in stone. Deciding something is a process, it allows us to move on. However, we should be challenging our past decisions, that is why time to reflect and think is important. Architects need to protect their time to have time to think. Engineering teams must have retrospectives to reflect on past decisions.

Sometimes it is too soon to decide, you might not have data or not be sure. You must balance the cost of waiting versus the cost of deciding now. Lean has a tool to deal with this called cost of delay. If the cost of delay is low, you can wait more. If it is high, you must decide now.

Experimentation is a good way to make temporary decisions, and then figure out what sticks and what does not stick. You can try A/B testing, feature flags, or prototypes to gather data and make informed decisions later. Deciding and waiting are not mutually exclusive, they require balance.

Why you need to know this?

Because such dilemma happens almost everyday if not everyday. Knowing when to decide and when to wait is a key skill for architects and engineers. It allows you to make better decisions, avoid mistakes, and deliver value faster. For instance if it’s a catastrophic decision better way. If the cost of waiting is high, better move.

That’s why it’s important to reflect about decisions, usually in retrospectives. Nobody does this but I believe in something I call Blameless Feature Reviews which could help us to learn 100x more from our decisions. Such reviews usually happens but only with high executives, and IMHO we must use a proper DevOps blameless mindset and be with the whole team.

Otherwise who is learning? Only the people who don’t build the software?

Why Properties?

Properties are characteristics or attributes that define or describe an object, system, or concept. Properties give us benefits. Without architecture properties, we would be lost in a sea of ambiguity and confusion.

Properties are things you really want for your architecture and systems. Think about a kid, you want your son or daughter to have good properties like honesty, kindness, and respectfulness. Similarly, in architecture and systems, we want properties that ensure scalability, security, and observability.

Some properties are obvious, others are non-obvious. But if you know they exist, you can make explicit work to make sure they are present in your architecture and systems.

Anti-Fragility

Anti-fragility is a property of systems that not only withstand shocks and stressors but also improve and grow stronger as a result of them. Unlike resilience, which focuses on maintaining stability in the face of adversity, anti-fragile systems thrive on volatility and uncertainty.

In order to achieve anti-fragility you need Defensive Programming and Chaos Engineering. You must test that your system is tolerant and can recover from failure. So you induce and provoke failure in your system and infrastructure to prove when necessary.

You do not want to discover in production if the system can recover from failure, you want to know beforehand. You want to know that your system is anti-fragile. Testing in production is not the same as discovering in production. Testing in production means, guard-rails and testing in production environment but without impacting real users.

Your systems must be anti-fragile. Your architecture must be anti-fragile. Your infrastructure must be anti-fragile. To be anti-fragile you need lots of hypotheses and experiments to prove your system is anti-fragile. Anti-fragility requires science and creativity.

How to develop Anti-Fragile Systems?

Basically we need to do 2 things. First we must test our infrastructure and systems to prove they are anti-fragile. AWS is anti-fragile by nature, but you could be using it wrong, so always a good idea to test it.

Second, you want to design with anti-fragility in mind. For instance we should not put all applications into a single database, we should have one database per service. Therefore we have isolation and protect against blast radius. So you see it is not just testing (which is chaos engineering) but also design with anti-fragility in mind.

State of the Art

If you will be building something. Why start with something deprecated? Why not use the latest versions? State of the art is not about just using the latest and greatest versions but also picking the best solutions.

Why not pick the best database? why not pick the best language? why not pick the best framework? why not pick the best architecture?

We should be creating the best architecture and the best solutions, just going with the “standards” is a path to stagnation.

This might sound impossible, but it’s not. It just requires more research and more effort. But is 100% doable and should be done.

Scalability

A good architect produces good architecture. Architecture is embedded in code. Good code is scalable code.

Scalability is not only for systems, it’s also for engineering teams. A good architecture allows teams to work in parallel without stepping on each other’s toes.

Good structure, good design, allow engineers to be more productive, and focus on delivering value, instead of fighting with the codebase. You want the architectures you produce to have these properties.

Observable

As an architect, the systems you produce must have the ability to be observable. Such property is very important. Consider observability a subset of testing which happens in production.

If you do not know what is going on, you are driving in the dark. You want to be able to understand how your system behaves in production, and be able to detect issues before they become problems for your users.

Without such understanding, you are not doing the whole cycle correctly. Building something following a design is not enough. You need to observe the final product in production, and be able to learn from it and improve it.

Making Observable systems implies having:

Proper Logging
Exposing custom application metrics
Latency distribution metrics for upstream and downstream dependencies
Proper Message IDs
The practice of looking at production monitoring dashboards and logs as part of your daily routine and retrofitting the learnings to the engineering team

Stability

Not only your architecture but your systems must be stable. Being stable is a state, which means they are not broken all the time. Systems should not be broken. It’s impossible to never break the system, after all we are only humans. However this is not an excuse for lack of attention, carelessness or poor practices.

A stable system is one where most of the time:

Build is passing
Tests are passing, in all environments
Deployments are successful
Monitoring shows healthy metrics in production
Tech debt is being managed

A stable architecture is one where:

You can reason about concepts with confidence
You can make changes without breaking things
You can onboard new team members without much friction

Secure

Software must be secure. Architects should prioritize security in their designs to protect data and maintain user trust. Lack of security ruins brands, trust is damaged, and legal consequences may arise.

Security means implementing measures to protect systems from threats such as unauthorized access, data breaches, and cyber-attacks. This includes practices like encryption, authentication, authorization, tokenization, passwords where the best way to deal with them is to not have credentials, and regular security audits.

Common considerations for secure architecture:

Data Protection: Ensure sensitive data is encrypted both in transit and at rest.
Access Control: Implement robust authentication and authorization mechanisms to restrict access to authorized users only
Regular Updates: Keep software and dependencies up to date to mitigate vulnerabilities.

Architects need to know security. Architecture must be designed with security in mind from the outset, rather than being an afterthought. This includes threat modeling, secure coding practices, and continuous monitoring for potential vulnerabilities.

By prioritizing security in software architecture, organizations can safeguard their systems, protect user data, and maintain trust with their users.

Nothing to Leak

If you have sensitive information, therefore it can be leaked. Unless you do not have it. For instance, there is a variety of tools and solutions that make you “not have secrets.” If you do not have secrets:

You can’t leak them
You don’t need rotate them
You don’t need to worry

Examples of “Nothing to Leak” solutions:

AWS RDS IAM Authentication
AWS KMS
Not valid to all applications - but - not storing OPEN_API_KEY

When you have Credentials

You need a lot of work to make sure they are secure. For each credential or key you have, you need:

Rotate them periodically
Store them securely (eg: secrets manager, AWS KMS, HashiCorp Vault, etc)
Audit their usage
Monitor for leaks (eg: Have I Been Pwned)
Limit their access (eg: least privilege principle) - for keys (called key scoping)

GitOPS is Sexy for Security

GitOps is a practice that uses Git repositories as the single source of truth for declarative infrastructure and applications. By leveraging GitOps, organizations can enhance security.

The core idea with GitOps is that you do not give admin or super privileges IAM roles to developers. Instead, you have a system that applies changes based on approvals. For GitOps, such approvals come from Pull Requests that are merged. You have a history of all changes and can walk back in time thanks to Git.

GitOps is perfect for security because you reduce the blast radius of “sharing credentials”.

Why Practices?

Architecture is not just about technology. Architecture is also about people, and what and how you work with the team.

The way you work matters, what you do and what you do not do (consciously) also matter.

Practices are about “how you do things,” the way you approach problems, day-to-day life, difficult situations, decisions, and trade-offs.

Attention to Detail

Architects must be meticulous in their work, ensuring that every aspect of a design is carefully considered and executed. Attention to detail is crucial for creating functional, safe, and aesthetically pleasing structures. Here are some key areas where attention to detail is essential.

Architects pay close attention to:

Code: Classes, Internal design, contracts, patterns/anti-patterns
Tests: Are they passing?, how good they are? do we have good diversity? coverage?
Production Logs: Are there any errors? warnings? anomalies?
Production Dashboards: Are all metrics within expected ranges?
Error Tracking: Are there any new issues? recurring issues? We should have ZERO EXCEPTIONS.
Performance Metrics: Are we meeting our performance goals? any regressions?
Security Audits: Are there any vulnerabilities? are we compliant with security standards?
Requirements: Are we meeting all specified requirements? any missing features?

Architects are critical and detail-oriented when it comes to every single tech detail.

Architecture Review

If you don’t have anything written, how can we review it? How can you communicate with the engineering team? How can you onboard new team members?

Architecture must be written down, at the minimum:

Overall Architecture Diagram
Key Decisions (with rationale)
Important Trade-offs

Arch documentation can be done with a markdown file, here is a good template: https://github.com/diegopacheco/tech-resources/blob/master/src/arch-doc-template.md

When you write down, principles, decisions, guidance, and trade-offs become explicit, easier to communicate, and easier to review. Diagrams help a lot, especially overall architecture diagrams and class diagrams, which are very useful.

Design First

Architects must think. They must produce the design before the implementation. However, that’s only possible if the architect masters the problem domain and the technology being used. If that does not happen, Proof of Concepts (PoCs) are a great way to explore the problem and the technology before the design is produced.

Writing down the design is also a great way to communicate it to the team and get feedback. The design should be a living document that evolves with the project. If you jump into executing all the time you are doing just tactical programming (Philosophy of software design).

If you move before you think, tech debt will happen, bad decisions will happen.

Ownership

Architects must have extreme ownership. They are responsible for the success and failure of the architecture they design. This includes:

Ensuring the architecture meets business requirements.
Continuously evaluating and improving the architecture.
Collaborating with development teams to ensure proper implementation.
Staying updated with industry trends and best practices.
Documenting architectural decisions and their rationale.
Advocating for the architecture within the organization.

Architects must be proactive. Architects are always:

Thinking: Thinking ahead about future changes and how the architecture can accommodate them.
Performing Research: Researching new technologies and patterns that can improve the architecture.
Socializing: Effectively communicating, educating, and influencing stakeholders to ensure alignment and understanding of the architecture.

Reading Code

I know this might sound silly. But Architects need to read the code, they should go download the code and read it. Not only read it but analyze it. Reading the code allow the architect to:

Understand the complexity of the system
See the anti-patterns on the code
Understand what the system does

Many meetings can be saved if people just read the code. Architects must be Hackers, they need to read the code. Read the system code, read the libraries code, read the frameworks code. Reading code is not a one-time thing; it’s an everyday thing.

Monthly Review

Code review is great. Architects should do Code review. However code review is pretty much focused on deltas. Which is not the whole story.

Code review happens in-cycle, meaning every week, every day, or pretty much every time a ticket/PR is done. We also need an off-cycle review.

As an architect, 1x per month or at least 1x per quarter you should look at the whole codebase. Why? Because then you are not looking at the deltas, you are looking at the whole picture and seeing the design.

Such practice is important to see:

Architectural concept drift: Are we still following the intended architecture? Are there parts of the code that have diverged from the original design principles?
Code quality trends: Is the overall code quality improving or deteriorating? Are there areas that need refactoring or technical debt reduction?
Consistency: Are coding standards and best practices being followed consistently across the codebase?
New patterns and anti-patterns: Are there new design patterns emerging that could be beneficial? Are there anti-patterns that need to be addressed?
Testing Diversity: Is the test coverage adequate? Are there areas that lack sufficient testing? Do we need new forms of testing and induction?

Working in the Trenches

Lean has this principle of “Gemba” which means “the real place” in Japanese. It emphasizes the importance of going to the actual place where work is done to understand processes and identify opportunities for improvement.

Software is a war that never ends. Architects need to be on the battlefront, in the trenches with engineers. Architects should not fix all the problems and code all the stories. However, they need to code, they need to help the team to deal with complex problems. Architects going there and helping the team is a way to prevent a timeline disruption. Otherwise what kind of architect are you?

Architects on the front line can understand the real problems the team is facing. Architects cannot be hands off.

Why Concepts?

Without knowing key concepts, you did not really learn something properly. Knowing concepts allows you to build mental models that help you understand new information faster and better.

Without mental models, you are just memorizing information, and that is not a good way to learn and the worst way to operate. Mental models can only be built with proper understanding of concepts.

Having good mental models is a game changer, because it allows you to:

Learn new things faster
Understand complex topics better
Make better decisions
Solve problems more effectively
Communicate ideas clearly

ACID

ACID is a property of database transactions intended to guarantee validity even in the event of errors or power failures. It is a relational database concept.

A == Atomicity

Atomicity ensures that a transaction is treated as a single unit, which either completely succeeds or completely fails.

If any part of the transaction fails, the entire transaction is rolled back, and the database remains unchanged.

C == Consistency

Consistency ensures that a transaction brings the database from one valid state to another valid state, maintaining all predefined rules, such as constraints, cascades, and triggers.

I == Isolation

Isolation ensures that concurrent transactions do not interfere with each other. The intermediate state of a transaction is invisible to other transactions until the transaction is committed.

This prevents transactions from reading uncommitted data from other transactions, which could lead to inconsistencies.

D == Durability

Durability guarantees that once a transaction has been committed, it will remain so, even in the event of a system failure. Committed data is saved to non-volatile storage, ensuring that it is not lost. Usually using a WAL (Write Ahead Log) to achieve this.

Why you need to know this?

Relational databases like PostgreSQL, MySQL, Oracle are ACID.
Non-relational databases like DynamoDB, Cassandra or Redis are not ACID.
Knowing that your database has ACID properties allows you to design simple systems because you can rely on ACID properties.

Let’s say you want to write a system where only one user can rent a car. You probably can think of some complex solution with locks. However you don’t need that because you can simply rely on the ACID properties. If two users try to insert the rental at the same time, one will succeed and the other will fail. So ACID allows your code to be simple.

Authentication & Authorization & Entitlements

Authentication

Authentication is the process of verifying the identity of a user or system. It ensures that the entity requesting access is who they claim to be. Common methods of authentication include:

Passwords (very bad)
Multi-factor authentication (MFA)
Biometric verification
OAuth tokens
API keys
Single Sign-On (SSO)

Authorization

Authorization is the process of determining what an authenticated user or system is allowed to do. It defines the permissions and access levels for different resources. Common authorization models include:

Access Control Lists (ACLs)
Role-Based Access Control (RBAC)
IAM (Identity and Access Management) systems
Policy-Based Access Control (PBAC)
Policy as Code

Entitlements

Entitlements refer to the specific rights or privileges granted to a user or system after authentication and authorization. They define what actions can be performed on specific resources. Examples of entitlements include:

Access a product catalog (invisible by default)
See premium content
See premium features

Why you need to know this?

Authentication

Not all our services need to be public or customer facing.
Whatever you have that is customer facing, must have authentication.
Internal services don’t require any authentication.

Authorization

It’s the next step after authentication.
In security we can give granular access to resources. This principle is called least privilege.
Authorization is checking if the user has the fine grained access to a resource.

Entitlements

It’s common in digital products to have tiers of products. Such tiers can be called: basic, premium, silver, gold, platinum, free, pro, enterprise. But all these tiers are also called entitlements.
Entitlements means, given the user subscription or plan, can they see some feature or not.
Consider entitlements the way to tell what features a user can see or not. Do not confuse entitlements with feature flags or experiments.

JWT

JSON Web Tokens (JWT) is a standard created in 2010. It defines a compact and self-contained way for securely transmitting information between parties as a JSON object. This information can be verified and trusted because it is digitally signed.

JWT has the following structure:

xxxxx.yyyyy.zzzzz

Where:

xxxxx is the header: It typically consists of two parts: the type of the token (JWT) and the signing algorithm being used, such as HMAC SHA256 or RSA.
yyyyy is the payload: It contains the claims. Claims are statements about an entity (typically, the user) and additional data. There are three types of claims: registered, public, and private claims. Payload Example:
```
{
    "id": "1234567890",
    "name": "John Doe",
    "admin": true
}
```
zzzzz is the signature: To create the signature part you have to take the encoded header, the encoded payload, a secret, the algorithm specified in the header, and sign that.

The flow would be something like:

User Logs in with credentials.
Server verifies the credentials and generates a JWT.
Server sends the JWT back to the user.
User stores the JWT (usually in local storage).
For subsequent requests, the user includes the JWT in the Authorization header using the Bearer schema.

Why you need to know this?

The main benefit of JWT is that it is self-contained. This means that all the information needed to verify the token is contained within the token itself. This makes JWTs very efficient for stateless authentication mechanisms, as the server does not need to store session information. Meaning you can save calls to a authenticate endpoint on every request because each service or each component that has the JWT can verify and validate locally without a remote call.

However, you should be rotating and expiring JWTs properly to mitigate security risks. Also, be cautious about what information you include in the payload, as JWTs can be decoded easily. Sensitive information should not be stored in the payload unless it is encrypted. You can debug JWTs easily using tools like jwt.io.

BASE

A concept from NoSQL databases. BASE systems prioritize availability instead of immediate consistency.

BASE Properties

BASE is an acronym that describes the consistency model used by many NoSQL systems. It stands for:

BA == Basically Available

The system guarantees availability - it will always respond to requests, even if the response contains stale or inconsistent data. The database remains operational even when parts of the system fail.

S == Soft state

The state of the system may change over time, even without new input, due to eventual consistency. Data doesn’t have to be immediately consistent across all nodes.

E == Eventual consistency

Given enough time without new updates, all replicas will eventually converge to the same value. The system doesn’t guarantee immediate consistency but promises that consistency will be achieved eventually.

Why you need to know this?

NoSQL databases like DynamoDB, Cassandra, Redis are often BASE. By understanding BASE properties you can better understand what to expect from your NoSQL database. For instance knowing you have a BASE set of properties in place means that if you insert some data, or update some data, you might not see it replicated across all nodes immediately. Eventually it will be all consistent across all nodes.

This might sound silly but affects the user experience in the sense that you will not have immediate effect of anything. In a Sync/Blocking universe everything is immediate, but you might think that is better? Well for humans might be but for systems it sucks and it is much harder to scale. In AWS everything is Async and Non-Blocking.

Do not believe me? Watch this keynote from AWS re:Invent 2022 where Dr. Werner Vogels (CTO of AWS) explains this concept in detail around the 1 hour mark.

PS: I was there in Vegas and saw that live, was pretty awesome.

Idempotency

It is an interesting property that allows the same operation to be performed multiple times without changing the result beyond the initial application.

In REST APIs, idempotency is an important concept, especially for HTTP methods. Common idempotent HTTP methods include GET, HEAD, OPTIONS, TRACE, and PUT.

GET: The most famous and common idempotent method. Retrieving a resource multiple times does not change its state.

Idempotency is important because if the same request arrives twice (there are no side effects), the server can safely ignore the second request or return the same result as the first one without any unintended consequences, or just re-process the same thing again without issues.

Such principle keeps us from writing complex software. For instance, if a GET operation also does inserts and deletes, we would have to handle the case where the same GET request is sent multiple times, which could lead to data inconsistency and unexpected behavior.

So we want to honor the idempotency principle to keep our software simple and predictable.

Why you need to know this?

When you are designing and implementing REST APIS. Which pretty much is the bread and butter of all backend development nowadays. You must guarantee that your GET, HEAD, OPTIONS, TRACE and PUT methods are idempotent. This is not merely a rule of thumb, it is a must. Because your consumers/clients will expect that behavior from your API. If you break that expectation, you will have a bad time debugging and fixing issues that could be easily avoided by following the idempotency principle.

Secondly if you break idempotency you will have more complex code to maintain. Because you will have to handle edge cases and weird things, it will be harder to test, everything will be worse. So just follow the principle and keep your code simple.

Pessimistic vs Optimistic Locking

When dealing with concurrent access to shared resources, two primary locking strategies can be employed: pessimistic locking and optimistic locking.

Pessimistic locking: assumes that conflicts will occur and therefore locks the resource for exclusive access when a transaction begins. This approach is suitable for environments with high contention, as it prevents other transactions from modifying the resource until the lock is released. However, it can lead to reduced concurrency and potential deadlocks.

Optimistic locking: assumes that conflicts are rare and allows multiple transactions to access the resource concurrently. Instead of locking the resource, it checks for conflicts only when a transaction attempts to commit changes. If a conflict is detected, the transaction is rolled back and can be retried. This approach is more efficient in low-contention environments, as it maximizes concurrency and minimizes locking overhead.

Optimistic locking is often implemented using versioning, where each resource has a version number that is checked and updated during transactions. If the version number has changed since the transaction began, it indicates a conflict.

Why you need to know this?

Understanding the differences between pessimistic and optimistic locking is crucial for designing systems that handle concurrent access to shared resources effectively. Choosing the right locking strategy can significantly impact system performance, scalability, and user experience.

Pessimistic locking is better when:

High contention for resources is expected.
The cost of rolling back transactions is high.

Optimistic locking is better when:

Contention for resources is low.
The cost of rolling back transactions is low.
High concurrency is desired.

It’s very appealing to just use optimistic locking everywhere. However there are scenarios where you should have none. For instance, a classical example. It’s an anti-pattern but happens a lot, React applications on the frontend that are not fully tested might have many re-renderings causing many requests to be sent to the backend. In such scenario optimistic locking can cause a lot of pain and will be unnecessary overhead. In such scenario, just do nothing and rely on last write wins strategy or just piggyback on ACID properties.

Of course it really depends on what the application does and the criticality of the transaction. But the React scenario is not hypothetical, I’ve seen it happen many times in real life. Also if the frontend did not implement a time based debounce mechanism, the user might have a nervous finger and click multiple times on a button causing multiple requests to be sent to the backend. In such scenario optimistic locking will just add unnecessary pain and false positives.

Partition

A partition of a set is a way of dividing the data set into subsets such that every element in the original set is included in exactly one of the subsets. In other words, a partition breaks down a set into distinct parts where no part shares any elements with another, and all parts together cover the entire original set.

Partition by:

A specific column or set of columns
A specific number of partitions
A specific size of each partition
A specific percentage of data in each partition
A specific condition or rule
A random distribution of data into partitions
Consistent hashing for distributed systems

Partitions matter because they can significantly impact the performance and efficiency of data processing tasks. Proper partitioning can lead to faster query execution, reduced data shuffling, and improved resource utilization in distributed computing environments.

When designing partitions, consider factors such as data distribution, query patterns, and the underlying storage system to ensure optimal performance.

Why you need to know this?

When you don’t have a lot of data you can 100% survive without partitions. But as your data grows, partitions become crucial for maintaining performance and scalability. Proper partitioning can help you manage large datasets more effectively, improve query performance, and optimize resource usage.

Partitions are a MUST at Scale. Another scenario you want partition is when the data grows quickly, like a lot of data everyday, recurrent batch jobs getting data non-stop. If you are just updating data in place you are fine, but if you keep inserting data without partitions, you will quickly run into performance issues as the dataset grows larger.

Schema Evolution

Schema evolution is the practice of changing data structures, API contracts, or message formats over time while maintaining compatibility with existing clients and services. This is critical for zero-downtime deployments in distributed systems.

Forward Compatibility

Forward compatibility means that old code can read data written by new code. The old system can safely ignore new fields it doesn’t understand.

When adding new fields to a schema:

New fields should be optional with sensible defaults
Old services can process new messages by ignoring unknown fields
Allows deploying new producers before updating consumers

Backward Compatibility

Backward compatibility means that new code can read data written by old code. The new system must handle the absence of fields that didn’t exist in older versions.

When reading old data:

New code must provide defaults for missing fields
New services can process old messages correctly
Allows deploying new consumers before updating producers

Breaking Changes

Breaking changes destroy compatibility and require coordinated deployments. Avoid these whenever possible:

Removing required fields
Changing field types
Renaming fields without aliasing
Changing field semantics
Making optional fields required

Safe Schema Changes

Safe changes maintain compatibility:

Adding optional fields with defaults
Removing optional fields
Adding new enum values at the end
Adding new message types
Deprecating fields instead of removing them

Versioning Strategies

URL Versioning: Different versions in the URL path like /api/v1/users and /api/v2/users

Header Versioning: Version specified in request headers like Accept: application/vnd.api.v2+json

Content Negotiation: Different media types for different versions

No Versioning: Evolve schema compatibly without explicit versions. Requires discipline but provides best flexibility.

Migration Strategies

Expand-Contract Pattern: Three-phase deployment for breaking changes:

Expand: Add new field alongside old field
Migrate: Update all services to use new field
Contract: Remove old field after migration complete

Shadow Reading: New code reads both old and new formats, writes only new format

Feature Flags: Toggle between old and new behavior at runtime

Database Schema Evolution

Database schemas require special care because data persists:

Use migrations that can run without downtime
Add columns as nullable first, backfill data, then make required
Drop columns in separate deployment after code stops using them
Use views or aliases to maintain old column names during transitions

Schema evolution is not optional in production systems. Every change must consider compatibility to avoid outages during deployments.

Why you need to know this?

All backend systems have databases. Backend systems should never share database access with other services. We need to maintain and evolve database schemas without breaking existing application. By following good schema evolution patterns we can ensure smooth deployments and maintain system reliability.

Backward compatibility is more important than forward compatibility in most backend systems, because backend systems are the source of truth for data. Also if you need to rollback the code on the application for a bug or some mistake, or the business just changes their minds, you can do that easily without breaking the database. If you are smart enough you might be able to escape database migrations in some cases even.

Forward compatibility is more important in event-driven systems, where multiple services consume the same events. In this case, you want to make sure that old services can still process new events without issues.

Source of Truth

It is a concept that means what database or system is considered the authoritative source for a particular piece of information. It is common in distributed systems to have multiple databases or even systems that store the same data. In such cases, it is crucial to designate one of these as the “source of truth” to ensure consistency and reliability of the data across the entire system.

The same concept is used in migrations. While you are migrating from System A to System B, System A is the source of truth until the migration is complete and verified. After that, System B becomes the new source of truth.

Why you need to know this?

In distributed systems, having a clear source of truth is essential to avoid data inconsistencies and conflicts. When multiple services or databases can modify the same data, it can lead to situations where different parts of the system have different versions of the truth. This can cause errors, confusion, and data integrity issues.

It’s common to have one system or one database behind a service to be the Source of Truth. This also means that a bunch of systems can READ or have a COPY of the data, usually in a form of a CACHE to have performance benefits, but when it regards where to WRITE must be a clear and only one place.

This might sound silly, but this can either bring great sanity or a big mess to your system. Always make sure you have a clear Source of Truth for your data.

Stateless vs Stateful Services

When designing services, one of the key architectural decisions is whether to implement them as stateless or stateful. State is a big thing. Stateless services do not retain any information about previous interactions, while stateful services maintain state information across multiple requests.

It is much easier to work with stateless services because they can scale more easily, recover from failures faster, and are generally simpler to manage. However, there are scenarios where stateful services are necessary, such as when maintaining user sessions or handling transactions.

Stateless does not mean that the service cannot use state at all; rather, it means that the service itself does not store state between requests. Instead, any necessary state can be stored in external systems like databases or caches. Stateful services are more complex.

Why you need to know this?

As much as possible try to create services as stateless. This will bring you a lot of benefits in terms of scalability, reliability, and maintainability. Most importantly it will be much easier to reason, maintain and evolve your services over time.

When you need to create stateful services, be very careful about how you manage state. Consider using external systems to store state information and ensure that your services can handle failures gracefully. Always evaluate the trade-offs between stateless and stateful designs based on your specific use case and requirements.

Why Patterns?

Patterns are common “Recipes” to solve recurring problems. They are proven solutions that have been tested and refined over time.

Patterns allow us to save time, because if someone knows the pattern, then you do not need to explain. If everything in the universe must be explained, you end up with slow progress.

There are many patterns out there, some are better than others, some are very specific to a domain or problem. Other patterns are like a Swiss army knife that can be applied in many different situations.

API GATEWAY

It’s an architecture pattern that acts as a single entry point for a set of microservices, handling requests by routing them to the appropriate service, aggregating responses, and performing cross-cutting tasks such as authentication, logging, and rate limiting.

API Gateway vs Load Balancer

An API Gateway and a Load Balancer serve different purposes in a microservices architecture:

API Gateway: Primarily focuses on managing and routing API requests. It handles tasks such as request transformation, response aggregation, authentication, and rate limiting. It operates at the application layer (Layer 7) of the OSI model.
Load Balancer: Primarily focuses on distributing incoming network traffic across multiple servers to ensure high availability and reliability. It operates at both the transport layer (Layer 4) and application layer (Layer 7) of the OSI model, depending on the type of load balancer used.

Key Features

Request Routing: Directs incoming requests to the appropriate microservice based on the request path, method, or other criteria.
Response Aggregation: Combines responses from multiple microservices into a single response to the client.
Cross-Cutting Concerns: Manages common functionalities such as authentication, logging, rate limiting, and caching.
Protocol Translation: Converts requests and responses between different protocols (e.g., HTTP to WebSocket).
Load Balancing: Distributes incoming requests across multiple instances of a microservice to ensure high availability and reliability.

Why you need to know this?

Because there are 3 common scenarios where, using API gateway is the right thing to do:

Exposing Public facing APIs: When we need to expose an API to the world, an API gateway is a must have. It will help us to manage security, rate limiting, logging, monitoring, etc.
Cross Cloud Communication: Let’s say you have 2 departments in your company, one is in AWS and the other in Azure. You can use an API gateway to manage the communication between both clouds, handling security, routing, etc.
Migrations: When migrating from a monolith to microservices, an API gateway can help to manage the transition, routing requests to the monolith or the new microservices as needed.

Backend for Frontend Pattern

The Backend for Frontend (BFF) pattern involves creating separate backend services tailored to the specific needs of different frontend applications. Usually written in a language that is common for the frontend team, meaning: NodeJS with JS or TypeScript.

BFFs should have Rendering Logic and should NEVER have Business Logic because that should be encapsulated in the core backend services.

BFFs Are

Frontend-Specific: Each BFF is designed and optimized for a specific frontend application.
Team Ownership: The team that builds the frontend typically owns and maintains its BFF.
Aggregation Layer: BFFs aggregate data from multiple downstream services.
Tailored Responses: Returns only the data the frontend needs in the format it expects.

BFF vs API Gateway

While both BFF and API Gateway sit between frontends and backend services, they serve different purposes:

API Gateway: Provides common functionality like authentication, rate limiting, and routing. It serves all clients with the same generic interface.

BFF: Tailors the API specifically to each frontend’s needs. It aggregates, transforms, and optimizes responses for a specific client type. Rendering logic is part of the BFF.

Why you need to know this?

Nowadays pretty much the frontend is written in Typescript and NodeJS or other modern runtimes like Deno or Bun. BFF allow us to have common render logic for Mobile and Frontend. It’s very common pattern to have a BFF between the web/mobile and the backend services.

You will see this pattern happening a lot. You should be leveraging this pattern as well. However you always need to be careful what you put on BFFs. Remember: No Business Logic on BFFs, only Rendering Logic.

Cache

A cache is a software component that stores data so that future requests for that data can be served faster. Caches are commonly used to improve performance and reduce latency in various applications, including web browsers, databases, and operating systems.

Types of Caches

Memory Cache: Stores frequently accessed data in RAM for quick retrieval.
Disk Cache: Stores data on a hard drive or SSD to reduce access times for frequently used files.
Web Cache: Stores web pages and resources to reduce bandwidth usage and load times.
Database Cache: Caches query results to speed up database access.
CPU Cache: A small-sized type of volatile memory that provides high-speed data access to the processor.

Cache Strategies

Write-Through Cache: Data is written to both the cache and the underlying storage simultaneously.
Write-Back Cache: Data is written to the cache first and then to the underlying storage at a later time.
Least Recently Used (LRU): Evicts the least recently accessed items when the cache is full.
First In First Out (FIFO): Evicts the oldest items first when the cache is full.
Time-to-Live (TTL): Items in the cache are assigned a lifespan, after which they are automatically removed.

Cache Invalidation

Cache invalidation is the process of removing or updating cached data when it becomes stale or outdated. Common strategies include:

Manual Invalidation: Explicitly removing or updating cache entries.
Automatic Invalidation: Using TTL or other mechanisms to automatically remove stale data.
Event-Driven Invalidation: Invalidating cache entries based on specific events, such as data updates.

Why you need to know this?

This pattern should be used often. However you need make sure you always have an invalidation mechanism in place to avoid serving stale data. Usually the most simple invalidation mechanism is to use a Time-to-Live (TTL) strategy, where cached data is automatically removed after a certain period. Imagine the cache expires after 2h or something like that.

Backend system should have cache to protect from database latency in case of slow queries or high traffic. Even if you call a downstream dependency that is slow, or a 3rd party API which is also slow you should cache as much as possible.

This pattern not only reduce latency but improves the user experience as well. IF the users see everything happens pretty fast they will be more happy rather than if everything is slow. But you might argue, OH, the first call will always be slow. Well for that case you can pre-warm the cache and make the call before the user need.

Connection Pool

A connection pool is a cache of database connections maintained so that the connections can be reused when future requests to the database are required. Connection pools are used to enhance the performance of executing commands on a database. Opening and maintaining a database connection for each user, especially requests made to a dynamic database-driven website application, is costly and wastes resources.

It’s expensive to open and close connections to the database all the time. That’s why these objects are created at application startup and they are borrowed and returned to the pool as needed.

Java has a great connection pooling libraries like HikariCP.

Why you need to know this?

You must use this pattern every time a backend system needs to access a database. Connection pools are used to reduce the overhead of establishing a connection each time a database access is required. Because connections are expensive and they must be reused.

Feature Flags

Feature flags are runtime configuration switches that enable or disable functionality without deploying new code. They decouple deployment from release, allowing gradual rollouts and instant rollbacks, migrations and A/B testing.

Strategies

Environment Variables: Simplest approach. Requires restart to change. Good for operational flags.

Configuration Files: Load from files on startup or reload periodically. No restart needed if hot-reloaded.

Database or Cache: Dynamic flags stored in database or distributed cache. Change instantly across all instances.

Feature Flag Services: Dedicated systems like LaunchDarkly, Split, or Unleash. Advanced targeting and analytics.

Targeting and Segmentation

Flags can target based on:

User ID or email for specific users
Percentage rollout for gradual releases
Geographic region or data center
User attributes like subscription tier
Environment like staging vs production
Random sampling for experiments

Anti-Patterns

Long-Lived Release Flags: Release flags should be temporary. Remove after release complete.

Nested Flags: Flags inside flags create exponential complexity. Avoid deeply nested conditions.

Flag Proliferation: Too many flags make system hard to reason about. Be selective.

Using Flags for Configuration: Feature flags are not general configuration. Use proper config systems for that.

Why you need to know this?

Because if you want to do experiments and A-B testing you will need use feature flags. Feature flags should:

Have a clear time to die. i.e this feature flag will last 3 sprints or 2 weeks.
Never be nested (make it flat, make it simple)
Have a highly descriptive name. i.e sales.report.top.sales.experience.v2 make sure you name your feature flags all with the same pattern.
Feature flags should not be confused with the CORE business(backend) or even with permanent configuration. i.e display.theme = dark | light is not a feature flag.
Be documented in a central place so everyone knows what they are and what they do. A Catalog for feature flags it’s a good idea.
Feature flags must be tested. You must have tests that cover the feature flag being ON and OFF.

Load Balancer

LB is a service that distributes incoming network traffic across multiple servers to ensure no single server becomes overwhelmed, improving application availability and responsiveness.

Load balancers can operate at different layers of the OSI model, such as Layer 4 (Transport Layer) and Layer 7 (Application Layer), providing various features like SSL termination, session persistence, and health monitoring of backend servers.

Common Load Balancing Algorithms

Round Robin: Distributes requests sequentially across the servers.
Least Connections: Directs traffic to the server with the fewest active connections.
IP Hash: Uses the client’s IP address to determine which server will handle the request.
Weighted Round Robin: Similar to Round Robin but allows assigning weights to servers based on their capacity.
Random: Distributes requests randomly across the servers.
Metric-Based: Uses specific metrics (like response time or server load) to make load balancing decisions.

Why you need to know this?

Backend services must have a load balancer in front of them. This is a good pattern, because it allow us to better use infrastructure resources and improve availability of our services. Usually the default if you are using AWS is to have a Application Load Balancer in front of your backend services. In case you have a lot of traffic you can also use a Network Load Balancer.

Message Patterns

There are different ways to structure message communication between systems.

Publish/Subscribe Pattern

Publishers send messages to a topic or channel without knowing who will receive them. Subscribers express interest in specific topics and receive all messages published to those topics.

Pub/Sub Nature:

One-to-Many: A single message can be delivered to multiple subscribers.
Decoupling: Publishers and subscribers don’t need to know about each other.
Dynamic Subscription: Subscribers can join or leave at any time.
Topic-Based: Messages are organized by topics or channels.

Pub/Sub Use cases:

Real-time notifications across multiple services
Event-driven architectures where multiple consumers need the same data
Log aggregation and monitoring systems

Point-to-Point Pattern

Messages are sent directly from one sender to one specific receiver. Messages are typically placed in a queue where they are consumed by a single consumer.

P2P:

One-to-One: Each message is consumed by exactly one receiver.
Message Ordering: Messages are typically processed in the order they arrive.
Load Distribution: Multiple consumers can read from the same queue for load balancing.
Guaranteed Delivery: Messages remain in the queue until successfully processed.

P2P Use cases:

Task distribution among worker nodes
Request-response communication between services
Sequential processing of transactions
Background job processing

Request-Reply Pattern

A pattern where a sender sends a message and waits for a response from the receiver. The sender includes a reply-to address in the message so the receiver knows where to send the response. This pattern combines aspects of both synchronous and asynchronous communication, allowing for non-blocking request-response interactions.

Message Ordering

Message ordering guarantees vary by pattern and implementation:

FIFO Ordering: Messages are processed in the order they are sent.
Partition Ordering: Messages within the same partition maintain order.
No Ordering: Messages may be processed in any order for maximum throughput.

Why you need to know this?

Basically there are 2 common scenarios you want to use such patterns.

You have multiple services that need to communicate with each other in a decoupled way. In this case you will use Publish/Subscribe pattern. Such pattern allow async communication and multiple services can listen to the same event. Ideally for event-driven architectures. For instance using Kafka topics or similar messaging systems.
You have a task that needs to be processed by a single service or worker. In this case you will use Point-to-Point pattern. Such pattern ensure that only one service will process the message. Ideal for task distribution and load balancing. For instance using AWS SQS or RabbitMQ.

Message ID

Message ID, also known as CORRELATION_ID. Each message or request must have a unique ID that is passed to downstream services via headers or other means. When you log anything you also log the MESSAGE_ID.

Why this is important?

Traceability: When you have a unique MESSAGE_ID for each request, you can trace the entire lifecycle of that request across multiple services. This is especially useful in microservices architectures where a single user action may trigger multiple service calls.
Debugging: If an error occurs, having a MESSAGE_ID allows you to quickly locate all logs and events related to that specific request. This can significantly speed up the debugging process.
Monitoring: MESSAGE_IDs can be used to monitor the performance of requests as they pass through different services.

Without a MESSAGE_ID, it becomes impossible to debug distributed systems.

Why you need to know this?

Every time you have services and distributed systems you must implement this pattern. If you are using messaging or kafka you also must implement this pattern. Otherwise it’s impossible to debug issues in distributed systems.

Pagination

When you have an endpoint that returns a large list of items, it is often useful to paginate the results. This means breaking the results into smaller chunks, or “pages”, that can be retrieved one at a time.

Basic benefit here is to reduce the amount of data transferred in a single request, which can improve performance and reduce load on the server. Less latency and more responsive applications are the end result.

Pagination Strategies

Offset-Based Pagination: This is the most common method, where the client specifies an offset (the starting point) and a limit (the number of items to return). For example, GET /items?offset=20&limit=10 would return items 21-30.
Cursor-Based Pagination: Instead of using an offset, this method uses a cursor (a unique identifier for a specific item) to mark the starting point for the next page of results. For example, GET /items?cursor=abc123&limit=10 would return the next 10 items after the item with the cursor abc123.
Page Number Pagination: This method uses page numbers to specify which page of results to return. For example, GET /items?page=2&limit=10 would return the second page of results, with 10 items per page.
Keyset Pagination: This method uses the values of the last item in the current page to determine the starting point for the next page. For example, if the last item on the current page has an ID of 50, the next request might be GET /items?after_id=50&limit=10.

Why you need to know this?

When not do pagination?

If a system needs process analytical transformations for reporting, aggregation, forecasting or even for machine learning workloads, in that case we are dealing with a big data scenario where we need a different stack and different solutions for this problem, we would not paginate in this case. We would use patterns like CDC (Change Data Capture) to stream data into a data warehouse or data lake, and then use specialized tools to process and analyze the data. Event Sourcing with Kafka + stream processing(kafka streams, flink or spark) could be another way to handle this problem.

When do pagination?

Every time you have a lot of data. You need to be careful with findAll or Select * queries that can return a lot of data at once. This can lead to performance issues, timeouts, and high memory usage. Also the client cannot make sense of all this data at once, so in that scenarios always use pagination.

Queue

Queue is a data structure that follows the First In First Out (FIFO) principle. Elements are added to the back of the queue and removed from the front.

In distributed systems, queues are often used to manage tasks, messages, or data that need to be processed asynchronously. They help in decoupling different parts of a system, allowing for better scalability and fault tolerance.

Queues use cases are:

Asynchronous processing: Tasks can be added to a queue and processed by worker nodes at their own pace.
Load balancing: Distributing tasks across multiple workers to ensure no single worker is overwhelmed.
Message brokering: Facilitating communication between different services or components in a system.
Rate limiting: Controlling the rate at which tasks are processed to avoid overwhelming downstream systems.
Event sourcing: Storing a sequence of events that can be processed later to reconstruct the state of a system.

Common Issues with Queues:

Message loss: If a queue is not properly configured, messages may be lost during transmission or processing.
Duplicate messages: In some cases, messages may be delivered more than once, leading to redundant processing.
Latency: Queues can introduce delays in processing, especially if they become overloaded.
Backpressure: If the rate of incoming messages exceeds the processing capacity, it can lead to increased latency and potential message loss.

Why you need to know this?

Queues are very useful. In a couple of very common scenarios:

You want to do asynchronous processing: You can use an internal in-memory queue, or you can use an external queue, depending on your durability needs.
You want to decouple different parts of your system: Using a queue can help you achieve better scalability and fault tolerance. Because now if the receiver system is down, the sender still can send a message and receiver can process later, make the system more resilient.
Queues are present in any HTTP server implementation: Behind the scenes, most HTTP servers use queues to manage incoming requests and distribute them to worker threads or processes for handling. Understanding how queues work can help you optimize the performance and scalability of your web applications. Queue allow your systems to “breath” under high load.

However you always want monitor queue depth, arrival and processing rate to make sure your system is healthy. This a good metrics to always keep an eye on in any system using queues.

Retry

Retry is a technique used in distributed systems to handle transient failures by attempting an operation multiple times before giving up. This approach is particularly useful in distributed systems, where network issues or temporary unavailability of services can lead to failures that may be resolved with subsequent attempts.

Timeouts

When implementing retries, it is important to set appropriate timeouts for each attempt. A timeout defines how long the system should wait for a response before considering the attempt a failure. Setting too short a timeout may lead to unnecessary retries, while too long a timeout can delay the overall operation.

Progressive Backoff

Progressive backoff is a strategy used to increase the wait time between successive retry attempts. This approach helps to reduce the load on the system and increases the chances of success in subsequent attempts. Common backoff strategies include:

Fixed Backoff: A constant wait time between retries.
Exponential Backoff: The wait time increases exponentially with each retry attempt.
Jitter: Adding randomness to the wait time to prevent synchronized retries from multiple clients.

Thundering Herd Problem

The thundering herd problem occurs when multiple clients simultaneously retry an operation after a failure, leading to a sudden surge in requests that can overwhelm the system. To mitigate this issue, techniques such as jitter in backoff strategies and limiting the number of concurrent retries can be employed.

Why you need to know this?

Everytime you make a call, does not matter if the service is internal or external you want combine retries with timeouts and progressive backoff to make your system more resilient to transient failures. This is especially important in distributed systems where network issues and service unavailability are common.

Every single call to AWS is the same principle, the good news is that AWS SDKs already implement retries with timeouts and progressive backoff for you, so you just need to configure them properly.

Webhook

A webhook is just an endpoint in an API or REST server. A webhook is used to notify internal/external services when certain events happen.

When an event occurs, the server makes an HTTP request (usually a POST request) to the specified URL of the web hook, sending relevant data about the event.

Web hooks are commonly used for:

Real-time notifications: Informing external systems about events as they happen.
Data synchronization: Keeping data in sync between different systems.
Integrations: Connecting different services and automating workflows.
Event-driven architectures: Triggering actions in response to specific events.

Webhooks do not require special technology. You can apply them with any backend technology that can handle HTTP requests.

Why you need to know this?

Webhooks are interesting pattern for 2 common scenarios:

You are doing async processing and want to notify another system when the processing is done. Lets say the other system is external. So this prevent the need for polling. Same pattern can be applied for internal systems as well.
You want to integrate with 3rd party systems that support webhooks to get notified about events happening in those systems. For example, you can use webhooks to get notified about new orders in an e-commerce platform or new issues in a project management tool. Being event driven is always better than polling.

Why Tools?

An architect needs a couple of tools. I thought a lot about writing this section because this can easily get deprecated since tools keep evolving, but then I decided to talk about the essence of tools we need and give some current recommendations.

An architect does not need many tools, but for engineering, we need lots of tools. For architecture, we need a couple of tools for:

Diagramming
Documentation
Thinking

Diagramming

Architecture must be visual; it’s important to have visual diagrams. In my humble opinion, there are at least 3 diagrams that are very, very useful for architects:

Overall Architecture Diagram: You can see the big picture, you see how services talk to each other, you see the main components.
Class Diagram: Can be useful for 2 common scenarios. A) Internal modeling of a system or B) Modeling tables in a database.
Sequence/state diagrams: Useful to understand how a system works internally, how data flows through the system. I would not have hundreds of these, the goal here is not to have one per UI. The goal is to have them for the most important and complex things.

The best paid diagram tool is Lucidchart. The best free diagram tool is Diagrams.net.

Why you need to know this?

Visual diagrams help us see what is not obvious. A picture is worth a thousand words. Software is complex, and the problems we deal with in distributed systems at scale are very hard to talk about. Drawing boxes and arrows allows us to put everybody on the same page and actually make sense of things.

It does not matter if you follow or do not follow UML standards. The goal is to make the invisible into something visible. You can only do code review because you see code in front of you; how do you do architecture and design review if there is nothing in front of you? So diagrams are the way to go.

Writing

Architects must write. I used a variety of tools to write like Evernote, Grammarly, Google docs, Markdown files, MS Word, and many others. Today, the best tool for me is Markdown files.

Markdown files are simple and lightweight, they work with any AI Grammar tool. They are also universal and portable. In my humble opinion, I do not have the perfect writing tool yet. I might need to build my own one day, but I think less is more. You need to focus on your ideas, so fewer distractions is the key.

Architects need to write:

Principles
Guidelines
Trade-offs
Decisions

When you write something down, you are dumping your brain and therefore doing the same as an engineer when they fire up a PR. You are allowing others to review. Written documents scale, because many people can read them, and they might last forever.

Why you need to know this?

This probably is the most non-obvious thing. Back to the code review, you can only review because someone wrote code. How do you review architecture? Well in part it can be with diagrams, but you can’t do 100% of all the things with diagrams. Because architecture also drives:

Technical Strategy
Decisions
Guidance to engineers
Standards and practices

Even on Architecture and design you can only best capture by writing. When you write, again we have something to review, to socialize. Architects must write, otherwise how do they review what they do? How do they convey their ideas? How do they socialize their vision? Writing also has a nice property which - enables scaling - because many people can read your writing, and you can reach a larger audience.

Thinking

An architect needs a thinking toolbox. In my humble opinion, you can use the same tool you use for Diagramming and Writing. POC’s are a great laboratory for thinking as well. I used a lot of mind-maps in the past, but I find that nowadays I prefer to use plain text files with a good structure (headings, lists, code blocks, etc).

Markdowns can be a great tool for thinking too, but you need a diagram tool to sketch some boxes and arrows. Without coding and experimentation, thinking is just an abstract exercise. When you know the systems and what they can do, then you can think about alternatives. In my humble opinion, thinking is about exploring alternatives, exploring different trade-offs and different ways to solve or model a problem. Thinking is about being creative and open-minded.

There is a difference when you are documenting what you solved already and that’s for review or for engineers to implement. Versus thinking about how to solve a problem that is not solved yet. In the first case you are more focused on clarity and precision, in the second case you are more focused on exploration and creativity.

Why you need to know this?

Because architecture is not just “choosing something”. It’s about a deep analytical and critical thinking process. It’s very easy to just dump more software. Lateral thinking requires practice and discipline. Architects need to think, otherwise they are just engineering following instructions from product.

Thinking is a must for architects. It must happen, otherwise the role of the architect is not effective. However nobody will tell you to think, actually the opposite, they will pressure to deliver and it does not matter if you think or not. So you better master the art of thinking, the art of trade-offs, the art of doing POCs, and the art of making good decisions based on your thinking.

In The End

Great song. There is no end. Being an architect is a never-ending journey. Software never stops changing, learning never stops. As an architect, you must become a master student so you can become a master teacher.

Teaching is all that matters. The more people know, the fewer bad decisions are made. The more bad decisions are avoided, the better software we all create. There is no winning without learning. The journey does not end here.

How to keep learning?

Keep reading books, articles, blogs.
Attend conferences, meetups, webinars.
Teach what you learn, write articles, give talks.
Read code, explore open source projects.
Research new technologies, experiment with them.
Socialize with other architects, share experiences, ask for feedback.

Final Thoughts

Architecture is hard. It is a battle that must be fought every day. There is no done, there is no resting, so you cannot do all that without love. If you do not have love and passion for learning and teaching, you cannot be a good architect.

The good news is that you can be inspired and you can be a different person tomorrow. To create better architecture, you must become a better architect. Thinking is cheap but is hard at the same time. So spend more time thinking and experimenting.

One last piece of advice: do not forget about output. If you just read, that is not good enough. You need to put it out, so others can review, so you can see if you really know and if it is really good.

Architecture is all about decisions, it’s all about trade-offs, so you need to be sure that you are making the right decisions. The only way to be sure is to keep learning and teaching.

Resources

Diego Pacheco’s Books

Here is a curated list of books that will help you become a better version of you.

Want to help me?

Consider buying one of my payed booked:

I also have FREE books:

Recommended Books

Here is a curated list of books that will help you become a better architect.

External Links

For a comprehensive collection of blog posts, articles, technical documentation, and external resources, please see References.

How I wrote this book?

I blog since 2007, that is: . Every page on this book has one or multiple links to blog posts I did on the past.

I wrote this book in a very different way compared with my 3 previous books. My 3 previous books were written in a formal way. This one was written in a very different way. Let’s explain the “formal process” and how usually it worked for me:

You need to write a proposal, proposal get debated and approved, you write.
Formal books have length requirements usually 300 pages.
Formal books once approved are waterfall and have several phases.
Once you deliver a chapter, there is an English reviewer.
After the English reviewer there is the technical reviewer.
After that there is copy-writing, index, layouting and finally printing.
Traditional process take from 7 to 12 months.
I wrote books alone and with other people, more people you have, more coordination you need and longer it takes, more things can go wrong, it’s literally no different than a project.

I want a different experience, I did several things differently here, I’m not saying I would never do traditional books again, but for sure is different, there are somethings here I like a lot, for instance:

Because I used mdbook the book is written with a tool in Rust which is markdown based.
Mdbook has 3 killer features for me:
- It has a built in search engine, and a very good one.
- It provides a direct link to all pages of the book, every page has a unique URL.
- It has a built in way to generate code snippets with syntax highlighting, videos, themes.
The book is hosted on git. Meaning I have version control over all the changes of the book what see what I did differently? just use git.
IF I want say something different, on the traditional book I need to write a new book and people need to buy it to read it, here I just do a git push and it’s live, because I have a workflow with github actions to publish the book in a github pages site.
It’s also a way for me to give it back for free.

What tools did I use?

I basically use VSCode to write the book. I used Github Copilot and Claude Code.

I did not use AI to generate the entire book. The book is mine, all content I wrote but I use AI to generate the following content:

Index
Glossary
References
Spell check and proof reading my English (fix typos and fix grammar issues never to write whole paragraphs).

I used claude code custom commands to do all this tasks, I create a book-all custom command that automated all those workflows:

book-all.md

## Perform several Tasks to publish my book
- Read all markdown files
- Perform the following tasks

## Task 1 - # Create or Update my Glossary
- My glossary is on a GLOSSARY.md
- Make sure my glossary is up to date

## Task 2 - # Create or Update my References
- My references are in REFERENCES.md
- Make sure my external references/links are up to date

## Task 3 - # Create or update my book index
- Index is on a file INDEX.MD
- Make sure my index is up to date

## Task 4 - # Create or update book CHANGELOG.md
- Read commits from git history
- Make sure the changelog has meaning
- Only look for markdown files, ignore *.html.

## Task 5 - # Fix my english
- Fix all my english issues
- Fix my typos
- Don't touch the HTML files only the markdown files
- Only fix english or grammar mistake, dont change my words or writting style
- Make sure you dont break anything, make sure you dont loose content

## Task 6 - # Make sure you did not lost content
- You cannot loose content
- Make sure you did not broke links
- Make sure all content is there
- Make sure you did not delete anything wrongly

Running this custom commands uses in avg ~70k tokens. So I use AI for the boring and repetitive tasks, not to write the book itself. When I run out of tokens on Claude Code I would fallback to Github Copilot.

CI/CD

This book was written with CI/CD in mind from day one. I have a script called bump-version.sh that bumps the version of the book on a file on the root called “VERSION”. When I released the book it had ~100 pages on version 1.0.0 during the first week of the book, I released content everyday. For the first week I did 6 releases: 1.0.1, 1.0.2, 1.0.3, 1.0.4, 1.0.5 and 1.0.6. Each release had new content. I also used AI to generate a CHANGELOG.md file so you can track my changes. After release 1.0.5 the book is with 132 pages.

This is killer feature because I can keep releasing new content, in a very lean/agile way, directly to you the reader.

Did you like my work? Did it help you?

IF you like my work, you can help me buying one of my other books here:

You can also help me by sharing this book in social media on X, LinkedIn, Facebook, Reddit, Hacker News, or any other place you like.

Changelog

All notable changes to this project will be documented in this file.

[1.0.8] - 2025-11-28

Added

PROTECT_YOUR_TIME.md: Added Gemini 3 Banana Pro illustration for the Protect Your Time chapter

Changed

INDEX.md: Corrected spelling, updated naming (Practices, Message ID), added How I Wrote the Book, and refreshed tool labels
GLOSSARY.md, REFERENCES.md, and other markdown files: Minor grammar and consistency fixes across terms and navigation content
TECH_DEBT_PLAGUE.md: Clarified wording around tests and Jira references

[1.0.7] - 2025-11-06

Added

JWT.md: Comprehensive documentation on JSON Web Tokens including structure, flow, and security best practices
GLOSSARY.md: Added JSON Web Token (JWT) entry with definition and explanation
INDEX.md: Added JWT entry in alphabetical index and Chapter 6 concepts section
REFERENCES.md: Added JWT.io debugger tool and Wikipedia JWT reference

[1.0.6] - 2025-11-02

Added

HOW_I_WROTE_THE_BOOK.md: Behind the scenes of the book creation process

Changed

GLOSSARY.md: Added Business Logic, Discovery Work, and Experimentation terms

[1.0.5] - 2025-10-31

Added

Pattern files - Added “Why you need to know this?” sections with practical guidance:

WEB_HOOK.md: Added section explaining use cases for async processing notifications and 3rd party integrations, emphasizing event-driven architecture over polling
RETRY.md: Added section on implementing retries with timeouts and progressive backoff for resilience, with AWS SDK examples
QUEUE.md: Added section covering three scenarios: async processing, system decoupling, and HTTP server request management, plus monitoring guidance for queue depth and processing rates
PAGINATION.md: Added section distinguishing when to use pagination vs big data patterns like CDC, Event Sourcing with Kafka, addressing findAll and Select * anti-patterns
MESSAGE_PATTERNS.md: Added section explaining Publish/Subscribe for event-driven architectures and Point-to-Point for task distribution
MESSAGE_ID.md: Added section emphasizing necessity for distributed systems and messaging/Kafka implementations
LB.md: Added section explaining load balancer requirements for backend services with AWS ALB/NLB examples
FEATURE_FLAGS.md: Added section with best practices including time-to-live, naming conventions, avoiding nesting, testing requirements, and catalog documentation
CONNECTION_POOL.md: Added section on database connection pooling necessity and connection reuse benefits
CACHE.md: Added section on TTL invalidation strategies, protecting against database latency, and cache pre-warming techniques
BFF_PATTERN.md: Added section on TypeScript/NodeJS/Deno/Bun implementations, emphasizing separation of rendering logic from business logic
API_GATEWAY.md: Added section covering three scenarios: public-facing APIs, cross-cloud communication, and monolith-to-microservices migrations

References and resources:

REFERENCES.md: Added Grammarly under Writing Tools section
REFERENCES.md: Added Figma under Design Tools section
REFERENCES.md: Added V Language, Gleam, Deno, and Bun homepages under Language & Runtime Homepages section
REFERENCES.md: Added AWS re:Invent 2022 keynote with Dr. Werner Vogels under Videos & Media section
REFERENCES.md: Added Mark Zuckerberg videos on product strategy and fast learning cycles under Videos & Media section

Changed

Grammar and spelling corrections across multiple files:

introduction.md: Changed “guidance about Software Architecture” to “guidance on Software Architecture”
DEFENSIVE.md: Fixed typo “ProducService” to “ProductService” in Scala code example
WORKING_ON_TRENCHES.md: Changed “frontend line” to “front line”
OPLOCKING.md: Improved grammar in React re-rendering scenario explanation
API_GATEWAY.md: Fixed “todo” to “to do”, “Comunication” to “Communication”, “departaments” to “departments”
BFF_PATTERN.md: Fixed “mmobile” to “mobile”, “pattrern” to “pattern”
CACHE.md: Fixed “simmple” to “simple”, “spires” to “expires”, “now only” to “not only”, “amd” to “and”
CONNECTION_POOL.md: Changed “everything a backend system need access” to “every time a backend system needs to access”
FEATURE_FLAGS.md: Fixed “feaure” to “feature”, “experrience” to “experience”, “buisness” to “business”, “themme” to “theme”, “falgs” to “flags”, “testd” to “tested”
LB.md: Fixed “resorcues” to “resources”
MESSAGE_ID.md: Improved section header formatting from “Why this is important?” to proper header
MESSAGE_PATTERNS.md: Fixed “secnario” to “scenario”
PAGINATION.md: Fixed “forecasrting” to “forecasting”, “learnign” to “learning”, “Everything” to “Every time”
QUEUE.md: Fixed “reciver” to “receiver”, “needss” to “needs”, “a good metrics” to “good metrics”
RETRY.md: Fixed “Everytime” to “Every time”, “Everysingle” to “Every single”

[1.0.4] - 2025-10-30

Added

Dilemmas - Added “Why you need to know this?” sections:

DISCOVERY_VS_DELIVERY.md: Added section emphasizing learning speed as competitive advantage with Mark Zuckerberg product strategy reference, fixed TikTok embed iframe for proper display
MOVE_FAST_VS_DO_IT_RIGHT.md: Enhanced content on balancing speed vs quality in product development
DECIDE_OR_WAIT.md: Added section on cost of delay analysis, experimentation strategies, and importance of Blameless Feature Reviews for learning from decisions
BUILD_VS_BUY.md: Added section on collaboration with product/UX, conducting trade-off analysis, and socializing decisions with stakeholders

Anti-patterns - Added “Why you need to know this?” sections:

TECH_DEBT_PLAGUE.md: Added section with strategies to fight tech debt including incremental improvements, team education, pushing back on bad practices, and explaining costs to management
STAGNATION.md: Added section on fighting auto-pilot mode through passion, continuous learning, conferences, knowledge sharing, experimentation, and POCs
REQUIREMENTS.md: Added section on handling requirements through product collaboration, UX understanding, spike techniques, production validation, experiments, and industry research
IGNORE_CULTURE.md: Enhanced content emphasizing architect’s role as leader and teacher in setting example for caring and action-oriented behavior

Changed

GLOSSARY.md: Added terms including Blast Radius, Blameless Feature Reviews, Learning Cycles, OpenSearch/Elasticsearch, and Spike

[1.0.3] - 2025-10-29

Added

Concepts - Added “Why you need to know this?” sections:

STATELESS_VS_STATEFULL_SVC.md: Added comprehensive content on service architecture patterns
SOURCE_OF_TRUTH.md: Added section explaining importance in distributed systems, addressing write vs read patterns, and avoiding data inconsistencies
SCHEMA_EVOLUTION.md: Added section on backward vs forward compatibility strategies, emphasizing backend systems as source of truth and rollback considerations
PARTITION.md: Added section explaining when partitions become necessary at scale and for rapidly growing datasets
OPLOCKING.md: Added section comparing pessimistic vs optimistic locking use cases, including React re-rendering scenario and high/low contention environments
IDEMPOTENCY.md: Added section on REST API design requirements for GET/HEAD/OPTIONS/TRACE/PUT methods and code complexity implications

Changed

BASE.md: Enhanced with additional details and clarity on eventual consistency patterns
AUTHENT.md: Enhanced with comprehensive authentication and authorization coverage
ACID.md: Enhanced with detailed transaction property explanations

[1.0.2] - 2025-10-28

Changed

Enhanced ANTI-FRAGILITY.md with additional content
Expanded OSS.md with more open source insights
Extended PROTECT_YOUR_TIME.md with additional time management content
Enhanced SO.md with more service orientation practices
Updated CRYSTAL_BALL.md with additional insights
Improved STAGNATION.md anti-pattern documentation
Expanded GLOSSARY.md with more architectural terms

Added

CHANGELOG.md file to track changes and updates

[1.0.1] - 2025-10-27

Added

Chapter 0 (Zero) with rationale and reasoning behind the book
Chapter 2 (Anti-Patterns): TECH_DEBT_PLAGUE, IGNORE_CULTURE, STAGNATION, REQUIREMENTS
Chapter 3 (Dilemmas): DISCOVERY_VS_DELIVERY, MOVE_FAST_VS_DO_IT_RIGHT, BUILD_VS_BUY, DECIDE_OR_WAIT
Chapter 4 (Properties): ANTI-FRAGILITY, STATE-OF-THE-ART, SCALABILITY, OBSERVABLE, STABILITY, SECURE
Chapter 7 (Patterns) as separate section from Concepts
Chapter 8 (Tools): DIAGRAMING, WRITING, THINKING
Chapter 9 (Epilogue): IN_THE_END, RESOURCES, REFERENCES, GLOSSARY, INDEX
CONNECTION_POOL pattern documentation
WHY sections across all chapters explaining rationale
GitHub Pages deployment workflow
Glossary with architectural terms and definitions
Index for topic navigation
Resources section with books and learning materials
References section with external links

Changed

Reorganized book structure into numbered chapters
Separated Concepts and Patterns into distinct chapters
Moved pattern files from /concepts to /patterns directory
Enhanced defensive programming content
Expanded crystal ball philosophy
Improved English grammar and readability
Updated table of contents and navigation

Fixed

Page rendering issues
Documentation structure inconsistencies

[1.0.0] - 2025-10-27

Added

Initial project setup with mdBook
Introduction and cover page
Philosophy section: CRYSTAL_BALL, DEFENSIVE, DOING_HARD_THINGS, FRONTEND_VS_BACKEND, OSS, SO
Practices section: ATTENTION_TO_DETAIL, ARCH_REVIEW, DESIGN_FIRST, OWNERSHIP, READING_CODE, MONTHLY_REVIEW, WORKING_ON_TRENCHES
Concepts and Patterns section: ACID, API_GATEWAY, AUTHENT, BASE, BFF_PATTERN, CACHE, CONNECTION_POOL, FEATURE_FLAGS, IDEMPOTENCY, LB, MESSAGE_PATTERNS, MESSAGE_ID, OPLOCKING, PAGINATION, PARTITION, QUEUE, RETRY, SCHEMA_EVOLUTION, SOURCE_OF_TRUTH, STATELESS_VS_STATEFULL_SVC, WEB_HOOK
Build scripts and documentation workflow
VERSION tracking system
Theme customization
.gitignore configuration

References

A comprehensive collection of external references, blog posts, articles, and technical resources that support and expand upon the concepts covered in this architecture library.

Diego Pacheco’s Resources

Main Platforms

Tiny Essays

Side Projects

Tupi Lang - A programming language written in Java 23
Jello - Vanilla JS web APIs, Trello-like application
Zim - A Vim-like editor written in Zig 0.13
Gorminator - A simple and dumb Linux terminal written in Go
Kit - A Git-like tool written in Kotlin
Shrust - A compression/decompression tool written in Rust
Smith - A security agent written with Scala 3.x
ZOS - A very tiny OS written in Zig
Tiny Games - A collection of JavaScript games

Architecture & Design

Architecture Documentation

Architecture Doc Template

Patterns & Best Practices

Core Architecture Principles

Properties & Quality Attributes

Code Review & Analysis

Technical Debt & Anti-Patterns

Practices & Decision Making

Build vs Buy

Security

Reference Documentation

C2 Wiki

Conceptual Integrity

Wikipedia

AWS Resources

Other Technical Resources

Tools

Diagramming Tools

Writing Tools

Grammarly

Design Tools

Figma

Language & Runtime Homepages

Videos & Media

Glossary

A comprehensive glossary of technical terms, concepts, patterns, and methodologies covered in this architecture library.

A

A/B Testing An experimental technique to gather data and make informed decisions by comparing two versions of a system. Often implemented using feature flags and experimentation platforms.

Agile Methodology Development approach emphasizing iterative development, team collaboration, and responsiveness to change. Balances discovery and delivery through continuous feedback cycles.

Akka A high-performance toolkit for building concurrent, distributed, and resilient message-driven applications on the JVM, supporting actor-based concurrency.

AWS KMS (Key Management Service) Amazon Web Services managed service for creating and controlling encryption keys used to encrypt data across AWS services and applications.

AWS RDS (Relational Database Service) Amazon Web Services managed relational database service supporting multiple database engines including PostgreSQL and MySQL.

Architecture Decision Record (ADR) A document recording important architectural decisions, the rationale behind them, consequences, and alternatives considered. Essential for communicating and reviewing architecture.

Availability A system quality ensuring that services are accessible and operational to users. A key requirement of distributed systems and emphasized in BASE consistency model.

Access Control Lists (ACLs) A list-based authorization model that explicitly specifies which users or systems can access particular resources.

ACID Atomicity, Consistency, Isolation, Durability - a set of properties that guarantee database transactions are processed reliably even in case of errors or power failures.

Anti-Fragility A system property where systems not only withstand shocks and stressors but also improve and grow stronger as a result of them.

API Gateway A pattern that acts as a single entry point for microservices, handling request routing, response aggregation, authentication, logging, and rate limiting.

API Keys An authentication method where a unique key is assigned to clients for identifying and authenticating API requests.

Architecture Review The practice of documenting architecture decisions, trade-offs, and principles for communication and improvement.

Attention to Detail A practice of meticulous examination of code quality, tests, production logs, dashboards, and security audits.

Authentication The process of verifying the identity of a user or system to ensure the entity requesting access is who they claim to be.

Authorization The process of determining what an authenticated user or system is allowed to do, defining permissions and access levels for different resources.

B

Backpressure A situation where incoming message rate exceeds processing capacity, leading to latency and potential message loss.

Blue-Green Deployment A deployment strategy where two identical production environments are maintained, with traffic switched from the blue environment to the green environment, enabling instant rollback if issues occur.

Backward Compatibility When new code can read data written by old code, requiring new services to provide defaults for missing fields.

BASE Basically Available, Soft state, Eventually consistent - a consistency model used by many NoSQL databases that prioritizes availability over immediate consistency.

Backend for Frontend (BFF) A pattern creating separate backend services tailored to specific frontend applications’ needs, typically containing rendering logic.

Big Ball of Mud An anti-pattern where a system has no recognizable architecture, with haphazardly structured code that is difficult to maintain and extend.

Biometric Verification An authentication method using unique biological characteristics such as fingerprints, facial recognition, or iris scans to verify identity.

Blast Radius The scope of impact when a failure or security breach occurs in a system, limited through practices like least privilege principle and proper isolation.

Blameless Feature Reviews A practice of reviewing features without blaming individuals, focusing on learning and improvement rather than fault-finding.

Build vs Buy Dilemma The decision between building solutions in-house for core business advantages or buying external solutions.

Build Status The state of the automated build process, which should consistently pass in stable systems as indicator of code quality.

Business Logic The core domain logic implementing business rules and operations, distinguished from rendering logic which prepares data for presentation.

C

Canary Release A deployment strategy where new code is gradually rolled out to a small percentage of users first, allowing observation of behavior and issues before full deployment. Often enabled through feature flags.

Code Review A practice where engineers examine and critique code changes before they are merged, enabling knowledge sharing, quality assurance, and identification of potential issues.

Cache A software component storing data for faster future access, commonly used in web browsers, databases, and operating systems.

Capability Oriented Services A principle that services should be organized around business capabilities rather than technical layers, promoting better alignment with business goals.

Cassandra A distributed NoSQL database designed for handling large amounts of data across many commodity servers, providing high availability with no single point of failure.

Chaos Engineering A practice of inducing and provoking failure in systems and infrastructure to prove system tolerance and recovery.

Circuit Breaker A design pattern that prevents cascading failures in distributed systems by detecting failures and temporarily stopping requests to failing services, allowing them time to recover.

Clojure A modern, functional Lisp dialect running on the JVM, emphasizing immutability and designed for concurrent programming.

Conceptual Drift An anti-pattern where systems gradually deviate from the intended architecture over time, losing coherence and consistency.

Conceptual Integrity A principle emphasizing consistency and coherence in system design, ensuring all parts work together harmoniously under a unified vision.

Community of Practice A group of people who share a common interest or profession and learn from each other through regular interaction, knowledge sharing, and collaboration on common problems.

Class Diagram A UML diagram useful for internal system design modeling or database table modeling.

Connection Pool A cache of database connections maintained so connections can be reused for future database requests, reducing the cost of opening/closing connections.

Core Domain The central, most important part of a business that provides competitive advantage and should receive the most focus, investment, and highest quality architecture and implementation.

Consistency The property that data in a system is in a valid state and follows all defined rules and constraints. Can be immediate as in ACID transactions or eventual as in BASE consistency model.

Consistent Hashing A partitioning technique for distributed systems ensuring stable distribution when nodes are added/removed.

Continuous Refactoring The ongoing process of improving code structure and design without changing external behavior, essential for maintaining code quality and preventing technical debt accumulation.

Contract-Driven Design An approach where services are designed around their contracts first, ensuring well-defined boundaries and compatibility before implementation.

Cost of Delay A Lean concept measuring the cost of waiting versus deciding immediately.

Crystal Ball (Foresight) A practice of thinking ahead and predicting future changes to architecture, allowing preparation for upcoming needs.

Credential Rotation The process of periodically changing credentials and keys to reduce the risk from long-term exposure if a credential is compromised.

Cursor-Based Pagination A pagination method using a cursor (unique identifier) to mark the starting point for the next page.

D

Deadlock A situation in concurrent systems where two or more processes are unable to proceed because each is waiting for the other to release a resource, resulting in a permanent blocking state.

Docker A platform for developing, shipping, and running applications in containers, providing lightweight virtualization and consistent environments across different systems.

DynamoDB Amazon Web Services fully managed NoSQL database service designed for high-performance applications requiring consistent, single-digit millisecond latency at any scale.

Database Migration The process of modifying database schema to support application changes, often requiring special care to execute without downtime.

Data Migration The process of transferring data from one system to another while maintaining integrity and minimizing downtime.

Data Synchronization The process of keeping data consistent across multiple systems.

Deployment Strategy The planned approach for releasing code to production, including methods like canary releases, blue-green deployments, and zero-downtime deployments to minimize risk.

Deployment Success The rate at which deployments to production complete without errors or rollbacks, indicating system reliability and confidence in release processes.

Design Document A living document that captures architectural design, including diagrams, key decisions, rationale, trade-offs, and principles. Serves as communication tool and basis for review and feedback.

Discovery Work Early phase of project work focused on understanding problems, exploring solutions, and validating assumptions through research, prototyping, and testing before committing to full implementation.

Deep Work Focused, uninterrupted time that architects must protect for research, thinking about trade-offs, evaluating solutions, and reading code, typically 3-6 hours at least 3 times per week.

Decide or Wait Dilemma The decision of whether to make timely decisions or wait for more information.

Defensive Programming A design approach emphasizing anticipating and handling potential errors or unexpected inputs to make code resilient.

Design First A practice of producing the design before implementation, documented and communicated to the team.

Diagrams.net A free, open-source diagram tool for creating architecture diagrams, flowcharts, and technical documentation visuals.

Discovery vs Delivery Dilemma The balance between exploring what to build (discovery) and executing the build (delivery).

Distributed Monolith An anti-pattern where a system is decomposed into services but remains tightly coupled, losing the benefits of distributed architecture while gaining its complexity.

Durability The guarantee that once data is committed, it will not be lost even in case of system failures. A core component of ACID transactions, typically achieved using Write-Ahead Logs.

Distributed Systems Computer systems where components are located on different networked computers that communicate and coordinate to achieve a common goal, presenting unique challenges in consistency, availability, and partition tolerance.

E

Encryption (In Transit and At Rest) Security measures to protect data during transmission and while stored.

Envelope Encryption A security practice where data is encrypted with a data encryption key, and that key is then encrypted with a master key, providing an additional layer of security and key management flexibility.

Error Tracking Systematic monitoring and recording of exceptions and errors in production, with goal of achieving zero exceptions. Critical for identifying and fixing issues.

Event-Driven Architecture An architectural style where components communicate asynchronously through events, enabling loose coupling and supporting real-time data distribution across systems.

Entitlements Specific rights or privileges granted to a user or system after authentication and authorization, defining what actions can be performed on resources.

Evernote A note-taking and organization application that can be used by architects for documenting ideas, principles, and architectural decisions.

Event Sourcing A technique storing a sequence of events that can be processed later to reconstruct system state.

Expand-Contract Pattern A three-phase deployment strategy for breaking changes: expand (add new field), migrate (update services), contract (remove old field).

Experimentation The practice of using A/B tests, prototypes, and production trials to validate assumptions and make informed decisions about features and architecture.

Exponential Backoff A retry strategy where the wait time increases exponentially with each retry attempt.

Extreme Ownership A practice where architects take complete responsibility for the success and failure of the architecture they design.

F

Feature Bloat An anti-pattern where systems accumulate excessive features over time, many of which are rarely used, leading to increased complexity, maintenance burden, and reduced usability.

Feature Flags Runtime configuration switches that enable or disable functionality without deploying new code, decoupling deployment from release.

FIFO Ordering A message ordering guarantee where messages are processed in the order they were sent.

Figma A collaborative design tool used for creating user interface designs, prototypes, and design systems in a browser-based environment.

Forward Compatibility When old code can read data written by new code, allowing old services to safely ignore new fields they don’t understand.

Full Text Search A search capability like OpenSearch or Elasticsearch for searching large product catalogs.

G

Gatling A powerful open-source load testing tool designed for testing web applications, microservices, and APIs, providing detailed performance metrics and reports.

Gemba A Lean principle meaning the real place in Japanese, emphasizing the importance of understanding work by being present where it actually happens. Architects practice gemba by working in the trenches with engineers.

GitOps A practice using Git repositories as the single source of truth for declarative infrastructure and applications, enhancing security by reducing the need to share admin credentials with developers and providing history of all changes through Git.

Grammarly An AI-powered writing assistant tool that helps improve grammar, spelling, and clarity in documentation and architectural writing.

Gatekeeping An anti-pattern where architects act as bottlenecks by requiring approval for all decisions, reducing team autonomy and slowing delivery.

Gleam A type-safe functional programming language that runs on the Erlang VM and JavaScript runtimes, designed for building scalable and maintainable systems.

Go A statically typed, compiled programming language designed by Google, known for its simplicity, efficiency, and excellent support for concurrent programming.

Guard-rails Safety mechanisms put in place to allow experimentation and testing in production while minimizing risk to users and systems.

H

Haskell A purely functional programming language with strong static typing and lazy evaluation, emphasizing mathematical correctness and type safety.

HashiCorp Vault An open-source tool for securely storing and accessing secrets, providing encryption services, credential rotation, and access control for sensitive data.

Hibernate An object-relational mapping framework for Java that provides a framework for mapping an object-oriented domain model to a relational database.

HikariCP A high-performance JDBC connection pool library for Java, known for being lightweight and fast.

HTTP Methods (GET, PUT, POST, DELETE) Standard HTTP request methods used in REST APIs, with GET and PUT being idempotent while POST is typically not.

I

Implicit Contract The hidden expectations and assumptions in API contracts that aren’t explicitly documented but can break integration when changed.

Internal Shared Libraries Reusable code libraries developed and maintained within an organization to promote consistency and reduce duplication across projects.

Identity and Access Management (IAM) A comprehensive system for managing user identities and controlling access to resources across an organization.

Idempotency A property where the same operation can be performed multiple times without changing the result beyond the initial application, essential for safe HTTP operations.

Ignore Culture An anti-pattern where problems are consistently ignored over time, making it acceptable to overlook warnings and anti-patterns.

IP Hash (Load Balancing) An algorithm using the client’s IP address to determine which server handles the request.

J

Java 23 The latest version of the Java programming language and platform, continuing to evolve with modern features while maintaining backward compatibility and enterprise reliability.

Jenkins An open-source automation server used for continuous integration and continuous delivery, enabling automated building, testing, and deployment of applications.

Jitter Adding randomness to retry wait times to prevent synchronized retries from multiple clients.

JSON Web Token (JWT) A standard created in 2010 that defines a compact and self-contained way for securely transmitting information between parties as a JSON object. This information can be verified and trusted because it is digitally signed. JWTs consist of three parts: header, payload, and signature, enabling stateless authentication without server-side session storage.

K

Kanban An agile methodology and visual workflow management system that uses cards and boards to visualize work, limit work in progress, and maximize flow efficiency.

Keyset Pagination A pagination method using values of the last item in the current page to determine the next page’s starting point.

Kotlin A modern, statically typed programming language running on the JVM, designed to be fully interoperable with Java while providing more concise syntax and enhanced safety features.

Kubernetes An open-source container orchestration platform for automating deployment, scaling, and management of containerized applications across clusters of hosts.

L

Last Write Wins A conflict resolution strategy in distributed systems where the most recent write operation takes precedence, potentially leading to data loss if concurrent writes occur.

Layered Architecture A traditional architectural pattern where systems are organized into horizontal layers with communication flowing through layers. Can lead to problems when services are designed 1:1 with features.

Lean Principle Management and development philosophy emphasizing elimination of waste, flow efficiency, and continuous improvement. Influences architectural decisions including cost of delay and decision-making timing.

Legacy System Existing system that was developed in the past, often with outdated technology or practices, requiring careful handling during modernization to maintain business continuity.

Least Privilege Principle A security practice where users and services are granted only the minimum permissions required to perform their tasks, reducing the blast radius of compromised accounts.

LucidChart A paid, professional diagramming tool for creating architecture diagrams, flowcharts, and system design visualizations.

Latency Distribution Metrics Metrics measuring response time distribution for upstream and downstream dependencies.

LaunchDarkly A commercial feature flag management service providing advanced targeting and experimentation capabilities.

Learning Cycles Iterative feedback loops for discovering what users want and validating assumptions through production experimentation and rapid testing.

Leaky Contracts An anti-pattern where service contracts expose internal implementation details, making it difficult to evolve services without breaking clients.

Least Connections (Load Balancing) An algorithm that directs traffic to the server with the fewest active connections.

Least Recently Used (LRU) A cache eviction strategy that removes the least recently accessed items when the cache is full.

Load Balancer A service that distributes incoming network traffic across multiple servers to ensure no single server becomes overwhelmed.

M

mdBook A command-line tool for creating online books with Markdown, commonly used for technical documentation and architecture libraries.

Markdown A lightweight markup language used for creating formatted text using plain text, commonly used for documentation, README files, and technical writing. Simple, portable, and works well with AI grammar tools.

Microservices An architectural approach where a system is composed of small, independently deployable services that are loosely coupled and organized around business capabilities. When done correctly with proper contracts, microservices allow teams to work in parallel and enable technology diversity.

Monolith A single, tightly integrated application where all components are bundled together. While monoliths can be well-structured initially, they often become difficult to scale and modify over time.

Monitoring Dashboard Visual displays of system metrics and health indicators that architects and engineers should regularly review as part of attention to detail. Critical for understanding production behavior.

Mental Models Structured ways of thinking and understanding concepts that help architects learn faster, understand complex topics better, make better decisions, and solve problems more effectively.

Message ID (Correlation ID) A unique identifier for each message or request that is passed through downstream services to enable traceability and debugging.

Metric-Based Load Balancing A load balancing algorithm that distributes traffic based on real-time metrics like CPU usage, memory consumption, or response times.

Mind Maps A visual thinking tool that helps organize information hierarchically, useful for brainstorming and exploring architectural alternatives, though some architects prefer plain text files with good structure.

MinIO An open-source object storage system compatible with Amazon S3 API, often used as a self-hosted alternative to AWS S3.

Multi-Track Agile An agile development approach that runs multiple parallel tracks for different types of work, balancing discovery, delivery, and technical excellence.

Monthly Review A practice of reviewing the entire codebase monthly to understand architectural concepts, code quality trends, and patterns.

Move Fast vs Do It Right Dilemma The tension between rapid iteration and delivery versus quality and stability.

MySQL An open-source relational database management system known for its speed, reliability, and ease of use, widely used for web applications and data warehousing.

Multi-Factor Authentication (MFA) An authentication method requiring multiple verification factors.

N

Netty An asynchronous event-driven network application framework for rapid development of maintainable high-performance protocol servers and clients in Java.

Nim A statically typed compiled systems programming language that combines successful concepts from mature languages like Python, Ada, and Modula with efficiency and expressiveness.

Null Validations A defensive programming practice of checking for null or undefined values before accessing or manipulating data to prevent runtime errors.

O

OAuth Tokens Security tokens used in OAuth protocol for authorization, allowing applications to access resources on behalf of users without exposing credentials.

Observability A system property enabling understanding of how a system behaves in production and detecting issues before they impact users.

Offset-Based Pagination A pagination method where the client specifies an offset (starting point) and limit (number of items).

OpenSearch/Elasticsearch Open-source full-text search and analytics engines used for searching large datasets, logging, and real-time analysis.

Open Source First A philosophy of making open source the default choice and avoiding proprietary software.

Optimistic Locking A concurrency control strategy that assumes conflicts are rare, allows multiple transactions concurrent access, and checks for conflicts only at commit time.

OSI Model (Layers) A reference model dividing network communication into 7 layers, used to describe where different services operate.

Overall Architecture Diagram A visual representation showing the big picture of how services communicate and main components.

P

PostgreSQL A powerful open-source relational database system known for its reliability, feature robustness, and performance, supporting advanced data types and extensibility.

Pagination A technique for breaking large datasets into smaller chunks (pages) that can be retrieved one at a time.

Performance Metrics Quantified measurements of system performance including latency, throughput, response times, and resource utilization. Must be tracked and analyzed to ensure systems meet performance goals.

Partition A way of dividing a dataset into subsets where every element belongs to exactly one subset, used to improve performance and efficiency in distributed computing.

Pessimistic Locking A concurrency control strategy that assumes conflicts will occur and locks resources for exclusive access when a transaction begins.

Point-to-Point Pattern A messaging pattern where messages are sent directly from one sender to one specific receiver in a queue.

Policy as Code An authorization approach where policies are expressed as code.

Policy-Based Access Control (PBAC) An authorization model that uses policies to define what users or systems can do.

Production Logging Comprehensive logging in production systems to capture application behavior, errors, and anomalies. Part of observability and attention to detail.

Proof of Concept (PoC) A small experimental implementation used to explore problems, validate technologies, and make informed architectural decisions before committing to full-scale design and implementation.

Protect Your Time A practice of safeguarding time for deep work by blocking calendar time, pushing back on excessive meetings, and maintaining focus periods for thinking and research.

Publish/Subscribe Pattern A messaging pattern where publishers send messages to topics and subscribers express interest in specific topics to receive all published messages.

Q

Queue A data structure following First In First Out (FIFO) principle used in distributed systems for asynchronous processing and load balancing.

R

RabbitMQ An open-source message broker that implements the Advanced Message Queuing Protocol, facilitating asynchronous communication between distributed systems.

Redis An in-memory data structure store used as a database, cache, message broker, and streaming engine, known for its high performance and versatility.

Random Load Balancing A load balancing algorithm that randomly selects a server for each request, providing simple distribution without state tracking.

Resilience The ability of a system to recover from failures and continue operating despite disruptions. Distinguished from anti-fragility, which requires systems to improve from stress.

Rate Limiting A technique controlling the rate at which requests are processed to avoid overwhelming systems.

Rendering Logic Application logic that prepares and formats data for presentation to specific clients, often placed in BFF layers to keep backends generic.

Retrospectives A practice where teams regularly reflect on their processes and performance to identify improvements and celebrate successes.

Reading Code A practice where architects regularly read and analyze system code, library code, and framework code.

Refactoring The process of restructuring code without changing its external behavior, essential for maintaining architecture.

Request-Reply Pattern A messaging pattern where a sender sends a message and waits for a response from the receiver.

Requirements as Immutable (Anti-Pattern) Treating requirements as fixed and unchangeable rather than viewing them as temporary decisions that evolve.

Retry A technique used in distributed systems to handle transient failures by attempting an operation multiple times before giving up.

Role-Based Access Control (RBAC) An authorization model that assigns permissions based on user roles rather than individual users.

Round Robin (Load Balancing) An algorithm that distributes requests sequentially across servers.

Rust A systems programming language focused on safety, speed, and concurrency, preventing memory errors through its ownership system without requiring a garbage collector.

RxJava A Java implementation of Reactive Extensions, providing a library for composing asynchronous and event-based programs using observable sequences.

S

Scala 3.x The latest major version of Scala, a programming language combining object-oriented and functional programming paradigms on the JVM with improved type system and syntax.

Scalability A property enabling systems to handle growth and load efficiently, also applies to engineering teams working in parallel.

Spring A comprehensive application framework for Java providing infrastructure support for developing enterprise applications with features like dependency injection and aspect-oriented programming.

Spring Boot An extension of the Spring framework that simplifies the setup and development of Spring applications through convention over configuration and embedded servers.

Secrets Management The practice of securely storing, rotating, accessing, and auditing sensitive credentials and keys using tools like AWS KMS, HashiCorp Vault, or secrets managers.

Security Audit Systematic examination of systems for security vulnerabilities, compliance issues, and adherence to security standards. Part of attention to detail.

Service Contract The explicit API specification defining how services communicate, including request and response formats, error handling, and behavioral guarantees. Well-designed contracts enable loose coupling and allow internal implementation changes without breaking clients.

Schema Evolution The practice of changing data structures, API contracts, or message formats over time while maintaining compatibility with existing clients and services.

Security Implementation of measures to protect systems from threats such as unauthorized access and data breaches.

Shadow Reading A migration strategy where new systems read and process data in parallel with old systems without affecting production, allowing validation before cutover.

Sequence/State Diagrams UML diagrams useful for understanding how a system works internally and how data flows through it.

Server Sent Events (SSE) A technology for pushing updates from server to client.

Service Oriented Architecture (SOA) An architectural approach treating services as first-class citizens, enabling isolation, independence, and flexibility.

Session Persistence A load balancer feature ensuring a client’s requests go to the same backend server.

Single Sign-On (SSO) An authentication method allowing users to access multiple systems with one login.

Split A commercial feature flag and experimentation platform for controlled feature rollouts and A/B testing.

S3 (Amazon Simple Storage Service) AWS object storage service, sometimes used as an anti-pattern when treated as a distributed monolith for all data storage needs.

Single Source of Truth (SSOT) The concept of designating one authoritative database or system as the definitive source for a particular piece of information in distributed systems.

Spike A time-boxed research activity to investigate a technical question, explore solutions, or reduce uncertainty before committing to implementation.

SSL Termination A load balancer feature that handles SSL/TLS encryption/decryption.

Stability A state where systems are not broken frequently, with passing builds, tests, successful deployments, and managed technical debt.

Stagnation An anti-pattern where architects stop learning and updating their knowledge, leading to outdated approaches.

State of the Art A principle of using the latest versions and best solutions available rather than deprecated technologies.

Stateful Services Services that maintain state information across multiple requests, more complex but necessary for scenarios like user sessions or transactions.

Stateless Services Services that do not retain information about previous interactions; any necessary state is stored in external systems like databases or caches.

System Modernization The process of updating legacy systems to use modern technologies, patterns, and practices while maintaining stability and reducing risk. Requires discipline and specific approaches.

T

Terraform An infrastructure as code tool for building, changing, and versioning infrastructure safely and efficiently across multiple cloud providers.

Test Automation (TTA) The practice of using automated software tools to execute tests, compare actual outcomes with predicted outcomes, and report results, improving testing efficiency and reliability.

Test Passing Rate The percentage of test suite tests that pass successfully, indicating code correctness and serving as metric for stability.

Testing in Production A disciplined practice of validating system behavior, performance, and reliability with real production traffic and data while minimizing risk to users through techniques like canary releases, feature flags, and guard-rails.

Thinking Tools Tools and techniques architects use for exploring alternatives, evaluating trade-offs, and solving problems creatively, including plain text files, markdown, diagrams, and proof of concepts.

Trade-offs The balance between competing concerns in architecture decisions, such as performance versus maintainability, speed versus quality, or simplicity versus flexibility. Understanding and documenting trade-offs is essential for making informed architectural decisions.

Tactical Programming A short-term focused programming approach prioritizing immediate features over long-term design quality, often leading to technical debt.

Tech Debt First An anti-pattern where bad decisions are made as the default choice instead of maintaining technical principles.

Technical Principles Non-negotiable architectural standards and guidelines that define acceptable approaches to system design and implementation.

Tech Debt Plague An anti-pattern where technical debt accumulates unchecked and architects fail to fight technical debt continuously.

Thundering Herd Problem A situation where multiple clients simultaneously retry an operation, causing a sudden surge in requests that overwhelms the system.

Time-to-Live (TTL) A cache management technique where items are assigned a lifespan after which they are automatically removed.

Tokenization A security technique that replaces sensitive data with non-sensitive tokens, reducing the risk of data exposure while maintaining data utility.

Technology Evaluation The process of researching and assessing new technologies, tools, and frameworks to determine if they fit architectural needs and can improve systems. Part of architect responsibility.

Tupi Lang A programming language designed for specific computational needs, focusing on simplicity and performance for targeted use cases.

TypeScript A strongly typed superset of JavaScript that compiles to plain JavaScript, adding static type checking and modern language features for building robust web applications.

U

Unleash An open-source feature flag management platform providing feature toggle capabilities and gradual rollouts.

V

V Language A simple, fast, and safe compiled programming language for developing maintainable software with a focus on simplicity and performance.

W

Waterfall Traditional sequential software development approach where discovery is completed before delivery. Often leads to inefficiencies when done linearly without iterative feedback.

Weighted Round Robin A load balancing algorithm variant that distributes requests based on assigned weights, directing more traffic to more powerful servers.

Webhook An HTTP endpoint used to notify internal/external services when certain events occur.

WebSockets A technology for real-time bidirectional communication between client and server.

Write Ahead Log (WAL) A database durability technique where changes are first written to a sequential log before being applied to the database, ensuring recoverability.

Writing (Documentation) A critical practice for architects to document principles, guidelines, trade-offs, and decisions. Writing scales communication across teams and creates lasting records that can be reviewed and improved over time.

Working in the Trenches (Gemba) A practice based on the Lean principle of “going to the real place” where work is done to understand processes and identify improvements.

Write-Back Cache A cache strategy where data is written to the cache first and then to underlying storage at a later time.

Write-Through Cache A cache strategy where data is written to both the cache and underlying storage simultaneously.

XP (Extreme Programming) An agile software development methodology emphasizing frequent releases, continuous feedback, simplicity, and practices like pair programming and test-driven development.

Z

Zero-Downtime Deployment A deployment strategy enabling schema changes without service interruption.

Zig A general-purpose programming language designed for robustness, optimality, and maintainability, providing low-level memory control without hidden allocations.