My Life as a CTO - Technology (Daily Work)

Okay, you've successfully made it through Day 1 and are now thinking about the hundred things on your plate. How should you prioritize them? As a Tech Leader, although you may really enjoy coding, if there are already excellent engineers on the team who can help with development, you need to focus more on other high-level tasks. This is because these are aspects where others may find it harder to assist you due to your position, your experience, or your understanding of the company's context. For a Tech Leader, these tasks often have relatively higher value.

Don't get me wrong; this doesn't mean you should only do these tasks. For example, I still spend a considerable amount of time developing and coding, especially when manpower is low. However, the following three tasks: PR Review, Monitoring, and Tech Spec Review, are more imperative and are activities I engage in almost daily. I'll also take this opportunity to share some of the insights I've gained over the years.

🤖 Daily Work: PR Review

PR Review is an essential aspect of engineering teamwork, but I've found that it often lacks systematic norms. The effectiveness of a PR review is largely dependent on aspects "beyond the code." Let's set aside the code itself for a moment and discuss what makes a good PR, which can also serve as a guideline for Tech Leaders to set team standards.

  1. Length

    When asking someone to do a PR review, you are essentially asking them to understand what you are doing, which involves engaging their short-term memory. Reviewers use their working memory to temporarily remember your PR's logic, scrolling up and down and constantly extracting, integrating, and piecing together your intended meaning. However, human working memory is notoriously poor. If your PR is too long, exceeding the reviewer's memory capacity, they may give up on understanding it or only grasp it superficially, unable to offer deep insights. Generally, I recommend keeping PRs to about 200 lines. If a PR exceeds this, it should be broken down to avoid overburdening your reviewer's cognitive load. Treat your reviewers well, and they will return the favor when reviewing your code.

  2. Description

    I've found that many people tend to skimp on the description, possibly thinking that the code speaks for itself. However, a good description can establish a clear context for the PR, especially explaining the 'Why,' which is often the hardest part to discern from the code alone. This is similar to normal conversation: if you don't set the context first, the other person will have to spend more time and effort to understand, or they might need to ask you many questions to grasp your intent. For example, someone might ask for advice but doesn't specify what kind of advice they need or why they want it (is it because a supervisor corrected them, or are they preparing for an interview?), which makes it hard for you to help effectively or requires much effort to extract the context. Providing the 'Why' allows the reviewer to not only judge how well something is done but also whether there might be a better way to address the 'Why.'

  3. Test Plan

    I believe it is the author's fundamental duty to thoroughly test before pushing a PR. Showcasing your tests can significantly increase the reviewer's context and reduce the burden of understanding. I highly recommend describing your test plan through videos, especially for UI parts, to showcase your work. Even for backend tasks, you can record your screen to show the reviewer how you tested, including the API calls and results. If time allows, I also recommend including a before-and-after comparison, which can highlight the purpose of the PR more clearly.

All of these can be implemented using GitHub's pull_request_template, which automatically incorporates the format you want others to fill out for each PR. You can even insist on requesting changes if the descriptions are not detailed enough. The stricter you are early on, the sooner everyone's standards will align.

🤖 Daily Work: Monitoring

As a CTO, one duty that is absolutely your responsibility, and that no one else should do for you, is monitoring. Monitoring is like a health check—it ensures that there are no major issues with your 'health,' and if there are any signs of chronic diseases, they can only be detected with monitoring. If, unfortunately, a major issue does arise, monitoring allows you to trace back what happened. Without it, you're just gambling with fate, seeing who lasts longer.

For a web service backend, here are a few things to monitor:

  1. Server Health

    The common recommendation for uptime is at least 99.99%, meaning the system is down for no more than 52 minutes per year. Practically speaking, I don't think it's realistic for startups to pursue this unless your service is marketed on high availability. Startups simply don't have the resources, and services like Cooby's are often adversely affected by third parties. If a critical issue happens in the middle of the night and we don't have enough engineers on rotation to fix it immediately, it's even more important that customer support is well-coordinated, affected customers receive adequate compensation, and the same mistakes are not repeated. Of course, this doesn't mean the number is unimportant, just that it shouldn't be a major focus for startups in the early stages. The nature of startups is to continually update and change. If you strive for the ultimate in uptime, it could slow down the overall progress of the company.

  2. API Response Time

    This metric doesn't really differ between startups and large companies; the industry benchmark for p50 is around 200 ms. This metric is usually not hard to meet as long as you haven't written any particularly bad or lengthy queries. More attention is usually needed for the p95 or p99 numbers. If your p50 is fine, but your p95 is consistently high, such as three to five seconds, it indicates a potential issue that needs resolution. This suggests that your service design provides a poor user experience for certain users, who are likely your most heavy users or those with large data volumes. Such issues fall into the category of important but not urgent in time management metrics. While slow API responses won't break immediately, they can pose significant hidden risks over time, with user experience being the most critical concern. Poor user experiences are like chronic diseases; they seldom prompt immediate complaints to customer support, but over time, they erode the positive feelings developed elsewhere, potentially even driving users to switch services. Additionally, a problem that initially appears only in the p95 metrics is likely to expand to affect p75 and eventually p50 as your service grows and gains more users, turning it from a 'minority' issue into a widespread problem. When you see abnormal numbers in p95, it's advisable to treat it as a high priority and ensure it's included in upcoming work.

  3. Error Rate

    Needless to say, if errors are being thrown, it indicates that something needs to be fixed. This might sound obvious, but in the daily grind of an engineer's schedule, it's often hard to make time for this. Concerning errors, I believe it's important to follow this principle: don't throw an error if you don't plan to handle it, or don't throw it as an error at all.

    This is akin to the boy who cried wolf. Once the wolf actually appears after many false alarms, no one pays attention. However, in implementation, it's easy to get into the habit of throwing all errors when handling them. But if you throw errors that you don't intend to address (for example, those stemming from third-party issues), you're just muddying your monitoring, obscuring the real issues. All thrown errors should be promptly addressed. However, errors that you don't intend to immediately address shouldn't fail silently; rather, they should be categorized differently (e.g., as info) or tagged with a different priority, still preserving the ability to track these issues without affecting the most critical errors.

    For example, the distinction between 400 and 500 errors should be clear. By definition, 400 errors are client-side issues, essentially 'errors not intended to be handled' by the backend, and should not be defined as critical in monitoring; these errors are expected to exist. On the other hand, the goal for 500 errors should be to reduce them to zero, indicating issues that need immediate resolution.

    Error codes are important for communicating issues to users on the frontend, allowing them opportunities to retry actions. Hence, proper communication of error codes between the frontend and backend is crucial. This is often overlooked in early development (since everyone believes the happy path will work), but establishing these standards early can save a lot of time in future refactoring (you'll have to define error codes eventually, so why not do it right from the start?). It's easy to be lazy early on and let the frontend display raw backend error messages, but I assure you, those messages are not meant for users to see. I've used too many services that spit out chunks of code when something goes wrong, causing nothing but headaches. Please define error codes properly from the start and prepare frontend messages that allow users to retry, treating those who spend time on your service with respect.

  4. Database Health

    Over the years, I've come to see the database as the lifeblood of backend services. Often, it's the bottleneck for API response times. Cooby's entire service has gone down several times due to database issues. The data that needs to be monitored includes but is not limited to: query response time, number of connections, transaction rates, disk I/O, CPU, and memory. A database has one job: to allow you to write and retrieve data. However, how well you can retrieve that data beautifully and efficiently depends on your database design and the way you write your queries.

    I'd like to share a lesson I learned the hard way. Everyone knows that when designing a database, you need to properly design its indexes. A database without indexes is like a dictionary without an index; it's a miracle if it doesn't drag down the whole database's performance. I believe no engineer would neglect to design the best possible indexes at the time. However, the problem is that requirements change, features evolve, and more importantly, people leave. Even if the initial database and index design are super efficient, given enough time and as your product evolves, anything can change. This puts to the test whether the original design can be effectively conveyed and whether it can be easily picked up by others. One of the biggest problems we encountered was when we adjusted the product needs and unexpectedly used an unindexed column for queries. Most users didn't notice any performance issues, especially in a testing environment, until after the product was live for a while. The database, silently suffering under the load, eventually crashed under the strain of this seemingly minor query, which brought down all major services. It took me several layers of digging to realize this was the issue, but once I added the indexes, all response times increased by more than tenfold—a typical example of one line of code solving thousands of problems. Although I haven't yet figured out a perfect solution, it's best to conduct these checks in an automated way after handovers or when people leave to prevent these major issues.

💫 Bonus: Hyrum’s Law

With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody.

— Hyrum’s law.

Hyrum's Law, named after Hyrum Wright and mentioned in the book "Software Engineering at Google," suggests that with a sufficient number of users of an API, it doesn't matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody. In simpler terms, no matter what the contract guarantees, any consistent feature your API exhibits will be used by someone. For example, if your API does not guarantee that the output will be sorted, but the implementation results in a sorted output most of the time, this characteristic will inevitably be relied upon by future users, forcing you to maintain this behavior.

This characteristic isn't necessarily bad, but API developers need to be aware of it. It's not enough to maintain only the explicit rules mentioned in the contract; the implicit ones must also be taken very seriously. If possible, all implicit rules should be treated as explicit and clearly defined.

Monitoring Service

Datadog is probably the most widely used web monitoring service. I have only used this service personally and find it to be very comprehensive. The user interface is simple and straightforward, capable of monitoring everything from broad trends to the most detailed specifics. I frequently use the Dashboard & Alert features. The Dashboard allows me to monitor key metrics periodically, and if there's a large display available, it's great for broadcasting so everyone can help monitor regularly. The Alert feature lets you set thresholds for metrics you care about, such as if the p95 API response time exceeds five seconds, or if the error rate exceeds a certain percentage over a specified period, it will trigger a notification that links to Slack. These proactive alerts are designed to ensure you are immediately aware of critical issues, sometimes waking you in the middle of the night to address them. This is why I emphasize that only errors intended to be addressed should trigger these alerts; otherwise, frequent false alarms might lead you to ignore important errors eventually. The thresholds should be tailored to each company's specific user characteristics and typically require continual adjustment to find the right balance.

Another commonly used monitoring system is AWS's Cloud Watch. Although we integrate most AWS services (EC2, API gateway, ELB, etc.) with Datadog for better UI monitoring, due to Datadog's pricing structure, some services, like Lambda, are especially expensive to integrate. Cooby uses Lambda extensively, so we opt to save costs by using Cloud Watch for monitoring directly. Although Cloud Watch also involves fees, it offers solid basic monitoring and data retention. Integrating with Slack can be achieved through SNS, though managing both systems can be cumbersome.

AWS Cost Explorer

Besides server health, another critical metric for any CTO is cost. Even if your service is stable and performs well, unexpected costs can cause significant issues, especially with auto-scaling services. Forgetting to set limits can lead to spending far beyond what you might anticipate.

We had a firsthand experience where a bug in our code caused excessive reads/writes on DynamoDB, which charges based on the number of read/write operations. This wasn't obvious in a testing environment due to low data volume, and because DynamoDB doesn't fail under heavy read/write requests but instead scales to meet demand, this scaling isn't free. The more convenient the service, the more dangerous it is if not monitored in real time. Without immediate oversight, your funds might silently drain into AWS's coffers, never to return. Another service that can quietly consume resources is Lambda; without setting limits, increased traffic or a bug can cause AWS to avoid throttling by spinning up numerous Lambdas, causing your bill to skyrocket. If these Lambdas connect to a database, an unexpected influx of requests can overwhelm the database, which we've also experienced.

Each experience teaches a lesson. After this incident, I realized the importance of these considerations and share them here. There are enough pitfalls in the world, and avoiding even one by learning from others' painful lessons is beneficial.

🤖 Daily Work: Tech Spec Review

“There are no right or wrong answers in architecture — only trade-offs.”

— Neal Ford

Software design, like all design, is more an art than a science. There are always multiple ways to solve a problem, and there's never a perfect solution; most decisions involve choosing between various trade-offs.

However, this doesn't mean that software design doesn't need to be scrutinized. People always have blind spots, and the engineer responsible for the design has likely thought about it the most and is probably the most knowledgeable in that area. But there are blind spots and higher organizational goals that might not be visible to an engineer's context. This is where a tech leader can provide the best stimulation and additional context, encouraging the team to think deeply and ensuring that most foreseeable situations have been considered. I emphasize that I don't believe the leader's role is to provide all the answers. Instead, their more critical responsibility is to ask the right questions, stimulate team discussion, and notice problems that haven't yet been identified. No one can be an expert in all areas, and it's often more crucial to have a framework that can be applied across various topics.

Here are some aspects that I believe should be included in every tech spec review. If they aren't covered in the review itself, it's the leader's responsibility to ask these questions:

Goals

  • Has the project achieved the original objectives it was set to meet?

    In "The 7 Habits of Highly Effective People," the first principle is to begin with the end in mind, setting clear goals. Goals guide everything, and every project that reaches an engineer for design has already passed through several stages. It might have been confirmed for implementation after extensive debate between PMs and designers, scheduled after discussions with the business team, or decided upon during technical team meetings to address technical debt. Every project has a motive and an objective at its inception, but it's easy to get lost in the details and forget the initial goals. Engineers eager to solve more problems might add improvements that are good for the system but not directly related to the project's primary goal. At this point, asking "What is the project's goal?" can help realign everyone to the same direction and make the priorities clearer.

    For example (purely hypothetical), if a PM proposes a project to test how improvements to the onboarding process affect conversion rates, and the onboarding process follows a login procedure with various implementations in the codebase, integrating the login process could make writing the onboarding improvements easier. Thus, during the tech spec review, the engineer's plan is to refactor the login process along with the new onboarding process. This is a great opportunity to reconfirm the project's goals with the team. Clearly, the login process refactor is a bonus, but not a hard requirement. With this understanding, the team can still discuss whether to implement the login process refactor in this iteration, evaluating other trade-offs. But at least there is a consensus that it's not mandatory. If there's additional bandwidth, are there other tasks more directly related to the project's goals, such as the completeness of metrics logging or the depth of the onboarding process implementation? These issues align more closely with the original goals and should be prioritized over other optimizations.

Implementation

  • Have other implementation methods been considered? Why was the current one chosen?

    Just as no design is perfect, the first solution that comes to mind is likely not the best and certainly not the only one. It's important to ensure that the solution being presented has been chosen after considering other options and evaluating their trade-offs. There should be a convincing reason behind every choice, whether it's faster development, stability, compatibility with existing structures, or even a personal interest in trying a new technology. All are valid reasons, but it should never be without reason. A common issue is the choice of database—relational, no-SQL, S3, etc., each with its own advantages and disadvantages. It's essential to discuss why a particular database was chosen. Sometimes, the choice is based merely on familiarity without considering that other options might be more suitable under different circumstances. This is an excellent topic for tech spec review discussions. (This is why system design interviews often focus on this!)

    If the engineer has already considered various implementation methods, further discussion can delve into the trade-offs of each option, evaluating whether the trade-off assessments are reasonable. If necessary, more context can be provided to adjust the order of trade-offs. For example, the engineer might assume that the PM team prioritizes speed, but you might know that the PM has other projects under testing. This is an excellent opportunity to provide that context, possibly even prioritizing integrity over quick implementation.

  • Is there a risk of overly optimistic estimates in implementation?

    This is slightly less critical than other items, not because it's unimportant, but because the responsible engineer usually handles it better than a leader, except perhaps for juniors or those unfamiliar with the codebase. Time estimation shouldn't be a major concern for leaders unless you notice a significant discrepancy from your intuitive estimates, which still offers a good opportunity to inquire deeply about the considerations behind the estimate.

Potential Risks

  • Are there any visible directions for expansion within the short term (e.g., three months)? Is there room for future extensions?

    If the company has good mid-to-long-term planning and can foresee future needs for product expansion, then it's a good time to provide context so the team can consider how to incorporate scalability into their designs. Increasing future scalability almost always means longer development times now. Is this something the company can afford at this stage, or do we prefer to use more development time in the future in exchange for rapid deployment now to meet urgent short-term needs? These decisions need to be guided by goals at all levels to make the best choices for the company at the moment.

    Besides horizontal expansion of features, if the rollout is successful and traffic increases, is there a plan for vertical scaling? Are the current rough estimates for this reasonable and accurate? This is another focal point during reviews.

  • Are there any conflicts with the directions other teams are currently developing?

    As the CTO, who oversees various ongoing projects, you are in a unique position to know if certain features might conflict with each other. For instance, if the infrastructure team is currently relocating and upgrading certain parts of the infrastructure, it's best for the feature team to avoid touching the parts being upgraded to prevent wasting time resolving code conflicts or having to rewrite work. These issues should be raised early in the spec review process.

These are the three areas I believe a tech leader should focus on in terms of Technology: PR Review, Monitoring, and Tech Spec Review, along with the best practices I've gleaned from my experience. Again, these aren't necessarily the correct answers. Truth is refined through debate, and I warmly welcome like-minded individuals to discuss these topics with me. There are definitely ways to improve.

Next, let's talk about leadership.

Previous
Previous

My Life as a CTO - Trust Leadership

Next
Next

My Life as a CTO - Technology (Day 1)