• 0 Posts
  • 81 Comments
Joined 1 year ago
cake
Cake day: July 23rd, 2023

help-circle




  • Speaking from 10+ YoE developing metrics, dashboards, uptime, all that shit and another 5+ on top of that at an exec level managing all that, this is bullshit. There is a disconnect between the automated systems that tell us something is down and the people that want to tell the outside world something is down. If you are a small company, there’s a decent chance you’ve launched your product without proper alerting and monitoring so you have to manually manage outages. If you are GitHub or AWS size, you know exactly when shit hits the fan because you have contracts that depend on that and you’re going to need some justification for downtime. Assuming a healthy environment, you’re doing a blameless postmortem but you’ve done millions of those at that scale and part of resolving them is ensuring you know before it happens again. Internally you know when there is an outage; exposing that externally is always about making yourself look good not customer experience.

    What you’re describing is the incident management process. That also doesn’t require management input because you’re not going to wait for some fucking suit to respond to a Slack message. Your alarms have severities that give you agency. Again, small businesses sure you might not, but at large scale, especially with anyone holding anything like a SOC2, you have procedures in place and you’re stopping the bleeding. You will have some level of leadership that steps in and translates what the individual contributors are doing to business speak; that doesn’t prevent you from telling your customers shit is fucked up.

    The only time a company actually needs to properly evaluate what’s going on before announcing is a security incident. There’s a huge difference between “my honeypot blew up” and “the database in this region is fucked so customers can’t write anything to it; they probably can’t use our product.” My honeypot blowing up might be an indication I’m fucked or that the attackers blew up the honeypot instead of anything else. Can’t send traffic to a region? Literally no reason the customer would be able to so why am I not telling them?

    I read your response as either someone who knows nothing about the field or someone on the business side who doesn’t actually understand how single panes of glass work. If that’s not the case, I apologize. This is a huge pet peeve for basically anyone in the SRE/DevOps space who consumes these shitty status pages.


  • This is a common problem. Same thing happens with AWS outages too. Business people get to manually flip the switches here. It’s completely divorced from proper monitoring. An internal alert triggers, engineers start looking at it, and only when someone approves publishing the outage does it actually appear on the status page. Outages for places like GitHub and AWS are tied to SLAs that are tied to payouts or discounts for huge customers so there’s an immense incentive to not declare an outage even though everything is on fire. I have yelled at AWS, GitHub, Azure, and a few smaller vendors for this exact bullshit. One time we had a Textract outage for over six hours before AWS finally decided to declare one. We were fucking screaming at our TAM by the end because no one in our collective networks could use it but they refused to declare an outage.




  • That explanation runs counter to my experience with VC-funded companies, marketing budgets, and running in the red in general. Trying to hit as much of the total addressable market as possible means burning money. Notice how I expanded and included discounts? You don’t even get a 5% off code. Framework is making a profit so they can lose margin on a low percentage (if they’re not making a profit then there’s no reason to not throw away more to get closer to TAM anyway).

    Board games run in the thousands for some of the bigger ticket items. I’m not sure you understand either market. I regularly crowdfund packages that are more than at least 25% of the Framework prices I’m skimming now.






  • The problem is the underlying API. parseInt(“550e8400-e29b-41d4-a716-446655440000”, 10) (this is a UUID) returns 550. If you’re expecting that input to not parse as a number, then JavaScript fails you. To some degree there is a need for things to provide common standards. If your team all understands how parseInt works and agrees that those strings should be numbers and continues to design for that, you’re golden.


  • While it’s certainly true that some classes of bugs are very easy to fix (“oh shit I forgot to apply the correct style”; “I mean to use this method whoops”), many bugs that exist in later-stage games require pulling a bunch of shit apart to figure it out. They’re in the same pool of difficulty usually as performance optimizations or balancing new functionality. Getting a successful test case can be difficult even if the bug is readily apparent. Getting the regression test to pass is the subject of a plethora of literature. It can be hard and difficulty often scales with codebase. If the bug was obvious and easy, it would have been done before.

    If it was obvious and easy and wasn’t done before because of time constraints, devs can still charge more because their wages should have gone up. This whole thread OP is kinda nuts (not the commenter I’m vehemently agreeing with and expanding on).