Incident Report: May 19, 2026 – GCP Account Suspension(blog.railway.com)

420 分 | 作者 0xedb 1天前

45 条评论

  • shwetanshu21 17小时前
    This should be a warning to anyone running GCP. They suspend accounts left right and centre without even thinking about what they're doing. It seems like they use Gemini 3.1 Pro to run their production decisions.

    TK has a history of absolutely destroying the culture of the place like in OCI and has done something similar in GCP from what I've heard. GCP and Google are completely different entities with how they work. Don't expect Google quality from the name. It's just like those old brands which now have cheap licensed products like Nokia (An exaggeration I know but not far from truth).

    Not only that they are known to shut off their services randomly giving you like 6 months to migrate. They have lots of engineers not doing anything, so they put them on migrating internal users off those services, most of their clients don't. There was a brilliant article on this by an ex-GCP employee that I can't find right now.

    Avoid GCP like plague if you are serious about your business.

    Edit: Gemini (unironically) found the article on this, a very good read: https://steve-yegge.medium.com/dear-google-cloud-your-deprec...

    • cyco130 17小时前
      And this is Railway, a big enough name to top the HN main page and presumably find someone from Google to intervene at some point. I would have zero recourse if it was some little product that I built.
      • datadrivenangel 16小时前
        Their account was restored in 10 / 19 minutes! It just took 4-6 hours to get everything fully healthy. I look forward to seeing the google response to this hopefully.

        May 19, 22:10 UTC - Our automated monitoring detected API health check failures and paged our on-calls, who started investigating the issue. May 19, 22:11 UTC - Dashboard returning 503 errors. Users unable to log in. May 19, 22:19 UTC - Root cause identified: Google Cloud Platform has suspended Railway's production account. May 19, 22:22 UTC - P0 ticket filed with Google Cloud. Railway's GCP account manager engaged directly. May 19, 22:29 UTC - Incident declared. May 19, 22:29 UTC - GCP account access restored. All compute instances remained stopped and persistent disks inaccessible.

      • shwetanshu21 16小时前
        100% agree, I've seen on Twitter and HN small players facing similar issues with no recourse and response from Google. I don't know what kind of place they are trying to build there.

        They got TK to woo the enterprise customers who were forced to be hostage to OCI. But it seems they are still doing opposite of hostage here.

      • ryanisnan 16小时前
        This is the bigger point of all of this. Scary.
    • thebruce87m 16小时前
      > Don't expect Google quality from the name.

      It sounds exactly like what I have experienced in terms of Google quality over the decades.

    • adrr 15小时前
      GCP was never known for their support and deprecation of services was always a huge risk. Its very sad because its actually a quality product. They should easily be the number 2 provider. Azure is extremely unreliable and their documentation is subpar. GCP being in 3rd place is more of their doing.
      • 8note 14小时前
        i wouldnt really call it a product without support, let alone a good product.

        its a nicely design hobby, that somebody could make a good product out of, by following the same abstractions

    • JohnMakin 17小时前
      All google products work like this. Should never be used for anything critical.
      • shwetanshu21 16小时前
        Yeah, sadly know that from being burned from one of their depreciations. In fact, 2-3. But you live and you learn. And it is better to learn from other's mistakes always.
    • dylan604 16小时前
      Hasn't theGoog acted this way of quick to suspend accounts well before Gemini? I like to bash on LLMs as much as the next guy, but this seems very much like the memory of a gold fish. Or, you are just too young to remember pre-LLMs???
      • shwetanshu21 16小时前
        Haha, no. I know Google bans anyone randomly with usually no recourse in sight. I just wanted to take a dig at how bad their LLM is too while we were at it and thinks like Google themselves which is not surprising.
      • anakaine 16小时前
        You missed the humor part and focused on the tech part, it seems.
      • skywhopper 16小时前
        They probably do use similar tech to make some of these decisions, though. And they always have done that as well.
    • keeda 12小时前
      At least from the outside TK seems to be doing well given GCP's growth. My completely uninformed assessment is that he stepped in as the disciplined adult in the room to override Google's otherwise lackadaisical approach to enterprise. (Clearly still some ways to go, as this incident shows.) Now, that may have created a culture that is at odds with the rest of Google, but it was probably required to become a "serious" enterprise org.

      That said, did OCI, being an Oracle division, have a culture worth destroying? On the other hand, I could see TK importing that culture into Google...

    • hn_throwaway_99 15小时前
      What/who is TK? What is OCI?
      • westonplatter0 15小时前
        Thomas Kurian (GCP ceo)
        • pm90 15小时前
          OCI is oracle cloud infrastructure.
        • fragmede 13小时前
          That Google poached from Oracle Cloud, fwiw.
          • rdtsc 9小时前
            The tired and disappointed angel on the right shoulder was still whispering “don’t be evil” while the devil on the left shoulder, leaning over, poked him in the ribs with the pitch fork and told him to shut his yapping. “Think about all money we can make, this is the guy we need !”
    • neya 11小时前
      This is just typical anti-Google FUD. Been on GCP for nearly a decade and as do my peers. Sure, you hear about a few stories like these and in this particular case with Railway, I would actually wait to see what caused the trigger for the suspension - both from Railway and from GCP. But, this sort of thing happens with every cloud provider including AWS (you can google for the same thing "AWS shut down our account with no warning") and you'll find tons of stories like these.

      As a former GCP consultant, I can share that these sort of shut downs aren't random and it's usually due to the customer not being compliant - that breaks cloud compliance requirements for the big clouds, so automated systems flag it. Eg. Someone serving CP on their CDN, for instance.

      The Railway incident report also doesn't directly address this at all other than:

          May 19, 22:22 UTC - P0 ticket filed with Google Cloud. Railway's GCP account manager engaged directly.
      
      
      So, I would actually like to know more (What did the account manager say exactly?) before I just simply jump onto the Google hate train because it's cool to do so.
      • Shank 5小时前
        > As a former GCP consultant, I can share that these sort of shut downs aren't random and it's usually due to the customer not being compliant - that breaks cloud compliance requirements for the big clouds, so automated systems flag it. Eg. Someone serving CP on their CDN, for instance.

        If this was the case it would obviously be horrific. I did check to see, and I noticed that Railway is not listed as an ESP who sent any reports to NCMEC / CyberTipline in 2025, which seems like the wrong number for a hosting provider. Maybe they just have absurdly good customers.

        https://www.missingkids.org/content/dam/missingkids/pdfs/202...

      • cube00 9小时前
        > (What did the account manager say exactly?)

        I doubt we'll ever know, especially if it makes Google look negligent (ie. not reaching out to the customer first before restricting their production account)

        Whatever the account manager said didn't inspire confidence that this wouldn't happen again.

        Finally, we are in planning to remove Google Cloud services from our data plane’s hot path, and keeping them only for secondary/failover.

        • neya 9小时前
          > Whatever the account manager said didn't inspire confidence that this wouldn't happen again.

          Or..they couldn't remain compliant with one of the strictest cloud vendors' policies.

          • scott_w 6小时前
            Yet their account was restored so clearly GCP themselves disagree with you.
            • neya 6小时前
              Or they fixed the issue to remain compliant so GCP would restore access. Again, I know it's fashionable to hate on Google here, but there are always 2 sides to every story.
    • justinclift 13小时前
      > Don't expect Google quality

      Google has an extremely poor reputation. Why are you thinking differently to that?

    • guluarte 17小时前
      This feels like google applying the same anti-spam mindset everywhere: detect risk, ban first, ask questions later.
      • praptak 16小时前
        It's pretty stupid that big customers like Railway are not somehow protected from this.
        • guluarte 13小时前
          I think all customers should be protected by at least one CSR doing a quick approve before banning the account.
      • shwetanshu21 16小时前
        That seems to be the case. But as we see it backfires. Railway is very public, but we know at hacker news google has been doing this kind of thing for quite some time now.
        • guluarte 13小时前
          yep, last year they deleted UniSuper private cloud too
    • cubefox 16小时前
      > It seems like they use Gemini 3.1 Pro to run their production decisions.

      They said they are already using Gemini 3.5 Pro internally.

      • shwetanshu21 16小时前
        Then it's a bad endorsement for Gemini 3.5 pro too. But jokes aside, I think they need a customer centric thinking instead of a self-centred one they seem to harbour even before TK joined (not everything can be blamed on him although it should be his responsibility now).
        • ACCount37 16小时前
          Google? Customer-centric? The closest thing to that is their cloud division buttering up some big name clients.

          Other than that, Google prefers to act like "customers" are some kind of unfortunate rash they can't quite seem to get rid of, but would love to do so.

          • shwetanshu21 16小时前
            Yup, updated with the article I mentioned by Steve Yegge. Still holds true today.
    • tw04 10小时前
      > Don't expect Google quality from the name.

      Google “quality”? We’re talking about the same company that has killed off dozens of useful apps? The company that’s made 6 different chat apps? The one who will kill user accounts with no recourse or person to call?

      GCP literally had to spend the last 5 years trying to convince enterprises everywhere that they were nothing like Google proper. I’m not sure the last time you left Silicon Valley, but Google’s name across the test of the US was synonymous with flaky commercial products that might not exist a year from now.

  • Animats 18小时前
    "Finally, we are in planning to remove Google Cloud services from our data plane’s hot path, and keeping them only for secondary/failover."

    That's pretty clear. Google can no longer be trusted as a B2B service provider.

    • nthypes 17小时前
      Meta is no different. I know a company that had their OAuth app on Meta rendered completely unusable just because one of their employees (a dev) had their personal Facebook account banned by Meta for no reason. They tried to escalate it multiple times but got nowhere, lol. Meta is even worse because accounts need to be 'personal'; if you have a Business Manager, the users added to it are all tied to their personal Meta/Facebook accounts. This is ludicrous.
      • dylan604 16小时前
        To me, building any business with dependencies on Meta is just a bad business plan.
      • matsemann 17小时前
        Yeah, people loose their business because a kid is logged in on their iPad, gets their google account suspended, and google knows it's the same household as the parent, and everything gets shut down
        • subscribed 16小时前
          Can't find this now but google did at least once disable company's accounts after dev got their account suspended.

          And as we know from the recent Gemini ban wave, you can get suspended just because.

          • ihsw 10小时前
            [dead]
        • genxy 17小时前
          Everyone needs a defensible root of trust, this goes all the way down to the registrar you use for your domain.
        • londons_explore 17小时前
          > google knows it's the same household as the parent,

          Nearly all these linkages are due to people sharing recovery email addresses and phone numbers. Don't do that.

          • dylan604 16小时前
            Are you honestly saying that a kid should not use their parent's email address as a recovery option? Seems like that would be the natural way to do it.
            • true_religion 16小时前
              I don’t know about you, but I have a family account that we use as an email recovery for kids.

              Adults have multiple emails so they won’t have to share it.

              If something takes out the family email account, that’s fine. The only thing going there regularly are school notices, contractor receipts and recovery emails.

              • matsemann 6小时前
                Point is that if one account gets suspended, all your accounts might. Your kids', the family account, your separate one that you use for gcp billing etc
          • MrDarcy 15小时前
            It’s almost impossible not to any more. This is victim blaming at this point.
      • nthypes 17小时前
        Meta and Google B2B are both horrible. Their ad account bans are constant, and they have no real escalation process to get help. These companies are monopolies that should treat businesses more seriously, especially in these situations.
      • malfist 17小时前
        [flagged]
        • skullone 17小时前
          The is in context of B2B, which meta has a huge ecosystem and often rips away a companies revenue for hidden reasons
          • budoso 16小时前
            Crazy considering this was their primary argument against the App Store's revenue share model. Not that they're wrong, but you'd think they would at least be consistent.
        • lotsoweiners 17小时前
          Seems relevant to me as it is still a service that their company relied on.
          • Kiro 17小时前
            Sure you're not misreading Metal?
        • simonw 17小时前
          They're a popular SSO provider.
        • skywhopper 16小时前
          A huge number of small businesses have no Internet presence beyond their Facebook and Insta pages, so … yes they are extremely relevant to a discussion about the risk to small business of flaky hyperscalers.
    • Zamicol 18小时前
      More businesses need to hear this message. Google has proven time and time again they cannot be trusted as a service provider, exactly because of this problem.
    • shimman 17小时前
      They trust them enough to still give them money, just goes to show how entrenched big tech is and why they need to be broken up into dozens of pieces.
      • jimbokun 16小时前
        In the meantime start by migrating away from them for anything serious.
    • jimbokun 16小时前
      There is a history going back many years of Google suspending or terminating accounts with no explanation, often having to backtrack after users published their frustration and the incident went viral.

      Google has always acted as if they have no obligations whatsoever to their paying customers.

    • tjwebbnorfolk 17小时前
      They have not explained WHY their account was suspended. That's the most important part, imo. Cloud Providers don't suspend entire accounts for no reason.
      • jodrellblank 16小时前
        > "Cloud Providers don't suspend entire accounts for no reason."

        Maybe I'm getting old but here[1] is a HN comment from 17 years ago complaining about Google banning accounts "by mistake" and having no recourse but to post on HN and hope Matt Cutts sees it and helps, and saying "there are literally 1000s of such stories for many years all over the blogoshphere and forums" which is something I remember from HN of years ago.

        [1] https://news.ycombinator.com/item?id=791004

        • robotnikman 12小时前
          And unfortunately nothing has changed since then regarding this.
      • ProfessorZoom 17小时前
        The cloud provider in question - GCP - who also deleted a 125 billion dollar company's entire account on accident?
        • linkregister 16小时前
          What company?
          • shwetanshu21 16小时前
            In May 2024, Google Cloud Platform (GCP) accidentally deleted the private cloud account and all backups belonging to UniSuper, an Australian pension fund managing over $125 billion.
            • bagels 15小时前
              I think that stretches what it means for a company to "be" a 125b company, but that is still awful.
              • kjs3 11小时前
                They are a pension fund; they literally had/have US$125 billion dollars under management. What exactly is being stretched here? I can't for the life of me think of something that qualifies more for being a 125b company than actually having 125b in assets.
                • bmandale 10小时前
                  Having assets under management doesn't mean you have that money. You don't own it, you are just taking care of it for somebody. When describing a company as an $X billion company, conventionally this is referring to the market cap. You could use it to describe other things they possess if you wanted to, but assets they manage will never be something they possess.
                • ianburrell 10小时前
                  Companies are described by revenue. UniSuper made $110 million recently. It deceptive to use the assets managed as the size since it makes it look like a much larger company. NVIDIA has revenue of $130 billion. $125 billion revenue would make it the largest company in Australia by a good amount.
                • lmm 10小时前
                  > I can't for the life of me think of something that qualifies more for being a 125b company than actually having 125b in assets.

                  Which this company didn't. They managed 125b of assets belonging to other people, they didn't have 125b of their own.

      • dmd 17小时前
        > Cloud Providers don't suspend entire accounts for no reason.

        You're joking, right?

      • subscribed 17小时前
        LOL, did you woke up from the hibernation?

        This is Google we're talking about. This absolutely happened many times in the past and will happen again.

      • jimbokun 16小时前
        Google has suspended entire accounts countless times for absolutely no reason.
      • JCTheDenthog 17小时前
        Unfortunately the cloud providers also rarely if ever tell you the reason.
        • rapfaria 16小时前
          Not defending them. but wouldn't it be a legal nightmare if they did?
        • londons_explore 17小时前
          My guess would be the credit card expired....

          If it were something out of Railways hands, I think they would say something like "We have not yet identified the reason for the suspension, and are awaiting a response from Google".

          • stackskipton 14小时前
            At any company doing Enterprise work, you don't cut off someone for non payment without Account Manager doing multiple phone calls to whoever you have contact information for, emailing everyone listed on the account and whoever opened a support ticket and maybe even putting a banner in the panel with "ACCOUNT OVERDUE, CALL US TO SORT IT OUT!"

            Generally it takes 30 days past due and complete no contact for anyone before suspension.

          • coreylane 16小时前
            No one pays $2m invoices with credit cards.
            • tyrshand 10小时前
              it surprised me a lot when I first encountered it, but some organisations do :)
      • kjs3 11小时前
        Cloud Providers don't suspend entire accounts for no reason.

        Oh...my. Just starting out in the industry, are we? Those of us who have been here for a while know reality is very different than newbie hopes and dreams. Once you've been burned for the n+1 time, that optimism will fade.

      • sophacles 17小时前
        FTA:

        > Google Cloud placed Railway’s production account into a suspended status incorrectly, as part of an automated action. This action extended to many accounts within Google Cloud. As this was a platform-wide action, there was no proactive outreach to individual customers prior to the restriction.

        This might be 100% of what google told them.

    • nikanj 18小时前
      Never could. Google might block your entire company because one of your workers did something nasty on their personal account, and their ban hammer is mighty and blocks all related accounts to the Nth degree
      • dpkirchner 13小时前
        I've been wondering if I should be interrogating my friends before allowing them access to my wifi. "Have you or any of your family members ever been banned by Google?"
    • tantalor 13小时前
      Somebody else's computer
    • sneak 9小时前
      No US large cloud provider can. They all spy for their national military, as well.
    • FrustratedMonky 17小时前
      Hasn't every cloud provider had issues? Is the enshitification of servces really isolated to Google, or are we all doomed.
      • jimbokun 16小时前
        Banning accounts for no stated reason is kind of a Google speciality. They have a long well documented history of this sort of thing.
      • Schiendelman 9小时前
        Has AWS ever done this? I think their lack of social network nonsense is a benefit here...
    • daninsea 17小时前
      Railway don't have a great reputation for building scalable systems (effects of vibe coding?). It's worth waiting for Google's response before jumping to conclusions. They can move to Azure/AWS/own datacenter, but there's a good chance this will repeat in a few months.
      • shwetanshu21 16小时前
        Sure, if this was one off isolated incident people would have agreed with you. But it's not. Even Google personal accounts have been used to ban their other ones including ones spending thousands of dollars on ads or GCP or any other paid google service, which is ridiculous.
      • linkregister 10小时前
        I understand this opinion, because their API keys / OAuth tokens had no permissions structure, so a user of the Railway MCP had their infrastructure destroyed by an overzealous LLM agent. However, this is orthogonal to their infrastructure capabilities.
      • anakaine 16小时前
        Their reputation is fine, and their uptake is due in part to their handling of scaling.

        If you're picking them instead of the underlying cloud provider, but you want all the knows and dials the underlying provider has, you've made the wrong choice.

      • Citizen_Lame 16小时前
        There is always one bootlicker, fresh 1 day account no less.
        • daninsea 12小时前
          Been a passive reader here at HN for too long, finally registered today. Instead of viewing this incident objectively, you choose to insult me (?).

          I know multiple startup founders personally (2 of them are in the current YC batch), and the sheer callousness with which they look at infra, especially from security/scalability/reliability angle is shocking.

          I'll personally reserve judgement against GCP (replace with AWS/Azure/OCI/whatever) until we know more.

        • lightedman 15小时前
          Then let me be the not day-one account to say Railway is utterly bearing some responsibility here.

          "However, in this ring, there was still a hard dependency on workload discoverability being tied to the network control plane API that was hosted on the machines running in Google Cloud."

          They've gotta be joking me that they deliberately left something so critical under the control of any other entity than themselves. That demonstrates a lack of critical planning and a lack looking at their configuration from a first-principles approach.

          • Citizen_Lame 14小时前
            There is always responsibility with Railway, that's given. But also taking into account how many big websites went down when AWS was down, building critical redundancy at such large scale is not cheap, and not many companies do it. Same as security theatre, we have redundancy theatre because they needed to sell the CLOUD.
    • sa46 17小时前
      Railway has an overwhelming incentive to pin the blame on Google. This report doesn't answer why Google suspended Railway's account.

      I'd wait for more details before adjudicating.

      • jerf 17小时前
        In principle, I agree with you.

        In practice, Google has earned the way my priors are ready to believe it's 100% their fault with mighty and sustained effort. Or lack thereof, depending on your point of view.

      • subscribed 16小时前
        That would be approximately 6365262822 time Google suspended someone for no good reason.

        So no, Google doesn't get the benefit of the doubt.

      • CPLX 17小时前
        They said it was automated and affected a bunch of other customers, which gives at least some hint.

        And in general Google lost any immediate benefit of the doubt status many years ago. Many such stories.

      • sophacles 16小时前
        To quote the article:

        > Google Cloud placed Railway’s production account into a suspended status incorrectly, as part of an automated action. This action extended to many accounts within Google Cloud. As this was a platform-wide action, there was no proactive outreach to individual customers prior to the restriction.

    • r0m4n0 16小时前
      I'm not sure that's the lesson to learn from this outage. Hell Google resolved the problem in 7 minutes which is as good as you could hope for.

      The resulting action should be you have proper disaster recovery, failover, etc.

      Not sure I would trust these folks if this is the conclusion they are coming to from this experience. Any cloud provider can/will do this to you.

      • MrDarcy 15小时前
        Google restored access but did not resolve the problem. VM’s were still shut down.
      • 0cf8612b2e1e 15小时前
        Google resolved the problem in seven minutes for a billion dollar company. Good luck if you are a nobody.

        The best you could hope for is that if there is something fishy with your account, you are contacted by Google to address it.

  • tcdent 17小时前
    The interesting and yet-to-be-explained part is why google flagged the account?

    Put all the timestamps you want in the post mortem about what you observed, but you haven't addressed the root cause.

    The "this doesn't make sense" part of the story likely has a real explanation that nobody wants to reveal yet.

    • boylan 17小时前
      This exact thing happened to me when I ran https://www.fatherly.com/ circa 2017. Google just shut down our account without notice. We were spending like $10k/month. It also locked us out of our premium support account, so we couldn't even get anyone there to notice that they'd locked us out.

      After about 8 hours, a random Google support tech said it was because we were mining bitcoin, which was laughably untrue. We had CPU usage graphs and logs for the whole time and there was no spike. At around 12 hours, they turned it back on, said it was "misconfiguration of our abuse detection" and gave us like $100 in credit.

      Absurd. Say what you will about AWS, they would never do that to a customer without a rep reaching out to you first. I have not trusted GCP since.

      • titzer 17小时前
        Google thinks everything should be replaced with automation.

        Remember knowledge cards? Prior to the LLM AI revolution, they had an extraordinarily crappy AI system digest the entire internet to figure out the wrong facts about stuff and then present it to users as solid truth, with no human review and no way to report inaccuracies.

        They just don't care. If the task requires a person to look at a thing and tell if it's right, they only do that for like 5 examples and then train a classifier, then deploy said classifier without thinking twice because "at internet scale" or whatever crap.

        • dylan604 16小时前
          Google is the epitome of expecting happy path results to always be the end result. I could absolutely see someone writing this knowledge card system, but then realizing how much work it would be to edit it with some PM not wanting to say the project was a failure and needing serious amounts of human effort to correct and just releasing it as is. Gotta earn those KPIs for that next promotion, and then it's someone else's problem!
    • Aperocky 17小时前
      Shouldn't Google answer this if they are unhappy with this incident report? Are we even sure that Railway knows?
      • e40 17小时前
        I seriously doubt Railway knows. That's the MO for Google and others, suspend account without explanation.
      • tomComb 17小时前
        They can't - that would violate the privacy rights of their customer.

        They need to tell Railway and Railway needs to tell us, or Railway can tell us that Google is refusing to tell them.

        Either way, we need to hear about this from Railway.

      • SoftTalker 17小时前
        The report at this point is pretty much just a timeline of what happened. No explanation of why, no accusations, no blame. A PR piece, to Railway's customers, reassuring them that "we're not ignoring this."

        Now the lawyers are huddling. IMO there won't be a lot more said publicly by either side, at least until any threat of lawsuits for damages is settled.

      • kyrra 16小时前
        the Railway PM doesn't say they weren't told. It just sort of glosses over this. I would be interested to know if they were told (or not).
    • array_key_first 17小时前
      I don't think you're typically told why for these things, and it's mostly automated from what I can tell. The automated systems make mistakes but more importantly they're completely opaque. Nobody, not even Google, knows how they work exactly.
      • potatoman22 17小时前
        Google should know why a human accepted the automated suggestion, or if and why there wasn't any human oversight in the first place.
        • okanat 16小时前
          Google knows and wants that there is no oversight. Don't do business with any big tech, if you don't want this kind of incidents.
          • stingraycharles 10小时前
            AWS and Microsoft don’t do this, not like this.
        • advisedwang 15小时前
          Google knows why there is no human oversight: because that is expensive (both in terms of the labor doing review and the ongoing fraud likely happening while the human review happens).
      • llmslave 16小时前
        For big accounts, like railway, zero chance this was a handsoff fully automated ban
        • mjcl 14小时前
          Really? This isn't the first time their automation took down a big customer (UniSuper in 2024) by accident. In that case the automation actually deleted the resources and GCP had to recover them.
        • x0x0 16小时前
          That assumes a competent org. If this were aws, I fully believe that. At gcp it's entirely plausible.
    • marginalx 16小时前
      Who is the "You" in "you haven't addressed the root cause"? If you are asking Railway to spend effort doing this rather than simply moving away from GCP, I'm not sure why they would unless they want to sue GCP to recover damages to brand and long term customer retention.

      The moment GCP shut off without any forewarning, its done deal, no need to ask any further questions.

    • neya 11小时前
      Top comments as usual are buried in deep hate for Google, I doubt that will pressure anyone at Railway to address this.
    • croes 17小时前
      That‘s the point where Google tells you they won’t tell you the exact reason because of security reasons
      • apple4ever 17小时前
        Exactly this, which is the problem with all modern accounts. No person to talk to so you can understand what happened and maybe fix it.
        • tcdent 17小时前
          They most definitely have a person to talk to. They're not the largest Google Cloud user by far, but they are large enough to have human account reps.
          • xp84 17小时前
            And those reps might not be told what the reason is.
      • realusername 17小时前
        They also don't want to tell you because then they have to put rules and cannot ban people arbitrarily.

        Giving reasons is putting accountability on Google and they don't want that.

  • AlfieJones 1天前
    This isn’t the first time Google Cloud has seriously messed with a customer’s account: https://cloud.google.com/blog/products/infrastructure/detail...
  • shwetanshu21 16小时前
    "Railway owns our vendor choices, and we ultimately own this one. Your customers don't care whether the failure was Google or Railway; they see your product. Your uptime is our responsibility, and we'll keep delivering on it."

    Kudos to them for acknowledging it and not doing PR speak. It shows it was an architectural failure from their part of trusting GCP, and they are working to fix it. Should they have seen it coming? Yes. But better late than never.

    • 8note 14小时前
      it sounds to me like templated text from the UVic ESS office somewhere:P
  • Jgrubb 1天前
    Railway has not had the best month in the tech press have they? And in both cases it was an automated process belonging to some other party that put them there, damaging their reputation.

    I was going to talk to our google rep about their killing the Gemini cli but this is way more concerning.

    • ImPostingOnHN 16小时前
      In the case of them giving AI admin credentials to delete their production database, and it deleted their production database: that's on them. They were the only ones who put the admin account credentials into their AI.

      Then they took no personal responsibility. That definitely damaged their reputation. Here, they are taking at least some responsibility. Props to them on improving.

      Also, GCP does indeed have serious reliability issues, and Google does indeed have serious customer support issues.

      EDIT: It has been brought to my attention below that the first 2 paragraphs are misattributed, and were not Railway, but rather a customer of theirs. Sorry, Railway!

      • nightpool 16小时前
        Did Railway give admin credentials to delete their production database? My memory of the incident is that a customer of Railways used an AI tool to delete their production database, and then blamed Railway for it. The customer was the one who put their own account credentials into their own AI, not Railway
    • QuercusMax 18小时前
      Building on someone else's platform is always gonna be a risky move, and building a platform on top of someone else's platform is even riskier.

      My company used to use a hosting provider that was basically AWS plus some extra guarantees. We just finished migrating onto regular AWS because they now offer what we need directly.

      • gandreani 17小时前
        But...AWS is a platform too, no? Seems like you're in the same category of risk you just moved to a more well-known name. Granted, Amazon is the most reliable even if they have their own quirks.
        • QuercusMax 17小时前
          Each critical dependency you stack multiplies your risk. Now you have to worry about Railway AND Google causing business-damaging outages.
          • stingraycharles 10小时前
            I was looking at this from Railway’s perspective. I really wonder what caused their account to be flagged, and they hint at more accounts being erroneously flagged as well.
  • majdalsado 17小时前
    Unfortunately we had to make emergency migration off to Azure yesterday due to this. Thankfully our DB was not hosted on Railway and we were back up in a couple hours.

    As much as we loved the simplicity they provided us, there's just been too many mishaps and shortcomings for us to continue running a B2B enterprise app on their infrastructure.

    Sad day :(

    • jmaw 16小时前
      What were your reasons for going with railway in the first place? I'm not super familiar with them, but did you choose them for unique offerings, or essentially just VMs? If unique offerings, how rough was the migration out?
    • gandreani 17小时前
      Azure suspended your account as well?
      • jmaw 16小时前
        I think they meant that they migrated off of railway TO azure as opposed to FROM azure
  • ryanSrich 18小时前
    Question: for a smaller SaaS tool, or even internal product. If a team doesn't want to manage AWS or another IaaS provider, what are the best alternatives for the following

    1.) Vercel - having a bad month

    2.) Supabase - having a bad month

    3.) Railway - now having a bad month

    • levkk 18小时前
      DigitalOcean. Seriously. They have been around a long long time and built a lot of the core infrastructure you rely on every day (e.g. Ceph).
      • wouldbecouldbe 17小时前
        I;ve had my share of VPS & Managed DB outages at DO, so they are also not faultless.
        • efdee 17小时前
          I've been with DO since checks mailbox 2014. Honestly never experienced an unannounced outage.
          • wouldbecouldbe 17小时前
            Yeah overall they are ok. I think 3 times managed db and one or twice a vps just dead. No issues in a year or so.

            They were always hardware failures, took about 45-120min. Not the end of the world, but also not fun getting lot of client complaints.

        • jasonlotito 17小时前
          Not if, but when. No one is faultless. Chasing after 100% is a fool's errand.
      • xp84 17小时前
        I have read plenty of snark about them on HN, but I found their product incredibly useful, well-designed, and easy to work with. If I was building a new startup from scratch, I'd definitely be giving them a look.

        I'm sure there are plenty of the like 1,000 AWS products that DO has no viable competitor for, but for what they do offer, they're great.

      • robotnikman 12小时前
        I've used DigitalOcean for personal projects for over a decade, no major issues so I definitely recommend!
      • ethagnawl 16小时前
        I've had nothing but good experiences with them and their docs and tutorials are excellent.
      • rathboma 16小时前
        Yes, I use DO with Hatchbox. It is a perfect combo. Been using for more and more projects.
    • haute_cuisine 5小时前
      Render has been solid so far
    • Illniyar 17小时前
      If you are unable to use IaaS directly. You need to accept that your service might be down.

      Even if you use AWS and the like, if you aren't building your app with redundancy across multiple AZs, then you'll have some downtime occasionally.

      And even if you do build redundancy with multiple AZ, some services might fail anyway as AWS is not entirely isolated. So you might have downtimes.

      So just accept downtimes and use the best tool for you (unless they are really bad, like GitHub level bad). If you cannot accept any downtime, you'll have to spend millions of dollars and months of work to have the confidence to expect no downtime. Something like Netflix's chaos monkey and infrastructure would be enough.

      • ndiddy 16小时前
        The advantage of going with AWS is that when us-east-1 goes down, half the internet goes down so you don't have to defend why you had a service outage.
        • true_religion 16小时前
          I just blame AWS for all outages whether that’s true or not.
    • danjl 17小时前
      I think the message here is that you can't trust any single cloud provider. You at least need two with full operational capability.
      • xp84 17小时前
        Yup. I don't know enough people at giant companies to know how many actually do this though. Not just talking having 2 AZs, I'm talking about ability in a DR scenario to fail over, within 5-10 minutes, to a different cloud provider, e.g. AWS → Hetzner, or GCP → Azure.

        My gut feeling is that the number of significant applications that have this capability can probably be counted on two hands. Especially since a lot of the largest footprints of software stacks running in the cloud belong to Google and Microsoft, who I'm pretty sure do not replicate their services into someone else's cloud.

        • jiggawatts 15小时前
          Before the cloud it was commonplace to have redundant data centres from two or more colocation provider companies. Similarly, Internet uplink diversity was commonplace.
    • zuzululu 16小时前
      Why does nobody consider that you can buy a baremetal box or even a VPS and that will get you very far without paying a metered fee
    • acdha 17小时前
      An intermediary can provide value but there’s also a risk so I’d consider why you don’t want to use AWS, GCP, etc. directly. All of the major cloud providers have services which are only slightly harder than what Railway does but allow you to grow into more advanced things as your needs expand without adding a third-party who controls your features, security, and availability.

      As an example, I note that GCP responded within 7 minutes according to their timeline. If you’d been using Cloud Run, that would have reduced downtime by over 7 hours — and there’s a good chance that you never would have gone down in the first place if the unknown trigger event was related to other customer activity or something odd Railway did.

      There’s also a complexity factor: note how much complex infrastructure they mentioned having to fix that you wouldn’t need for your own account. That code does useful things, I’m sure, but it’s also a lot of moving parts which a hosting provider needs and you don’t – this outage took everyone down, whereas individual AWS or bare metal users would’ve otherwise been unaffected. There isn’t a global optimum which is the same for everyone but I think developers are prone to wildly over-estimating how much time they save by removing a couple of deployment steps relative to the direct costs and the less obvious costs of working within someone else’s environment.

      • christophilus 15小时前
        This entire thread illustrates why you don’t want Google in any critical part of your business. AWS, sure. Azure? Maybe. I’m not familiar with Azure, but if I have to pick one, it’s AWS.
    • nightpool 18小时前
      Fly, Render, and even Heroku still are all better choices then working with Railway I think
      • nathancahill 14小时前
        I love Fly, but their docs are.. tough. They've had multiple iterations of the control plane API, and it's very hard to do things the "correct" way with conflicting official docs.
    • rathboma 16小时前
      Hatchbox + Digital Ocean is an unbeatable combo and provides Railway-like automation with self-owned infra.
    • dejaydev 17小时前
      Depending on exactly what you're building, all of these things sounds like one VPS. A bit of maintenance/security burden managing the machine if you're not used to it but as the others have said: Next.js can be selfhosted, unless you need the serverless/edge stuff; then I would go to Cloudflare Workers.
    • Saris 17小时前
      Maybe a VPS? Simple to manage and way cheaper.

      But really any service (or even on-site hosting) can have downtime, if that's not acceptable then I suppose building/using a tool that can be distributed between multiple hosts located in different geographical areas is the best option.

    • mattmatters 18小时前
      Haven't used railway but my understanding is they are something similar to Heroku. Fly.io has been pretty great for tiny projects in that niche.

      For Vercel if your nextjs site can be compiled statically you could probably throw it up on almost anything. We've self hosted before which is pretty straightforward but you lose a lot of the image optimization stuff unless you go deep into setting up open next.

    • d-cc 13小时前
      Ramnode had always worked well for my projects.
    • nathanielks 16小时前
      Fly.io (AFAIK) still has a relatively good track record?
    • delduca 17小时前
      Hetzner (or any VM provider) + Dokku works best.
    • nozzlegear 17小时前
      [dead]
    • jiggawatts 15小时前
      I love how everyone in Silicon Valley acts like Microsoft doesn’t exist.

      Azure!

      It’s the enterprise cloud with enterprise support. They won’t randomly pull the plug on your account, unlike companies that have a wildly different cultural background:

      Google - ad tech (you’re the product)

      Amazon - shop front (you’re a comptetitor)

      Oracle - lawyers (you’re a future lawsuit for license extortion)

      Etc…

    • fabianlindfors 18小时前
      Shameless self plug but check out: https://specific.dev (especially if you use coding agents)

      No code lock-in through SDKs and built on top of AWS with great DX for both developer and coding agents

  • teraflop 18小时前
    > May 19, 22:10 UTC - Our automated monitoring detected API health check failures and paged our on-calls, who started investigating the issue.

    > At 22:20 UTC on May 19, Google Cloud placed Railway’s production account into a suspended status incorrectly, as part of an automated action.

    If the timestamps are accurate, what was causing the errors 10 minutes before the account was suspended?

    The simplest explanation is just that one or the other of these timestamps is wrong, which wouldn't be a big deal. But if the timestamps aren't known with certainty, it seems very odd to include them in the writeup as though they are certain, even though they are very obviously inconsistent with each other.

    • Shank 18小时前
      > If the timestamps are accurate, what was causing the errors 10 minutes before the account was suspended?

      Assuming the timestamps are accurate, Google probably started terminating resources while the account was not "suspended" and only completed that after all resources were disabled.

      • sroussey 18小时前
        Or the account started doing something nefarious (assuming one of their customers as root cause, not railway itself) that started causing real problems and Google shut it down.

        The problem with not having the data is that it’s easy to make assumptions.

        • bink 17小时前
          The absence of any explanation for the suspension does seem intentional. If it were me that's one of the first things I would've asked so that I could make sure it doesn't happen again.
          • sroussey 8小时前
            My ebay account is partially suspended -- i can buy but i can not sell. Of course i asked why! But they refuse to tell me.
    • jonas21 17小时前
      The 22:20 timestamp from the body of the post is wrong. The timeline section (where the 22:10 timestamp came from) is consistent with itself, and also contains:

      > May 19, 22:19 UTC - Root cause identified: Google Cloud Platform has suspended Railway's production account.

      They couldn't have identified the root cause before it happened.

    • thekevan 17小时前
      That 10 minutes is likely very normal. Possibly...

      * A Google employee messes up a setting (like one of the previous incidents) triggers something that looks like a suspension is warranted and it takes 10 minutes to flow through the process to suspend.

      * A Railway customer does something corrupt, or seemingly corrupt, Google's system starts limiting access and take 10 minutes to decide it should be a suspension.

      These are even more likely if there is a person in the loop to approve, who obvious did not dig deep enough to see that they should not have done so.

  • dantillberg 17小时前
    What drives Google to apply these actions so completely and immediately, versus a more deliberate approach, with notification and delay before action, manual review for paying customers, or a warning to resolve within X hours/days? Once or twice could be errors or bad implementation, but these can't explain away the pattern.

    It would seem that Google's counsel has deemed that whenever _____ is detected, the company must immediately and completely sever the business relationship. What is that driving concern? Is it sanctions enforcement? CSAM? Something else?

    • BitWiseVibe 17小时前
      It could be automated action based on abuse reports. TONS of spam comes from Railway associated networks.
      • spogbiper 16小时前
        I work in a security adjacent role and I know we have had a few incidents that involved Railway networks lately. Could be something to that, I don't know
    • e40 17小时前
      The problem is scale. Google uses automation and doesn't have the people to review the actions of that automation. I never worked at Google but this is the most obvious explanation from watching these things happen for years and years.

      Please, someone that worked at Google, please comment.

      • ihsw 9小时前
        [dead]
  • dan_sbl 17小时前
    > As a side effect, Terms-of-service acceptance records were also reset, prompting users to re-accept on their next visit to the dashboard.

    Don't get me wrong- the rest of this mess falls pretty clearly on Google Cloud, but this one feels like something Railway did to themselves.

  • Bender 17小时前
    I've read all the threads and their main page and I still don't really understand what this service is. Is this like a commercial alternative to Gerrit? What do people use this for?

    I'm not a developer, just curious what this is.

    • natbennett 17小时前
      The category is “Platform as a Service”

      Alternative to Fly or Heroku

      Here is my source code Run it on the cloud for me I do not care how

      In this case it looks like they also bundle together a bunch of the other services you would need to get code onto the platform, monitor it once it’s there and so on

      • Bender 17小时前
        Oh I see, so they manage the server hosting and application server configuration, optimization and all that jazz. Almost like one step away from managed hosting. Makes sense now, thankyou!
  • nosefrog 15小时前
    It's highly unlikely that GCP banning their account without telling them is true, but GCP is probably not going to go public with the real reason.
    • christophilus 15小时前
      I’m 100% willing to believe it based on Google’s track record.
  • beauregardener 16小时前
    Sadly, My Railway project is still having issues 24 hours later. Already started emergency migration away from Railway backend :(
  • tptacek 18小时前
  • whirlwin 16小时前
    The RCA and preventive measures was a pleasant read. I got a lot of respect for companies putting a lot of effort into incident reports like these. Makes them appear very professional rather than just blaming the cloud provider outright.
  • indrex 18小时前
    Had similar experience with GCP. Terminated VMs six times, and responded zero times.
  • tristanb 17小时前
    "Your customers don't care whether the failure was Google or Railway; they see your product. Your uptime is our responsibility, and we'll keep delivering on it." - Thanks Claude!
  • mellosouls 17小时前
    Even if it ultimately turns out to be "Google's fault" (as this report seems to be saying), Railway say they own the incident but make no apology here.
  • alansaber 16小时前
    Unfortunately we've also had a litany of problems with our GCP deployment and chose to remove them completely as a service provider.
  • myself248 12小时前
    How many trains were delayed or incorrectly routed as a result?
  • theredleft 18小时前
    back to on-prem
    • mxuribe 17小时前
      Honestly, i have been wanting to suggest to my leaders that we should go to on-prem for primary, and use cloud only as extra for peak traffic and/or failover, etc...but, the culture where i'm at is so bought into cloud as if it solves all problems...and then, in the next breath they all ask me to drastically reduce cloud costs and ensure 100% uptime at all times 24/7/365 (1005 uptime without complexity and without any added costs!).
  • stefan_ 18小时前
    It's reassuring to know they will ban a million dollar enterprise customer just like they will ban your GMail of 20 years.
    • tedd4u 17小时前
      I can't believe Kurian has not put his foot down about this. Adverse action against accounts over $X ARR absolutely must have review by revenue-carrying people before the action is taken.
    • SoftTalker 17小时前
      It really is amazing that there is not some level at which "human review" becomes mandatory. Customers of that size already have dedicated account rep contacts.
  • rurban 23小时前
    Google, the new Microsoft!
    • cryo32 18小时前
      I think this is just the default endgame of large corporates which suck up large quantities of customers. They are a race to the bottom and you end up with service by footgun. My own company is responsible for doing this in our sector. Literally every technology decision favours automation over verification because it's cheaper to say sorry than do it right.
    • christophilus 15小时前
      I’m no Microsoft fan, but they are pretty good at long term support of enterprises. Way, way better than Google from what I can tell.
    • raverbashing 17小时前
      Amazon played AWS from day 1 as if they were the runner-up (and in a sense they were), and while it does look like it's day 2 there, they are not letting the momentum down

      Microsoft might have technical warts but commercially they are strong and Azure is a lot of times bundled with other services and you know you can get someone on the phone if needed

      Google has... ?

      • wiether 16小时前

          > Google has... ?
        
        A spot in the Top 3, and is neither Microslop nor AWS.

        At least that's my understanding from discussing with people praising GCP.

        Let's say you want a big cloud provider, but you don't want Azure because of Microslop's old and recent history, and you don't want AWS because it's the default cloud provider.

        You're left with GCP. And many people are stuck in the 00's, and still believe Google is the cool kid crushing the boring old corporations.

      • rescbr 17小时前
        > Google has... ?

        former Oracle salespeople

    • redwood 21小时前
      Honestly they really are starting to look that way. Total opinionated Walled Garden that's against an open and thriving ecosystem. Unlike Microsoft the technology is not yet garbage but I hope this isn't where they're going to end up
      • pesus 18小时前
        It sure is heading towards being garbage, though. Search is actively being degraded in favor of a barely functioning AI, and I'm sure it's not going to stop there. Seems like it was inevitable once ad/finance people got ahold of the company.
  • loxodrome 17小时前
    I will definitely not be signing up on GCP because of this.
  • in_a_society 18小时前
    Google has a culture problem. This is not something that can change easily nor will it change when it’s not recognized as being an issue within their organization.

    Between my peer c-suites, the conversation is that GCP cannot even be in the consideration set until such a time as a several-year period has elapsed without this kind of incident.

  • FajitaNachos 17小时前
    19 minutes from detection to getting the google account restored is pretty awesome honestly.
  • 1970-01-01 17小时前
    They forgot to get reimbursement for downtime. A free month of GCP is better than nothing.
  • siliconc0w 15小时前
    Why would you use an infrastructure provider on top of another infrastructure provider? It adds cost and risk, it's always going to be a leaky abstraction, and it's not hard to learn how to use GCP or AWS correctly - especially with agents.
    • bagels 15小时前
      What intermediate is involved?
  • koliber 17小时前
    Now given the logic that you can't be dependent on any one service to run your SaaS, how does Railway convince its customers to run their SaaS on a single service?
  • delduca 17小时前
    Flagged by some AI automation.
  • pm90 15小时前
    I don’t understand why Google still has TK helming GCP when its obviously not achieved the kind of success it should. Google infra is some of the best in the world yet GCP is meh. It continues to underperform and seems content to be a distant 3rd behind AWS and Azure.
  • phendrenad2 8小时前
    Nov 28, 2012, user fbuilesv posted: Google Compute Engine is on Limited Preview right now. If you're planning to offer a service that you care about you should consider this.

    Do we know if GCP has ever left limited preview..??!

  • ibejoeb 18小时前
    I've been getting serious, recently, about moving all my workloads to equipment that I control in datacenters with which I have professional relationships. It's less expensive, easier, and this kind of nonsense doesn't happen. These cloud providers need to step back and observe how terrible they've made these products. Footguns everywhere, pricing that is impossible to forecast or reason about, broken APIs, and automated self destruction. Then you have third-party providers sitting on top of them, adding another layer of each antifeature. Crazy.
    • lacewing 18小时前
      > These cloud providers need to step back and observe how terrible they've made these products.

      They don't, because the allure of effortless scaling is hard to resist: everyone thinks of themselves as the next tech unicorn. And if you actually become an unicorn, you're already too dependent on AWS / Azure / GCP to easily move somewhere else. At best, your strategy is to become "multi-cloud".

      • ibejoeb 17小时前
        That effortlessness is a fantasy. That's illustrated right here in this write-up by how complicated their system is.

        >Railway’s network is a mesh ring, built up of high availability fiber interconnects between Metal <> GCP <> AWS. However, in this ring, there was still a hard dependency on workload discoverability being tied to the network control plane API that was hosted on the machines running in Google Cloud

        What the hell is even that?

        • foobar1726 16小时前
          "We had all of our workers set up in an open office layout to to make sure everyone could talk to each other without a single point of failure. But last night the boss got too drunk and didn't come in, so everyone spent the day scrolling tiktok."
    • Scaled 17小时前
      It's really surprising how much cheaper colo becomes if you have an even vaguely predictable workload. And you don't have to be a major customer, either -- the data centers will happily sell you single U's or a couple U's, even on a monthly basis if you ask, making it perfectly viable for startups or advanced personal projects.
    • mxuribe 17小时前
      > ...These cloud providers need to step back and observe how terrible they've made these products...

      I doubt that will happen because none of them want to stop the money-making machine they have! And, if your thought after my comment is that all us techies are making a fuss, so the cloud providers and businesses using them will hear our cries and trigger a backlash...? I doubt that to...because some senior business leaders that i see are bent on listening more to management consultants as opposed to abalance of folks including their own internal experts...but, alas, maybe i'm just having too cynical a day today. :-)

    • bombcar 18小时前
      The thing that's nice about physical datacenters with people is that they often have to physically walk over to disconnect you - it's not as easy as some automated system doing an AI.
      • Scaled 17小时前
        And if they do, you can walk over there too and ask a human why in person. (Or just call the NOC)
  • kittikitti 10小时前
    It's not my proudest moment, but at least being banned and suspended all the time brings some wisdom.
  • AtNightWeCode 17小时前
    So, what was the reason for the account suspension. Why did it happen? I know Google can be a bit stupid with their automatons but I am bit skeptical here. There are sites more critical than Railway hosted on GCP.
  • corndoge 17小时前
    > Your customers don't care whether the failure was Google or Railway; they see your product.

    Refreshing. So tired of businesses blaming their vendors. Oh it wasn't us spamming you text messages and emails, it was Shopify. Oh, our delivery guarantee said 2 days and it's been a week? That's not us, it's UPS.

    I don't care. I didn't pay UPS or Shopify. I paid you.

  • llmslave 16小时前
    Major infra provider -> has no backups/game plan if GCP goes down
  • tamimio 17小时前
    > Railway’s production account into a suspended status incorrectly, as part of an automated action.

    Be it individuals or companies, this time is the best time to ditch all dependence on anything clouds or SaaS since all are using automated AI, more and more of these incidents will occur.

  • ChrisArchitect 19小时前
    Related discussion during the incident:

    https://news.ycombinator.com/item?id=48201484

  • _justme 14小时前
    Does this qualify for a list entry on killedbygoogle.com ?
  • charcircuit 16小时前
    >Google Cloud placed Railway’s production account into a suspended status incorrectly, as part of an automated action.

    There is no justification given on why this action was incorrect. It's possible they actually did something wrong.

    • dpkirchner 13小时前
      But not so wrong that they could get their account back in 10 minutes.
    • nosefrog 15小时前
      Yup...
  • ur-whale 17小时前
    Perfect reminder that it's time to use Google Takeout while I still can.
  • guluarte 17小时前
    tldr: AI suspended an almost a billion dollar startup account.