Podcast: Dave Blakey talks Nova with Running In Production

Reproduced courtesy of Running In Production.

In this episode of Running in Production, Dave Blakey goes over how their load balancing service (Nova) handles 33,000+ events per second across a 100+ server Kubernetes cluster that runs on both AWS and DigitalOcean. There’s a sprinkle of Serverless thrown in too.

Transcript

You're listening to the Running in Production podcast where developers and engineers talk about their tech stacks, lessons learned and general tips from running web apps in production. Here's Nick and today's guest.

Nick Janetakis: Welcome to Running in Production. Today I'm with Dave Blakey who is running a combination of Go and Kubernetes to help deliver a load balancer service called Nova. Dave, welcome to the show.

Dave Blakey: Thank you. Thanks for having me. I'm excited to chat.

Nick: Yeah, no problem. So do you want to start off by introducing yourself and letting people know a little bit more about the app that we're going to go over today?

Dave: Absolutely. As you mentioned, my name is Dave Blakey, I’m the CEO of Snapt, which is an ADC company, which is really just a fancy way of saying load balancing, web acceleration and application security. The space is quite keen and close to our hearts and we're going to talk about our own app today, called Nova, which provides those services at large scale for cloud-native, next-gen businesses and deployments, and how we built that, how we structured it, and how we've prepared for scaling it out.

Nick: Nice. So how long has Nova been running in production?

Dave: Well, the product itself is still in alpha. It's going to beta middle of this month [February 2020], but we have quite a few production plans already. In our game, load balancing in alpha is pretty stable. I think our first production client went live on the platform about four months ago. So, it’s been about three months.

Nick: OK, is it just you developing this application? Or do you have a small team around it?

Dave: We've got two teams working on it. There are multiple components to it, the backend and then the  frontend. But in total, there's probably about nine people working on it.

Nick: Oh, wow. So that is a pretty decently sized team.

Dave: Yeah, it's too big to share a large pizza.

Nick: Right. Gotta get two of them.

Dave: Exactly.

Nick: Before we jumped on this call, you mentioned that GoLang is a primary component of the backend. Do you want to go into why you happened to choose GoLang? And what parts of the language benefit your application?

Dave: Absolutely. I mentioned briefly that we have these two components. Really, it's one system and you control it through a web interface. So, you're doing your settings, configuring what you want to run, and then ultimately deploying that. But what's really happening in the backend is that we are then shipping that configuration out to potentially hundreds or thousands or tens of thousands of your connected devices. And that configuration and everything is all done through our primarily PHP driven website. But that speaks to our GoLang server infrastructure, which then communicates with all of the clients, which will be running on actual clients’ machines, often in Kubernetes, containers, things like that, which is also a Go binary sitting there.

So the reason I go into that description is because you can already see where I'm headed: with tens of thousands per client of connected devices, we needed something that was very scalable, that all of these clients connect to one central point of control, being this GoLang server, that had to be extremely scalable, do a lot of things in parallel, and in a very efficient way. And Go was a combination of a very good way of doing that and a way for us to develop quickly, if that makes sense. You'll see a lot of the choices we've made have been: "what is the most scalable best way here that also is a very agile way of getting it done?" And Go really fits that that profile in my opinion.

Nick: Was there any prior experience that you had with Go that made it easier to use it or did you pick it based on your use cases for the app?

Dave: Our team is made up of many programming languages. You know, we work in a lot of different languages across the business from frontend stuff to core kernel stuff. So, we have a lot of experience with a lot of different languages, and Go was just suggested as the natural fit for that piece of the pie. On the client side, we're quite flexible. We've been looking at doing a version that's smaller, a smaller footprint, for which we’ll probably just use C. But for the server, Go was a very obvious choice for us. So it was quite an easy choice to make.

Nick: Maybe we should just rewind a little bit and cover the details of what your application is composed of. Because you mentioned you have a server, and you have a client. Now, is the client the binary that your end customer would install on their server?

Dave: Yes, exactly. So to be clear, typically, the client is actually deploying one of our containers or VM image or something like that. But what's really the real function of that is to run this thin client of ours. And our client’s job is to remain in constant communication with our server in order for us to control that instance – container or VM or whatever it might be. So the client basically just accepts instructions from the server, and has the requirements of being extremely low latency, very scalable and very lightweight. And then the reason for the server to exist is for us to obviously know what clients are connected and to be able to deploy configurations, issue commands, collect statistics, send jobs to the clients, etc. So there's a client-server infrastructure, and then there’s what we would call the cloud – the frontend to the users which communicates with the server in order to tell what clients to do what jobs.    

Nick: So that cloud frontend, that is what end users connect to just like a regular website and that's how they configure their load balancers, right?

Dave: Exactly right. And that’s written primarily in PHP.

Nick: Is that just PHP standard library or are you using some type of framework for that?

Dave: We use Laravel. Well, we use large parts of Laravel. One of the nice things about that is that it's flexible with being able to use some of your own stuff as well. But yeah, it's PHP with the Laravel framework.

Nick: Nice. And was that just based on what your developers knew how to use beforehand?

Dave: When we were looking at using a framework, there was never much question. We knew we would use Laravel if we did use a framework, because this is a question that's come up within our business and our kind of circle many times before: what frameworks do we like, what ones don’t we like, etc. And Laravel ticks a lot of the boxes we want. The bigger question was more if we were going to use a framework at all. There were a lot of concerns from the team like well, this is large project, it has a very long lifespan, and there were justifications for doing things ourselves. Like, we don't want to necessarily have to do things through ORM in a framework and stuff like that. But one of the nice things for me personally about Laravel is that it doesn't really force you to use the Laravel components when it doesn't fit. So ultimately, we made the decision to use the framework in order to keep our development pace up. And then once we had made that decision, Laravel was the obvious choice for us.

Nick:  So I do not have very much experience with Laravel. Are there features of that framework that just fit nicely with the type of web UI that you're building?

Dave: Look, you can do anything in any framework created. I think it's more about the fit of the team, is probably more the truth, because you can do anything. But Laravel, I would consider it to be a more modern framework. And it has a lot of the similar kind of design principles as we have within our system. And a lot of thought has gone into how one might scale such a platform out and things like that. And it's very flexible, it's not like one size fits all. It's quite easy to customize, which was important for us. So yeah, it's gaining a lot of popularity. It is certainly one of the newer ones, but it seems to be quite well regarded in the PHP community, I would say.

Nick: Yeah, I know, there's a website called Laracasts and it has something like a million screencast videos and learning Laravel.

Dave: Exactly. And that's a big concern. You know, we have this business when no one we hire ever knows what we do because we're doing things that people aren't doing, and in a lot of different combinations. So we also have to think about how easy is it to do upskill on things or to get resources for things and stuff like that. And they have done a great job of building this community around that, you know, with documentation and screencasts and forums. Although, I think you probably could say the same for any framework. But yeah, I like the Laravel ecosystem.

Nick: So, is that application – the web  frontend to drive the configuration for the load balancer – is it a monolithic application like one big Laravel app or something else?

Dave: No, it's quite spread out. So, the entire thing runs inside of Kubernetes. And so we have, obviously, your traditional web kind of frontend side that you expect. And then we have stuff like the server communication, how the servers scale out, and all of that sort of stuff. But interestingly, we actually don't run things like jobs and that inside that environment. Our jobs servers are actually serverless, so they're running on Lambda. We run jobs, basically, almost instantly, we just changed the scale-up or scale-down. So, it's quite broken up, because the user experience side really all you're doing is ultimately any button click, any action you take is really just resulting in a job, which is then being run on Lambda.

Nick: You mentioned jobs and it being broken up running Kubernetes on the backend, and AWS. Are there other components of your tech stack that you'd like to go over? You’re using Docker because it's in Kubernetes, but are you also using Docker in development?

Dave: Yeah, we are quite heavily actually. We use Vagrant for all of our development work. And then we have, you know, almost exact copies of what's running in production that people can run locally via various containers and things like that. So they can run their own server, they can fake their own clients because when you're working on the web frontend, you also have to then ultimately have it speak to a server, which ultimately deploys those configs to clients and so on. So, we make quite heavy use of Docker in development, because it allows us to easily simulate a live environment for developers.

Nick: This seems like this is not your run of the mill, traditional web application. There's a lot of moving parts where I can definitely see a VM being necessary. So, what's the development experience like for that for a developer? Do they just run the VM and maybe run some kind of program to up it all and that's it?

Dave:  Yeah, exactly. For the real, core Laravel app, if you will call it that, they can just run a Vagrant up and boot it up and start to work on it. However, that application will then depend on local databases, for example. It's very typical in Vagrant to bring up like a PostgreSQL database, but we have a time series database, which needs to come down. Then you have the server, then you have fake clients, so then we just have some tooling, like a couple of scripts to pull down our containers, like our test containers that automatically connect. You know, one of the nice things about Vagrant is that you can predict the IP address of all your developers’ systems, so you can have interconnected things like clients and servers that are connecting, all just be pre-allocated IPs, and then every developer has the same environment, which is fantastic. This is the first project that I've worked on, where we've gone so far down the road in terms of like tooling for development, and it's been great. Because even our CI/CD processes, you can run them all locally because you have all the containers to actually test running things on clients and so on.

Nick:  This sounds like one of those projects where if you weren't doing all of that, and you didn't invest the time, then getting set up in development would be near impossible, right? It would be like, like a 700-step README file that will get out of date very fast.

Dave:  Exactly. And it actually consumes some of our resources. It's important to maintain that stuff as well, like, the environments and what features are needed and, and how developers work on it. So, it's something that we focus on quite a lot, but it comes at a cost. It does take up resources from our team.

Nick: I'm sure in your case that's time well spent.

Dave: Yeah, well, exactly. I mean, we've got no choice really.

Nick: So you mentioned a time series database. Do you want to mention what database you're using for that?

Dave: At the moment, we're using InfluxDB. It's quite a difficult question, and it's something that may even be interesting to listeners. When you look at the regular databases – MySQL, MariaDB, PostgreSQL – you can find the strengths and weaknesses in them all. When you get into time series, it's a whole different story. And it's quite hard to work out which one you should use, which one you shouldn't. They've all got pluses and minuses. So that's why I took a bit of a pause to say, yeah, we're using Influx. We are very happy with it right now and it's working well, but in some of our scaling tests and stuff like that, we've identified potential areas where there might be challenges. There’s no clear answer, and so that's an area we're still investing a lot of experimentation. We anticipate learning even more as we scale the platform out. To give you an idea, right now, the product is four months old, it's still in alpha, and we're writing about 33,000 metrics a second. So our scale requirements are very large, and we can expect that to go up by probably 1000 times in the next year.

Nick: Wow. 33,000 a second is no joke. I'm not going to bust out the calculator and do the math. But that is like millions upon millions a day, right?

Dave: Yeah, exactly. Because the thing is we have the central cloud platform, but it's also actually the central repository for all the information. So, if you've got 100 ADCs deployed, we're taking metrics from all of those ADCs all the time and storing them centrally. And that's per client. So yeah, the numbers get very big, very fast. And so we have these interesting challenges, like even having to use a time series database. There are a lot of companies and web apps that could just store reporting data in SQL and be fine. And I would recommend that if they can, but at our scale, that's obviously not really an option.

Nick: I have some projects where I'm just using PostgreSQL, and it's recording maybe dozens of events a second, which is a much different scale than tens of thousands.

Dave:  Exactly. Even with PostgreSQL, you could probably do tens of thousands. But like I say, we're in alpha and if you add 1000 times that, then all of a sudden, you're talking about millions of events per second. And then you've got a real unique database scaling problem that I believe only time series databases are an option to solve.

Nick: Now, I'm jumping around a little bit here. But since we're talking about so many requests per second, alpha software and a load balancer, this is a load balancer that would be basically an entry point to your application if you were hosting this on your own infrastructure, right?

Dave: Exactly. It's the kind of single source for all of your clients to connect to, and it does all of the communication to your, let's say, web servers. It's obviously not exclusively web servers, but primarily it is. So it sits in between client communication and the web servers.

Nick: Right. And now let's just say that you're scaling up, months go by, things are getting popular and awesome, and you have a little bit of growing pains. Would those growing pains affect your end users who are using that load balancer?

Dave: No. One of the interesting things when you run a web app is you've got to think to yourself, okay, well, what kind of uptime do I want to have? How many nines should I put on this application? And you never put 100%, because it's a web application, and you're hosting it centrally. So, by design, we have separated the two concepts. So the clients that I spoke about, which are the actual ADCs, can run with the cloud being offline, should it need to be. Obviously, that's not our design. The intention is that everything stays online as much as possible. So we've put a lot of effort into how we build that environment, and which clouds we’re in. But should the worst case happen and the site goes down, then the load balancers will continue to function fine.

Nick: Nice. That is definitely a big win, because uptime is important and knowing that it's decoupled from your infrastructure is really good stuff.

Dave: We didn't really have a choice. The thing about the load balancers is they have to have 100% uptime. They very importantly cannot go down. And that's why they will obviously be redundant and so on. We can't have them go down because of maintenance or something like that.

Nick: Can you give the TLDR and how your load balancer is different than what you might get on AWS or GCP or another platform?

Dave: Absolutely. So, cloud load balancers are basically saying, where should I accept traffic? And where should I send it? And they can get a little bit smarter than that. When they do get smarter than that they often start to get expensive though. But, generally speaking, they're like a commodity option, whereas what we provide is what the market will sometimes call an ADC. But really, it's looking at four primary pieces. The first is load balancing, so availability. The second is security, so it's got a web application firewall, so it's protecting against denial of service attacks, SQL injection, cross site scripting, you can do geo-fencing and all the stuff you can imagine from that. Then, web acceleration – it will cache content from your servers, it can serve stale content if your servers go down, it can rewrite the pages to minify content, all that kind of stuff. And then the big thing, which is becoming much more popular recently is observability. So, which page is responding the slowest? And what's the latency like I'm missing? Is there an anomaly? Has something changed? Are there more failed logins from a specific country? So, load balancing is one piece of it, but it’s a very small piece

Nick:  So in your case, you're much more than just a traditional ELB on AWS. You're adding all sorts of monitoring, alerts and other stuff based on what the load balancer gives back to your server’s information, right?

Dave: Exactly. And then the second piece and what's also become quite interesting – and this is less about us and more about the state of hosting, and it's something that we've invented ourselves – is that more and more companies are having to go to multi-cloud deployments. Take us, for example, we're in AWS and Digital Ocean. And so we're hosted in two different cloud providers in two different locations in the world, for exactly this reason, if something was to go wrong. And when you get to that point, it becomes very difficult to rely upon cloud specific proprietary tech like ELB or ALB for your WAF requirements. So, people often want in those environments something that can be the same, like a mirror, in all your environments. So with Nova, for example, you could deploy an ADC that has the exact same config into Amazon, Azure and GCP. Then you know all those environments are the same and you're getting the same reports. So that's a big trend with high availability and being in multiple locations, especially want to get on-premises and it becomes very important.

Nick: You mentioned AWS and Digital Ocean. Are you running basically a replication of your service in those two providers?

Dave:  Exactly. We run active in both. And it depends on certain workloads. For example, right now, we are making use of some cloud specific provider stuff, like we use AWS for all jobs. So that's entirely true. So they will be in two different AWS locations. But we have our databases, the time series information, the Nova client server communication stuff in both clouds at the same time.

Nick: What type of number of servers are you running on both of those clouds?

Dave: It's not that high, but it's probably higher than it could be. We tend towards using more servers that cost less. We're in the load balancing game, right? We'd rather have ten $20 servers than two $500 servers. And so we have quite a few. We have a very big Kubernetes cluster on Digital Ocean, because we use it for testing. If I exclude that, probably we've got 40 individual droplets. Including our Kubernetes cluster, it's probably like 140. But that's for testing – we have to scale up a million clients in order to get our platform to load it. So that's almost separate.  And then AWS, I think it's about 15 to 20.

Nick: Okay, and now as for these servers, spec-wise, you mentioned a $20 a month DO droplet? Is that what you're running or do a little bit more?

Dave: That is what we're running. For our load balancers, for the Nova load balancers that we deploy – you know, we eat our own meals – we use our own load balancers in front of our services. Those are $5 droplets.

Nick: What distro are you running on those servers?

Dave:  For us, if we can, we basically will always use Ubuntu. It's just familiar to us, it's easy to use, easy to manage and secure. We’ve got a lot of experience with it internally. So on anything if we can control it, it's going to be Ubuntu.

Nick: Is that 1804, the latest LTS?

Dave: Yeah. To be honest, we've got one or two that are running 19. But yeah, most of it will just be whatever the LTS vision is.

Nick: And then on the AWS side of things, you mentioned a couple servers there as well. What classifications are those?

Dave: Yeah, I think they're between $40 and $60 a month generally per server.

Nick: Okay.

Dave: I'm sure it's with most workloads, but especially with our workloads, more cores does not scale well in cloud environments. So, a two-core or four-core instance for us performs very well. And then eight cores and 16 cores and 32 cores does not. You don't get the same scale. And the way cloud costing works, if your system is distributed like ours is – for example, we run everything in Kubernetes, it doesn't matter if something is two servers or 10 servers – you get far more bang for your buck, by running 10 four-cores than one 40-core.

Nick:  That makes total sense.

Dave: And then, also, you’re resilient against failure. So, it's naturally like self-sufficient system.

Nick: It’s cool how you mentioned before that to run your own infrastructure, you're using your own tool, the load balancers, to manage your own stuff.

Dave: Exactly. I don't want to get too feature specific on our stuff, but that's where you start to see like a lot of the benefit of using something that’s third party. When we do deployments, for example, it automatically handles blue-green deployments for us and takes servers out that don't make it back up. So to be honest, it helps a lot with our availability to do deployments.

Nick: Now, speaking of these servers, you have a whole bunch on Digital Ocean, a couple on AWS, what do you use to set up and provision those servers?

Dave: Most of our servers are one piece of a Kubernetes cluster. We're mostly just managing it with that. They have no real individual value, as far as an instance goes. In Digital Ocean I think probably 80% of our servers are managed by Kuberetes. The rest of them are template driven or tooling that we've got internally. We don't use any large deployment stuff like Ansible.

Nick:  Yeah, that makes sense if we're using a Kubernetes cluster. So how's that working out, by the way, with having Kubernetes run all the stuff?

Dave: Good. It works great, but it requires a different mindset. We try not to have any kind of permanent storage inside of pods or containers that don't require them. So, it does wind up meaning that you have to develop things differently. But one of the great things is that that lends itself to a development style that is very scalable. To think of a simple example, let's say you want to run a report, and it needs to create a PDF that you're then going to download. That's not actually possible in that environment. And then if you were to scale it out you would have problems. Let’s say you would then have to tell your load balancer to keep users on one server and if that server went down, the report would be gone.  Whereas with us, we will have to say: "Okay, well, we're going have to send that to some kind of shared storage place because it needs to be shared amongst all the containers." So it means your development has to be somewhat different, but it forces you to do it in a way that will then be easy to scale, if that makes sense.

Nick: Yeah, absolutely. And I can't agree more with that. Because a lot of people think it's like, well, I need to scale so let me just hit the Kubernetes button and then it's like rainbows and everything works nicely.

Dave: Yeah. My experience, you know, is very unfair, because at Snapt we have software. Our shop, for example, is not running in Kubernetes, it's running in a monolithic type of environment. So when we talk about Nova, it's like, oh, yeah, it’s rainbows and sparkles everywhere, but it's because we started developing it on these platforms. We never had to migrate it to them. And I think that's much more challenging.

Nick: Yeah, I know I have some clients who want to move to Kubernetes, and they haven't really built their application in a way that works with that. So, yeah, it's a huge undertaking on the dev side of things. Like we said, if you need to upload a file and suddenly, that needs to go to S3, not on a little box. Or what about session state – you can't just keep that in a Laravel web process, it needs to be somewhere else.

Dave: Yeah, exactly. But you know, the funny thing is, we see a lot of people when we're selling our product, and they will say, what's the story with Kubernetes and how's the integration there? And fortunately for us, that's obviously good. But most of their workload actually isn't really well suited to Kubernetes or cloud-native in general. I don't think that cloud-native is like a destination. I think it's like a spectrum. There are some workloads that work really well in Kubernetes. There are some that work well in serverless environments. And there are some that work well on hardware, and there's some that work well in VMs. We try to account for that, so we're saying, whatever your platform is, we'll try and support that because I really don't think everyone should just say, "Oh, yeah, move everything we’ve got and stick it in Kubernetes".

Nick: Right. Now speaking of that, does that mean your load balancer service would work for people who are not using Kubernetes, like if they just have like a Digital Ocean droplet?

Dave: Yeah, absolutely. That's the default, actually. Because a lot of the time, it's easier for them because we create and maintain a droplet, so if anything happens, we just recreate it. They can easily just use that. In fact, most of our clients are not running entire Kubernetes environments. Because that's the smallest side of their business. They will typically have 20% of their workload in Kubernetes and need a solution there. But the other 80% is still traditional from monolithic to microservices to whatever, it's not necessarily cloud-native.

Nick: So do you want to walk us through how would that work? Let's say, I'm a regular web developer, I have two Digital Ocean servers, separate servers and I want to load balance them now and, let’s say, I have a web project running on both of them. How would your service let me do that?

Dave:  Digital Ocean is a good example. Let me be clear and say it doesn't have to be a cloud provider, but it can be easier if it's a cloud provider. Let's use Digital Ocean as an example. When you sign up, you can put in your Digital Ocean API credentials – so you just generate a token for it and then we can launch droplets on your account. And then we can look and see what systems you’ve got running.  You can then say, "Okay, I want to launch a droplet". You choose the $5 one, for example, especially if you've got two servers, that will be more than enough. So, you'll tell Nova to provision a new Nova node in your account. And we’ll launch a droplet for you. And then that will connect back and you can then configure an ADC there. In your example, it will say, "we see you've got two web servers, would you like send the traffic there?" And you just say "yes", and it will maintain all of that for you, get their IPs automatically. It’s really easy with a cloud account.

Nick:  And now all the difficult stuff about dealing with a load balancer, like doing rolling restarts and only serving traffic for the one that's up, is that all completely handled by your service?

Dave:  Yeah, it's all automated. You can get into the nuts and bolts of it and configure like how we do health checks, and what we consider is down versus up, and all of that stuff. By default, for example, we consider any 500 status code to be a down server. You can tweak that. But by default, it will just handle all of that. And it actually monitors replies. So, you don't even have to mark a site as down. For example, for us, when we do an update, the servers that are updating when we push our new code, the servers that are receiving the new code will generate a 501 HTTP status code, which Nova actually intercepts and then says the server's down, let me shift this over to another web server that's still online. It just handles blue-green deployment automatically.

Nick: That's super interesting, because load balancing is so hard and using Kubernetes for a smaller scale site like that might not be … I don't want to say it's not the right move, but it's pretty complicated to set up.

Dave: Well, for businesses it's hard. I mean, 80% of workloads in enterprise are still running on VMware, on-premises or in racks or in data centers. And even if you do have a workload in Kubernetes, you still need to be able to deploy ADCs and load balancers in your VMware environment. So, you know, it's all the same.

Nick: Right. So before we move on a little bit, and talking a little bit more about how you deploy your own things, I just wanted to ask one more question: What would that deploy process look like for an end user who has your load balancer in front of two Digital Ocean servers, and let's say they push a new version of their code base to update something, how does your load balancer ensure that the new code gets on to both servers?

Dave: They will push directly to their servers. And if they sent the code to both servers, like let's say, they run a script error that takes the site down so everyone gets an error message. And then after they've deployed everyone gets an "everything's okay" message. If they deployed to both of their servers at the same time, then we’ll have nowhere to send the traffic. So you could optionally have us generate a maintenance page, for example, that says "everything is down" – you can customize your error pages, like “please try again in 60 seconds.” But in an ideal environment, what they would do is deploy to the one server, and we'll pick up that one server has gone down and move everyone over to the other server. Then when it comes back up, we'll shift the load back so that it's 50-50, then they deploy to the other one, and then we move everyone off that one and then move them back when they’re done. So as long as they don't wipe out 100% of their servers. So in our example, will deploy to just 50% and then 50%. But if we want to be safe, we can deploy to 10 and 20, then 30 or however you want to do it.

Nick: Right, that makes sense. So it is kind of like a rolling restart, but I don't know how you would classify that – maybe more like a manual rolling restart?

Dave: Yeah, exactly. So when you deploy, you just deploy to two different groups. That’s obviously an option. They can deploy to both, and if it takes 30 seconds, the load balancer can just generate a message saying, “Please try again in 30 seconds,” or something. But if they don't want any downtime, then they can do that. That's like your automated kind of manual approach. You know, where you’re saying let the load balancer automatically pick up my manual deployments. But of course, for bigger enterprises, what they'll do is send an API request over to say "drain these servers" because I’m going to do a test deployment to them and then tell us to bring the traffic back to them or put 10% of the traffic on this new server so that I can test if it's working and stuff like that.

Nick: Right. And I think that's where some people get lost up in the Kubernetes buzzword. Maybe it is possible for this all to happen, like this one magic bullet where it manages all of that automatically and you don’t need to worry about breaking up your deploy.

Dave: Exactly, that’s the thing. With Kubernetes, for example, we can point to a service behind us that's made up of a bunch of containers or pods. And we'll send traffic to any that are online. So if your deployment is set up in such a way that once you deploy, new containers come up when old containers go down, we'll pick that up. But if you’ve got a user speaking to a specific container in a Kubernetes environment and they’re halfway through a download and that container disappears, that user is going to get disconnected and have to start again. So that's the problem. If you've got a load balancer sitting in between, it can buffer that connection, find another server that will send the last half of that file, if it’s available, and try to maintain this communication.

Nick: Right. There's definitely a lot of value in that. So we’ve been talking a little bit about deployment for your customers’ perspective, but do you want to go into your workflow for deploying a new version of your application?

Dave: Yeah, absolutely. Ours is actually quite simple. We’ve tried to keep it simple. Every time we make anything and any time we move it up to becoming less simple, we try to justify it to try to have deployment be as simple as possible. Mostly, what happens is that two of our web servers, the frontends, once code is in Git, once we’ve merged it, we will then manually trigger deploy. So we control our deploys. That's just by choice right now, and we’ll deploy to some amount of the servers, and deploy to the rest of them once they’re online. Or we could deploy to the whole lot if need be, but we obviously don't, because we don't want the site to go offline.

And then the more complicated part is then updating things like the job servers and stuff like that. But we've got scripts and custom tools built to do that, which will take place in our deployment script when we click the button to deploy. And then for the client server side, that's a bit complicated. So the server runs in Kubernetes, we will deploy a new version of that container, and then switch users over via the load balancer so it will take the connections of the old one and then move them over to the new one. But it then has to check, "do we need to update the clients?", and all that kind of stuff.

Nick:  Now, you mentioned that you do manual deployments, which I think is a good idea too. I always like that human elements of you push the red button when you're ready instead of an automated thing. But that deploy button, is that like a custom script you wrote or are you using Jenkins or something else?

Dave: No. We have a CI/CD environment, which is primarily run on CircleCI, actually. We collect all of that stuff, so on our deployment platform, it will have the build status and all that kind of stuff. But that deployment script, that job to actually go and do it, is all custom and stuff we've done.

Nick: And now when you're dealing with secrets during that deployment process is that just through environment variables through CircleCI and Kubernetes? How are you dealing with that?

Dave: Almost exclusively environment variables. There's a bit of a headache when you get to stuff like Lambda because certain things can’t be certain sizes and you can't store … for example, the certificates for our API are too big, so you have to come up with different solutions. But primarily, it's just environment variables.

Nick: Right. So it sounds like you may have hit your head against the wall with those limitations. What have you done to get around that?

Dave: Well, there are ways, there are like storages that they have which can store a large number of bytes – 4K and then 4MB or something like that.  I don’t know the exact details, but you just have to adapt to that. But one of the challenges of that, and that's why I say we’re using it right now but likely will change to our own solution, is that because it locks us into that cloud provider for that service, which is something that by design, we don't really want to do for two reasons. One, we want to be flexible with the way we host and where we might be online and so on. But secondly, we also intend to offer this is as a self-hosted solution for large enterprises and they’re not going to have those functions.

Nick:  Yeah, that's definitely an important aspect if you plan to support that.

Dave: It's quite a big trend at the moment. So I don’t know what you would call it, but almost like cloud neutral. The ability to literally lift and shift your entire platform and move it to another cloud is quite a big trend that we see in our enterprise clients for cost savings, security, safety, whatever it might be. It's a big concern and it’s quite difficult to achieve.

Nick: Yeah, definitely. Especially when you're dealing with the data, moving the data is always the really hard part.

Dave: Exactly. And replication and keeping things in-sync between two locations. It's quite a challenge.

Nick: So, end-to-end when you do add a new code to deploy on... let's say, you're updating the Laravel apps to update some component of the web UI that your customers would use: what's the turnaround time from you pushing code to it being in production?

Dave: It's quite short. We're not at Twitter or Facebook levels of deploying thousands of times a day, but probably from pull requests going through, it would have to go to our staging system. It’s slow because of manual processes. We’ll push it to staging or actual user testing on the system and often it can be like a reporting change, which means we need to wait a day in order to see that the midnight changed over and things like that. But ultimately, our deployment process probably takes between five and 10 seconds, most of which is NPM compiling things.

Nick: That's actually really, really fast, especially if you're dealing with dependencies and things like that.

Dave: Yeah, it's very quick to deploy. We can actually get it down even lower. Literally, what we do right now is on our production servers we’ll compile all of the assets like our JavaScripts and stuff like that, which is not necessary. You know, we could push compiled stuff, but for now it's perfectly fine for us.

Nick: Right. So that takes care of the web frontends, but now let's say you need to push a change to something running on the client side or the Go backend for the server side of that. How does that look or is that at about the same?

Dave: That's different. Because what that actually involves is changing a container, updating a container. So we will push a new container, which we build with our internal tooling. We'll get the latest versions of everything we need from our repositories, which is all just on GitHub, compile it all, create the container and push it, then when we want to actually deploy that we just push it – put a new container up in Kubernetes.  And when I say we've got tooling and scripts, that script, for example, is probably like five lines. It's like just a shell script, five commands, because it doesn't have to be anything fancy.

Nick: Yeah, I'm a big fan of shell scripts. You can get a lot accomplished with very little.

Dave: Yeah, exactly. What we’ll often do is integrate it in some way. So, we might have a web Control Panel, which shows you things like the build has paused, this has a merged PR, and then when you click the Go button, it might just run a shell script.

Nick: So now that the application is running in production, you’ve pushed a new deploy, what do you have set up for doing error reporting and logging and metrics and things like that?

Dave: We've had to do a lot of stuff ourselves, just because of how much we care about the performance of the platform, such as latency and all that kind of stuff. An error, for example, with us might not be an error. It might be someone tried to run a job on a node that just disconnected because it went down on the client side.  So there's a lot of our own dashboarding. But for error reporting, we actually use Sentry which I’m quite a fan of. So, for example, the web app, any Laravel stuff will go straight to Sentry and then everything's integrated with Datadog, which we use for a lot of our logging reporting. So we have our own custom StatsD thing which we send reporting metrics to which then gets shipped off to Datadog, and all that stuff messages on Slack.

Nick: Nice. That seems to be a popular pipeline:  Datadog, Sentry and Slack.

Dave: There is some overlap, to be honest, but it just depends what you use and for what purpose. For example, we don't have any Datadog agents on any of our servers. So their traditional way of monitoring – installing their little Python script or whatever it is and having it run like your CPU usage – we don't use that at all. Literally, the only thing we use them is for StatsD, that we send timings for communication with the server process and how long pages are taking to load and how long a command takes to execute, how many commands are running per second, sample rates and all that stuff. That's what we use it for. It's really wherever you find a fit for the tool. What's nice about using third-party tools, like if you're using Laravel and Sentry, you just install the Laravel-Sentry plugin and you've got kit working perfectly. So there's no reason not to use people’s software, and it's great.

Nick: Yeah, absolutely. Are you also taking advantage of things that certain cloud providers give you like Digital Ocean alerts and stuff like that or no?

Dave: No, not really. Because our clouds are almost like fake servers. Really, they’re just operating as part of a cluster. And so we need to manage that cluster ourselves, at least right now, is the reality.  If you're given instances, that are just like one of 20 of the exact same instances, if it was to fail it doesn’t really have any kind of effect on us. We manage the infrastructure within the Kubernetes. We don't really worry too much about their CPU alerts, and memory usage alerts and stuff like that. We take the layer sitting on top of that and worry about the health of each individual device.

Nick: That makes sense. And it also goes back to what you mentioned before where, you know, it'd be nice if you could just switch from Digital Ocean to a different provider and it not be the end of the world.

Dave: Exactly. For example, we have a client that spends $5 million to $6 million on cloud computing fees a month, and they can switch clouds multiple times a month, because they get a better price on another cloud and they just move. The reality is that they’re not leaving one and joining another, they're scaling up in one and scaling down in another. To give you an idea, we've got clients who have got 500 load balancers deployed, behind each load balancer there could be 10 to 20 servers. They might have 5000 servers in production with potentially tens of thousands of applications on them. So, a difference in 10% of cloud pricing bill might be worth designing in this fashion, where you can actually cloud hop. When you get to that kind of scale, the way you look at an infrastructure provider is very different.

Nick: Yeah, for sure. I mostly operate at a much smaller scale – low tens of servers at the most. What you just mentioned are things I don't even think about. Spending millions of dollars a month on cloud hosting is a whole other level.

Dave: Exactly. And you’d be surprised. Look at us, we have nine developers. Our company is a lot bigger than that, but there are nine developers. We’ve got 150 servers sitting in Kubernetes at the moment. There’s so much tooling in the DevOps world, enablement from open source and all these kind of cloud providers nowadays that a small team might have a pretty big set of infrastructure. And what you have in a team like ours, we don’t have people that are experts in colocation or stuff like that. You really have to start thinking about that from software instead of just from IT. It used to be an IT Ops problem, like the developers wrote this thing and I’ve just got to find a way to host it in both places.  Today, it’s a developer problem. That’s the rise of DevOps. How can these things merge to say, well, we can create the software in this way that is cloud neutral.

Nick: That’s a great point. I forget the stats, exactly, but a couple of years ago I think it was WhatsApp, they had such a low number of engineers but they were serving billions of events per month.

Dave: Instagram as well. I could be totally wrong here, but I think Instagram had something like 11 employees and they were so massive that their value per employee was something like $120 million or something. And they had such a high amount of infrastructure. It just shows how the game is changing.

Nick: We definitely live in an interesting time now, right? One solo developer can run some pretty crazy stuff.

Dave: It’s exciting, though. We could never do what we do now even five years ago. For our test environment, we need to see how high we can scale. So, we’re running a million Nova clients on that large Kubernetes cluster that I mentioned. How would we possibly launch a million servers to test something even three years ago? And we’re a relatively small team compared to big organizations that could still never have achieved that. I don’t even want to call it tooling, but that stuff has come so far that it’s really enabling small teams to do really cool stuff.

Nick: You mentioned testing and making sure things stay up – what type of plans do you have for disaster recovery, like malicious users, weird events? How do you do your database backups and all that fun stuff?

Dave: Databases is mostly automated, because everything of ours is redundant and also collocated. We have two different data centers that everything goes to. Nova is routing our traffic to both data centers, and if one fails it will move it to the other one. But we’re not that worried about a database backup every minute. So instead, we’ll just have database backups that happen every hour in case of some unforeseen event across the entire infrastructure that we have. Then we just run hourly backups at the moment. Obviously, the hosts of the PostgreSQL servers are backed up as well. But our infrastructure will run in multiple locations, everything is highly available, so there is at least two of everything, for example, the database or the load balancer or something like that. And then we use GSLB – it’s actually a component of Nova, but there are other people that provide that service as well – to do geographic routing. So, we’ll send Europeans to Europe, and Americans to America and if America is down, then we’ll send the Americans to Europe.

Nick: Right. So, you have quite a lot of good stuff going on to protect against downtime of your own system.

Dave: Like I said, we’re users of our own product. But that’s our business. We’re lucky that we have experience in that space.

Nick: Since you’re using your own tool in your own system, now other people can take advantage of that as well.

Dave: Exactly. That’s why I like doing stuff like this. We’ve learnt these lessons from working with our clients. We work with a lot of Fortune 500 businesses that are on the bleeding edge of solving a lot of these scalability issues. So that’s how we’ve learnt how to solve these things. And that’s why it’s nice to do stuff like this because it gets out to the community of smaller developers that are starting stuff up and thinking "how should we design this and how should we build it?" We have a free version of Nova for smaller teams for exactly that. It’s so difficult if you’re just starting out to follow best practices and this kind of thing can take on huge technical debt if you don’t know how to start.

Nick: You mentioned some clients spending millions of dollars per month. I have a feeling a lot of listeners of this podcast are not in that position. Is this something that a regular person would use on a smaller project and still get value from it?

Dave: Absolutely. By definition, you think that load balancing means I must have at least two servers. But that’s not even the requirement because, as I mentioned, we do a lot of application firewalling and Layer 7 security and observability and all that stuff, so you can use it with even one. But from a cost point of view, we allow you to have up to five nodes – a node would be a server or a container in a cloud – with any number of ADCs, so you can do that for free without any cost at all. So, for people that are hobbyists or got a small site, SMEs for example, they can just use it for free. It’s also really nice because I think if you’re in our game – DevOps – and you’re looking at stuff like this, these are important technologies to know about and to play with. And it even helps developers. It’s actually quite cool. Some of the new developers we brought on for this project, the way they develop things has changed a lot due to the exposure to how you take things into production, like load balancing and all the concerns around that. It’s a really good experience for people in the DevOps space.

Nick: There’s nothing that beats just real-world, actual experience, discovering workflows, automating things…

Dave: Exactly. So for all of that, it’s totally free.

Nick: That’s great. And like that five-node limitation, is there any other limitations imposed? Is it one of those pay-to-win type of things where you only get 20% of the features but now you have to pay for the rest?

Dave: No. To be completely honest with you, there are some limitations but they’re all entirely practical. So, for example, we can only offer so much support on the free version and we can’t do SLAs and stuff like that. And we also only keep your data, your porting for seven days, because we’re actually storing that stuff for you for free and storing it for long is expensive for us. But functionality-wise, being able to deploy that stuff, use the security, everything I mentioned, you can use all of it for free.

Nick: So, people can use it for free, but then it costs money if they want to go beyond five nodes. We didn’t really get a chance to talk about how you’re able to deal with payments within your application – are you using Stripe or some other payment gateway?

Dave: We use Stripe. On our older platform, what we call our shop, which is where we sell our traditional product called Snapt Aria, we use 2Checkout, Stripe and PayPal. But for Nova, we decided to standardize on Stripe and use only that.

Nick: Stripe is a good service. Have you upgraded to payment intent?

Dave: Yeah. Our billing is the most recent thing that we’ve worked on the platform, so everything is using all their new stuff. Where Stripe is great, we bill per month – there’s a fee per node per month. But what we really do is bill per hour. If you run one for five hours, you pay just for five hours. And if you want to do that type of billing where you want to bill in arrears at the end of the month for this unknown amount, which could be less than what the client originally signed up for, Stripe is really nice for that.

Nick: That is such a common thing to do in the cloud world.

Dave: Variable billing. People hear that and they think, you’re going to bill me for more than what I signed up for. But really, what a lot of people like us are trying to deal with is billing people less. When you’re charging depending on the number of things that are running, you need to be able to alter the bill within some reasonable range and Stripe has done a good job of finding the middle ground  between protecting the consumer and empowering the developer. With PayPal, for example, it’s very difficult to charge an amount to someone that they didn’t originally agree to.

Nick: I’m not sure of the rules for Stripe, but is it that as soon as you get that token to allow to make a charge, it could be for that amount or less, but not more?

Dave: It depends. You can actually request one to have a larger amount or to increase. But it’s quite transparent. The user gets a good sense of what they’re allowing. For us, because we charge by the node and you could launch 10 nodes, we have to be able to charge you more, but then we add our developer side-protections. For example, you can say never "run more than this" or "warn me when I go over this amount of money".

Nick: Stripe is just getting better and better and accepting payments is just getting harder and harder.

Dave: Having a provider for that makes a lot of sense. Another reason why working with frameworks often makes sense. Interestingly enough, for billing we had to move away from Laravel systems and do our own because of the complexity. But we were still able to use a lot of the existing classes that they have.

Nick: In a semi-related matter, when it comes to your service notifying users of things going wrong over email, what transactional email servers do you use?

Dave: We use SendGrid. As to why, to be honest, I’ve used them for the longest time and have always been happy with them.  I haven’t recently evaluated others, so I don’t know. But I can say that SendGrid has always done a good job for us.

Nick: Do you happen to know off the top of your head how many emails you send out a day?

Dave: Not that many. Probably less than 1,000 at the moment. Users can subscribe to notifications for alerts on their nodes and information about the devices, if there’s a spike in traffic, and a lot of it is that sort of thing. But not a huge amount.

Nick: One last tech question before we wrap this up. How do you deal with SSL certificates with your setup? Does your load balancer terminate them?

Dave: Nova will provision Let’s Encrypt certificates automatically, so we just use that functionality, so it self-provisions certificates for the site.

Nick: As the end developer setting this up, if they were to use your load balancer, they don’t even really need to think about even setting up Let’s Encrypt on their own?

Dave: No. Literally, they can just say Let’s Encrypt and they say "yes" and they accept the terms and conditions of Let’s Encrypt and then you type in whatever domain names you want us to issue certificates for. Also, importantly, we don’t get those certificates, they reside on the node. We just have the configuration that you want those certificates. We don’t get the keys. And the node is in your infrastructure so it will automatically just get the Let’s Encrypt certificates and renew them and keep them all up to date.

Nick: What is that one saying: push button, get bakin’?

Dave: Exactly. Except it’s complicated SSL certificates. It’s much easier to have the load balancer provision that all for you. What can be a concern is that typically when you use the word termination, and SSL termination is what we would normally call that, then you think you’re sending plain text out the back of the load balancer. But if you’re in public cloud and you worry about security, you can re-encrypt. So, you can have clients certificates sort of environments where we then re-encrypt to your backends if need be. But it’s often very nice to have it centrally on the load balancer because if you look at vulnerabilities in software in the last five years, there has been a lot in SSL related things, on web servers, and it’s very nice to have one central thing that is maintained by us and managed by us. We were the first ADC company in the world to patch Heartbleed, for example. So, you’ve got this company that’s taking care of this stuff for you, and you don’t have to worry about user error. If you have 50 servers, updating the certificates on all 50 servers is quite a hassle, whereas if you only have one place, it’s much easier. I’m a big believer in having your load balancer do your SSL as opposed to having your web servers do it.

Nick: It almost goes back to what you said before about developer patterns that you need to take into account when you’re using something like Kubernetes – it can’t just upload a file to the server directly, you have to put it somewhere else. This is similar.

Dave: Exactly, and it’s nice to have responsibilities condensed. Your container then becomes as simple as possible. And the web server is suddenly no longer important. If you wind up serving some functions from a serverless environment, it doesn’t matter, because you’ve got that stuff contained in the load balancer. It just makes for better design for people who are ultimately looking to scale things out.

Nick: Do you have any best tips and lessons learned for anyone who might be developing some type of similar application as you? Or at least any type of Laravel web app?

Dave: Let me say, our project is a big one. This is a very large-scale system and it’s something that is going to be worked on and run for many years, so perhaps my tips are somewhat different. There are a lot of startups that are creating things with the vision of having this type of thing. So one of the best decisions we made was to develop this thing we call our code contract, which is eight bullet points of how we have to write code in order to ensure that it’s maintainable, that it’s secure, that it’s clear, that it’s scalable, etc. And believe it or not, you can do all that with eight bullet points, if you’re very specific about them. And every pull request we put through is evaluated against this contract, and pull requests are rejected if they don’t pass. We started that on Day One. Sometimes it’s meant that developing  a certain feature maybe took a little bit longer but not a lot longer, and it’s meant that we’ve wound up in this environment where we have a system that meets these criteria that are otherwise very hard to meet.

Nick: That almost sounds like Heroku’s 12-factor application, but you have an eight-factor application?

Dave: Sort of. It’s like eight instructions if you don’t want to get an angry meme on your pull request. It’s very specific things – the difference between something that writes from a system and something that reads from a system, for example. We had to design something in a very specific way there as well, because we are a security company. It’s eight core concepts, and it doesn’t have to be eight. Another company wouldn’t use our eight. The idea is sitting down and saying, what is the purpose of the system. For us, we said it has to be extremely scalable, it’s going to be worked on for many years, it has to be very secure, etc. So for example, something that’s going to be worked on for many years changes the way you name functions. If someone else is going to come in in three years’ time and look at something. And we just try to live by that contract, and it’s served us really well. It wasn’t my idea, and it came from our development team and it was a great idea.

Nick: Sounds like a great idea. On a similar note, I’m a huge fan of just having a checklist that you can go through. It just helps so much.

Dave: It really helps. Pull requests can be quite funny. You have differences of opinions on certain things. But with this, you will often see. “this violates the code contract,” on a pull request on a piece of code. And then there’s no discussion, people don’t debate it, they say, "oh yeah, good point" and then it gets fixed. It’s like this independent third party that’s not disagreeing with your code but rather just enforcing our rules. And it works well.

Nick: Given that your app has been in development for a little bit, did you make any mistakes early on that you had to correct later?

Dave: I don’t want to say no, because we definitely have. But I can’t think of any large-scale things. Time series databases were difficult. We struggled to choose one and get all of the features we wanted. So, it took longer than it should have. But otherwise, no, not really. I think we’re pretty happy with where we are so far in the journey. But like, I say, we are still early in our journey of development on this app.

Nick: We’ll have to set another date for a year from now and see what happens. So, Dave, thanks so much for coming on the Running in Production podcast, it was great having you on the show.

Dave: Thanks for having me on, it’s been great to chat.

Nick: Do you want to share any links?

Dave: Absolutely. Our business is called Snapt, you can find us at corp.snapt.net. And we’re on Twitter, Facebook.