Tag: glitchtip

  • Monitor network endpoints with Python asyncio and aiohttp

    Monitor network endpoints with Python asyncio and aiohttp

    My motivation – I wanted to make a network monitoring service in Python. Python isn’t known for it’s async ability, but with asyncio it’s possible. I wanted to include it in a larger Django app, GlitchTip. Keeping everything as a monolithic code base makes it easier to maintain and deploy. Go and Node handle concurrent IO a little more naturally but don’t have any web framework even close to as feature complete as Django.

    How asyncio works compared to JavaScript

    I’m used to synchronous Python and asynchronous JavaScript. asyncio is a little strange at first. It’s far more verbose than just stringing along a few JS promises. Let’s compare this example of JS and Python.

    fetch('http://example.com/example.json')
      .then(response => response.json())
      .then(data => console.log(data));
    async def main():
        async with aiohttp.ClientSession() as session:
            async with session.get('http://example.com/example.json') as response:
                html = await response.json()
                print(html)
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

    There’s more boilerplate in Python. aiohttp has three chained async calls while fetching JSON in JS requires just two chained promises. Let’s break these differences down a bit

    • An async call to GET/POST/etc the resource. At this time, we don’t have the body of the request. fetch vs sessions.get are about the same here.
    • An async call to get the body contents (and perhaps process them in some manner such as converting a JSON payload to a object or dictionary). If we only need say the status code, there is no need to spend time doing this. Both have async text() and json() functions that also work similarly.
    • aiohttp has a ClientSession context manager that closes the connection. The only async IO occurs when closing the connection. It’s possible to reuse a session for some performance benefit. This is often useful in Python as our async code block will often live nested in synchronous code. Fetch does not have this (as far as I’m aware at the time of this writing).
    • get_event_loop and run_until_complete allow us to run async functions from a synchronous code function. Python is synchronous by default, so this is necessary. When running Django or Celery or a python script, everything is blocking until explicitly run async. JavaScript on the lets you run async code with 0 boilerplate.

    One other thing to note is that both Python and JavaScript are single threaded. While you can “do multiple things” while waiting for IO, you cannot use multiple CPU cores without starting multiple processes, for example by running uwsgi workers. Thus in Python it’s called asyncio.

    Source: docs.aiohttp.org

    Network Monitoring with aiohttp

    Network monitoring can easily start as a couple line script or be a very complex, massive service depending on scale. I won’t claim that my method is the best, mega-scale method ever, but I think it’s quite sufficient for small to medium scale projects. Let’s start with requirements

    • Must handle 1 million network checks per minute
    • Must run at least every 30 seconds (smaller scale this could probably go much shorter)
    • Must only run Python and get embedded into a Django code base
    • Must not require anything other than a Celery compatible service broker and Django compatible database

    And a few non-functional requirements that I believe will help scale

    • Must scale to run from many servers (Celery workers)
    • Must batch database writes as efficiently as possible to avoid bottlenecks
    Overview of architecture

    A Celery beat scheduler will run a “dispatch_checks” task every 30 seconds. Dispatch checks will determine which “monitors” need checked based on their set interval frequency and last check. It will then batch these in groups and dispatch further parallel celery tasks called “perform_checks” to actually perform the network check. The perform_checks task will then fetch additional monitor data in one query and asynchronously check each network asset. Once done, it will save to the database using standard Django ORM. By batching inserts, we should be able to improve scalability. It also means we don’t need a massive number of celery tasks, which would be unnecessary overhead. In real life, we may only have a few celery works for the “small or medium scale” so it would waste resources to dispatch 1 million celery tasks. If we batch inserts by 1000 and really have our max target of 1 million monitors, then we would want 1000 celery workers. Another variable is the timeout for each check. Making it lower, means our workers get done faster instead of waiting for the slowest request.

    See the full code on GlitchTip’s GitLab.

    Celery Tasks

    @shared_task
    def dispatch_checks():
        now = timezone.now()
        latest_check = Subquery(
            MonitorCheck.objects.filter(monitor_id=OuterRef("id"))
            .order_by("-start_check")
            .values("start_check")[:1]
        )
        monitor_ids = (
            Monitor.objects.filter(organization__is_accepting_events=True)
            .annotate(
                last_min_check=ExpressionWrapper(
                    now - F("interval"), output_field=DateTimeField()
                ),
                latest_check=latest_check,
            )
            .filter(latest_check__lte=F("last_min_check"))
            .values_list("id", flat=True)
        )
        batch_size = 1000
        batch_ids = []
        for i, monitor_id in enumerate(monitor_ids.iterator(), 1):
            batch_ids.append(monitor_id)
            if i % batch_size == 0:
                perform_checks.delay(batch_ids, now)
                batch_ids = []
        if len(batch_ids) > 0:
            perform_checks.delay(batch_ids, now)
    
    @shared_task
    def perform_checks(monitor_ids: List[int], now=None):
        if now is None:
            now = timezone.now()
        # Convert queryset to raw list[dict] for asyncio operations
        monitors = list(Monitor.objects.filter(pk__in=monitor_ids).values())
        loop = asyncio.get_event_loop()
        results = loop.run_until_complete(fetch_all(monitors, loop))
        MonitorCheck.objects.bulk_create(
            [
                MonitorCheck(
                    monitor_id=result["id"],
                    is_up=result["is_up"],
                    start_check=now,
                    reason=result.get("reason", None),
                    response_time=result.get("response_time", None),
                )
                for result in results
            ]
        )
    

    The fancy Django ORM subquery is to ensure we are able to determine which monitors need checked while being as performant as possible. While some may prefer complex queries in raw SQL, for some reason I prefer ORM and I’m impressed to see how many use cases Django can cover these days. Anything to avoid writing lots of join table SQL 🤣️

    aiohttp code

    async def process_response(monitor, response):
        if response.status == monitor["expected_status"]:
            if monitor["expected_body"]:
                if monitor["expected_body"] in await response.text():
                    monitor["is_up"] = True
                else:
                    monitor["reason"] = MonitorCheckReason.BODY
            else:
                monitor["is_up"] = True
        else:
            monitor["reason"] = MonitorCheckReason.STATUS
    
    async def fetch(session, monitor):
        url = monitor["url"]
        monitor["is_up"] = False
        start = time.monotonic()
        try:
            if monitor["monitor_type"] == MonitorType.PING:
                async with session.head(url, timeout=PING_AIOHTTP_TIMEOUT):
                    monitor["is_up"] = True
            elif monitor["monitor_type"] == MonitorType.GET:
                async with session.get(url, timeout=DEFAULT_AIOHTTP_TIMEOUT) as response:
                    await process_response(monitor, response)
            elif monitor["monitor_type"] == MonitorType.POST:
                async with session.post(url, timeout=DEFAULT_AIOHTTP_TIMEOUT) as response:
                    await process_response(monitor, response)
            monitor["response_time"] = timedelta(seconds=time.monotonic() - start)
        except SSLError:
            monitor["reason"] = MonitorCheckReason.SSL
        except asyncio.TimeoutError:
            monitor["reason"] = MonitorCheckReason.TIMEOUT
        except OSError:
            monitor["reason"] = MonitorCheckReason.UNKNOWN
        return monitor
    
    async def fetch_all(monitors, loop):
        async with aiohttp.ClientSession(loop=loop) as session:
            results = await asyncio.gather(
                *[fetch(session, monitor) for monitor in monitors], return_exceptions=True
            )
            return results

    That’s it. Ignoring my models and plenty of Django boilerplate, we have the core of a reasonably performant uptime monitoring system in about 120 lines of code. GlitchTip is MIT licensed so feel free to use as you see fit. I also run a small SaaS service at app.glitchtip.com which helps fund development.

    On testing

    I greatly prefer testing in Python over JavaScript. I’m pretty sure this 15 line integration test would require a pretty complex Jasmine boilerplate and run about infinite times slower in CI. I will gladly put up with some asyncio boilerplate to avoid testing anything in JavaScript. In my experience, there are Python test driven development fans and there are JS developers who intended to write tests.

        @aioresponses()
        def test_monitor_checks_integration(self, mocked):
            test_url = "https://example.com"
            mocked.get(test_url, status=200)
            with freeze_time("2020-01-01"):
                mon = baker.make(Monitor, url=test_url, monitor_type=MonitorType.GET)
            self.assertEqual(mon.checks.count(), 1)
    
            mocked.get(test_url, status=200)
            with freeze_time("2020-01-01"):
                dispatch_checks()
            self.assertEqual(mon.checks.count(), 1)
    
            with freeze_time("2020-01-02"):
                dispatch_checks()
            self.assertEqual(mon.checks.count(), 2)

    There’s a lot going on in little code. I use aioresponses to mock network requests. Django baker to quickly generate DB test data. freezegun to simulate time changes. assertEqual from Django’s TestClient. And not seen, CELERY_ALWAYS_EAGER in settings.py to force celery to run synchronously for convenience. I didn’t write any async tests code yet I have a pretty decent test covering the core functionality from having monitors in the DB to ensuring they were checked properly.

    JS equivalent

    describe("test uptime", function() {
      it("should work", function() {
        // TODO
      });
    });

    Joking aside, I find it quite hard to find a good Node based task runner like Celery, ORM, and test framework that really work well together. There are many little niceties like running Celery in always eager mode that make testing a joy in Python. Let me know in a comment if you disagree and have any JavaScript based solutions you like.

  • Deploy Django with helm to Kubernetes

    This guide attempts to document how to deploy a Django application with Kubernetes while using continuous integration It assumes basic knowledge of Docker and running Kubernetes and will instead focus on using helm with CI. Goals:

    • Must be entirely automated and deploy on git pushes
    • Must run database migrations once and only once per deploy
      • Must revert deployment if migrations fail
    • Must allow easy management of secrets via environment variables

    My need for this is to deploy GlitchTip staging builds automatically. GlitchTip is an open source error tracking platform that is compatible with Sentry. You can find the finished helm chart and gitlab CI script here. I’m using DigitalOcean and Gitlab CI but this guide will generally work for any Kubernetes provider or Docker based CI tool.

    Building Docker

    This guide assumes you have basic familiarity with running Django in Docker. If not, consider a local build first using docker compose. I prefer using compose for local development because it’s very simple and easy to install.

    Build a Docker image and tag it with the git short hash. This will allow us to specify an exact image build later on and will ensure code builds are tied to specific helm deployments. If we used “latest” instead, we may end up accidentally upgrading the Docker image. Using Gitlab CI the script may look like this:

    docker build -t ${CI_REGISTRY_IMAGE}:${CI_COMMIT_REF_NAME} -t ${CI_REGISTRY_IMAGE}:${CI_COMMIT_SHORT_SHA}

    This uses -t to tag the new build with the Gitlab CI environment variables to specify the docker registry and tags. It uses “ref name” which is the tag or branch name. This will result in a tag such as “1.3” or branch such as “dev”. This tagging is intended for users who may just want a specific named version or branch. The second -t tags it with the git short hash. This tag will be referenced later on by helm.

    Before moving on – make sure you can now docker pull your CI built image and run it. Make sure to set the Dockerfile CMD to use gunicorn, uwsgi, or another production ready server. We’ll deal with Django migrations later using Helm.

    Setting up Kubernetes

    This guide assumes you know how to set up Kubernetes. I chose DigitalOcean because they provide managed Kubernetes, it’s reasonably priced, and I like supporting smaller companies. DigitalOcean limits choice which makes it easier to use for average looking projects. It doesn’t offer the level of customization and services AWS does. If you decide to use DigitalOcean and want to help offset the cost of my open source projects, considering using this affiliate link. My goals for a hosting platform are:

    • Easy to use
    • Able to be managed via terraform
    • Managed Postgres
    • Managed Kubernetes
    • Able to restrict network access for internal services such as the database

    Whichever platform you are using, make sure you have a database and it’s connection string and can authenticate to Kubernetes. If you are new to Kubernetes, I suggest deploying any docker image manually (without tooling like helm) to get a little more familiar. Technically, you could also run your database in Kubernetes and Helm. However I prefer managed stateful services and will not cover running the database in Kubernetes in this guide.

    Deploy to Kubernetes with Helm in Gitlab CI

    Update Feb 2021
    The GlitchTip Helm Chart is now a generic Django + Celery Helm chart. Read more here.


    Now that you have a Docker image and Kubernetes infrastructure, it’s time to write a Helm chart and deploy your image automatically from CI. A Helm chart allows you to write Kubernetes yaml configuration templates using variables. The chart I use for GlitchTip should be a good starting point for most Django apps. At a minimum, read the getting started section for Helm’s documentation. The GlitchTip chart includes one web server deployment and a Django migration job with helm lifecycle hook. You may need to set up an additional deployment if you use a worker such as Celery. The steps are the same, just override the Docker RUN command to start celery instead of your web server.

    Run the initial helm install locally. This is necessary to set initial variables such as the database connection that don’t need to be set in CI each deploy. Reference each value to override in your chart’s values.yaml. If following my GlitchTip example, that will be databaseURL and secretKey. databaseURL is the Database connection string. I use django-environ to set this. You could also define a separate databaseUser, databasePassword, etc if you like making more work for yourself. The key to make this work is to ensure one way or another the database credentials and other configuration get passed in as environment variables that are read by your settings.py file. Ensure your CI server has built at least one docker image. Place your chart files in the same git repo as your Django project in a directory “chart”

    Run helm install your-app-name ./chart --set databaseURL=string --set secretKey=random_string --set image.tag=git_short_hash

    If you use GlitchTip’s chart – it will not set up a load balancer but it will show output that explains how to connect locally just to test that everything is working. The Django migration job should also run and migrate your database. This guide will not include the many options you have for load balancing. I choose to use DigitalOcean’s load balancer and having it directly select the deployment’s pods. Note that in Kubernetes, a service of type Load Balancer may run a service providers load balancer and allow you to configure it through kubernetes config yaml. This will vary between providers. Here’s a sample load balancer that can be applied with ​kubectl –namespace your-namespace apply -f load-balancer.yaml note that it uses selector to directly send traffic from the load balancer to pods. It also contains DigitalOcean specific annotations, which is why I can’t document a universal way to do this.

    apiVersion: v1
    kind: Service
    metadata:
      name: your-app-staging
      annotations:
        service.beta.kubernetes.io/do-loadbalancer-certificate-id: long-id
        service.beta.kubernetes.io/do-loadbalancer-healthcheck-path: /
        service.beta.kubernetes.io/do-loadbalancer-protocol: http
        service.beta.kubernetes.io/do-loadbalancer-redirect-http-to-https: "true"
        service.beta.kubernetes.io/do-loadbalancer-tls-ports: "443"
    spec:
      type: LoadBalancer
      ports:
      - name: http
        port: 80
        protocol: TCP
        targetPort: 8080
      - name: https
        port: 443
        protocol: TCP
        targetPort: 8080
      selector:
        app.kubernetes.io/instance: your-app-staging
        app.kubernetes.io/name: your-app
    
    

    At this point you should have a fully working Django application.

    Updating in CI using Helm

    Now set up CI to upgrade your app on git pushes (or other criteria). While technically optional, I suggest making separate namespaces and service accounts for each environment. Unfortunately this process can feel obtuse at first and I felt was the hardest part of this project. For each environment, we need the following:

    • Service Account
    • Role Binding
    • Secret with CA Cert and token

    For a rough analogy the service account is a “user” but for a bot instead of a human. A role binding defines the permissions that something (say a service account) has. The role binding should have the “edit” permission for the namespace. The secret is like the “password” but is actually a certificate and token. Read more from Kubernetes documentation.

    Once this is set up locally, test it out. For example, use the new service account auth in your ~/.kube/config and run kubectrl get pods –namespace=your-namespace. The CA cert and token from your recently created secret should be what is in your kube’s config file. I found no sane manner of editing multiple kubernetes configurations and resorted to manually editing the config file.

    apiVersion: v1
    clusters:
    - cluster:
        certificate-authority-data: big-long-base64 
        server: https://stuff.k8s.ondigitalocean.com
      name: some-name
    
    ...
    
    users:
    - name: default
      user:
        token big-long-token-from-secret

    Notice I used certifate-authority-data so I could reference the cert inline as base64. Next save the entire config file in Gitlab CI under settings, CI, Variables.

    Screenshot from 2020-01-24 10-59-53

    There’s actually a lot happening in this little bit of configuration. File type in Gitlab CI will cause the value to save into a random tmp file. The key “KUBECONFIG” will be set to the file location. KUBECONFIG is also the environment variable helm will use to locate the kube config file. Protected will allow this only to be available to protected git branches/tags. If we didn’t set protected, someone with only limited git access could make their own branch that runs echo $KUBECONFIG and view the very confidential data! If set up right, you should now be able to run helm with the authentication that just works.

    Finally add the deploy step to Gitlab CI’s yaml file.

    deploy-staging:
      stage: deploy
      image: lwolf/helm-kubectl-docker
      script:
        - helm upgrade your-app-staging ./chart --set image.tag=${CI_COMMIT_SHORT_SHA} --reuse-values
      environment:
        name: staging
        url: https://staging.example.com
      only:
        - master
    

    ​stage ensures it runs after the docker build. For image, use lwolf/helm-kubectl-docker which has helm already installed. The script is amazingly just one line thanks to the previous authentication and Gitlab CI variable tricks done. It runs helm upgrade with –set image.tag to the new git short hash and –reuse-values allows it to set this new value without overriding previous values. Using helm this way allows you to keep database secrets outside of Gitlab. Do note however that anyone with helm access can read these values. If you need a more robust system then you’ll need something like Vault. But even without Vault, we can isolate basic git users who can create branches and admin users who have access to helm and the master branch.

    The environment section is optional and let’s Gitlab track deploys. “only” causes the script to only run on the master branch. Alternatively it could be set for other branches or tags.

    If you need to change an environment variable, run the same upgrade command locally and –set as many variables as needed. Keep the –reuse-values. Because the databaseURL value is marked as required, helm will error instead of erase previous values should you forget the important –reuse-values.

    Conclusion

    I like Kubernetes for it’s reliability but I find it creates a large amount of decision fatigue. I hope this guide provides one way to do things that I find works. If you have a better way – let me know by commenting here or even open an issue on GlitchTip. I’m sure there’s room for improvement. For example, I’d rather generate the django secret key automatically but helm’s random function doesn’t let you store it persistently.

    I don’t like Kube’s, maddening at times, complexity. Kubernetes is almost never a solution by itself and requires additional tools to make it work for even very basic use cases. I found Openshift to handle a lot of common use cases like deploy hooks and user/service management much easier. Openshift “routes” are also defined in standard yaml config rather than forcing the user to deal with propreitary annotations on a Load Balancer. However, I’m leery of using Openshift Online considering it hasn’t been updated to version 4 and no roadmap seems to exist. It’s also quite a bit more expensive (not that it’s bad to pay more for good open source software).

    Finally if you need error tracking for your Django app and prefer open source solutions – give GlitchTip a try. Contributors are preferred, but you can also support the project by using the DigitalOcean affiliate link or donating. Burke Software also offers paid consulting services for open source software hosting and software development.