How I leveraged GitLab CI to keep my application fed with data (with example)

17.07.2018

GitLab’s integrated CI/CD is advertised as the way for quick, automatic build, test and deploy of code. But what if we used it to continuously deliver the product of our code, rather than code itself? It turned out to be perfect - and achievable - case for my Cinema Citizen web app.

Manual simply didn’t work

In short, CCtzn helps you plan movie marathons. The calculations run on JSON data I need to prepare beforehand for each cinema and day of the week. The script responsible for “download & convert” work is PHP piece proudly called timetables-maker. It works reasonably well, given how unstructured the original data is, and how many separate datasets it generates. Currently, about 1000 files / 4 minutes are spit out on a laptop.

Since my app relies on the real world timetables provided by cinema chains, it’s a bit time-sensitive. The same day they update information about screenings, I should run the script for fresh data. My shared hosting supports cron, but also limits max_execution_time so the script could be killed in the middle of running! If you read my previous post, you know that I’m too cheap to upgrade the server, especially for non-profit-yet app’s needs.

I started thinking about changing approach. Instead generating hundreds of JSONs at once every week, let the script run on demand, resulting in single (cached) file every time the user requests for timetable at selected venue and date. This way had its own downsides, though:

“Cold start”; if specific JSON hasn’t been accessed yet, its generation takes some time perceived by the user as a lag.
I can’t know whether the script failed (for example, if format of the data I fetch slightly changed and the converter needs tuning).
It could still time out (for example, if external API/website is down).

So until lately, I was the actual backend for my app, having to manually run the script and upload the generated files via FTP (another trait of simple, oldschool hosting). I was too lazy for it, so CCtzn just laid there for months, starving for JSONs. But this time, being lazy also motivated me to delegate this process to the machine.

Automatic did work

When I started playing with GitLab’s solution, my primary focus was to automate FTP upload; after that came everything else (e.g. building JavaScript bundle). I used to write Gulp tasks for the job, but again it required me to trigger file transfer, and it was failing sometimes. Being able to skip the local step and start directly from repository feels awesome.

The following paragraphs contain some quoted terms (like “stage”) from CI/CD nomenclature, which should be interpreted as in docs; however they read pretty intuitively.

A typical setup consists of a few “stages” (for example, “build”, “test” and “deploy”), which ultimate goal is to deliver updated version of your code to some environment. This is what most of tutorials say. The fact is that you can do pretty arbitrary things, not necessarily aimed at releasing, using the CI/CD engine. This is what I did - trigerred side effects (i.e. generated JSONs with movie timetables) from my code, and pushed the files to FTP. All for free, within monthly allowance of 2000 minutes of “runners” utilization; and without (visible) time limit for my script.

The process step by step:

I figured out stages and “jobs” required for fulfilling my task. I defined just two steps: first would be naturally to download and convert timetables for my app, and second to upload them.
I saved some variables in GitLab’s “CI / CD Settings” view so I don’t expose my server credentials in the code.
I defined the jobs in a special .gitlab-ci.yml file, placed in repository root.
I (again in GitLab web interface) set up a schedule for running the “pipeline”, using cron syntax.

And voila! Finally I have humanless backend :) As promised (and because I work openly on the CCtzn project), you can see the final configuration here.

What surprised me when working with GitLab CI/CD (a.k.a friction log)

That I am free to choose Docker image my job will run on. The “Shared Runners” section in CI/CD Settings view suggests (by tags under runner name) that jobs for the free tier should be compatible with Ruby. Without configuring the image field inside jobs’ definition, they indeed failed because by default the runners pick up a generic, Ruby-centric image. Now I always try to find suitable, “minimal for the job” images.
I expected that consecutive stages will automatically have access to products of previously finished jobs. The fact is that every job runs in its own, freshly built, independent environment. In order to save such “artifacts” between jobs, an adequate config must be added explicitly. It’s logical, but wasn’t obvious at first.