If you ever ran a serious production database (is there any other kind, really? 🤔 ), you are most likely using some sort of database migration tool. Tools like Flyway, Liquibase or Knex for JS are pretty good at that and haven’t caused much trouble in our lives until we started treating our apps as cattle and run them in K8s, replicated, rescheduled every now and then, without us even knowing. They scale up and down automatically, sometimes even in the middle of the night while we sleep soundly in our beds. Then something interesting starts to happen….

Problem statement: Lock it & throw away the Pod

All of the mentioned tools above have a safety mechanism that introduces a temporary lock on a table while the schema validation or migration is taking place in order to ensure consistency. However, when we are talking in a Kubernetes context (pun intended), it could happen that the app takes longer to validate the schema or fails the healthcheck exactly in that moment when the lock is present on the database.

Which will lead to the Pod being deleted and replaced by another one that will start but as soon as it sees that there is a lock from a different Pod, it will fail to start crash the Pod. Welcome to CrashLoopBackOff.

Fixing the problem in less than 80 lines of Go

I am basing this solution to be specific for Liquibase but it can easily be adapted to other DB migration tools out there with a few tweaks. Our goal (and hence, our logic) is simple: We want to check the table for any locks and remove them. Hmmm, wait, but what if the lock is legit, we don’t want to accidentally remove a lock that we’re not supposed to, right? In that case, let’s set up a threshold and say that if our lock is older than X minutes, we remove it. Let’s dig into it!

Pseudocode (it’s still code, bro)

package main


func main() {
    // Connect to DB
    // Check table for locks
    // If locked
        // check time of lock
            // if lock time is higher than X minutes
                // remove lock
    // end
}

Less pseudo and more code

In order to connect to the database, we will need a driver. The official package go-sql-driver looks good for the job. Looking at the documentationm we can see that connecting to the database is pretty straighforward:

package main

import (
    "database/sql"
    "log"
    "time"

    _ "github.com/go-sql-driver/mysql"
)

func main() {
    // Connect to DB
    db, err := sql.Open("mysql", "username:pass@tcp(localhost:3306)/schema_name")
    if err != nil {
        log.Fatal(err)
    }
    db.SetConnMaxLifetime(time.Minute * 3)
    db.SetMaxOpenConns(1)
    db.SetMaxIdleConns(1)
    defer db.Close()

    // Check table for locks

    // If locked
        // check time of lock
            // if lock time is higher than X minutes
                // remove lock
    // end
}

At this point, we can check the table for locks by running a query. We only need 3 fields from the table. LOCKED - a boolean that tells us if the lock is present or not; LOCKGRANTED - timestamp of the lock; LOCKEDBY - to show us which Pod has locked the table, purely for traceability purposes. We will also need a struct to work with the data:

type lock struct {
    isLocked    bool
    lockGranted string
    lockedBy    string
}

// connect to database....
result := db.QueryRow("SELECT LOCKED, LOCKGRANTED, LOCKEDBY FROM DATABASECHANGELOGLOCK")
var l = lock{}
err := result.Scan(&.isLocked, &l.lockGranted, &l.lockedBy)
if err != nil {
    log.Fatal(err)
}
// ....

However, here I have ran into a few issues. First, my LOCKED field was of type tinyint so I couldn’t serialize it properly to a bool field resulting in error messages like ...sql/driver: couldn't convert "\x01" into type bool....

I have seen that the sql package has some pretty useful field wrappers like NullBool and NullString that could be used in case of empty results and similar. Unfortunatelly, it still didn’t work for me so I ended up using a small hack found in this Github issue which made me rewrite my query like so:

result := db.QueryRow("SELECT (LOCKED = b'1'), LOCKGRANTED, LOCKEDBY FROM DATABASECHANGELOGLOCK")

I did ended up, however, using the sql.NullString for my other two struct fields which are useful in case our query returns an empty result.

The end result

Now, this is how all the code looks together along with the logic for checking the lock and removing it:

package main

import (
    "database/sql"
    "log"
    "time"

    _ "github.com/go-sql-driver/mysql"
)

type lock struct {
    isLocked    bool
    lockGranted sql.NullString
    lockedBy    sql.NullString
}

func main() {
    // Connect to DB
    db, err := sql.Open("mysql", "username:pass@tcp(localhost:3306)/schema_name")
    if err != nil {
        log.Fatal(err)
    }
    db.SetConnMaxLifetime(time.Minute * 3)
    db.SetMaxOpenConns(1)
    db.SetMaxIdleConns(1)
    defer db.Close()

    // Check table for locks
    result := db.QueryRow("SELECT (LOCKED = b'1'), LOCKGRANTED, LOCKEDBY FROM DATABASECHANGELOGLOCK")
    var l = lock{}
    err := result.Scan(&.isLocked, &l.lockGranted, &l.lockedBy)
    if err != nil {
        log.Fatal(err)
    }

    // If locked
    if l.isLocked {
        // check time of lock
        lockTime, err := time.Parse("2006-01-02 15:04:05", l.lockGranted.String)
        if err != nil {
            log.Fatal(err)
        }
        now := time.Now()
        diff := now.Sub(lockTime)
        maxLockTime, err := time.ParseDuration("3m") // we are choosing 3 minutes as our max lock time
        if err != nil {
            log.Fatal(err)
        }
        // if lock time is higher than X minutes
        if diff > maxLockTime {
            lockedBy, _ := l.lockedBy.Value()
            log.PrintF("Lock from %v older than %v detected, will try to delete...", lockedBy, diff.Minutes())
            // remove lock
            res, err := db.Exec("DELETE FROM DATABASECHANGELOGLOCK")
            if err != nil {
                log.Fatal(err)
            }
            rows, err := res.RowsAffected()
            if err != nil {
                log.Fatal(err)
            }
            log.Print("Deleted locks: %v", rows)
        }
    }
    // end
}

STOP! Refactor time

Drake refuse

The code above works but it’s very ugly so we will proceed to refactor it. The following things are eye stingers:

  • Everything is crammed into the main function. We want to separate the code into smaller, easier to read functions.
  • Remove the comments.
  • Extract strings into constants.
  • Extract database connection details into environment variables since we don’t want to store them in plain text in the code.
  • Extract the maxLockTime into an environment variable because we want this value to be configurable easily.
  • Move error handling to a separate function since it’s making the code very unreadable.
  • Leave a comment about our SQL query hack for the tinyint so the next person (or ourselves in 6 months) knows why this is here.

After all of the above, here is how our code looks like now:

package main

import (
	"database/sql"
	"log"
	"os"
	"time"

	_ "github.com/go-sql-driver/mysql"
)

type lock struct {
	isLocked      bool
	lockGranted sql.NullString
	lockedBy    sql.NullString
}

const (
	// (LOCKED = b'1') is done to convert tinyint to boolean. See: https://github.com/go-sql-driver/mysql/issues/440
	DB_GET_LOCK_QUERY    = "SELECT (LOCKED = b'1'), LOCKGRANTED, LOCKEDBY from DATABASECHANGELOGLOCK"
	DB_DELETE_LOCK_QUERY = "DELETE FROM DATABASECHANGELOGLOCK"
	DB_ENGINE            = "mysql"
	DB_PORT              = "3306"
	TIME_LAYOUT          = "2006-01-02 15:04:05"
)

var (
	maxLockTime = os.Getenv("MAX_LOCK_TIME")
	DBUser      = os.Getenv("DB_USER")
	DBPass      = os.Getenv("DB_PASS")
	DBURL       = os.Getenv("DB_URL")
	DBName      = os.Getenv("DB_NAME")
)

func main() {
	db := connectToDb()
	defer db.Close()
	result := db.QueryRow(DB_GET_LOCK_QUERY)
	var l = Lock{}
	e := result.Scan(&l.isLocked, &l.lockGranted, &l.lockedBy)
	handleError(e)
	if l.isLocked {
		handleDBLock(l, db)
	}
}

func handleDBLock(l lock, db *sql.DB) {
	lockTime, e := time.Parse(TIME_LAYOUT, l.lockGranted.String)
	handleError(e)
	now := time.Now()
	diff := nowTime.Sub(lockTime)
	maxLockTime, e := time.ParseDuration(maxLockTime)
	handleError(e)
	if diff > maxLockTime {
		lockedBy, _ := l.lockedBy.Value()
		log.Printf("Lock from %v older than %v detected, will try to delete...", lockedBy, diff.Minutes())
		res, e := db.Exec(DB_DELETE_LOCK_QUERY)
		handleError(e)
		rows, e := res.RowsAffected()
		handleError(e)
		log.Printf("Deleted locks: %v", rows)
	}
}

func connectToDb() *sql.DB {
	connstr := DBUser + ":" + DBPass + "@tcp(" + DBURL + ":" + DB_PORT + ")/" + DBName
	db, err := sql.Open(DB_ENGINE, connstr)
	handleError(err)
	db.SetConnMaxLifetime(time.Minute * 3)
	db.SetMaxOpenConns(1)
	db.SetMaxIdleConns(1)
	return db
}

func handleError(e error) {
	if e != nil {
		log.Fatal(e)
	}
}

Now that’s more like it….

Drake approves


Containerize and K8s…ize

We want to run this inside of a Kubernetes cluster as a CronJob. My other alternative was to run it as an AWS Lambda function but the issue was that the database credentials needed to be stored as plain text environment variables. I’ve done some reasearch and saw that we can use AWS SSM and store the credentials there but then we would have to write our code with serverless framework which feels like adding much unecessary bloat for such a simple task. So let’s put the code in a container and run it as a K8s CronJob.

For building the Docker image, we will use some pretty standard stuff - Docker multi-stage build and run the binary from a scratch image:

FROM golang:alpine AS builder
WORKDIR $GOPATH/src/app/

COPY main.go .
COPY go.mod .

RUN go get -d -v
RUN CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -ldflags='-w -s -extldflags "-static"' -a -o /go/bin/app .

FROM scratch
COPY --from=builder /go/bin/app /go/bin/app

ENTRYPOINT ["/go/bin/app"]

The resulting image is pretty small (4MB!) and and doesn’t have a lot of room for exploits. I am sure it can be made more secure by using an unpriviliged user but I had some issues trying to set that up and couldn’t be bothered since my time was limited for this task.

The last piece of the puzzle is the CronJob manifest as well as the Secret where we will store our database credentials, which results in something like this:

---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: liquibase-lock-release
  namespace: example
spec:
  schedule: "*/3 * * * *"
  concurrencyPolicy: Replace
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: liquibase-lock-release
            image: my-secure-registry.org/liquibase-lock-release:v1
            imagePullPolicy: Always
            env:
              - name: MAX_LOCK_TIME
                value: "3m"
            envFrom:
              - secretRef:
                  name: liquibase-lock-release-secret
          restartPolicy: OnFailure
          imagePullSecrets:
            - name: my-image-pull-secret
---
apiVersion: v1
kind: Secret
metadata:
  name: liquibase-lock-release-secret
  namespace: example
type: Opaque
data:
  DB_USER: <base64-encoded-db-user>
  DB_PASS: <base64-encoded-db-pass>
  DB_URL: <base64-encoded-db-url>
  DB_NAME: <base64-encoded-schema-name>

At this point, we only need to kubectl apply -f this baby and we are ready to profit!

Hope you have enjoyed this post. The code for this can be found in this GitHub repository!

Keep on hacking!