Updates on PHP performance increase

While waiting for PHP 7.4 production release and its strong type new feature, I was wondering how the performance increased. A while ago I wrote a small stress test which is not the most reliable in all cases, it just points out some major changes that started occurring in PHP 7.

Since that test, PHP 7.3 was released and now there is a release candidate for 7.4. While the memory usage is the same for my test case since 7.0, the execution time looks better with every new version.

#!/usr/bin/env bash

test=$(cat << 'eot'
$time = microtime(true);

$array = [];
for ($i = 0; $i < 10000; $i++) {
    if (!array_key_exists($i, $array)) {
        $array[$i] = [];
    }

    for ($j = 0; $j < 1000; $j++) {
        if (!array_key_exists($j, $array[$i])) {
            $array[$i][$j] = true;
        }
    }
}

echo sprintf(
    "Execution time: %f seconds\nMemory usage: %f MB\n\n",
    microtime(true) - $time,
    memory_get_usage(true) / 1042 / 1042
);
eot
)

versions=( 5.6 7.0 7.1 7.2 7.3 7.4-rc )

for v in "${versions[@]}"
do
    cmd="docker run --rm -ti php:${v}-cli-alpine php -d memory_limit=2048M -r '$test'"
    sh -c "echo ${v} && ${cmd}"
done

Results (ignore the absolute values, just watch the differences):

5.6
Execution time: 4.573347 seconds
Memory usage: 1332.009947 MB

7.0
Execution time: 1.464059 seconds
Memory usage: 347.669807 MB

7.1
Execution time: 1.315205 seconds
Memory usage: 347.669807 MB

7.2
Execution time: 0.653521 seconds
Memory usage: 347.669807 MB

7.3
Execution time: 0.614016 seconds
Memory usage: 347.669807 MB

7.4-rc
Execution time: 0.528052 seconds
Memory usage: 347.669807 MB

Merge JSON arrays by key in PostgreSQL

UPDATE:

While testing more cases, I’ve come up with a solution which better covers my needs. Instead of taking distinct values, I used a FULL JOIN between the two data sets:

SELECT json_agg(data) FROM (
	SELECT coalesce(new.id, old.id) AS id, coalesce(new.value, old.value) AS value
	FROM jsonb_to_recordset('[{"id": 1, "value": "A"}, {"id": 2, "value": "B"}, {"id": 3, "value": "C"}, {"id": 6, "value": "6"}]') AS old(id int, value text)
	FULL JOIN jsonb_to_recordset('[{"id": 1, "value": "NEW A"}, {"id": 2, "value": "NEW B"}, {"id": 5, "value": "5"}]') AS new(id int, value text)
	ON old.id = new.id
) data

Given a JSON column which holds an array of objects, each object having “id” and “value” properties, I wanted to update the array with a new one, by merging the values by the “id” property.
If same id is present in an object of both arrays, the new one will be used.

Old:

[
    {"id": 2, "value": "B"},
    {"id": 1, "value": "A"}
]

New:

[
    {"id": 1, "value": "X"},
    {"id": 3, "value": "C"},
    {"id": 4, "value": "C"}
]

Result:

[
    {"id":1, "value":"X"},
    {"id":2, "value":"B"},
    {"id":3, "value":"C"},
    {"id":4, "value":"C"}
]

And I wanted to do this in one database roundtrip, not take the data in back end, merge it, and send it back to the database.

First, I transformed the array objects into records and prioritized the new array by giving its values a position; and I put both sets together:

SELECT 1 AS position, * FROM json_to_recordset('[{"id": 2, "value": "B"}, {"id": 1, "value": "A"}]') AS src(id int, value text) -- old
UNION ALL
SELECT 0 AS position, * FROM json_to_recordset('[{"id": 1, "value": "X"}, {"id": 3, "value": "C"}, {"id": 4, "value": "C"}]') AS dst(id int, value text) -- new

Then I extracted only the id and the value, the position being used only to sort data:

SELECT id, value FROM (
	SELECT 1 AS position, * FROM json_to_recordset('[{"id": 2, "value": "B"}, {"id": 1, "value": "A"}]') AS src(id int, value text) -- old
	UNION ALL
	SELECT 0 AS position, * FROM json_to_recordset('[{"id": 1, "value": "X"}, {"id": 3, "value": "C"}, {"id": 4, "value": "C"}]') AS dst(id int, value text) -- new
) AS ordered ORDER BY position

Next step, I took the distinct rows by id:

SELECT DISTINCT ON (id) id, value FROM (
	SELECT id, value FROM (
		SELECT 1 AS position, * FROM json_to_recordset('[{"id": 2, "value": "B"}, {"id": 1, "value": "A"}]') AS src(id int, value text) -- old
		UNION ALL
		SELECT 0 AS position, * FROM json_to_recordset('[{"id": 1, "value": "X"}, {"id": 3, "value": "C"}, {"id": 4, "value": "C"}]') AS dst(id int, value text) -- new
	) AS ordered ORDER BY position
) data

And finally, the result needs to be converted to JSON:

SELECT json_agg(uniqueid) FROM (
	SELECT DISTINCT ON (id) id, value FROM (
		SELECT id, value FROM (
			SELECT 1 AS position, * FROM json_to_recordset('[{"id": 2, "value": "B"}, {"id": 1, "value": "A"}]') AS src(id int, value text) -- old
			UNION ALL
			SELECT 0 AS position, * FROM json_to_recordset('[{"id": 1, "value": "X"}, {"id": 3, "value": "C"}, {"id": 4, "value": "C"}]') AS dst(id int, value text) -- new
		) AS ordered ORDER BY position
	) data
) uniqueid

PostgreSQL: cannot call json_to_recordset on a scalar

I have a nullable JSON column in a PostgreSQL table. Data stored in the column is an array with key/value objects which have to be converted into records at some point. It’s a job for json_to_recordset, which takes a JSON array as input and returns the records.

create table people (
	properties json
);

insert into people values (null), ('[{"key": "a"}, {"key": "b"}]');

select * from people;

select key
from
	people,
	json_to_recordset(properties) as properties(key text);

But JSON also allows ‘null’ as value. And I don’t mean it allows null values because it’s a nullable column, it actually allows the string ‘null’, which is recognized as a null JSON value.

insert into people values ('null');

The value of the properties column in the new row is not null, so now json_to_recordset will complain about being passed a scalar.

Be careful when updating a JSON column, make sure you really set it to null if you want no value.

Abstract dependencies

Projects depend on packages, internal ones or from 3rd parties. In all cases, your architecture should dictate what it needs and not evolve around a package, otherwise changing the dependency with another package and mocking it in tests could take unnecessary effort and time.

While it’s very easy to include a package in any file you need and start using it, time will show it’s painful to spread dependencies all over. Someday you’ll want to change a package because it’s not maintained anymore or you discover security issues which lead to loss of trust, or maybe you just want to experiment with another package.

This is where abstraction comes in, helping to decouple an implementation in your project, to set the rules which a dependency must follow. Each package has its own API, thus you need to wrap it by your API.

To show an example, I’ve chosen to integrate a validation package into Echo framework. The validator package requires you to tag your structs with its validation rules. Continue reading Abstract dependencies

Error handling in Echo framework

If misused, error handling and logging in Go can become too verbose and tangled. If the two are decoupled, you can just pass the error forward (maybe add some context to it if necessary, or have specific error structs) and have it logged or sent to various systems (like New Relic) somewhere else in the code, not where you receive it from a function call.

One of the features I appreciate the most in Echo framework is the HTTPErrorHandler which can be customized to any needs. And combined with the recover middleware, error management becomes an easy task.

package main

import (
   "errors"
   "fmt"
   "net/http"

   "github.com/labstack/echo/v4"
   "github.com/labstack/echo/v4/middleware"
   "github.com/labstack/gommon/log"
)

func main() {
   server := echo.New()
   server.Use(
      middleware.Recover(),   // Recover from all panics to always have your server up
      middleware.Logger(),    // Log everything to stdout
      middleware.RequestID(), // Generate a request id on the HTTP response headers for identification
   )
   server.Debug = false
   server.HideBanner = true
   server.HTTPErrorHandler = func(err error, c echo.Context) {
      // Take required information from error and context and send it to a service like New Relic
      fmt.Println(c.Path(), c.QueryParams(), err.Error())

      // Call the default handler to return the HTTP response
      server.DefaultHTTPErrorHandler(err, c)
   }

   server.GET("/users", func(c echo.Context) error {
      users, err := dbGetUsers()
      if err != nil {
         return err
      }

      return c.JSON(http.StatusOK, users)
   })

   server.GET("/posts", func(c echo.Context) error {
      posts, err := dbGetPosts()
      if err != nil {
         return err
      }

      return c.JSON(http.StatusOK, posts)
   })

   log.Fatal(server.Start(":8088"))
}

func dbGetUsers() ([]string, error) {
   return nil, errors.New("database error")
}

func dbGetPosts() ([]string, error) {
   panic("an unhandled error occurred")
   return nil, nil
}

Avoid logging in functions:

server.GET("/users", func(c echo.Context) error {
   users, err := dbGetUsers()
   if err != nil {
      log.Errorf("cannot get users: %s", err)
      return err
   }

   return c.JSON(http.StatusOK, users)
})

Keep the handler functions clean, check your errors and pass them forward.

Automated Jenkins CI setup for Go projects

I was helping a friend on a project with some tasks amongst which code quality. I’ve set up some tools (gometalinter first, then golangci-lint) and integrated them with Travis CI.

Then GitHub decided to give free private repositories and my friend made the project private. Travis and other CI systems require paid plans for private repositories so I decided to set up a Jenkins environment. I’ve used Jenkins before, but never configured it myself. I thought it’s going to be a quick click-click install process, but I’ve found myself in front of various plugins, configurations, credentials and all sort of requirements.

One of the first things that come in my mind when I have to do something is if I’ll have to do it again later and if I should automate the process. And that’s how a pet project was born. My purpose was to write a configuration file, install the environment, then configure jobs for the repositories I need.

I’ve created two job templates, one for triggering a build on branch push, one for pull requests (required webhooks are created automatically). Based on the templates, I just wanted to create a new job, insert the GitHub repository url and start using the job.

The job templates are aiming at Go projects to run tests and code quality tools on (all required tools are installed automatically). Other configured actions are automatic backups and cleanups. And you’ll also find some setup scripts for basic server security and requirements.

There are things which could be improved, but now I have a click-click CI system setup called Go Jenkins CI.

 

 

Protect Docker secrets files

I have a Docker container managed with Docker compose, which defines the unless-stopped restart policy. But my container never starts after I reboot the machine. But it does restart if I just restart the Docker service. I have a similar setup on another machine which I have no issues with.

I kept on searching the issue until it hit me all of a sudden. I’m using secrets read from files. The files are in the /tmp  directory which gets cleared, thus the container fails to start.

When I defined the secrets files, I just threw them away from the source code repository, without thinking too much where. I wasn’t even sure if I was going to use them for a long time.

Dev testing

A user story is not done after the code has been written. Also, the developer’s part is not done after the code has been written. I strongly believe in dev testing: after implementing, the developer will once again read each line of the requirements (they had already done this at least once before implementing, right?!?!) and manually test the feature (even if they’ve written automated tests). At least a sanity test to make sure the QA colleagues will not get back to them after a few minutes of testing or every few minutes with another issue.

The interaction with the QA colleagues should be more in breaks. I want to see those people get bored like hell because they just need to check the implementation against the requirements and get back to chatting because everything is OK. If they get back with special situations, corner cases, weird exceptions, everyone’s doing a great job.

We’re human, we forget things while developing, it’s OK. But if the number of things is too big to fit in memory and it increases with every feature, maybe we should read the requirements a few more times and test each one of them. And we should ask questions. The code is done after we understand, implement, test.

And always ask ourselves what happens to the other parts of the app if we do just a small change in the old code. Then test. Yes, we, the developers, must test!

The mid-language crisis

A while after I wrote the Apixu libraries for Go and PHP, they asked me if I could help them with some issue a client had when trying to install the Python library. I quickly tested and sent them steps on how to install all requirements. And I noticed the library could use some additions.

My experience with Python was of a few lines and I felt like it would be a good context to write a few more, to understand the language a little better, to learn about its requirements, ways to set up a library and write some tests to validate JSON schemas. A clean language with a simple package manager and clear error handling. Basic integrations with Flask and Django were pretty smooth.

One thing I like about Apixu is they have multiple libraries and after freshening the Python one I was inspired to do more. And there was where to choose from.

JavaScript is an old friend but we met only in the browser some years ago. It was time to face its server side of the moon and add some touches to the NodeJS library. I’m not a huge fan of callbacks, but Promises are indeed a nice way to handle responses. Express was easy to start with, I like its micro framework feeling. The package manager, npm, doesn’t seem too far from PHP’s composer, we got along well. Continue reading The mid-language crisis

Encrypt column in PostgreSQL

PostgreSQL has a nice encryption (and hashing) module called pgcrypto which is easy to use. Using a provided key, you can quickly encrypt a column which contains sensitive information.

While the setup is fast and the usage is simple, there could be some disadvantages in some contexts:

  • be careful how you send the encryption key to the database server (if public, use SSL for transport, else keep it in a private network at least)
  • data is encrypted/decrypted in the database, so the transport is in plain (watch out for memory dump attacks)
  • some queries are very slow, as the decrypt operation is performed on the entire table if you want to sort or filter by encrypted columns

Continue reading Encrypt column in PostgreSQL