String processing: Mask an email address

While studying STD algorithms in C++, one simple exercise I did was masking an email address. Turning johndoe@emailprovider.tld into j*****e@emailprovider.tld, considering various cases like very short emails and incorrect ones (one could impose a precondition on the input, that it must be a valid email address to provide a valid output, but for this exercise, I wanted some edge cases).

To know what kinds of inputs I’m dealing with and what the corresponding valid outputs should be, I’ll start with the test data:

const std::map<std::string, std::string> tests{
        {"johndoe@emailprovider.tld", "j*****e@emailprovider.tld"},
        {"jde@emailprovider.tld",     "j*e@emailprovider.tld"},
        {"jd@emailprovider.tld",      "**@emailprovider.tld"},
        {"j@emailprovider.tld",       "*@emailprovider.tld"},
        {"@emailprovider.tld",        "@emailprovider.tld"},
        {"wrong",                     "w***g"},
        {"wro",                       "w*o"},
        {"wr",                        "**"},
        {"w",                         "*"},
        {"",                          ""},
        {"@",                         "@"},
};

Besides solving the task itself, I was also curious about an aspect: What would be the differences between an implementation using no STD algorithms and one using various STD algorithms? I followed how the code looks and how it performs.

The first approach was the classic one, using a single iteration of the input string, during which each character is checked to see if it should be copied to the output as is or it should be masked. After the iteration, if the character @ was not found, the propper transformation is done.

std::string mask(const std::string &email, const char mask) {
    if (email[0] == '@') {
        return email;
    }

    std::string masked;
    masked.reserve(email.size());

    bool hide = true;
    bool is_email = false;

    for (size_t i = 0; i < email.size(); ++i) {
        if (email[i] == '@') {
            is_email = true;
            hide = false;
            
            if (i > 2) {
                masked[0] = email[0];
                masked[i - 1] = email[i - 1];
            }
        }

        masked += hide ? mask : email[i];
    }

    if (!is_email && masked.size() > 2) {
        masked[0] = email[0];
        masked[masked.size() - 1] = email[masked.size() - 1];
    }

    return masked;
}

At first look, it might not be obvious how all the requirements are implemented, but then you can understand pretty fast, as this approach is imperative, it shows every step. Maybe some parts could be extracted into functions to better explain what’s happening. The code is all there, it’s simple, you just have to analyze it. But the function is a little long.

The STD algorithms approach gives another perspective.

std::string mask_with_find_and_transform(const std::string &email, const char mask = '*') {
    std::string masked = email;

    auto p = find(masked.begin(), masked.end(), '@');
    auto offset = (p - masked.begin() > 2) ? 1 : 0;
    auto begin = masked.begin() + offset;
    auto end = p - offset;

    transform(begin, end, begin, [mask](const char &) { return mask; });

    return masked;
}

The first thing that stands out is the length of the function, reduced to about half of the previous one. Then, the algorithms find and replace_if give you some hints about what’s happening. This version operates on a copy of the input string. It finds the range of characters to be masked and then transforms that range by replacing each character with the mask character. The range limits are the single magical aspect here, mainly finding a so-called offset that helps compute where to start and where to end masking characters.

I don’t consider the performance differences between the two versions relevant for tasks like this one, but time-critical applications may see differences, in some contexts, between these kinds of approaches. Out of curiosity, I wrote two more versions of this task (not the cleanest, but I just wanted to see other ways to write) and measured speed multiple times, then computed some stats (min, max, average, median). I don’t intend to go towards a version or another because of the timing results, it was more of an exploratory approach. I’ve tried to see what happens if I return or not early from functions if the input could not be processed (see comments in the first function below). And I used different compilers (GCC, Clang) and standards (11, 14, 17).

A run of the tests looks like these:

Execution time (nanoseconds)

min:  67   max: 2283  avg: 179.78       med: 124      mask (2283, 152, 85, 84, 81, 81, 81...)

min: 110   max: 1160  avg: 163.91       med: 156      mask_with_find_and_transform (1160, 137, 119, 118, 115, 115, 118...)

min: 110   max: 364   avg: 158.68       med: 152      mask_with_find_and_replace_if (215, 114, 115, 113, 112, 116, 113...)

min:  67   max: 2139  avg: 102.82       med:  98      mask_with_find_first_of_and_replace (398, 90, 69, 70, 69, 71, 71...)

And the full source code:

/**
 * Mask an email address: johndoe@emailprovider.tld -> j*****e@emailprovider.tld
 */

#include <string>
#include <algorithm>
#include <map>
#include <chrono>
#include <iostream>
#include <functional>
#include <numeric>
#include <iomanip>
#include <vector>

using func = std::function<std::string(const std::string, const char)>;

// no STL algorithms (verbose)
std::string mask(const std::string &email, const char mask) {
//    if (email.empty() || email == "@" || email[0] == '@') {
//        return email;
//    }
//    vs
    if (email[0] == '@') {
        return email;
    }
//    vs
//    no if statement

    std::string masked;
    masked.reserve(email.size());

    bool hide = true;
    bool is_email = false;

    for (size_t i = 0; i < email.size(); ++i) {
        if (email[i] == '@') {
            is_email = true;
            hide = false;

            if (i > 2) {
                masked[0] = email[0];
                masked[i - 1] = email[i - 1];
            }
        }

        masked += hide ? mask : email[i];
    }

    if (!is_email && masked.size() > 2) {
        masked[0] = email[0];
        masked[masked.size() - 1] = email[masked.size() - 1];
    }

    return masked;
}

// STL find/transform (compact)
std::string mask_with_find_and_transform(const std::string &email, const char mask = '*') {
    std::string masked = email;

    auto p = find(masked.begin(), masked.end(), '@');
    auto offset = (p - masked.begin() > 2) ? 1 : 0;
    auto begin = masked.begin() + offset;
    auto end = p - offset;

    transform(begin, end, begin, [mask](const char &) { return mask; });

    return masked;
}

// STL find/replace_if (the predicate feels like a hack)
std::string mask_with_find_and_replace_if(const std::string &email, const char mask = '*') {
    std::string masked = email;

    auto p = find(masked.begin(), masked.end(), '@');
    auto offset = (p - masked.begin() > 2) ? 1 : 0;
    auto begin = masked.begin() + offset;
    auto end = p - offset;

    replace_if(begin, end, [](const char &) { return true; }, mask);

    return masked;
}

// STL strings find_first_of/replace
std::string mask_with_find_first_of_and_replace(const std::string &email, const char mask = '*') {
    std::string masked = email;
    if (masked.empty()) {
        return masked;
    }

    auto pos = masked.find_first_of('@');
    auto start = 1;
    auto n = pos - 2;

    if (pos < 3) {
        start = 0;
        n = pos;
    }

    if (pos == std::string::npos) {
        start = 1;
        n = masked.size() - 2;

        if (n < 1 || n == std::string::npos) {
            start = 0;
            n = masked.size();
        }
    }

    masked.replace(start, n, n, mask);

    return masked;
}

long call(const func &f, const std::string &input, const std::string &expected) {
    auto start = std::chrono::high_resolution_clock::now();

    auto result = f(input, '*');

    auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(
            std::chrono::high_resolution_clock::now() - start
    );

    if (result != expected) {
        char message[100];
        sprintf(message, R"(actual: "%s"; expected: "%s")", result.c_str(), expected.c_str());
        throw std::runtime_error(message);
    }

    return duration.count();
}

enum class measure {
    // show execution time for each function
    function,

    // show execution time for each function and input
    input,
};

int main() {
    const auto measure_type = measure::function;

    const std::map<std::string, std::string> tests{
            {"johndoe@emailprovider.tld", "j*****e@emailprovider.tld"},
            {"jde@emailprovider.tld",     "j*e@emailprovider.tld"},
            {"jd@emailprovider.tld",      "**@emailprovider.tld"},
            {"j@emailprovider.tld",       "*@emailprovider.tld"},
            {"@emailprovider.tld",        "@emailprovider.tld"},
            {"wrong",                     "w***g"},
            {"wro",                       "w*o"},
            {"wr",                        "**"},
            {"w",                         "*"},
            {"",                          ""},
            {"@",                         "@"},
    };

    std::map<std::string, std::vector<long>> results;
    for (const auto &test:tests) {
        for (int i = 1; i <= 100; ++i) {
            if (measure_type == measure::function) {
                results["mask"].push_back(call(&mask, test.first, test.second));
                results["mask_with_find_and_transform"].push_back(call(&mask_with_find_and_transform, test.first, test.second));
                results["mask_with_find_and_replace_if"].push_back(call(&mask_with_find_and_replace_if, test.first, test.second));
                results["mask_with_find_first_of_and_replace"].push_back(call(&mask_with_find_first_of_and_replace, test.first, test.second));
            } else {
                results["mask(" + test.first + ")"].push_back(call(&mask, test.first, test.second));
                results["mask_with_find_and_transform(" + test.first + ")"].push_back(call(&mask_with_find_and_transform, test.first, test.second));
                results["mask_with_find_and_replace_if(" + test.first + ")"].push_back(call(&mask_with_find_and_replace_if, test.first, test.second));
                results["mask_with_find_first_of_and_replace(" + test.first + ")"].push_back(call(&mask_with_find_first_of_and_replace, test.first, test.second));
            }
        }
    }

    std::cout << "\nExecution time (nanoseconds)\n\n";

    for (const auto &result : results) {
        auto data = result.second;

        auto all = std::accumulate(data.cbegin(), data.cend(),
                                   std::string{},
                                   [](const std::string &a, long v) { return a + std::to_string(v) + ", "; });

        sort(data.begin(), data.end());

        auto max = *std::max_element(data.cbegin(), data.cend());
        auto min = *std::min_element(data.cbegin(), data.cend());
        auto avg = std::accumulate(data.cbegin(), data.cend(), 0.0) / data.size();
        auto med = data[data.size() == 1 ? 0 : ((data.size() + 1) / 2)];

        std::cout << "min: " << std::setw(3) << min
                  << "\tmax: " << std::setw(3) << max
                  << "\tavg: " << std::setw(3) << std::fixed << std::setprecision(2) << avg
                  << "\t\tmed: " << std::setw(3) << med
                  << "\t\t"
                  << result.first
                  << " (" + all.substr(0, all.size() - 2) + ")"
                  << "\n\n";
    }
}

Leave a Reply Cancel reply