Apache Spark MapReduce with PHP

Aug 19, 2019 by Thibault Debatty | 3932 views



When it comes to Big Data processing, I'm a huge fan of the Apache Spark project. Spark is a very powerful tool to analyse very large datasets in parallel, and at the same time it provides a nice API that allows to write clean distributed code.

JavaRDD<String> lines = sc.textFile("data.txt");
JavaRDD<Integer> lineLengths = lines.map(s -> s.length());
int totalLength = lineLengths.reduce((a, b) -> a + b);

Sadly enough, there is currently no PHP driver to run PHP code on a Spark cluster.

However, to process smaller datasets, you can use php-spark. This library is a wrapper around (local) arrays that implements the same methods and syntax as Apache Spark. If you are used to Spark, this library will allow you to write PHP MapReduce code in a breeze!

$data = new Dataset([1, 2, 3, 4]);
$result = $data
    ->map(function ($v) {
        return 2 * $v;
    ->reduce(function ($v, $agg) {
        return $agg + $v;

$result == 20;


Easiest way to install is using composer:

composer require cylab/php-spark


The main component of the library is the Dataset. To create a dataset, simply pass an array of data to the constructor:

use CylabSparkDataset;

$d = new Dataset([1, 2, 3, 4]);

A dataset is immutable, like a Resilient Distributed Dataset in Spark. Each operation creates a new Dataset:

$d2 = $d->map(function ($v) { return 2 * $v; });

Like in Spark, some methods expect a dataset containing <key, value> tuples, like reduceByKey(func)

use CylabSparkDataset;
use CylabSparkTuple;

$strings = ["foe", "bar", "foe"];
$d = new Dataset ($strings);
$d2 = $d->map(function($s) { return new Tuple($s, 1); });

$counts = $d2->reduceByKey(function ($count, $sum) {
    return $sum + $count;

# Tuple<"foe", 2>

The complete list of methods and documentation is available on the website of the project: https://gitlab.cylab.be/tibo/php-spark

Interactive shell

I really like to use php-spark with an interactive PHP shell, like psysh....

To install psysh:

composer global require psy/psysh:@stable

Install php-spark globally aswell:

composer global require composer require cylab/php-spark

Rebuild the global autoloader of composer:

composer global dump-autoload

You can now use php-spark interactively:

This blog post is licensed under CC BY-SA 4.0

Fully customizable emails using Laravel 9
With the release of Laravel 9, the Swift Mailer (that is no longer maintained) has been replaced by the Symfony Mailer. You can already find some useful information about this change along all the other ones in the Upgrade Guide from Laravel 8.x to 9.0. However this guide does not contain enough information if you want to send fully customized emails. This blog post proposes you a solution coming directly from the Symfony documentation!
SQL injection with SQLMap
Code injection is one of the most critical web application vulnerabilities. Indeed, the consequences of code injection can be dramatic (impact). Moreover, still today a lot of web applications are vulnerable to code injection (frequency). Finally, some tools like SQLMap allow to automatically detect and use these vulnerabilities (exploitation). For this reason, the vulnerability is listed in the top 10 published by the Open Web Application Security Project (OWASP) [1]. In this blog post, we will present one type of code injection, called SQL injection, and we will show how to perform a SQL injection attack with SQLMap.
Filter USB devices with udev (and some PHP code)
USB devices can be a liability : they can be used to exfiltrate data from a computer or server, to plug a hardware keylogger, or to plant a malware. Hence on a managed computer, USB devices should be filtered and whitelisted. In this blog post we show how this can be achieved thanks to udev, and some PHP code.
This website uses cookies. More information about the use of cookies is available in the cookies policy.