MARk : use built-in file data source

Jul 15, 2021 by Thibault Debatty | 514 views


The Multi-Agent Ranking framework (MARk) allows to quickly build ranking and detection systems by combing building blocks. In this blog post, we show how to use the file data source to inject data into the system...


Easiest way to run MARk is using docker-compose. So first create a dedicated directory, you can call it mark and create a subdirectory called modules:

mkdir mark
cd mark
mkdir modules

Now, in the mark directory, create a file called *docker-compose.yml with following content:

version: '2.0'
    image: cylab/mark-web:1.4.5
    container_name: mark-web
    - MARK_HOST=mark-server
    - MARK_PORT=8080
    - APP_URL=""
    - "8000:80"
    - mark
    image: cylab/mark:2.5.3
    container_name: mark-server
    - ./modules:/mark/modules
    - MARK_MONGO_HOST=mark-mongo
    - "8080:8080"
    - mongo
    image: mongo:4.4
    container_name: mark-mongo

You can now start the server with:

docker-compose up

For now the server is empty as there is no data flowing in, and no detector configured...

You can stop the server with ctrl + c.

The built-in file data source

MARk supports the concept of built-in data source. These data sources are started automatically when the server starts. From a technical point of view, a built-in data source is a java class that implements the DataAgentInterface.

The current version of MARk (2.5.3) has one built-in data agent: the FileSource. The FileSource is a generic data agent that reads a file and parses it line by line using a named regular expression. So it can be used for any kind of object to be ranked. This data source takes 2 mandatory parameters:

  • file : the file to read
  • regex : the named regex to use

To illustrate how this data source works, we will build an example, where:

  • we use the logs of a proxy server (stored in a text file)
  • the subjects of interest (that we want to rank) are the internal computers, identified by their IP address.

You can download an example log file, and save it to the modules directory, with the following command:

wget -O modules/1000_http_requests.txt

This file contains lines like these:

1472083251.488    575 TCP_MISS/200 1411 GET - DIRECT/ text/html
1472083251.573    920 TCP_MISS/200 765 GET - DIRECT/ text/html
1472083251.613    444 TCP_MISS/200 755 GET - DIRECT/ text/html
1472083251.658    590 TCP_MISS/200 1083 GET - DIRECT/ text/html
1472083251.724    683 TCP_MISS/200 1419 GET - DIRECT/ text/html
1472083251.862    442 TCP_MISS/200 1960 GET - DIRECT/ text/html
1472083251.938    276 TCP_MISS/200 111 GET - DIRECT/ text/html
1472083252.040    888 TCP_MISS/200 1055 GET - DIRECT/ text/html
1472083252.155    209 TCP_MISS/200 713 GET - DIRECT/ text/html
1472083252.163    106 TCP_MISS/200 680 GET - DIRECT/ text/html

Now we must build a named regex (a regex where capture groups can receive a name), to extract from each line:

  • a timestamp (in seconds);
  • the different components of the subject (in our case there is only one: the IP of the client).

You can use a site like to help you build the regex:

Now you can create the appropriate configuration file for the file data agent, that you should call and place in the modules directory. Pay attention that the \ characters must be escaped:

label: data.proxy
  file: 1000_http_requests.txt
  regex: "^(?<timestamp>\\d+\\.\\d+)\\s+\\d+\\s(?<client>\\d+\\.\\d+\\.\\d+\\.\\d+)"


Now you can start the MARk server:

docker-compose up

After a few seconds, the server will up and running, and the data agent will start reading and parsing the file.

The web interface will be available at with following default credentials:

  • E-mail:
  • Password: change-me!

At the bottom of the Status page, you can see that the configured data source is properly listed. You can also see that the database ingested 1000 data records. Finally, by clicking on "Inspect", we can see that the data records have been correctly parsed.

Controlling speed

However, all records seem to have the same timestamp. Indeed, by default the FileSource tries to read the file as fast as possible. To respect the time interval between lines, you can use the speed configuration parameter:

label: data.proxy
  file: 1000_http_requests.txt
  regex: "^(?<timestamp>\\d+\\.\\d+)\\s+\\d+\\s(?<client>\\d+\\.\\d+\\.\\d+\\.\\d+)"
  speed: "1"

This time the data records will be processed according to their timestamp. In the example data file, the lines span over 8 seconds. If needed, you can even slow down processing by indicating a smaller speed value, like 0.1.

Going further

Now that data is flowing into your server, you can add and configure built-in detectors or create your own detectors to build your detection pipeline.