Jul 15, 2021 by Thibault Debatty | 752 views
The Multi-Agent Ranking framework (MARk) allows to quickly build ranking and detection systems by combing building blocks. In this blog post, we show how to use the file data source to inject data into the system...
Easiest way to run MARk is using docker-compose. So first create a dedicated directory, you can call it
mark and create a subdirectory called
mkdir mark cd mark mkdir modules
Now, in the
mark directory, create a file called *docker-compose.yml with following content:
version: '2.0' services: mark-web: image: cylab/mark-web:1.4.5 container_name: mark-web environment: - MARK_HOST=mark-server - MARK_PORT=8080 - APP_URL="http://127.0.0.1:8000" ports: - "8000:80" depends_on: - mark mark: image: cylab/mark:2.5.3 container_name: mark-server volumes: - ./modules:/mark/modules environment: - MARK_MONGO_HOST=mark-mongo ports: - "8080:8080" depends_on: - mongo mongo: image: mongo:4.4 container_name: mark-mongo
You can now start the server with:
For now the server is empty as there is no data flowing in, and no detector configured...
You can stop the server with
ctrl + c.
MARk supports the concept of built-in data source. These data sources are started automatically when the server starts. From a technical point of view, a built-in data source is a java class that implements the DataAgentInterface.
The current version of MARk (2.5.3) has one built-in data agent: the FileSource. The FileSource is a generic data agent that reads a file and parses it line by line using a named regular expression. So it can be used for any kind of object to be ranked. This data source takes 2 mandatory parameters:
file: the file to read
regex: the named regex to use
To illustrate how this data source works, we will build an example, where:
You can download an example log file, and save it to the modules directory, with the following command:
wget https://cylab.be/s/bIJf4 -O modules/1000_http_requests.txt
This file contains lines like these:
1472083251.488 575 184.108.40.206 TCP_MISS/200 1411 GET http://ajdd.rygxzzaid.mk/xucjehmkd.html - DIRECT/220.127.116.11 text/html 1472083251.573 920 18.104.22.168 TCP_MISS/200 765 GET http://epnazrk.wmaj.ga/zlrsmtcc.html - DIRECT/22.214.171.124 text/html 1472083251.613 444 126.96.36.199 TCP_MISS/200 755 GET http://epnazrk.wmaj.ga/zjeglwir.html - DIRECT/188.8.131.52 text/html 1472083251.658 590 184.108.40.206 TCP_MISS/200 1083 GET http://kfiger.wfltjx.cc/uxmt.html - DIRECT/220.127.116.11 text/html 1472083251.724 683 18.104.22.168 TCP_MISS/200 1419 GET http://isogbg.hgwpxah.nz/roeefw.html - DIRECT/22.214.171.124 text/html 1472083251.862 442 126.96.36.199 TCP_MISS/200 1960 GET http://rkfko.apyeqwrqg.cm/rdhufye.html - DIRECT/249.70.126.8 text/html 1472083251.938 276 188.8.131.52 TCP_MISS/200 111 GET http://ootlgeqo.fomu.ve/sfidbhq.html - DIRECT/243.179.195.173 text/html 1472083252.040 888 184.108.40.206 TCP_MISS/200 1055 GET http://qddggmg.rtvw.ru/uwwmy.html - DIRECT/220.127.116.11 text/html 1472083252.155 209 18.104.22.168 TCP_MISS/200 713 GET http://swienzd.uzqwmbs.nu/tzxpdxdq.html - DIRECT/22.214.171.124 text/html 1472083252.163 106 126.96.36.199 TCP_MISS/200 680 GET http://lisbnwk.hhafb.sb/uenyswiuf.html - DIRECT/188.8.131.52 text/html
Now we must build a named regex (a regex where capture groups can receive a name), to extract from each line:
You can use a site like regex101.com to help you build the regex:
Now you can create the appropriate configuration file for the file data agent, that you should call file.data.yml and place in the modules directory. Pay attention that the
\ characters must be escaped:
class_name: be.cylab.mark.data.FileSource label: data.proxy parameters: file: 1000_http_requests.txt regex: "^(?<timestamp>\\d+\\.\\d+)\\s+\\d+\\s(?<client>\\d+\\.\\d+\\.\\d+\\.\\d+)"
Now you can start the MARk server:
After a few seconds, the server will up and running, and the data agent will start reading and parsing the file.
The web interface will be available at
http://127.0.0.1:8000 with following default credentials:
At the bottom of the Status page, you can see that the configured data source is properly listed. You can also see that the database ingested 1000 data records. Finally, by clicking on "Inspect", we can see that the data records have been correctly parsed.
However, all records seem to have the same timestamp. Indeed, by default the FileSource tries to read the file as fast as possible. To respect the time interval between lines, you can use the
speed configuration parameter:
class_name: be.cylab.mark.data.FileSource label: data.proxy parameters: file: 1000_http_requests.txt regex: "^(?<timestamp>\\d+\\.\\d+)\\s+\\d+\\s(?<client>\\d+\\.\\d+\\.\\d+\\.\\d+)" speed: "1"
This time the data records will be processed according to their timestamp. In the example data file, the lines span over 8 seconds. If needed, you can even slow down processing by indicating a smaller speed value, like