APT-GRAPH

Build Status Coverage Status

Developement takes place at https://gitlab.cylab.be/cylab/apt-graph

The focus of APT-GRAPH is the detection of Advanced Persistent Threat (APT). More specifically, the aim is to study proxy log files and to detect a domain used as Command and Control (C2) by an APT. The implemented algorithm models the traffic by means of a graph and tries to detect infections by looking for anomaly within this graph. The algorithm has been designed to work closely with an analyst. This analyst can work interactively with a set of parameters and adapt the algorithm to focus on a specific type of APT.

Requirements

Maven modules

Main modules

Auxiliary modules

Quick Start

Get the latest version from GitHub.

git clone https://github.com/RUCD/apt-graph.git

Build all the modules together.

cd apt-graph/apt-graph
mvn clean install

Run the Batch Processor to build the graphs.

cd ../batch
./analyze.sh -i <proxy log file> -o <graphs directory>

There is a test file in batch/src/test/resources/. Use the following command to check the Batch Processor with the test file:

./analyze.sh -i ./src/test/resources/1000_http_requests.txt -o /tmp/mytest/

Run the server.

cd ../server
./start.sh -i <graphs directory>

By default, the UI is available at http://127.0.0.1:8000 and the JSON-RPC Server is at http://127.0.0.1:8080.

There is a folder in server/src/test/resources/ containing dummy graphs. Use the following command to check the Server with these graphs:

./start.sh -i ./src/test/resources/dummyDir/

Connect to the UI using a browser (http://127.0.0.1:8000). Choose the parameters as shown on the screenshots below and click on "Apply" to get the result.

If everything is alright you should get something like this:

UI-example

UI-example

Usage

batch

./analyze.sh -h
usage: java -jar batch-<version>.jar
-c <arg>   Select only temporal children (option, default: true)
-f <arg>   Specify format of input file (squid or json) (option, default: squid)
-h         Show this help
-i <arg>   Input log file (required)
-k <arg>   Impose k value of k-NN graphs (option, default: 20)
-o <arg>   Output directory for graphs (required)
-x <arg>   Overwrite existing graphs (option, default: false)

The next command is typical to start the preprocessing:

./analyze.sh -i <proxy log file> -o <graphs directory> -k 50 -f squid

server

./start.sh -h
usage: java -jar server-<version>.jar
-h             Show this help
-i <arg>       Input directory with graphs (required)
-study <arg>   Study output mode (false = web output, true = study output) (option, default: false)

The next command is typical to start the Server:

./start.sh -i <graphs directory>

infection

    ./infect.sh -h
    usage: java -jar infection-<version>.jar
     -d <arg>            APT domain name (required)
     -delay <arg>        Delay between start of the burst and injection of APT
                    (option for traffic APT, default: middle of the burst)
     -delta <arg>        Duration between two requests of the same burst (required for traffic APT)
     -distance <arg>     Minimal time distance between two injections
                (option for traffic APT, default: no limitation)
     -duration <arg>     Minimal duration of a burst to allow APT injection (required for traffic APT)
     -f <arg>            Specify format of input file (squid or json) (option, default: squid)
     -h                  Show this help
     -i <arg>            Input log file (required)
     -injection <arg>    Maximal daily number of injections
                    (option for traffic APT, default: no limitation)
     -o <arg>            Output log file (required)
     -proportion <arg>   Injection rate in the possible bursts (1 = inject in all possible bursts)
                (option for traffic APT, default: 1)
     -step <arg>         Specify time step between periodic injections in milliseconds 
                    (required for periodic APT)
     -t <arg>            Type (periodic or traffic) (required)
     -u <arg>            Targeted user or subnet (required)

The next command is typical to simulate a periodic infection:

./infect.sh -i <log file path> -o <output file path> -u <user ip> -d APT.FINDME.apt -t periodic -step 43200000

The next command is typical to simulate a traffic based infection:

./infect.sh -i <log file path> -o <output file path> -u <user ip> -d APT.FINDME.apt -t traffic -duration 5000 -delta 1000

traffic

./traffic.sh -h
usage: java -jar traffic-<version>.jar
 -f <arg>   Specify format of input file (squid or json) (option, default: squid)
 -h         Show this help
 -i <arg>   Input log file (required)
 -o <arg>   Output CSV file (required)
 -r <arg>   Time resolution in milliseconds (required)

The next command is typical to compute a traffic histogram:

./traffic.sh -i <input log file> -o <output CSV file> -r 1000

config

./config.sh -h
usage: java -jar config-<version>.jar
 -field <arg>   Configuration field to sweep (required)
 -h             Show this help
 -i <arg>       Input configuration file (default configuration line) (required)
 -multi <arg>   Sweep the given field as complement to stop value of the first field
            (option, default: no second field)
 -o <arg>       Output configuration file (required)
 -start <arg>   Start value of sweep (required)
 -step <arg>    Step of sweep (required)
 -stop <arg>    Stop value of sweep (required)

A typical default configuration line is:

{"input_dir":"<input directory>",
"output_file":"<output file path>/ROC_anon.csv",
"n_apt_tot":"2","user":"108.142.213.0","feature_weights_time":"0.1",
"feature_weights_domain":"0.9","feature_weights_url":"0.0",
"feature_ordered_weights_1":"0.8","feature_ordered_weights_2":"0.2",
"prune_threshold":"0.00","max_cluster_size":"1000000",
"prune_z":"true","cluster_z":"false","whitelist":"true",
"white_ongo":"","number_requests":"5","ranking_weights_parents":"0.4",
"ranking_weights_children":"0.4","ranking_weights_requests":"0.2",
"apt_search":"true"}

The next command is typical to create a configuration file studying the pruning threshold:

./config.sh -i <default configuration file> -o <output configuration file> -field prune_threshold -start 0.0 -stop 1.0 -step 0.1

study

./study.sh -h
usage: java -jar study-<version>.jar
 -h         Show this help
 -i <arg>   Input configuration file (required)
 -x <arg>   Overwrite existing files (option, default: false)

The next command is typical to produce several ROC based on the provided configuration file:

./study.sh -i <input configuration file>

Data representation

Request

The Request Object contains all the needed information about a request. Two types of proxy log files are supported: Squid and JSON.

The following example is a typical line of a Squid format file:

1425971539.000   1364 108.142.226.170 TCP_NC_MISS/200 342 GET http://77efee5dcb3635e09435eb33a8351364.3da9819b747e806d78f83f22c703d178.an/59a543c185f3330a33a47736d6879e16 - -/146.159.80.113 image/gif

The following example is a typical line of a JSON format file:

{"@version":"1","@timestamp":"2014-10-10T23:12:24.000Z","type":"proxy_fwd_iwsva","timestamp":"Sat, 11 Oct 2014 01:12:24,CEST","tk_username":"192.168.2.167","tk_url":"http://weather.service.msn.com/data.aspx?src=Windows7&amp;wealocations=wc:8040075&amp;weadegreetype=F&amp;culture=en-US","tk_size":0,"tk_date_field":"2014-10-11 01:12:24+0200","tk_protocol":"http","tk_mime_content":"text/xml","tk_client_ip":"192.168.2.167","tk_server_ip":"92.122.122.162","tk_domain":"weather.service.msn.com","tk_path":"data.aspx","tk_file_name":"data.aspx","tk_operation":"GET","tk_uid":"0271674894-29fe6b562438c1f7e996","tk_category":"40","tk_category_type":"0","geoip":{"ip":"92.122.122.162","country_code2":"EU","country_code3":"EU","country_name":"Europe","continent_code":"EU","latitude":47.0,"longitude":8.0,"location":[8.0,47.0]},"category":"Search Engines/Portals"}

Domain

A Domain Object is defined as a list of Requests. Each list has the related domain name as given name. The similarity between two domains is defined as the sum of the similarities between the requests of these domains.

Graph

The graphs of requests build by the Batch Processor are k-NN graphs. All other graphs are general graphs. The implementation of the tools used to compute and process the graphs has been done in java-graphs.

Clusters

A graph of clusters is modeled as a list of graphs. Each of these graphs represents a cluster.

Algorithm

Core

The Core defines the similarities used to compute the k-NN graphs of each user. The chosen similarities are the following:

​ e.g.: edition.cnn.com and cnn.com have eq_6, eq_7 and eq_8

Batch Processor

The Batch Processor is composed of the following processing steps:

  1. parse a proxy log file (squid or JSON format);
  2. split the data by user;
  3. build k-NN graphs of requests for each similarity and each user;
  4. select the children requests among the neighbour requests (optional)
  5. compute graphs of domains for each similarity and each user;
  6. store all necessary data in graphs directory: user graphs (ip.address.ser, e.g.: 192.168.2.1.ser), list of users (users.ser), list of subnets (subnets.ser), k value (k.ser).

Server

The Server is composed of the following processing steps:

  1. load data of the users selected by the analyst (ip.address.ser, users.ser, subnets.ser, k.ser);
  2. merge similarity graphs for each user using a weighted sum of similarities;
  3. merge all user graphs using a sum of similarities;
  4. prune the merged graph;
  5. compute clusters in the graph (deprecated);
  6. filter large clusters (deprecated);
  7. clean the graph based on white listing (optional);
  8. compute the rank list of suspicious domains.

UI

The UI gives access to the following parameters:

Further details

Documentation is available here. Further details can be found in the code itself, where each method has been documented.

License

Source code is released under MIT license. Read LICENSE file for more information.

Check this project on GitLab