java-wowa-training

pipeline status Maven Central

The WOWA operator (Torra) is a powerfull aggregation operator that allows to combine multiple input values into a single score. This is particulary interesting for detection and ranking systems that rely on multiple heuristics. The system can use WOWA to produce a single meaningfull score.

A Java implementation of WOWA is available at https://github.com/tdebatty/java-aggregation.

The WOWA operator requires two sets of parameters: p weights and w weights. In this project we use a genetic algorithm to compute the best values for p and w weights. For the training, the algorithm uses a dataset of input vectors together with the expected aggregated score of each vector.

This project is a Java implementation of the PHP wowa-training project.

Installation

Using maven :

<dependency>
  <groupId>be.cylab</groupId>
  <artifactId>java-wowa-training</artifactId>
  <version>0.0.4</version>
</dependency>

https://mvnrepository.com/artifact/be.cylab/java-wowa-training

Usage

public static void main(String[] args) {

    Logger logger = Logger.getLogger(Trainer.class.getName());
    logger.setLevel(Level.INFO);
    int population_size = 100;
    int crossover_rate = 60;
    int mutation_rate = 10;
    int max_generation = 110;
    int selection_method = TrainerParameters.SELECTION_METHOD_RWS;
    int generation_population_method = TrainerParameters.POPULATION_INITIALIZATION_RANDOM;

    TrainerParameters parameters = new TrainerParameters(logger, population_size, 
         crossover_rate, mutation_rate, max_generation, selection_method,        generation_population_method);

    //Input data
    List<List<Double>> data = new ArrayList<>();
    data.add(new ArrayList<>(Arrays.asList(0.1, 0.2, 0.3, 0.4)));
    data.add(new ArrayList<>(Arrays.asList(0.1, 0.8, 0.3, 0.4)));
    data.add(new ArrayList<>(Arrays.asList(0.2, 0.6, 0.3, 0.4)));
    data.add(new ArrayList<>(Arrays.asList(0.1, 0.2, 0.5, 0.8)));
    data.add(new ArrayList<>(Arrays.asList(0.5, 0.1, 0.2, 0.3)));
    data.add(new ArrayList<>(Arrays.asList(0.1, 0.1, 0.1, 0.1)));

    //Expected aggregated value for each data vector
    List<Double> expected = new ArrayList<>(Arrays.asList(0.1, 0.2, 0.3, 0.4, 0.5, 0.6));
    //Create object for the type of Solution (fitness score evaluation)
    SolutionDistance solution_type = new SolutionDistance(data.get(0).size());
    //Create trainer object
    Trainer trainer = new Trainer(parameters, solution_type);

    AbstractSolution solution = trainer.run(data, expected);
    //Display solution
    System.out.println(solution);

}

The example above will produce something like:

SolutionDistance{
    weights_w=[0.1403303611048977, 0.416828569516884, 0.12511121306189063, 0.1872211165629538, 0.1305087298401635],
    weights_p=[0.0123494228072248, 0.10583088288437666, 0.5459452827654444, 0.17470250892324257, 0.1611718492107217],
    distance=8.114097675242476}

The run method returns a solution object, consisting of p weights and w weights to use with the WOWA operator, plus the total distance between the expected aggregated values that are given as parameter, and the aggregated values computed by WOWA using these weights.

The method run can be used with ArrayList as the above example or with file name. One of these json file names contains the data and the second contains the expected results.

Parameters description

Solution type

The algorithm is built to be used with different methods to evaluate the fitness score of each chromosome. Two different criteria are already implemented : distance and AUC.

It is possible to create new Solution type with new evaluation criterion. The new Solution type must inherit of AbstractSolution class and override the method computeScoreTo. It is also necessary to modify the method createSolutionObject method in the Factory class.

Cross-validation

Example

public static void main(String[] args) {

        Logger logger = Logger.getLogger(Trainer.class.getName());
        logger.setLevel(Level.INFO);
        int population_size = 100;
        int crossover_rate = 60;
        int mutation_rate = 10;
        int max_generation = 110;
        int selection_method = TrainerParameters.SELECTION_METHOD_RWS;
        int generation_population_method = TrainerParameters.POPULATION_INITIALIZATION_RANDOM;

        TrainerParameters parameters = new TrainerParameters(logger, population_size,
                crossover_rate, mutation_rate, max_generation, selection_method, generation_population_method);

        //Input data
        List<List<Double>> data = new ArrayList<>();
        data.add(new ArrayList<>(Arrays.asList(0.1, 0.2, 0.3, 0.4)));
        data.add(new ArrayList<>(Arrays.asList(0.1, 0.8, 0.3, 0.4)));
        data.add(new ArrayList<>(Arrays.asList(0.2, 0.6, 0.3, 0.4)));
        data.add(new ArrayList<>(Arrays.asList(0.1, 0.2, 0.5, 0.8)));
        data.add(new ArrayList<>(Arrays.asList(0.5, 0.1, 0.2, 0.3)));
        data.add(new ArrayList<>(Arrays.asList(0.1, 0.1, 0.1, 0.1)));
        data.add(new ArrayList<>(Arrays.asList(0.1, 0.2, 0.3, 0.4)));
        data.add(new ArrayList<>(Arrays.asList(0.1, 0.8, 0.3, 0.4)));
        data.add(new ArrayList<>(Arrays.asList(0.2, 0.6, 0.3, 0.4)));
        data.add(new ArrayList<>(Arrays.asList(0.5, 0.1, 0.2, 0.3)));
        data.add(new ArrayList<>(Arrays.asList(0.1, 0.1, 0.1, 0.1)));

        //Expected aggregated value for each data vector
        List<Double> expected = new ArrayList<>(Arrays.asList(1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0,1.0));
        //Create object for the type of Solution (fitness score evaluation)
        SolutionDistance solution_type = new SolutionDistance(data.get(0).size());
        //Create trainer object
        Trainer trainer = new Trainer(parameters, solution_type);

        HashMap<AbstractSolution, Double> solution = trainer.runKFold(data, expected, 2, 2);
        //Display solution
        for (Map.Entry val : solution.entrySet()) {
            System.out.println(val);
        }
    }

The method runKFold runs a k folds cross-validation. Concretely, it separates the dataset in k folds. For each folds, a single fold is retained as the validation data for testing the model, and the remaining k − 1 folds are used as training data. The cross-validation process is then repeated k times, with each of the k folds used exactly once as the validation data. The k results can then be averaged to produce a single estimation. For each tested fold, the Area Under the Curve is also computed to evaluate the classification efficiency (works only expected vector that contains 0 and 1).

The code above produces a result similar to:

SolutionDistance{
weights_w=[0.8673383311511217, 0.04564604584006219, 0.0647437341741078, 0.022271888834708403], 
weights_p=[0.5933035227430291, 0.10784413855996985, 0.03387258778518031, 0.26497975091182074], 
fitness score=2.2260299633096268}=
0.16666666666666666

SolutionDistance{
weights_w=[0.7832984118592771, 0.12307744745817546, 0.07982187970335382, 0.013802260979193624], 
weights_p=[0.01945033161182157, 0.3466399858254755, 0.18834296208558235, 0.44556672047712065], 
fitness score=1.7056044468736795}=
0.4166666666666667

As output, the method runKFold return a HashMap that contains the best solution for each fold and the AUC corresponding to this solution. The method runKFold takes as argument the dataset (data and expected result) the number of folds used in the cross validation and a value that can increase the number of alert is this number is to low. This method is interesting to increase the penalty to do not detect an alert.

As for a classical learning, the method runKFold can be used as the example above or with json files. In this case, the arguments are String that are the file names.

References

Check this project on GitLab