Pseudonymisation by random tables

Apr 10, 2024 by Charles Beumier | 711 views

https://cylab.be/blog/333/pseudonymisation-by-random-tables

Pseudonymisation is a technique for data privacy protection that replaces personal information with artificial identifiers called pseudonyms. Some level of identification is retained, typically for analysis purposes. In contrast, anonymisation removes all identifying information.

This blog justifies pseudonymisation in the context of mobile telephony research for attack detection and presents a method based on substitution of sensitive identifiers using random tables.

The SS7 suite of protocols, used by 2G and 3G equipment, is still necessary today as fallback solution when 5G or 4G antennas are out of reach. Many vulnerabilities about SS7 have been discovered and their mitigation is limited due to design choices dating back to the 1970s.

Why ?

The SS7 traffic is composed of messages that may leak subscriber’s private information such as their position, calls, or SMS content.

Detecting suspicious behaviors within SS7 traffic often necessitates analyzing message details for each subscriber, as attackers can spoof any user. If subscriber identifiers must be masked to comply with data protection rules (GDPR), behaviorial analysis requires unique identifiers. Therefore, SS7 attack detection needs pseudonymisation, entailing a one-to-one mapping of original identifiers to pseudonyms. This mapping should be kept confidential and challenging to guess, even with knowledge of some mapping values.

The possibility of collecting some mapping values should not be underestimated. In our digital era, more and more information is accessible at low cost and in little time, and the risk of reidentification through cross-database analysis is not zero.

What ?

To comply with GDPR regulations, the details contained in SS7 traffic may not be directly linked to individuals. Two main subscriber identifiers are present in SS7 traffic: the International Mobile Subscriber Identity (IMSI), used by the network and stored on the SIM card, and the public telephone number (MSISDN). For the sake of simplicity, we will limit explanations to these two identifiers although a real application must consider pseudonymisation of other SS7 fields, such as the Global Titles, which may contain a mix of equipment and subscriber identifiers.

The IMSI, typically not publicly accessible, holds significant value for attackers as it enables them to target specific users. More information on this topic is available in this blog: https://cylab.be/blog/202/mobile-phones-should-you-be-afraid-of-disclosing-your-imsi. While knowledge of MSISDNs may be less critical, the link MSISDN - IMSI can be made by specific SS7 functions or thanks to databases from the black market.

As mentioned earlier, reidentification may occur through database cross-referencing. The content of an SMS (possibly in clear in a SS7 message), the location of the connected antenna, or call schedules could lead to the identification of a specific subscriber. For confidentiality reasons, these pieces of information should be minimal in the SS7 data used for research.

When ?

Pseudonymisation should be implemented as soon as possible, preferably at the source (e.g., within the facilities of the mobile operator collecting data), to mitigate risks during data transportation.

How ?

Designing pseudonymisation must consider several criteria to ensure confidentiality and provide a practical software solution. To minimize the risk of privacy data leaks, the pseudonymisation procedure should incorporate a mechanism for verifying the effectiveness of pseudonymised files.

Random tables for confidentiality and uniqueness

The pseudonymisation algorithm must ensure that reversing the process, i.e. recovering the original identifiers from pseudonyms, is impossible, even when partial information (knowledge of some real identifiers and their pseudonyms) is available.

A random table offers a straightforward and fast solution: a real identifier serves as an index in a table containing randomly assigned unique values. Once the pseudonymization process is complete, the table is destroyed to prevent any potential leak.

A random table with unique values can be created in two steps. Table values are first initialized with their respective index (table[i] = i). Then the values are shuffled by swapping each table entry with another table entry randomly selected. This swapping process ensures that the table values remain both unique and complete. We also added the condition that a table entry can not be equal to its index, so that an identifier and its pseudonym always differs, what may prove useful to check that values were pseudonymised.

Designing a random table that converts D-digit identifiers into D-digit pseudonyms offers a significant advantage: it ensures that identifiers keep their length after pseudonymisation. As a result, both the original and pseudonymized files possess identical lengths, differing only in the digits that are pseudonymized.

Practical solution

While the algorithm presented in section 5.1 appears straightforward, its execution time can become prohibitive when dealing with the vast number of identifiers present in SS7 traffic. Also, IMSIs contain 15 digits, necessitating a too large memory for table allocation. Fortunately, not all digits within IMSIs and MSISDNs need to be pseudonymized.

In fact, IMSIs and MSISDNs start with 5 digits for the country code and the operator code. These are necessary for traffic analysis and do not need to be pseudonymised. The remaining digits identify a subscriber. Due to the varying size of MSISDN, it was decided to pseudonymised the last 5 digits of identifiers. This imposes a table length of 100.000 values. Each randomly generated table represents one possibility among 99999x99998x…x2x1 permutations (as each entry cannot have a value equal to its index). Notably, the randomness of table generation ensures that knowledge of any entry does not provide insight into the rest of the table.

Separate random tables

When pseudonymizing multiple identifiers like the IMSI and MSISDN, it is safer to utilize separate random tables. This precaution is crucial within the SS7 context, where criminals may link a MSISDN with an IMSI (possibly with illegal databases), facilitating targeted attacks on public numbers.

If only one random table is used for IMSI and MSISDN pseudonymization, the chances to find pseudonymised-real pairs is higher, thanks to cross-referencing databases about IMSI, about MSISDN and possibly illegal databases from the black market.

Processing Time

An important criterion for practical implementation is processing time, especially when dealing with high volumes of traffic data. Pseudonymizing such data can take days, depending on the computer infrastructure and programming choices.

One bottleneck in the process can be file accesses. Values of identifiers need to be located in files, pseudonymized and then written to files. One optimized approach involves copying a file, locating identifiers of interest, pseudonymizing their last 5 digits and updating the files by overwriting the new 5 digits. For a few identifiers, only a small percentage of the file needs to be updated.

Checking pseudonymisation

Before leaving the data collection site, pseudonymized SS7 files should undergo a verification process to confirm the pseudonymization of sensitive identifiers.

To check pseudonimization, each original file is compared with its corresponding pseudonymised version, first ensuring that each pair of files has the same length. The pseudonymised values of each pseudonymised identifier (last 5 digits) are written in a table indexed by the original values (last 5 digits). As soon as any table entry receives two different values when scanning the files of the database, the test is exited with the value and position of the inconsistency. If all files are scanned without any inconsistencies, the test is declared successful.

Conclusion

We presented why and how to pseudonymise sensitive identifiers in 2G/3G SS7 data for research in order to protect data privacy. We explain how random tables of a few last digits can offer an elegant and efficient solution.

This blog post is licensed under CC BY-SA 4.0