Personal data is worth its value in bits in the digital era. The last two decades have seen the internet and digital technology completely transform economies and societies in radical ways. One of these ways is in the creation and monetization of personal data, and information generally, and indeed the phenomenon of ‘infonomics’. It is estimated that your typical silicon valley giant (Amazon, Facebook, Google, etc.) makes an average of $35.00 per month from the average adult’s data in America, and there are now more than 4 billion people JUST on social media worldwide. When something is ‘free’, the consumer is the product. Needless to say, establishing internal infonomics for online companies is seriously lucrative.
There is a dark side to infonomics, however. These mountains of data do not just affect the advertising preferences on our favorite free websites, they provide the substrate on which complex algorithms that shape our lives are built on. Our data is fed into A.I blackboxes that mathematically calculates how to best keep us, as individuals, engaged online, whether that be shopping for certain products, watching videos it knows we find compelling, or creating associations between things it knows we like, and things it thinks we might like. As more and more consumers become aware of the value of their personal data and privacy, solutions and services are clamoring to take money for the promise of safeguarding your digital profile.
There are myriad different services you can use, depending on what data you want to safeguard. If it’s your browsing/location data, there are lots of VPNs that can allow you to browse otherwise region-locked streaming programs like Netflix and Prime, or let you sail the high seas for anything you want. There are further apps and add-ons to your browser that will prevent sites from collecting personal browsing data or cookies, and this will preserve your ‘big data’ components and make it much harder for advertisers to build a direct advertising profile for you. There is a new kind of service however, which promises to reduce your personal data exposure conclusively, using blockchain technology.
Verified.Me currently exists as a service and application for encrypting highly-personal information such as health records, SIN, and government issued IDs. By partnering with personal information stakeholders like banks, insurance companies, telecoms, etc.The app allows only one (highly encrypted) input and output for personal data, tokenizing your digital identity, and then sharing it on a closed and permissioned ERC-20 chain. Banks were the first to adopt the technology and network in order to better protect and securely access client data. This allows your bank, insurer, employer, and government to use your identity ‘token’ without the need for special handling procedures, without ever directly seeing the entirety of your profile, only the specific data they need for their purposes.
While this might be the first app promising to ‘tokenize’ your identity, the concept and demand for such a product is quickly developing as more and more people come to rely on digital means of earning their livelihoods and communicating everything in their personal lives. Tokenization using distributed ledger technology seems to be the most secure, and most efficient way to secure and use personal data in the new Web 3.0.
How does any of this work?
Tokenization can protect privacy by ensuring that only tokens, rather than a permanent identity number or other PII, are exposed or stored during a transaction. In addition—where the same person is represented by different tokens in different databases—tokenization can limit the propagation of a single identifier (e.g., a unique ID number). This can help limit the ability to correlate a person’s data across different databases, which can be a privacy risk and also increases the possibility of fraud.
The essential features of a token are: (1) it should be unique, and (2) service providers and other unauthorized entities cannot “reverse engineer” the original identity or PII from the token. There are two primary types of tokenization:
- Front-end tokenization: “Front-end” tokenization is the creation of a token by the user as part of an online service that can later be used in digital transactions in place of the original identifier value. This is the approach taken by Aadhaar to create a Virtual ID derived from India’s Aadhaar Number. The problem with front-end tokenization is that it is very user driven, requiring users to be digitally literate and technically capable of both understanding why they would need a token and how to create one online. This could easily lead to a digital divide with regard to privacy protection.
- Back-end tokenization: “Back-end” tokenization is when the identity provider (or token provider) tokenizes identifiers before they are shared with other systems, limiting the propagation of the original identifier and controlling the correlation of data. Back-end tokenization is done automatically by the system without user intervention, meaning that people do not need to do anything manually or understand why they would need to create tokens, eliminating any potential digital divide and protecting identifiers and PII at source. Austria’s virtual citizen card is one example of this type of tokenization, and India has also implemented back-end tokenization of the Aadhaar number in addition to its Virtual ID.
The data contained on Austria’s virtual citizen card is called “Identity Link” and consists of full name, date of birth, cryptographic keys required for encryption and digital signatures, and the “SourcePIN”—a unique identifier created by strong encryption of the 12-digit unique ID (CRR) number. To ensure integrity and authenticity, the Identity Link data structure is digitally signed by the SourcePIN Register Authority at issuance. Access to SourcePIN and cryptographic keys on a CC is protected by PIN.
To safeguard user privacy, the eGovernment Act stipulates that different identifiers be used for each of the country’s 26 public administration sections—e.g., tax, health, education, etc.— that a person accesses. A sector-specific personal identifier (ssPIN) is created from the SourcePIN using one-way derivation, a tokenization method through which a sector specific-pin is algorithmically computed from the SourcePIN.
Unlike the SourcePIN, the ssPIN can be stored in administrative procedures. Public authorities can use the same ssPIN to retrieve a citizen’s data stored within the same procedural sector, for example, if they need to view the citizen’s records or use it to pre-fill forms. However, authorities do not have access to ssPINs from other sectors.
Administrative procedures often require authorities from different sectors work together. If authority “A” requires information about a person from authority “B” in another sector, authority “A” can request sector “B’s” identifier from the SourcePIN Register Authority by providing the identifier from their own sector, the person’s first and last name, and their date of birth. The SourcePIN Register Authority then sends the ssPIN from authority “B” to authority “A” in encrypted form; however, this can only be decrypted by authority “B”. In order to access the data, authority “A” then sends the encrypted ssPIN to authority “B,” which decrypts it and returns the requested data. Source: Privacy by Design: Current Practices in Estonia, India, and Austria.
Although tokenization and encryption both obscure personal data, they do so in different ways, as shown in Figure 14. In general, tokenization is often simpler and cheaper to implement than encryption and has a lower impact on relying parties, as they do not need to decrypt data in order to use it. Tokens also have the advantage that, because they replace PII rather than hiding it like encryption, it is impossible to recover the original data in the case of a data breach.
Figure 14. Tokenization vs. encryption
At the same time, however, tokenization requires a means of mapping tokens to the actual identifier or PII data values (e.g. a token vault or algorithm)—with the most obvious options being through cryptography or reference tables. This can create issues with scalability, particularly where there is a need to access the actual user data in order to complete a transaction. For authentication, this is not always the case, as there does not necessarily need to be a disclosure of any personal data in order to prove that the individual is who they say they are. Implementations such as GOV.UK Verify and Aadhaar are capable of managing the tokenization of identifiers at scale by avoiding the need to share data.
In January 2018, the Unique Identification Authority of India (UIDAI) announced the introduction of two services for the Aadhaar unique ID system: (a) Virtual ID, and (b) UID token and limited KYC. Both features use tokenization to enhance the privacy and protection of Aadhaar holders’ personal data.
The virtual ID service involves front-end tokenization. It allows users to keep their unique, 12-digit Aadhaar number hidden from service providers by generating a random, 16-digit virtual ID number. This requires accessing the resident portal and authenticating themselves using an OTP sent on their registered mobile number. The virtual ID is mapped to the Aadhaar number by UIDAI. Once a person has generated a Virtual ID, they can provide that 16-digit number instead of their Aadhaar number for authentication; new Virtual ID numbers can be generated once every 24 hours.
A key privacy-enhancing aspect is that the Virtual ID is temporary and revocable. As a result, service providers cannot rely on it or use it for correlation across databases. Users can change their Virtual ID as needed, just as one would reset their computer password/PIN.
As a complement to the virtual ID, UIDAI also introduced back-end tokenization to address the storage of Aadhaar numbers in service provider databases. Now, when a user gives their Aadhaar number or Virtual ID to a service provider for authentication, the system uses a cryptographic hash function to generate a 72-character alphanumeric token specific to that service-provider and Aadhaar number which can be stored in the service provider database. Because different agencies receive different tokens for the same person, this prevents the linkability of information across databases based on the Aadhaar number. Only UIDAI and the Aadhaar system knows the mapping between the Aadhaar number and the tokens provided to the service providers.
Subsequently, when the user authenticates with the service provider, the ID system again computes the token using the same hash function with Aadhaar number, service provider code and the secret message as inputs and generates the same UID token. The UID token would always be same for the given combination of Aadhaar number and service provider code. The combination of the Virtual ID and UID token increases the level of privacy and security, as shown in the figure below:
Certain service providers (“global AUAs”) are allowed to store and use Aadhaar numbers and use the full eKYC API, which returns both the Aadhaar number and the token, along with the KYC data. Other service providers (“local AUAs”) can only use the limited eKYC API using the token, and do not receive the Aadhaar number. This will limit the linkability of personal information across databases, as shown in the figure below.