"TL;DR: No more an excuse!": On making privacy policies easier to read and interpret
Introduction
Many websites collect, share, and use users’ Personally Identifiable Information (PII). Many regulatory bodies around the globe have long enforced requirements on posting privacy policies online. Over the past few decades, however, the collection, sharing, and usage of users' PII have risen to a major privacy concern over the internet, so much so that newer laws have gone into effect to protect user privacy. Prominent examples of such laws are the General Data Protection Regulation (GDPR) in the European Union and the California Consumer Privacy Act (CCPA) in the United States. Research has shown, time and again, that privacy policies are too long and hard to comprehend for their intended users, and therefore users rarely take the time and effort to read them. Privacy Policies are getting longer over time and increasingly difficult to read. To address the poor readability of privacy policies, we have developed tools that leverage machine learning (ML) to automatically summarize privacy policies along with a question answering system. We make these tools publicly available.
Goal
Methodology
| Homepage of the app |
Jurassic-1 Chatbot
The model is accessible through an API. We use this model for our Question Answering Module. We also release the initial Roberta-based model for QnA on Huggingface Hub along with Training scripts on Github.
FTC Paragraph extractor
We finetuned the BERT model by adding a classifier layer on top of it and finetuned it on the OPP115 dataset. Then we design a module where given a Privacy Policy P, we take paragraphs/segments p1,p2, p3, .. pn and for each, we predict the FTC Guideline that it attributes to. We give the user interface a textbox to insert the text and dropdown multi-select for the user to select the guidelines the user wants the text to be extracted for.
Summarizer
The summarization module is a T5 language model fine-tuned on the summarization dataset curated. The T5 model has shown great summarisation capabilities compared to its size, in the paper it was introduced. This was perfect for us considering the low resources we will have during the inference when the app is deployed. We release the fine-tuned T5 model and training scripts as part of the project.
The summarization module has a textbox area where the user can enter the privacy policy along with the minimum and maximum number of words that are allowed during summarization. The user gets the summarized policy as output. It is worthy to note that we follow abstractive summarization with T5 instead of the usual extractive methods.
Explainable AI
Recently there has been a buzz in finding what features of the input to a Deep Learning Model resulted in the prediction it made. Few papers have shown how looking at gradients, attention, etc. of models can help in finding the parts of the input that had more weightage on the output. A python library Transformers Interpret implements these papers and tries to explain the attributes of a transformer-based model. We use this library to explain the predictions of our models in the Explainable AI module in a beautiful readable interface.
The use case for this module that we found is that the user can learn from the model, ie. They can look at why the module decided to look at those particular cues and then use those cues in their own daily lives later when reading policies. This module can be used to train a non-technical person to read privacy policies quickly and effectively.
Telegram bot
We also use the trained model as a backend for a telegram bot for easy access and usage. The bot is created using python. It listens for specific inputs by the user and executes functions accordingly. The main bot is hosted on Heroku as a worker. The bot is given API access to the Jurassic model and as well as the Telegram servers. The hosted bot communicates with the Telegrams server and the Jurassic model asynchronously for rapidness and acts as a communication channel between the model and the user.
Analysis
We use our models on various privacy policies along with exploratory data analysis.
Longitudinal Analysis
First, we run our model, which predicts FTC Guidelines for policy documents ranging from the year 2000 to 2019. We clean the text, segment the HTML, and pass each segment of the document to classify it for one of the FTC Guidelines from the OPP115 Dataset. We then plot the distribution of percentages for each of these guidelines as shown in Figure over time.
We observe that there is a spike in 2018 where there is an increase in the mention of text related to the guidelines. We believe that it is because of the GDPR which was enacted in 2018. The General Data Protection Regulation is a regulation in EU law on data protection and privacy in the European Union and the European Economic Area. It also addresses the transfer of personal data outside the EU and EEA areas. The GDPR was adopted on 14 April 2016 and became enforceable beginning 25 May 2018.
Geographical variation based on GDPR
We noticed that some companies keep different versions of their websites for different regions. We analyze if the privacy policies are different in each region in this case. For this, we collected privacy policies of some major companies from Forbes list from two regions: GDPR Complaint and NON-GDPR Complaint. To find this we collected the policies of 500 websites Then we passed them through a sentence transformer to generate sentence embeddings.
We observe that even though a large number of Documents are exact matches, there are some documents that are showing differences. This suggests that these websites might be following different policies at different locations based upon the strictness of the laws. We believe there needs to be a unified strict law like the GDPR throughout the world providing Privacy Law enactment which is beneficial as shown by our analysis.
How do the Privacy Policies of Forbes 500 Companies differ from the top Indian Startups?
We collected the privacy policies of Forbes Fortune 500 Companies and of the Top 300 Indian Startups. We classify the text after cleaning to one of the FTC Guidelines from the OPP115 Dataset. We then plot the distribution of percentages for each of these guidelines.
We observe that most of the guidelines follow a similar distribution except for International and Specific Audiences which is found higher in Forbes 500 Policies which makes sense. But why are the Indian Startups not worrying about international Audiences? This question needs further analysis which we leave for future work. We also plot word count distributions for Forbes and Indian Startup Documents.
We observe that most of the time they follow a similar distribution but generally the Forbes 500 Documents are lengthier. Next, we plot Dale-Chall readability index distributions for both categories. The higher the index, the tougher the text to read and more age and education required. We observe that Indian policies are generally easier to read, owing to the simple English being used. But with this, there might be a reduction in the information covered which needs to be studied.
Survey
We did user evaluation testing through a live demo in a controlled environment. We questioned a user base of 25 users from varying demographics. Most of the users preferred the QnA model over others mostly stating ease of usage as the reason.
Presentation and Team
Team Members:
- T H Arjun, CSD, UG3, 2019111012
- Arvindh A, CSD, UG3, 2019111010
- Jaywant Patel, ECE, UG3, 2019102001
- Aryan Kharbanda, CSE, UG3, 2019101018
- Ansh Khandelwal, ECE, UG3, 2019102008
- Aditya Kadam,CSD,UG3,2020121009
Important Links:
Dataset: online_privacy_qna
FTC Classifier: priv_ftc
Privacy Policy Question Answering Model: priv_qna
Privacy Policy Summarization Model: priv_sum
App Repo: policy-park
Comments
Post a Comment