Github and LinkedIn User Profiling
Introduction
GitHub and LinkedIn are platforms very widely used by working professionals. However, these platforms present some privacy concerns which allow us to infer more than what was intended by the user. These inferences could be used to make hiring and firing decisions in companies
We have been able to infer whether a person wants to make a job switch, is actively looking for a job, or has received some work opportunity from the LinkedIn and GitHub activities.
We can also infer Twitter usernames for a large percentage of GitHub users, even though they have not explicitly linked their Twitter handles. This connects the professional and private lives of people and has serious privacy implications.
Linkedin Activity Analysis
For Linkedin activity analysis, we first obtained 33 random profiles on LinkedIn. We created a new LinkedIn account for scraping these profiles and then used selenium for scraping activity data from their profiles. The activity data contained the type of activity (like, love, insightful, curious, celebrate, comment, post) and the time at which the activity occurred. In the image below we can see an example of an activity on a person's LinkedIn activity page.
We then analyzed these profiles to see if a change or rise in activity showed a positive correlation with a significant change in their profile. We found that 66% of the profiles showed a positive correlation between rise inactivity and a significant change in profile.
Out of these 66% of profiles, 46% of the profiles had a job change, 25% got an internship, 18% had a change in role, 11% had got a change in credentials. This is represented in the pie chart below.
Interviews
To further understand how people use LinkedIn we interviewed 10 participants. 60% participants were from prestigious colleges in India, 30% were from other Indian colleges and 20% were from colleges abroad. 60% of the participants were internship holders and 40% of the participants were recent graduates.
We asked them a series of questions and follow up questions about their reasons for using LinkedIn.
One of the observations made from the interview was that students from prestigious colleges in India did not use LinkedIn for job searches since they could rely on their connections. However, students from other colleges in India and abroad relied on LinkedIn more heavily for internships and job search.
All of the participants said that they were not frequent users of LinkedIn. 60\% of the participants said they use LinkedIn once or twice and month and 40\% said that they opened the website once a week. All of the participants said they spent less than 10 minutes on LinkedIn for most sessions. However, participants that relied on LinkedIn for jobs and internships became more active during job and internship season.
When asked how would they feel if their LinkedIn activity could be used to predict a change in job, they said that they would be shocked and would see it as a violation of privacy.
Hence the survey also matches with our activity analysis that showed that there was very little activity most of the time and activity peaked when trying to look for a job.
GitHub - Twitter Identity Resolution
Using attribute matching, a GitHub user account is searched and linked to a Twitter user account. GitHub allows you to explicitly link your Twitter account to your GitHub profile. Only a small percentage of the profiles, however, link to their Twitter accounts. We randomly grabbed 100,000 GitHub profiles. Each GitHub account has its own id, which is an integer. To make the scraping random, we randomly sampled 100,000 integers. To get all of the profile details, we used the GitHub APIs. Only about 1000 of the 100,000 profiles scanned had a Twitter account linked to them. We used these Twitter accounts as a source of ground truth to fine-tune our technique and see how well it performs.
Both GitHub and Twitter may be used to retrieve a variety of profile data. To examine how comparable the GitHub and Twitter profiles are, we chose a common set of attributes. We resorted to picking candidate accounts from Twitter due to the difficulty of indexing all of the profiles on Twitter. We use Twitter's search API to find potential profiles. Using the Twitter search API, we retrieved candidate profiles using a combination of name, username, and location. After we've gathered all of the candidate profiles, the next step is to rank them according to their attribute similarities.
The above table lists what Github and Twitter attributes can be compared.
To compare face similarity, we used the deepface package, which calculates the cosine similarity of the VGG face embeddings. We use Levenshtein similarity for both the name and the username.It was important to normalise based on the number of attributes accessible because different profiles had different numbers of attributes. Each attribute's similarity is determined, and the weighted total of each attribute's similarity is used to rank profiles based on their similarity.This method of resolving profiles presented a number of difficulties. We reviewed several examples manually to see what went wrong when our technique failed.
A majority of the profiles were missing information. It's possible that the information on their GitHub profile isn't visible on their Twitter profile. Face matching also failed in circumstances where the faces could not be detected. The failure of the resolution was caused by a lack of qualities in the majority of cases.
The similarity function used for the common attributes.
We used the GitHub data which we initially scraped and filtered the accounts which had their Twitter accounts connected and used a part of the data as the ground truth. The ground truth data was used to test the effectiveness of our algorithm. We were able to link accounts with a 35 % accuracy rate.
Conclusion
Video Poster Presentation
Team Members
- Joseph Cherukara, UG2K18, CSD
- George Tom, UG2K18, CSD
- Arpan Dasgupta, UG2K18, CSD
- Ishan Upadhyay, UG2K18,CLD
- Vipul Chhabara, UG2K18, CSD
- Rishav Kundu, UG2K18, CSD
- Yash Bhansali, UG2K18, CSE
Comments
Post a Comment