1-844-696-6465 (US)        +91 77600 44484        help@dezyre.com

How LinkedIn uses Hadoop to leverage Big Data Analytics?

With more than 400 million profiles (122 million in US and 33 million in India) across 200+ countries, more than 100 million unique monthly visitors, 3 million company pages, 2 new members joining the network every second, 5.7 billion professional searches in 2012,7600 full-time employees, $780 million revenue as of Oct, 2015 and earnings of 78 cents per share (phew!)  - LinkedIn is the largest social network for professionals. People prefer to share their expertise and connect with like-minded professionals to discuss various issues of interest in a platform like LinkedIn, as it allows them to represent themselves formally in a less traditional manner. 2 or more people join LinkedIn’s professional network every second, making up the pool of 400 million members. They could be skilled professionals searching for a job or head-hunters looking for top talent.

The Big Data Ecosystem at LinkedIn

Wondering how LinkedIn keeps up with your job preferences, your connection suggestions and stories you prefer to read? LinkedIn Big Data Analytics, is the success mantra that makes LinkedIn predict what kind of information you need to know and when you need it. At LinkedIn, big data is more about business than data. Here’s a case study exploring how LinkedIn uses its data goldmine to be a game changer in the professional network space.

“Our ultimate dream is to develop the world’s first economic graph", a sort of digital map of skills, workers and jobs across the global economy. Ambitions, in other words, that are a far cry from the industry’s early stabs at modernising the old-fashioned jobs board.”- said Jeff Weiner

LinkedIn is a huge social network platform not just in terms of revenue or members but also in terms of its multiple data products. LinkedIn processes thousands of events every day. It tracks each and every activity by the users. Big Data plays a vital role for data engineers, data analysts, data scientists and business experts that seek an in-depth understanding of various interactions happening in the social graph. Data scientists and analysts use big data to derive performance metrics and valuable business insights that lead to profitable decision making for marketing, sales and other functional areas.

Interested to know how much a LinkedIn Data Scientist earns?

CLICK HERE to get the Data Scientist Salary Report for 2016 delivered to your inbox!

LinkedIn uses data for its recommendation engine to build various data products. The data from user profiles and various network activities is used to build a comprehensive picture of a member and her connections. LinkedIn knows whom you should connect with, where you should apply for a job and how your skills stack up against your peers as you look for your dream job.

Learn Data Science in Python and R

If you would like more information about Data Science careers, please click the orange "Request Info" button on top of this page.

LinkedIn Hadoop and Big Data Analytics

LinkedIn Big Data Analytics

Several technical accomplishments and contributions pepper LinkedIn’s hallmark 13-year journey as a pioneer in the professional networking space. Apache Hadoop forms an integral part of the technical environment at LinkedIn that powers some of the commonly used features on the mobile app and desktop site. As of May 6, 2013 –LinkedIn has a team of 407 Hadoop skilled employees. The biggest professional network consumes tons of data from multiple sources for analysis, in its Hadoop based data warehouses. The process of funnelling data into Hadoop systems is not as easy as it appears, because data has to be transferred from one location to a large centralized system. All the batch processing and analytics workload at LinkedIn is primarily handled by Hadoop. LinkedIn uses Hadoop for development of predictive analytics applications like “Skill Endorsements” and “People You May Know”, ad-hoc analysis by data scientists and for descriptive statistics for operating internal dashboards.

The Big Data Ecosystem at LinkedIn

Let’s take a look at the big data ecosystem at LinkedIn -

  • Hadoop
  • Pig
  • Hive
  • Azkaban (Workflow)
  • Avro Data
  • Zookeeper
  • Data In- Apache Kafka
  • Data Out- Apache Kafka and Voldemort

Here’s a quick look at LinkedIn big data technologies, that are powered by Apache Hadoop –

  • Voldemort - A NoSQL distributed key value storage system, used at LinkedIn, for various critical services fuelling a large portion of the website.70% of all Hadoop data deployments at LinkedIn employ key-value access using Voldemort.
  • Decomposer - Contains large matrix decomposition algorithms implemented in Java.
  • White Elephant –Parses Hadoop logs and provides visualization dashboard that include number of slots used, count of failed jobs, total disk time and CPU time for different Hadoop clusters.
  • Giraph  – Used for social graph computations and interpretations on Hadoop clusters.
  • Avatar- LinkedIn’s scalable and highly available OLAP system used in “Who’s Viewed My Profile” feature. It serves queries in real-time.
  • Kafka - Publish-Subscribe messaging system, that unifies online and offline processing by providing a method for parallel load into Hadoop. Kafka at LinkedIn is used for tracking hundreds of different events like page views, profile views, network updates, impressions, logins and searches over a billion records every day.
  • Azkaban - Open source workflow system for Hadoop that provides make-like dependency analysis and cron-like scheduling.

Live servers are updated with large scale parallel fetch outcomes from Hadoop, into Voldemort that warms up the cache. After this, Voldemort introduces atomic switchover to the next day’s data on each server. An index structure in the Hadoop pipeline produces multi terabytes of lookup structure that uses hashing. This process helps obtain a balanced equilibrium between cluster computing of resource, to achieve faster responses. Hadoop is used to process huge batch workloads – it takes approximately 90 minutes to create a 900 GB data store on a Hadoop development cluster with 45 nodes. Hadoop clusters at LinkedIn are down for periodic maintenance and upgrades but their Voldemort servers are always up and running.

LinkedIn Big Data Products

LinkedIn is injecting big data analytics into various features on its platform by building novel data products -

1) People You May Know

If you are a LinkedIn user, you probably know about the star feature of LinkedIn “People You May Know”. This feature reminds LinkedIn users with suggestion about other LinkedIn users they probably would be interested to connect with. “People You May Know” feature began with a huge Python script in 2006 and it started to drive immense growth on the platform since 2008.

Linkedin Data Products

Most of LinkedIn’s data is offline and it moves pretty slowly. LinkedIn’s data infrastructure uses Hadoop for batch processing. LinkedIn pre-computes the data for “People You May Know” product by recording close to 120 billion relationships per day in a Hadoop MapReduce pipeline, that runs 82 Hadoop jobs which require 16TB of intermediate data. The feature is implemented by a job that makes use of a statistical model to predict the probability of two persons knowing each other. The data infrastructure uses bloom filters to accelerate join operations while running jobs which provides 10 times better performance. There are 5 test algorithms continually running - producing approximately 700 GB of output data for the ‘People You May Know’ feature.

So, the next time when LinkedIn suggests someone that you have never expected to discover in the network, from a completely different part of your online life - do not worry! LinkedIn is tracking everything - right from your browser settings, log in details, in-mails you send and the profiles you view, to bring to you a list of people that you can connect with, who match your preferences.

2) Skill Endorsements

Skill Endorsement is another interesting data product built by LinkedIn, that recruiters use, to look for the skills and expertise of a particular candidate. A member can endorse another member in their network, for a skill which is then showed on the endorsed person’s profile. Skills endorsement is a deep information extraction data problem.

The workflow first determines the various skills that exist for a member, which requires synonym detection and finding ambiguities if any. The skills are then joined with the profile of a member, social graph, groups and any other activity by the member that helps in finding out the skills for the person. After the skills are resolved, endorsement recommendations are computed by measuring the affinity between two members and the tendency for a member to have a particular skill. The resulting skill recommendations are delivered through Voldemort as key-value stores by mapping a member id to the list of other members, skill id’s and the score. The output is used by the front end team at LinkedIn to display it in a user friendly manner as shown below –

 

Skill Endorsements Data Products at LinkedIn

3) Jobs You May Be Interested In

Searchable job titles, connections and skills are LinkedIn’s greatest possessions that employers can use when looking for top talent. LinkedIn is joining the dot for corporates by leverage big data for intelligent workforce planning through “Jobs You May Be Interested In” feature. 90% of Fortune 100 Companies use LinkedIn to hire top talent and 89% of professionals use LinkedIn to land a job. According to LinkedIn, 50% of the website engagement comes from “Jobs You May Be Interested In” feature. Machine Learning is plays a vital role in everything at LinkedIn whether it is Job Recommendations, Group Recommendations, News Story Recommendations, Personalization of the Social Feed or any personalized search.

Jobs You May Be Interested In

 

LinkedIn uses various Machine Learning and Text Analysis algorithms to show relevant jobs on a LinkedIn member’s profile. The textual content like skills, experience, and industry are extracted from a member’s profile. Similar features are extracted from the job listings available in LinkedIn. A logistic regression model is applied to know about the ranking of relevant jobs for a particular LinkedIn member based on the extracted features.

The machine learning algorithms that power the “Jobs You May Be Interested In” module, do not merely consider the city of residence and current field. There are multiple activities that are tracked before providing a job recommendation to a member. For instance, the ML algorithm analyses the migration patterns of a member. The machine learning algorithm for job recommendation at LinkedIn has determined that an employee in San Francisco will be more interested in a job opportunity in New York than to Fresno. The algorithms also tracks how often a member changes jobs. If a member if promoted quickly then the algorithm recommends jobs that are a step up for them.

4) News Feed Updates

LinkedIn incorporates data analytics and intelligence to understand what kind of information you’d like to read, what subjects interest you most, what kind of updates you like and putting together the aggregated real-time news feed for you. A LinkedIn member receives an update if any other member in their connections have an updated profile. For instance, to show deeper analytic insights like highlighting the company that most of the members/connections now work at, requires multiple join computations on different data sources which is time-consuming. As this is a batch compute intensive process that require joining company data of different member profiles Hadoop is used for rapid prototyping and testing new updates.

News Feed Updates on LinkedIn

With its data-drive strategy-LinkedIn continues to grow exponentially in terms of its revenue and member base through innovative data products. Let us know in comments if we have missed out any other important data product of LinkedIn that leverages analytics.

Interested in Machine Learning? Build your Skill Set with DeZyre’s Data Science in Python and R Training.

PREVIOUS

NEXT

Learn Data Science in R Programming