Beijing UNPOS Intelligent Technology Co., Ltd.
In Keen IO, we believe that the company will use the event data has a competitive advantage, the world's leading technology companies to prove this point. But we're still surprised by what Facebook, Amazon, Airbnb, Pinterest and Netflix data teams do. They set new standards for software and businesses to get information from data.
Because of the large number of users of their products, these teams must continually define the methods of scale analysis. They have invested millions of dollars in data structures, and in most companies, the size of the data team exceeds the size of the entire engineering department.
We set up the Keen IO, in order to make the majority of the software engineering team without the need to start from scratch all the content, you can use the latest large-scale event data technology. However, if you are curious about how to become a giant company, please collect some of the best companies from the framework.
Netflix has 93 million users, no interactive defects. As their engineering team described in the evolution of the Netflix data pipeline, they capture about 500 billion events per day, about 1.3 PB of data per day. They will record 8 million events per second during peak hours. They hired more than and 100 data engineers or analysts.
The following is a simplified view of the data structure, the above article which shows that the open source system Apache Kafka search server Elastic Search, Amazon's cloud storage service AWS S3, Apache Spark big data processing, operation framework of Apache Hadoop and big data analysis services EMR as main component.
With more than 1 billion active users, Facebook has one of the world's largest data warehouses, storage of more than 300PB. This data is used for a wide range of applications: from traditional batch processing to graphical analysis, machine learning and real-time interactive analysis.
In order to carry out large-scale interactive query, Facebook engineers invented the Presto, a customized distributed SQL query engine for peer to peer analysis. Every day there are more than 1000 Facebook employees using Presto, through Hive, HBase and Scribe pluggable backend data storage, the number of queries per day more than 30000 times.
Airbnb supports more than 100 million users browse the list of about 2000000 houses. They are smart enough to provide new travel advice to these users, which has a great impact on their growth.
Airbnb's data science manager, Elena, at last year's meeting, "building a world-class analysis team," said they have extended the Airbnb data team to the size of more than 30 engineers on the scale of Grewal. This is 5 million per person per year.
In the blog "data infrastructure" in an article, AirbnbEng Mayfield, Krishna Puttaswamy, architect James Swaroop Jagadish and Kevin Longdescribe describes the basic elements for constructing data structure and provide higher reliability for mission critical data. They rely heavily on Hive and Apache Spark, and use the Facebook Presto.
Pinterest has more than 100 million users browse more than 10 billion page views per month. As of 2015, they extended the data team to the size of more than 250 engineers. Their infrastructure relies on the open source system Apache Kafka, data processing framework Storm, system infrastructure Hadoop, open source database HBase and GPU renderer Redshift.
Pinterest team not only need to track a large number of customer related data. Like other social platforms, they also need to provide detailed analysis of advertisers. Huang Tongbo in the Behind Pins: Building Analytics at Pinterest, wrote in a paper: in order to meet this demand, they improved their analysis stack. Here is how the Pinterest team uses Apache Kafka, AWS S3 and HBase schematic:
Twitter / Crashlytics
Handle 5 billion meetings in real time. Ed Solovey describes the architecture of the Crashlytics Answers team built to handle billions of everyday mobile device events.
Keen IO data architecture
As I mentioned before, we construct a Keen data interface (API), so that any developer can use world class data structure, without the need to have a lot of infrastructure construction of a huge team. Thousands of engineering teams use Keen's API to capture, analyze, stream, and embed event data, including real-time and batch applications.
While developers using Keen do not need to know what happens behind the scenes when sending an event or running a query, the following is the architecture of its request:
Keen IO information processing structure
On the input side, the load balancer handles billions of incoming post requests. The event stream comes from an application, a web site, a connection device, a server, a billing system, etc.. Events need to be verified, sorted, and optionally enriched with additional metadata, such as IP-. It all happened in a matter of seconds.
Once safely stored in Apache Cassandra, event data can be queried by REST API. Our architecture (by Apache Storm, DynamoDB, Redis and AWS lambda Technology) support exploration of real-time data from the original incoming data to various applications and customer oriented report cache query demand. Keen queries tens of thousands of events per day, and builds reports, automation, and data mining interfaces for thousands of customers.
Thanks to the consistent data engineering community, constantly inventing new data technologies, open source, and sharing their knowledge. If there are not so many of the basic work of the project team, and we do not work together every day, our team can not have today. Comments and feedback.
Special thanks to the authors and architects mentioned in the article: Netflix Steven Wu, Facebook Presto, AirbnbEng and Pinterest Engineer Martin Traverso, and Crashlytics Answers Ed Solovey.
Thanks for editing Terry Horner, Dan Kador, Manu Mahajan and Ryan Spraetz help.