Like many other companies in the social networking world, Zoosk inherits a vast amount of data every day from user interactions, web logs, financial transactions, as well as standard business metric data. Making sense of the data and turning it into actionable intelligence is of utmost importance to Zoosk, where we are constantly trying to optimize our product offerings and business processes. The question is: how do we most effectively leverage our data, and turn it into business intelligence?
There are a few typical approaches to answer this question.
Traditionally to gain business intelligence, one can leverage a star schema data warehouse with a multi-dimensional OLAP engine, to provide the business with a user-friendly toolset to quickly “slice and dice” data to identify trends and patterns. These toolsets can be something that users are familiar with, such as Microsoft Excel and web dashboards. However, traditional data warehouses are typically not as flexible with semi-structured data, especially when data volume gets very large.
More recently, one can leverage what is known as “Big Data”, using tools such as Hadoop and Hive, to cost effectively deal with large amounts of data in unprecedented variety, velocity and volume. Using MapReduce, Hadoop can effectively process large amounts of data cheaply by splitting the data into smaller nodes. It also provides the advantage of being able to store unstructured data. However, unlike traditional DW, analytics via Hadoop/Hive is still relatively primitive and currently lacks very user friendly tools for the business user, who may not want to write their own Hive-SQL queries.
So what do we do at Zoosk? Simple, do both!
The overall Zoosk Analytics Platform high level architecture is illustrated as follows:
1) We first start with collecting as much as data as we need. Our website writes user behavior data to a scribe log, while the web server collects vital web statistics in its web logs. The web logs are considered our “Big Data” portion of our source data.
2) In parallel, we also pull mission critical, structured, production data directly from our user cluster database. These database clusters include user profile data, financial information and other transaction based information. These are considered the “Small Data” portion of our source data.
3) “Big Data” such as scribe logs / web logs are staged in Hadoop, where we also have Hive on top of it to provide ad hoc analytical access to the underlying data. It also acts as a first stage data integration process to structure and aggregate our “Big Data” into more manageable information.
4) Database scripts and Extract, Transform, Load (ETL) programs are used to load both our “Big Data” and “Small Data” sets into our RDBMS staging tables.
5) By using industry proven methodologies in data warehouse / dimensional modeling, we conform all of our data with over 100 dimensions across 20+ fact tables, and store them in a centralized Enterprise Data Warehouse (EDW). The EDW can be used for ad hoc query analysis as well, but its main purpose is to populate our OLAP multidimensional data marts.
6) OLAP data marts have all the dimensional structures, KPIs, and metrics defined so users can simply browse our cube via Excel or web reports, easily navigate our user friendly BI universe, or create their own reports based on a single version of truth. They are aggregation aware, and can automatically direct users’ traffic on most used reports to our aggregation cache providing them with sub second query response time.
7) The Zoosk analytical platform enables our users to gain insights from the data via different complimentary vehicles. Power users can perform ad hoc analysis directly from Hive or SQL EDW. Further, power users can connect to the OLAP cube directly via toolsets such as MS Excel and perform data pivot analysis. Advance visualization techniques, such as charts, gauges, sparklines, in-cell data bars and conditional formatting, can be employed to provide executive dashboard reporting. Standard template reports are also available online and in mobile platform for 24/7 anywhere data availability.
Below is a few interesting statistics of our analytics platform:
8 OLAP cubes
20+ Fact tables
90 conformed dimensions in the main cube alone
150+ cube dimensions
450+ measures across the measure groups in the main cube alone
3.6 billion session logs processed a year
10 billion tracked interactions tracked a year
80 billion web traffic count tracked a year
130 billion rows of data ingest to Hadoop a year
40 Terabytes of data ingest to Hadoop a year
By combining traditional DW / BI architecture with newer technologies such as Hadoop / Hive, we are able to provide:
- Highly scalable infrastructure that easily handles unstructured data, as well as scale to large amounts of data typical to websites
- Structured, time proven, enterprise-wide information repository consolidating all incoming data and provide a single version of truth to all of our users.
- Conformed analytic platform which enable cross analysis to be performed across various different fact metrics via conformed dimensions.
- User friendly analytical platform which allow browsing and creation of custom reports without deep technical knowledge
- Performance optimized analytical platform where OLAP / Usage based aggregations help provide sub-second query response time for standard queries / reports

What kind of infrastructure is in play here especially in the staging and SSAS side? Did you prototype any of the emerging OLAP-on-Hadoop products like “SaasBase Analytics”?
Hi Ram. Thanks for the question. The hadoop world and DW/BI world is mostly on different infrastructure even though data is exchanged between the two. Our DW/OLAP servers are on SQL Server platform, hence they sit on Windows 2008. We did not prototype emerging analytics on hadoop products, mostly because we also have data coming from non-hadoop source which we staged in a SQL based DW now. Not all our data are “big data”, and this allows us the flexibility to deal with “small data”, such as traditional finance / user data in a standard DW/OLAP data flow. It also allows users to use analytics tool which they are familiar with now, such as Excel or Tableau, or even straight t-SQL. The landscape is changing all the time too, so I am sure we will revisit these emerging technologies in the short term future.
I love SQL Server and Analysis Services. I’ve spent a big part of my career working with both. But I really hate loading and managing big data in this environment. I’m guessing this is why Microsoft was quick to bring some of the same ideas via their PDW, e.g. Horton Works based Hadoop etc.
So far I haven’t identified any good substitutes for OLAP/Analysis Services in the Open Source world — have you?
SQL Server & SSAS really have awesome value propositions for a SMB BI solution, especially when business users love Excel (and most do). The new SSAS Tabular mode (columnar in-memory architecture) really makes using the cube fun again, by providing sub second response time even with billion rows of data. In the open source world, I haven’t identified a good substitute for OLAP per se (at least not one that work so well with Excel, Powerview, and other tools), but I do believe some of the newer columnar db (cassandra, impala, or even Redshift, etc) provide many analytical capabilities without being a true OLAP cube… (at least from a performance perspective, but may be not from a usability perspective…imo)
Thanks for the reply! For our Hadoop distribution I’ve gotten pretty far with Cloudera’s CDH-4 release. Your diagram reminds me of my own I guess simply because there are a limited number of choices here
I was trained at Kimball several years ago by Margy Ross on the ins and outs of Star Schemas. Of all the choices I could come up with it still seems to fit that the Data Warehouse stays on the Microsoft side but all the file loading stuff works out pretty well on the Hadoop side. Did you struggle with that choice as well?
I think the choice fell into place by itself, especially when we are also dealing with other non-hadoop data source, such as financial data from production. At the end of the day, IMO, technologies are enablers, and I believe it’s still true that business users tend to think in terms of “metrics BY”, such as sales BY product BY date. That natural business analytical thinking is still best support by dimensional modeling, and dimensional modeling is still best presented in OLAP type engine. Not the only way, but possibly easiest way for non-technologist.