A Federated Information Infrastructure that Works

7 min readApr 12, 2020

This article is based on the presentation I gave at the Data Council Conference in Barcelona in October 2019 titled A Federated Information Infrastructure that Works. I went through the slides typing what I spoke over them, edited the text, and added some of the most relevant slides and code snippets in between paragraphs.

Introduction

In this article we describe how we provided easy access to data to local and central teams of Data Analysts and Data Scientists in Adevinta.

Adevinta is a leading online marketplaces specialist in 16 countries. Our marketplaces help everyone and everything find new purpose: we help people find jobs; buy, sell or rent apartments; buy and sell second hand items,… In Spain we own very well know brands like Fotocasa, Habitaclia, Milanuncios, Coches.net or Infojobs.

We also have a global services department split between Barcelona and Paris and this part of the organization is where I work at. I manage a team of Data Engineers working on building and governing curated datasets (Business Intelligence at scale) for analytics purposes.

Given this multi-tenancy set-up and the fact that both local and central teams need to make data-informed decisions, we identified the following problems to solve:

Provide easy access to key facts about our marketplaces (tenants)
Eliminate data-quality discussions, establish trust in the facts
Reduce impact on manual data requests to each tenant
Minimize regional effort needed for global data collection
Provide a framework and infrastructure that can be extended locally

The journey of building the information architecture that could solve all the above problems started a couple of years ago and today, we can proudly say that the majority of the problems have been solved. During this journey we’ve identified three main challenges:

How to find the right level of authority
How to govern data sets
How to build common infrastructure as a platform

Let’s go into detail on how we solved each one of them.

How to find the right level of authority

If we classify companies using the authority dimension, we can distinguish between centralised and decentralised organizations. In a decentralised organization, all the authority is delegated to the different operations whereas in a centralised organization, the authority resides in a central body.

During the last years, in Adevinta we have switched our data strategy between both ends of the spectrum. When I joined in 2013, the company (named Schibsted Classified Media back then) was a portfolio of completely decentralised operations. Each brand was executing fast thanks to its autonomy, but granular data sets for analytics or to build data products at a global scale were non-existent.

Each marketplace had their own storage system and some of them, depending on their level of maturity, had a data warehouse too. On the other side, the global source of truth was a simple corporate KPI database with very aggregated data that the different brands were sending via an API.

A couple of years later, the strategy of the company switched and we became a more centralised organization, and so was our data strategy. A central data platform was built with the goal of storing all company’s datasets for consumers to use. The problem with this approach was that data was too raw, too dirty and often incomplete; which reduced its accessibility.

This made our data strategy to pivot again to a federated architecture, in the middle of the authority spectrum. We kept some things from the monolithic data platform like its physical storage, AWS S3, and we used new approaches to become a federation.

A Federated Information Architecture That Works

In this architecture, which is the one we stabilised into, each operation keeps their autonomy and local storage systems, even data warehouses. Data generated by common event tracking systems and shared components is cleansed globally and used to calculate metrics and to segment users. This work is done once by a central team (ours) and provides the ability to compare and benchmark all the different tenants or operations.

Operations can use these global data sets thanks to the downwards federation. Each regional data warehouse is a different Redshift instance and thanks to Redshift Spectrum we can make available data from corporate data sources in the data lake, which is in S3, very easily. Central teams can query this same data using a global Athena instance.

This downwards federation works well because data is not physically duplicated: Redshift Spectrum and Athena are just views on top of S3. This has reduced enormously data quality discussions.

How to govern data sets

In order to scale our federated Business Intelligence architecture, we embraced the concept of “data sets as products” as described in this article: How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh. Quoting Zhamak Dehghani:

“For a distributed data platform to be successful, domain data teams must apply product thinking with similar rigor to the datasets that they provide; considering their data assets as their products and the rest of the organization’s data scientists, ML and data engineers as their customers”.

The main characteristics a dataset as product needs to have are:

Discoverable
Addressable
Trustworthy
Self-describing
Inter operable
Secure

Let’s go into detail in each one of these characteristics and explain how we have solved each one of them.

Discoverable

In order for datasets to be discoverable we’ve built a search engine of datasets on top of S3. The first version of the search engine just indexed all buckets and paths, which was very good but it didn’t have any proper governance.

After several iterations, we ended up with a dataset registry where you have to actively register your dataset so that it appears in the search. This allowed to also request mandatory metadata fields at the moment of registration; which made addressability and search easier and powerful.

Addressable

When you register a dataset, you need to add some metadata to the dataset: name, Athena table and S3 path where you can find the data, description, list of fields, example data,…

Having addressable datasets has made our teams that work with data more productive. On one side, Data Analysts and Data Scientists are autonomous in finding and using the data they need. On the other side, Data Engineers have far less interruptions from people asking where they can find data about X.

Trustworthy

Checking data quality regularly and automatically is a must to fulfill the trustworthy characteristic of datasets as products. And owners of the datasets need to react accordingly to the results of these checks.

Also, we knew that some people was not using the dashboards or the data because they didn’t know if the data was accurate. One success story we had in this aspect is the fact that we provided contextual data quality information to consumers of the data; like for example in Tableau dashboards.

Having this contextual data quality information has increased trust in the data and again made Analysts and Product Managers more autonomous in using the dashboards to make data-informed decisions.

Self-describing

Only creating and registering a new dataset is not enough for people to start using it. As we commented before, metadata is very important for adoption. The characteristics that describe each dataset are:

Data location
Data provenance
Data mapping
Example data
How often is the data generated
Input preconditions
Link to a Jupyter notebook with example code using the dataset

Some of this metadata is generated automatically but other points are requested to the creator and maintainer of the dataset.

Inter operable

Common nomenclature is mandatory to make datasets inter operable. Some basic rules we implemented are:

Fields that have the same name in different data sets should contain the same data
If those fields are primary keys or foreign keys, you should be able to join these data sets by these keys

Secure

Registered data sets should not be automatically available to everyone. Employees need to request access to each one of them and data controllers need to grant or deny access individually.

When requesting access, it is mandatory to specify until when the access is needed and for what purpose. This has worked well for us because again, it provides autonomy to our data users.

How to build common data infrastructure as a platform

As stated earlier, our team computes core business metrics for reporting and analytics. While doing so, we’ve realised that all metric calculations follow a simple pattern:

Metrics need to use specific events (filter)
Some transformations should be applied before aggregating
Data is aggregated grouping by several dimensions (using count, count distinct, sum, average,…)
Some transformations should be applied after aggregating
Every metric needs to be calculated by day, week, month, rolling average of 7 and 28 days

So it made sense to abstract our code and provide Engineers with a simple framework to reuse and extend our core set of business metrics:

val simpleMetric: Metric = withSimpleMetric(metricId = AdsWithLeads,cleanupTransformations = Seq(filterEventTypes(List(isLeadEvent(EventType, ObjectType)))),dimensions = Seq(DeviceType, ProductType, TrackerType),aggregate = countDistinct(AdId),postTransformations = Seq(withConstantColumn(Period, period)(_),withConstantColumn(ClientId, client)(_)))

With reusing this simple code, we can calculate as many metrics as we want. This configuration is then passed to the cube() function in Spark that calculates subtotals and a grand total for every permutation of the dimensions specified. This way we have OLAP cubes already pre aggregated.

Another example of these reusable framework and libraries is the one to run user segmentation models. By default, we provide a standard user segmentation using a RFM model. But we also provide the functions to run any type of segmentation. This example is a custom example taken from a Jupyter notebook:

val df = spark.read.parquet(path).groupBy("user_id").agg(count(col("event_id")).as("total_events"),countDistinct()(col("session_id")).as("total_sessions"))val dfWithSegments = df.transform(withSegment("segment_chain",Seq(SegmentDimension(col("total_events"), "events_percentile", 0.5, 0.8),SegmentDimension(col("total_sessions"), "sessions_percentile", 0.5, 0.8))))

This enables local and central teams to calculate their own user segmentation and share it with others, again providing a lot of autonomy to every Data Analyst and Data Scientist in the organization.

Conclusions

As we’ve mentioned several times during the article, one of the main achievements in building a federated information architecture is the autonomy we’ve given to data consumers.

The non-invasive governance program we deployed and the development of datasets as products have been key successes in scaling our journey into being more data driven in all parts of the organization.