Sweden’s Klarna and Spotify design big data architectures to aid business growth

Many organisations still don't know how to capitalise on big data. Swedish firms Klarna and Spotify are making use of its possibilities

There is lot of talk about big data these days, but many organisations still don't know how to use ​it. Swedish companies Klarna and Spotify are among those that make good use of the new possibilities.

Klarna, an e-commerce business founded in Sweden in 2005, provides customer payment services for online stores. The business idea is to simplify buying. One aspect of this vision is to eliminate risk for seller and buyer, which means it is essential ​to make good assessments of a buyer’s trustworthiness.

“To increase precision ​of risk assessment, we have invested in a big data infrastructure. We began a year and a half ago and went into production this spring,” says Erik Zeitler, who holds a PhD in database technology and is the technology lead in data infrastructure at Klarna.

T​he new infrastructure makes it possible to use several different sources at the same time​ when making a ​risk assessment​; previously a single database was used.​

​“This ​of course ​means we can take into account more kinds of data when doing risk assessment, and it also means we can try out new ​risk ​models by maintaining several alternative transformations at the same time," ​says Zeitler. 

"We have implemented a continuous delivery pipeline on top of Hadoop that makes it easy to deploy new transformations in production in a traceable fashion. ​The ​single database ​used to be considered as the source of the truth, so​ to speak​, and that's not the case anymore.​”​ 

Read more on big data architecture

Klarna won't reveal exactly which data is used, but it comes from three main sources: internal customer history data, data specific to the purchase, and external vendor data.

“Each risk assessment is based on a pretty limited amount of data – it​'​s not many megabytes. ​B​ut this data is ​derived from nightly batch transformations, ​and th​ey ​are pretty big,” ​says Zeitler.

This means that Klarna ​prepares preliminary variables​ every night on all customers who previously have made purchases.​

“And when ​a ​person wants to make a new purchase, we use those variables. There are a number of factors that can stop a transaction, and one of those is if you are buying ​unexpected​ products. For example, are you expected to suddenly buy a hundred ​USB​ sticks? If not, it might be a fraud.”

W​hile it​ is easy to measure how many transactions Klarna fails to stop – it​'​s just a matter of counting how many purchases don​'​t get paid – it​ is harder to assess how many purchases that are denied on false grounds.

“​It's an ongoing project​ to try to ​estimate the loss we make from our false negatives. We​ continually ​​tweak our models to try to minimise both the false positives and false negatives. ​A​nd the more data we use, the better our chances are to make a good estimate,” ​says Zeitler.

If the customer does not pay, Klarna takes the hit.​ The ​e-stores ​are charged ​a fixed price combined with a fee for each transaction.

“The online stores that begin using Klarna usually increase their sales by about 30%. One of the reasons is that we eliminate friction for the customer by separating buying from paying. Customers buy first, and then decide how they want to pay,” he says.​

To complete a purchase, Klarna only asks for information the customer can easily provide, according to Zeitler.

“We ask for things like email address, and post code or social security number, until we have enough information to identify the person, and to make the risk assessment. If we accept the purchase we send the customer an invoice​. T​hat means the customer do​es not have to part with their​ money before they get the product.”

The data used to identify the customer, and the data used to make the risk assessment, ​varies from country to country, ​mainly due to the difference in public registers. This – in combination with differences in payment cultures – means it​ is a lot of work for Klarna to enter new markets, according to Zeitler.

When you talk about big data you look at volume, variety and velocity – called the 'three Vs' – and o​ur challenge is not the volume; the big challenge is to handle​ the variety of the data

Erik Zeitler, technology lead, data infrastructure, Klarna

Klarna recently stepped into the UK​ market​, and its system is now offered by around 45,000 e-stores in 15 European countries. With the growing number of customers, the total amount of data handled also grows.

“​B​ut the amount of data is still not enormous. ​In our nightly batch transformation we go through somewhere between 10 ​t​erabytes and one petabyte of data. When you talk about big data you look at volume, variety and velocity – called the 'three Vs' – and o​ur challenge is not the volume; the big challenge is to han​dle​ the variety of the data,” he says.

To meet the challenge that the big number of data sources and their complexity imposes, it​ is not enough to deploy ordinary ​relational databases, in Klarna’s view. The company’s front-end systems are built on the NoSQL database Riak from the vendor Basho. The risk assessment is made in the next layer, which is actually a cluster of relational databases.

“Those are the two online systems. Back​ ​office – where the nightly batches are made – we have a system built on Apache Hadoop. And Hadoop is two things - it’s​ ​​scalable storage in the form of ​Hadoop ​Distributed ​F​ile ​S​ystem​, HDFS​​, ​and scalable execution in the form of the programming model MapReduce,” ​says Zeitler.

One of the big pros with this setup is the ability to use ​an ​SQL-like language called HiveQL on Hadoop, and then feed the result of the batch transformation into the front-end system every morning, according to Zeitler.

“Another pro is that ​the ​front-end ​SQL ​stays up to date with a steady stream of new transactions from the online system. That is, the information f​rom ​the online system simmers down to Hadoop for offline analysing, but some of the information also goes directly into the cluster​ of relational databases. And then we write over the information in the online databases every morning with the outcome from ​H​adoop.”

This architecture, call​e​d La​mb​da, is a way to eliminate complexity and gives Klarna great flexibility, according to Zeitler.

“​I​t took a great deal of ​reading​ and discuss​ing​ before we settled ​for this solution​​,” says Zeitler, who was the one who took the initiative for investing in a big data infrastructure at Klarna.

“I proposed it to the upper management ​two years ago, ​when ​I​ was relatively new at Klarna, and they listened to me. I don​'​t know if this is something that is more common in ​Swedish organisations, wh​ich​ are usually pretty non-hierarchical, but I think it​'​s pretty cool that you get listened to as a newcomer,” he says.

Download our buyer's guide to big data infrastructure

In this 10-page buyer’s guide, Computer Weekly looks at the mindset and technology businesses need to analyse various forms of data, the low-cost solid state memory powering datastreams from social network feeds and the industrial internet and a revision of the traditional approach of matching back-end infrastructure to application requirements.

Contents include:

  • Choosing a platform to manage the big data mix.
  • Storage struggles to keep up with data growth explosion
  • Choosing a platform to manage the big data mix.

Click here to access the Computer Weekly buyer's guide to big data infrastructure

Buyer's guide to big data infrastructure

Spotify and big data technologies

Music streaming service ​S​potify, founded in 2006, is another ​S​wedish-born comp​a​ny that has invested in a big data infrastructure: a 690-node Hadoop cluster in the back office, and on top of that clusters of the open source NoSQL database Apache Cassandra.

“We have 40 million monthly active users, and they generate a lot of data. We process that data to generate new data to hand back to the users. For example, we give song and playlist recommendations,” says Jimmy Mårdell, tech product owner at Spotify, and responsible for the data being delivered back to users.

The first big data infrastructure was built when the company started eight years ago, and much has happened since. In the beginning Spotify ran a small Hadoop cluster with 35 nodes, and data was imported into Hadoop using a store-and-forward approach. Today data is imported through a streaming system using the messaging system Kafka.

“All user activity generates a lot of logs and data, and then we have to ship that from all over the world to our Hadoop cluster – that is what we use Kafka for,” says Mårdell.

The main processing language used to be Python, but now Spotify is moving into Apache Crunch and Scalding instead. Spotify has also built its own workflow manager, called Luigi, which is used to synchronise all the analysing done on its data.

“We need Luigi to stay away from total chaos. We have open sourced Luigi, so other companies can use it as well; we like open source and use it a lot at Spotify. You get help from the community to develop the open source software – it's a win-win situation,” says Mårdell.

Spotify has also recently started using Spark, which is a different way of doing big data processing, as opposed to MapReduce. When it’s time to send the processed data back to the  user, the data is first transported back from Hadoop to Cassandra, using tools Spotify has written in-house.

Then Cassandra is used to serve data back to the user. Since it’s important the data is geographically close to the user, Spotify has several Cassandra clusters dispersed around the world.

“We can’t have the Cassandra clusters in one single place, like the Hadoop cluster. If we had the data in Sweden, and an Australian user asked for it, the user experience wouldn't be that good – it would be too slow.”

Spotify has also chosen to have many instances of SQL and Cassandra databases, to secure the stability of the system, according to Mårdell.

“For example, the data that delivers your playlists and the data that delivers the discovery page are separated in totally different database clusters. This means that if the playlist databases would get sick for some reason, everything else in Spotify would still work perfectly. Decoupling is the key to scalability,” says Mårdell.

Read more on Big data analytics