How TrueAccord Scaled its Email Sends to Millions a Month

By on May 1st, 2018 in Data Science, Engineering and Data, Product and Technology
TrueAccord Blog

Scaling to sending millions of emails a month is a difficult task, and sending debt collection emails is exponentially harder. To prevent spam and abuse, email providers and infrastructure vendors developed tools and tactics that can easily hurt, blacklist, and eliminate not only the “bad guys” but also the uninitiated sender. Still, we scaled to sending millions of emails a month while enjoying high open and click-through rates that allow us to service consumers the way they want to be serviced (we use other channels as well, but focus on email here). We learned important lessons about scale along the way, through trial and error and calculated planning, and we’re sharing them today.

Challenges With Scaling Email

Email is one of the most penalized communication platforms. There are no filters or blockers or spam buttons when receiving a phone call, or picking up a letter from your mailbox, but email is equipped to keep the bad guys out and let the good guys in. ISPs (email inbox providers) design algorithms to keep the user happy and engaged, and an inbox full of spam is not very pleasant. Unfortunately, sometimes the algorithm gets it wrong, and what is actually an email with good intentions from a trusted sender gets filtered through as spam.

To further complicate the issue for email senders, each ISP has a different set of rules and regulations they filter for. What may be an acceptable email in Google is flagged as spam in Yahoo and vise versa. There is no clear rule book to refer to when attempting to scale emails to a very high volume. The algorithms are also always changing to react to real spammer behavior,  further complicating any attempt to create one clear step-by-step process for success.

The signals for spam prevention algorithms touch on many aspects of emails but include content, design, time, volume, and frequency of sending emails, consumer engagement, digital signatures, and many more. Getting everything right is complex, but if you get any of these wrong, you may find yourself indefinitely blacklisted and banned from emailing.

TrueAccord’s Unique Perspective

Operating in the debt collection space further complicates scaling emails. Even if consumers agreed to be contacted via email, they do not necessarily welcome them, leading to lower inbox placements than eCommerce brands. Despite this enormous hurdle, TrueAccord has similar engagement rates to that of eCommerce companies with up to 30% open rate and 14+% click through rates. IT took a lot of work and careful attention to detail to get us there.

TrueAccord uses machine learning algorithms to pick the best email to send to a specific person at the right time in their debt collections process. The team customizes content, time, and frequency of emails, slowly ramping up scale while monitoring performance. In addition, a lot of TrueAccord contact attempts are reactive, made in response to consumer action or feedback. Contacting consumers in context adds credibility and attracts consumer attention while they are still engaged, further improving their response rates. This close attention to detail coupled with engaging content and data driven targeting makes a significant difference. TrueAccord increases consumer engagement and signals to ISPs that its emails are legitimate, creating a virtuous cycle that improves inbox placement and consumer exposure to emails, again improving engagement.

Our Top Tips for Emails at Scale

We’ve polled our product and deliverability experts to offer you our top tips to follow when building a scalable email program. If you follow these you’ll have a better chance to replicate our success and experience engagement rates that will support, rather than hurt, your long term inbox placement.

Create Valuable Content

The most important aspect to scaling email is writing good content that looks reputable and is well designed. It’s important to earn the consumer’s trust and stay away from using words and phrases that trigger spam engines. TrueAccord accomplishes this by personalizing every email, and sending the right email at the right time during the debt collections process, while also passing every email through a robust approval process to maintain quality.

Consult Experts

Because consumer engagement and open rates are a cornerstone of our business practice, we work closely with a team of email deliverability experts and providers. They provide specific industry knowledge concerning each ISP and assist in the warm up strategy for each domain and IP address. Experts help audit deliverability programs as well as deal with ISP-specific challenges and knowhow.

Segment Domains and IP addresses

Utilizing segmented domains and IP addresses allows for growth and scale while limiting the risk to your reputation from a single mistake, which is one of the biggest traps for new email programs. TrueAccord segments email sends to manage sender reputation and distribute potential issues across multiple domains and IP so none of them see too many bounces or receive too many spam complaints, nor have a too high proportion of unopened emails.

Start Methodically and Slow

Scaling your program too early is heavily penalized even among high engagement senders. Most established companies who add an email strategy to an existing customer base make this mistake, which often cannot be undone. TrueAccord places strict limits on email volume growth to make sure ISPs don’t flag our systems.

This is especially important when starting out with a new client. When a new portfolio is added we will send a small group of several hundred test emails for a few days to measure general deliverability and bounce rates. This test cycle provides insight into the appropriate strategy to use for this specific portfolio. If bounce rates are normal, we can begin to send emails freely, but if the levels are higher than expected we’ll utilize high risk mitigation strategies.  

Measure Measure Measure

Set, measure, track. Data is the life blood of a scalable email program because you must track performance in of multiple indicators across multiple segments to detect any developing issue. TrueAccord created smart alerts that highlight engagement, spam issues, email features and other indicators across IPs, domains, receiver domains and several others. Together they provide us with a realistic view of how the program is doing as it scales, and where we may have opportunities for improvement.

It’s taken TrueAccord two years of trial and error and obsessing over data to scale to millions of emails sent each month. Our email scale will continue to grow as our consumer base and business grows, and we are confident that this strategy will support our growth.

How We Created Heartbeat

By on April 24th, 2018 in Data Science, Engineering and Data, Industry Insights, Machine Learning, Product and Technology
TrueAccord Blog

Sophie Benbenek, TrueAccord’s Head of Data Science, discusses the early days of building our machine learning based engine, Heartbeat, and how it has evolved since. Hear about our approach to machine learning, how we move from heuristics to statistical models, and other anecdotes from the early days of TrueAccord.

Scaling TrueAccord’s Infrastructure

By on April 12th, 2018 in Data Science, Engineering and Data, Industry Insights, Machine Learning, Product and Technology
TrueAccord Blog

TrueAccord’s machine learning based system handles millions of consumer interactions a month and is growing fast. In this podcast, hear our Head of Engineering Mike Higuera talk about scaling challenges, prioritizing work on bugs vs. features, and other pressing topics he’s had to deal with while building our system.

Conversion At TrueAccord: Tuning A Machine Learning Engine

By on April 3rd, 2018 in Data Science, Engineering and Data, Industry Insights, Machine Learning, Product and Technology
TrueAccord Blog

TrueAccord’s system is machine learning based, but every new product type requires a little bit of tuning to beat the competition. Hear our CSO and VP of Finance in this short podcast about the Conversion Team and what it does to make sure TrueAccord stays ahead of competition.

 

Using an Experimentation Engine to Improve the Debt Collection User Experience

By on March 27th, 2018 in Data Science, Engineering and Data, Industry Insights, Machine Learning, Product and Technology
TrueAccord Blog

Innovative automation processes are finally gaining traction in debt collection, as companies increasingly distance themselves from costly and unmanageable call centers. And now, with an eye on continuous process improvement, a new focus on experimentation is enhancing the way these companies recover revenue and create a more effective user experience. Experimentation engines – whereby various collection scenarios and features are tested and evaluated based on real-time data – empower creative and customized contact and offer strategies that improve liquidation as well as customer satisfaction.  

Typical Challenges for the Call Center Model

The traditional debt collection call center model faces multiple challenges. Because of their commission compensation model, collection agents often use aggressive tactics on the phone, pushing for an immediate lump sum payment, or a short-term installment option to speed payment. Even if the consumer picks up the phone at all (which in today’s smartphone culture is becoming far less likely), they feel pressured and may commit to a plan they simply can’t afford. The result is an installment plan that breaks, many times after the first payment, and consumers often charge back the phone payment because they felt antagonized about being pressured to begin with. The call center cost structure also cannot afford to support highly customized plans with irregular payment schedules, missing out on another segment of consumers. All of these add up to a significant disadvantage given today’s consumers and their financial needs.

Flip the System on It Head with a Machine Learning Based Approach
The modern approach to debt collection is omnichannel, digital-first, consumer-centric and leverages data and experimentation to determine the best course of action based on consumer preference and behavior.

TrueAccord’s system communicates with consumers automatically through a wide range of digital channels, including email, text and social channels. And because it’s digital-first and fully reactive to consumer behavior and preferences, it’s a far less aggressive, much more personalized collection environment that delivers superior results when competing with call centers. Historical data collected over several years, combined with machine learning algorithms that evaluate individual behavior and preferences, enables this highly targeted and personalized treatment. Two to three email interactions per week serve as a baseline, with added channels in support and reactive communications responding to consumer interactions when needed.

This approach is also highly collaborative, focused on educating consumers and treating them the way they want to be treated. When they’re ready to commit to a plan, they just view payment options online and choose the one that makes the most sense. The result is higher liquidation rates in the long run, higher payer rates, and higher consumer satisfaction that leads to fewer complaints.

Machine Learning Drives the Experimentation Engine

The most important asset in the TrueAccord model is the data collected and analyzed over time that enables us to accurately predict what messages people respond to, what payment offers work best, and for which type of consumer. This complex data-driven system is part of our DNA and entails a lot of moving parts that allow us to truly understand what resonates with each consumer.

The driving force behind the system’s ever evolving performance is an experimentation engine that allows us to test various scenarios to see how collection processes work and how they can be improved. Since digital-first channels are highly instrumented and offer real time tracking on our website, we can learn in short cycles and continuously improve. To launch an experiment, we establish a hypothesis we want to test, monitor what’s happening in the conversion funnel at each touchpoint, see how each product or plan is being used and where consumers are dropping off. Even when an experiment fails, we learn from the data and make future iterations in a continually improving system. We partner strategically with our clients to customize experiments for their product lines and make experimentation-based optimization an ongoing process.

A few sample experiments:

Aligning Payments to Income

The number one reason payment plans fail is consumers don’t have enough money on their card or in their bank account. Our hypothesis was that if you align debt payments with paydays, consumers are more likely to have funds available, and payment plan breakage is reduced. The experiment tested three scenarios: one as a control, one defaulting to payments on  Fridays and one where consumers used a date-picker to align with their actual payday. After testing and analysis, we found that the date-picker approach worked best, lowering breakage without negatively impacting conversion.

Self-service Payment Experiences Reduce Costs and Breakage

Consumers with debt often can’t always predict when they’ll be paid or how much.  Our hypothesis was that by allowing them to self-service their payment plans and make modifications along the way (based on changes in their lives), we would reduce the need for interaction and improve the customer experience while reducing breakage. This experiment was also a success, reducing breakage rate, and also lowering call rates because before its launch, consumers had to call to change their plan.  By making the desired functionality readily available, we were able to increase payment plan success rate and save agent time.

Even Failures Are a Learning Experience

One hypothesis we tested was that customers that dropped off our radar after not choosing a plan could be enticed to sign up for a new plan if offered longer payment plans. After sending texts and emails based on their behavior, we found that new sign ups simply didn’t materialize by just offering longer payment plans with referring to the consumer’s specific life situation. The offers had a high open and click rates, but not sign ups. This indicated that we were on the right track but needed to iterate and come up with an alternative solution.

An experimentation engine allows every company to test their own hypotheses to see if their customized solutions work or not. A digital-first, highly instrumented experience allows us to run dozens of experiments concurrently, learning from each experiment so we can progressively improve our experience and results. Even when experiments fails, they unearth insights that can be used to improve performance next time as part of follow on experiments. In the world of debt collection, testing and continuous improvement means better results in the long run.

Building An Experimentation Engine

By on March 20th, 2018 in Data Science, Engineering and Data, Product and Technology, Testing
TrueAccord Blog

TrueAccord beats the competition on many levels, and does that through rigorous testing and improvement. Hear a talk from our CTO Paul Lucas and Director of Product Roger Lai on our approach to experimentation.

 

To download a transcript of this post, click here.

My First Month On The TrueAccord Engineering Team

By on December 4th, 2017 in Culture, Engineering and Data, Industry Insights

I can’t believe it’s been a month since I started at TrueAccord. About eleven months ago, my husband and I moved to Bay Area from Australia. It took us a few months to settle down with two young kids, a 5 years old girl and a 18-month-old boy. Three months ago I decided to look for a job as my final step of settling down. As you can imagine, reaching out for a good job opportunity has never been easy. So when my friend who happened to work at next door of our San Jose office told me TrueAccord is hiring an engineer, I immediately applied it and got an offer after a few interviews.

Settling in a new job is not easy and it takes time and effort to process a ton of new information. I received a warm welcome and plenty of help from everybody. Through a short but informative onboarding process, I got to know the people with different backgrounds and perspectives and felt their energy and enthusiasm about the industry. With their help and support, I feel comfortable and ready for the challenges yet to come. The fact that the company is committed to maintaining a diverse workforce and ensuring an inclusive work environment is very attractive. I am really excited to join a multicultural environment like this.

My onboarding journey started with an orientation in our headquarters at San Francisco. I was introduced to the various teams I’d be working with, (engineering, data science, analytics, product)  and got to know them, what projects they were working on and the collaboration among them. I was so impressed with how well structured and organized TrueAccord is a startup company, where you can see well-established collaborations between teams, clearly defined procedures to help saving time and resources. Now, I was ready to start in our San Jose office, where the engineering team is based, and officially became a member of this innovative and collaborative team.

My first task was to set up my development environment, which was more challenging than I thought it would be. It took some time to get everything up and running. It was not the smoothest setup I experienced but it was a great opportunity for me to learn and understand our tech stack and development environment. It was also a great opportunity to see how supportive my team was in helping me get through this somewhat frustrating process, they jumped in to troubleshoot so I could get up and running quickly.

Setting up the workstation was just the start of the bigger journey. There was so much information I need to absorb to do my job. Coming from a Java background, the first thing I had to tackle was learning Scala since our main backend system depends on it. In addition to that, there are so many related technologies I needed to get my hand on. Again, my team was there to make sure I succeed, especially my manager, who make a great plan for me to help me navigate the process. Great discussions happened every week about our system and tech stack between myself and my teammates, which allowed me to jump in and start on some real tasks.

It has been a great month. There were challenges, frustrations, and excitements. I feel lucky that TrueAccord offered me my first job in the US. I am so thankful to everyone that helped me in this great journey. With such a great start, I believe I’m on the right track of fast and effective ramping, and I look forward to contributing more to our project.

How Much Testing is Enough Testing?

By on February 2nd, 2017 in Engineering and Data, Product and Technology, Testing
TrueAccord Blog

Ggb by night


One hundred years ago, a proposal took hold to build a bridge across the Golden Gate Strait at the mouth of San Francisco Bay.  For more than a decade, engineer Joseph Strauss drummed up support for the bridge throughout Northern California.  Before the first concrete was poured, his original double-cantilever design was replaced with Leon Moisseiff’s suspension design.  Construction on the latter began in 1933, seventeen years after the bridge was conceived.  Four years later, the first vehicles drove across the bridge.  With the exception of a retrofit in 2012, there have been no structural changes since.  21 years in the making.  Virtually no changes for the next 80.

Now, compare that with a modern Silicon Valley software startup.  Year one: build an MVP.  Year two: funding and product-market fit.  Year three: profitability?…growth? Year four: make it or break it.  Year five: if the company still exists at this point, you’re lucky.

Software in a startup environment is a drastically different engineering problem than building a bridge.  So is the testing component of that problem.  The bridge will endure 100+ years of heavy use and people’s lives depend upon it.  One would be hard-pressed to over-test it.  A software startup endeavor, however, is prone to monthly changes and usually has far milder consequences when it fails (although being in a regulated environment dealing with financial data raises the stakes a bit).  Over-testing could burn through limited developer time and leave the company with an empty bank account and a fantastic product that no one wants.

I want to propose a framework to answer the question of how much testing is enough.  I’ll outline 6 criteria then throw them at few examples.  Skip to the charts at the end and come back if you are a highly visual person like me.  In general, I am proposing that testing efforts be assessed on a spectrum according to the nature of the product under test.  A bridge would be on one end of the spectrum whereas a prototype for a free app that makes funny noises would be on the other.

Assessment Criteria

Cost of Failure

What is the material impact if this thing fails?  If a bridge collapses, it’s life and death and a ton of money.  Similarly, in a stock trading app, there are potentially big dollar and legal impacts when the numbers are wrong.  On the contrary, an occasional failure in a dating app would annoy customers and maybe drive a few of them away, but wouldn’t be catastrophic. Bridges and stock trading have higher costs of failure and thus merit more rigorous testing.

Amount of Use

How often is this thing used and by how many people?  In other words, if a failure happens in this component, how widespread will the impact be?  A custom report that runs once a month gets far less use than the login page.  If the latter fails, a great number of users will feel the impact immediately.  Thus, I really want to make sure my login page (and similar) are well-tested.

Visibility

How visible is the component?  How easy will it be for customers to see that it’s broken?  If it’s a backend component that only affects engineers, then customers may not know it’s broken until they start to see second-order side effects down the road.  I have some leeway in how I go about fixing such a problem.  In contrast, a payment processing form would have high visibility.  If it breaks, it will give the impression that my app is broken big-time and will cause a fire drill until it is fixed.  I want to increase testing with increased visibility.

Lifespan

This is a matter of return on effort.  If the thing I’ve built is a run-once job, then any bugs will only show up once.  On the other hand, a piece of code that is core to my application will last for years (and produce bugs for years).  Longer lifespans give me greater returns on my testing efforts.  If a little extra testing can avoid a single bug per month, then that adds up to a lot of time savings when the code lasts for years.

Difficulty of Repair

Back to the bridge example, imagine there is a radio transmitter at the top.  If it breaks, a trained technician would have to make the climb (several hours) to the top, diagnose the problem, swap out some components (if he has them on hand), then make the climb down.  Compare that with a small crack in the road.  A worker spends 30 minutes squirting some tar into it at 3am.  The point here is that things which are more difficult to repair will result in a higher cost if they break.  Thus, it’s worth the larger investment of testing up front.  It is also worth mentioning that this can be inversely related to visibility.  That is, low visibility functionality can go unnoticed for long stretches and accumulate a huge pile of bad data.

Complexity

Complex pieces of code tend to be easier to break than simple code.  There are more edge cases and more paths to consider.  In other words, greater complexity translates to greater probability of bugs.  Hence, complex code merits greater testing.

Examples

Golden Gate Bridge

This is a large last-forever sort of project.  If we get it wrong, we have a monumental (literally) problem to deal with.  Test continually as much as possible.

Criterion Score
Cost of failure 5
Amount of use 5
Visibility 5
Lifespan 5
Difficulty of repair 5
Complexity 4

Cat Dating App

Once the word gets out, all of the cats in the neighborhood will be swiping in a cat-like unpredictable manner on this hot new dating app.  No words, just pictures.  Expect it to go viral then die just as quickly.  This thing will not last long and the failure modes are incredibly minor.  Not worth much time spent on testing.

Criterion Score
Cost of failure 1
Amount of use 4
Visibility 4
Lifespan 1
Difficulty of repair 1
Complexity 1

Enterprise App — AMEX Payment Processing Integration

Now, we get into the nuance.  Consider an American Express payment processing integration i.e. the part of a larger app that sends data to AMEX and receives confirmations that the payments were successful.  For this example, let’s assume that only 1% of your customers are AMEX users and they are all monthly auto-pay transactions.  In other words, it’s a small group that will not see payment failures immediately.  Even though this is a money-related feature, it will not merit as much testing as perhaps a VISA integration since it is lightly used with low visibility.

Criterion Score
Cost of failure 2
Amount of use 1
Visibility 1
Lifespan 5
Difficulty of repair 2
Complexity 2

Enterprise App — De-duplication of Persons Based on Demographic Info

This is a real problem for TrueAccord.  Our app imports “people” from various sources.  Sometimes, we get two versions of the same “person”.  It is to our advantage to know this and take action accordingly in other parts of our system.  Person-matching can be quite complex given that two people can easily look very similar from a demographic standpoint (same name, city, zip code, etc.) yet truly be different people.  If we get it wrong, we could inadvertently cross-pollinate private financial information.  To top it all off, we don’t know what shape this will take long term and are in a pre-prototyping phase. In this case, I am dividing the testing assessment into two parts: prototyping phase and production phase.

Prototyping

The functionality will be in dry-run mode.  Other parts of the app will not know it exists and will not take action based on its results.  Complexity alone drives light testing here.

Criterion Score
Cost of failure 1
Amount of use 1
Visibility 1
Lifespan 1
Difficulty of repair 1
Complexity 4

Production

Once adopted, this would become rather core functionality with a wide-sweeping impact.  If it is wrong, then other wrong data will be built upon it, creating a heavy cleanup burden and further customer impact.  That being said, it will still have low visibility since it is an asynchronous backend process.  Moderate to heavy testing is needed here.

Criterion Score
Cost of failure 4
Amount of use 3
Visibility 1
Lifespan 3
Difficulty of repair 4
Complexity 4

Testing at TrueAccord

TrueAccord is three years old.  We’ve found product-market fit and are on the road to success (fingers crossed).  At this juncture, engineering time is a bit scarce, so we have to be wise in how it is allocated.  That means we don’t have the luxury of 100% test coverage.  Though we don’t formally apply the above heuristics, they are evident in the automated tests that exist in our system.  For example, two of our larger test suites are PaymentPlanHelpersSpec and PaymentPlanScannerSpec at 1500 and 1200 lines respectively.  As you might guess, these are related to handling customers’ payment plans.  This is a fairly complex, highly visible, highly used core functionality for us.  Contrast that with TwilioClientSpec at 30 lines.  We use Twilio very lightly with low visibility and low cost of failures.  Since we are only calling a single endpoint on their api, this is a very simple piece of code.  In fact, the testing that exists is just for a helper function, not the api call itself.

I’d love to hear about other real world examples, and I’d love to hear if this way of thinking about testing would work for your software startup.  Please leave us a comment with your point of view!

Skipping Photoshop: How we made ID Badge creation 10x faster by using facial recognition

By on November 1st, 2016 in Engineering and Data, Product and Technology
TrueAccord Blog

Recently TrueAccord has grown to the size where our compliance stance requires the addition of photo ID badges. It’s a rite of passage all small-but-growing companies endure and ours is no different.

Since I have previous experience setting up badge systems and dealing with the printers, I volunteered to kickoff this process. I’ve evaluated pre-existing badge creation software in the past and found them all significantly lacking. In a previous environment, I wrote my own badge creation software which fit the needs at the time. The key phrase being “at the time“. For tech startups, it’s not unusual to go from onboarding one person every other week, to 10 people a week in a year or two. That means every manual step for onboarding someone will go from an “oh well, it’s just once every other week” to “we need to dedicate several hours of someone’s time every week to this process.” Typically that same growth period also happens to be when your operations (IT, Facilities, and Office Admin) organizations are the most short staffed and the least likely to have the free time to do that. “Where is this going?” and “How much work does this mean for me?”, you ask? Allow me to share with you how I automated our badge system – Photoshop included.

Continue reading “Skipping Photoshop: How we made ID Badge creation 10x faster by using facial recognition”

Repos: How we use MySQL as a key-value store

By on July 21st, 2016 in Engineering and Data, Product and Technology
TrueAccord Blog

When we started TrueAccord in 2013, we used MySQL to store our data in pretty traditional way. As business requirements came in, we found ourselves continuously migrating our table schemas to add more columns and more tables. Before MySQL 5.6, these schema changes would lock down the database for the entire duration of a change causing a brief downtime. When the company was smaller and just starting out, this was tolerable, but as we grew the increase in schema complexity was getting harder to manage via SQL migration scripts.

We were looking for an alternative, something like Big Table, the key-value store that I used back at Google. Using a key-value store enables storing an entire document as a value, and thus eliminating the need for migrations. We investigated several publicly available key-value stores, but none of them met our major requirements at the time. As a small engineering team, we wanted a hosted fully managed database solution, so that backups and server migrations are taken care of for us. Additionally we wanted security features like encryption at rest. DynamoDB came the closest to matching our requirements, but was missing encryption at rest.

We came across this old post from FriendFeed that describes at a high-level design that meets our requirements which inspired our implementation. First, we chose to use MySQL (now Aurora) managed by Amazon RDS as our backing datastore. This solves the requirement for a hosted, managed, encrypted database, and this is a battle-tested database. Then for the key-value interface (to avoid schema migrations), we built a thin library called Repos that provides a key-value interface implemented on top of MySQL. Now we have something that allows us to move quickly on top of a reliable datastore.

Enter Repos

Each repo represents a map from a UUID (key) to an arbitrary array of bytes representing the value. Each repo is stored in MySQL using two tables. The first table is the log table. Every time we wanted to insert or update an entity, we will insert it to this table.

Column name Type Description
pk bigint(20) Auto incremented primary key
uuid binary(16) Unique id for each entry
time_msec bigint(20) Time inserted
format char(1) Describes the format of the entry_bin column.
entry_bin longblog The value.

We always append to this table, never updating an existing row. By doing so, we get the full history of every object. This has proven to be really handy for debugging why a change has occurred, and when.

The format column can take two possible values: ‘1’ means the value in entry_pb is a serialized protocol buffer, and ‘2’ means it is compressed using Snappy (a compression scheme that aims for high speed and reasonable compression)

To optimize look-ups, we have another table, the “latest” table, with the following format:

Column name Type Description
parent_pk bigint(20) PK of this entry in the log table.
uuid binary(16) The unique id of the entry(here it is a primary key)
format Char(1) Describes the format of the entry_bin column.
entry_bin longblog The value.

 

Whenever we insert an element to the log table, we also upsert it to this table so it always has the latest inserted element. We do this as a transaction to ensure the tables are always in sync.

Secondary Index Implementation

The first hurdle when going in this route is secondary indexes. For example, if your Repo maps a user id to his account information (email, hashed password, full name), how would you look up an account by email? To do so, we implemented index tables. An index table maps the values in the key value store to a primitive value that MySQL can index. A single repo may have multiple indexes, and each one goes to its own table. Index tables have the following layout:

Column name Type Description
parent_pk bigint(20) PK of this entry in the log table.
uuid binary(16) Random id for each entity (here it is a primary key)
value * The indexed value (for example, the email address of the user)

 

We always insert to the secondary index. Therefore, over time, the index will contain stale values. To solve that, when querying, we join the uuid and parent_pk with the latest value and return the result only if there is a match.

For example, if we have a person with id “idA” and he changed his  email, the log table would look like this:

pk uuid time_msec value (format, entry_bin)
501 idA t1 {“user”: “john”, “email”: “john@example.com”}
517 idA t2 {“user”: “john”, “email”: “john@domain.com”}

 

The latest table, would have only the updated row:

parent_pk uuid value (format, entry_bin)
517 idA {“user”: “john”, “email”: “john@domain.com”}

 

The email index table would have the email value, for each version of the object:

parent_pk uuid value
501 idA john@example.com
517 idA john@domain.com

 

Now, to find an account whose latest email value is “john@domain.com”, the Repos library would build a query similar to this:

SELECT l.uuid, l.format, l.entry_bin FROM latest AS l, email_index AS e
  WHERE e.value = john@example.com" AND
        e.uuid = l.uuid AND e.parent_pk = l.parent_pk

Our Repo library provides a nice Scala api for querying by index. For example,

accountsRepo.byEmail.all("john@domain.com")

Would return all the accounts that have this email address.

Using Table Janitor to Manage Our Tables and Indexes

The table janitor is a process implemented as an Akka actor that runs on our JVMs. This actor is responsible for two main tasks:

  1. Ensuring that the underlying MySQL tables are created.It does this by reflecting all of the Repos and indices defined in the code and then creating the corresponding MySQL tables. This makes adding a new repo or adding an index as simple as just defining it in the code.
  2. Ensuring that the indices are up to date. This is necessary since when a new index gets added, there may still be servers that run old version of the code and do not write into the new index. The table janitor regularly monitors the log tables and (re-)indexes every new record. Adding an index to an existing repo is easy – we just declare it in the code.

How we do Analytics

We use AWS data pipeline to incrementally dump our log tables into S3. We then use Spark (with ScalaPB) for Bigdata processing. We also upload a snapshot of it to Google’s Bigquery. As all our repos use Protocol buffers as their value type, we can automatically generate Bigquery schemas for each repo.

Pros and Cons of Our Approach

By writing repos and have all our database access go through it, we get a lot of benefits:

  • Uniformity: having all our key-value maps being repos has the advantage that every optimization and every improvement applies to all our tables. For example, when we build a view that shows an object history, it works for all of our repos.
  • Schema evolution is free when using protocol buffers as values. We can just add optional fields, rename existing fields, or convert an optional to a repeated and it just works.
  • Security: storing data securely on RDS is a breeze. Encryption at rest? Click a checkbox. Require data encryption in transit? SSL is supported by default.
  • Reliability: We never had the RDS MySQL (later Aurora) instances go down (besides rare scheduled maintenance windows which require the instances to be rebooted). We have never lost data. Additionally we can recover the database to any given snapshot in time with RDS by replaying binary logs on top of a snapshot.
  • Ease of use: adding a Repo or an index is trivial. All of our ~60 or so Repos work in exactly the same way, and accessed through the same programmatic interface, our engineers can easily work with any of them using the same programming interface.
  • Optimization/Monitoring/debugging: Since MySQL is a mature and well-understood technology, there is a plethora of documentation on how to tune it, how to debug problems. In addition, AWS provides a lot of metrics for monitoring how an RDS instance is doing.

However, there are also downsides:

  • Storing binary data in MySQL limits what can be done using the command line MySQL client. We had to write a command line tool (and a UI) to look up elements by key so we can debug. For more complex queries, we use Spark and BigQuery for visibility into our data.
  • Being a homegrown solution, we occasionally had to spend time tuning our SQL queries when our repos grew in size. On the positive side, scaling up due to business growth is a good problem to have and fixing it for one repo, made an improvement for all others.
  • JDBC has Multiple Layers: JDBC/HikariCP/Mysql connector: we had quite a few issues where it was tricky to pinpoint the source of the problem.

Alternatives: What the Future Looks Like

As much as we’d like our homegrown solution, we are continuously thinking what our next storage solution will be like.

  • Current versions of both MySQL and Postgres come with built-in support for indexing JSON documents.
  • Google now offers a publicly hosted version of Bigtable.
  • We are moving towards having our data represented as a stream of events which may benefit from a different data store.

Success

The Repos implementation has enabled our engineering team to quickly develop a lot of new functionality, as well as iterating over the data schema. By implementing on top of RDS, we have the peace of mind that our data is safe and our servers are up to date with all the security patches. At the same time, having full control over the implementation details of repos allowed us to quickly implement additional security measure so we can satisfy the stringent requirements of card issuers and other financial institutions, without sacrificing development speed.