Upgrading our underwriting stack

Learn how we migrated our offer computation process and how it's helping us scale now.

Matthew Zhu

Parafin offers fast, flexible, and fairly priced capital to small businesses at scale.

Underwriting and offer computation are the base ingredients of a financially sustainable capital program. They determine the sizing and pricing of capital offers. Parafin works with platform partners to generate offers for hundreds of thousands of their small business merchants every day. These offers range from a few thousand dollars to millions of dollars. Nailing underwriting and offer computation ensures that we rarely have to turn down pre-approved merchants, creating the best possible experience for our customers.

Check out some of the problems we solved and things we learned from our offer computation migration that we did earlier this year to get an idea for what it’s like working on Parafin’s engineering team.

Context

Historically, we had split the two processes of underwriting and offer computation between our Data Science and Engineering teams:

- Data Science owned underwriting, answering questions like: “How can we predict this business’s future cash flow?” and “What range of parameters (such as capital amounts, fees, durations, repayment rates, etc.) are we willing to give them in their offer?”. Data Science then passed a risk profile to Engineering via AWS S3.

- Engineering owned offer computation, which read a risk profile from S3 and found an offer that satisfied all of Data Science’s constraints. Engineering handled the offer lifecycle, which included things like surfacing the offer to the merchant, offer expiration, and providing visibility to the software platform partner for administrative purposes.

Both processes require high reliability and observability to keep everything running smoothly at scale. Our previous separation of responsibility worked well at a time when our team and lending volumes were much smaller, our testing infra was far better in our application backend than in our data warehouse, and we anticipated Engineering needing to rapidly iterate on offer computation methods.

Motivations and goals

As we were scaling, we made a couple of major observations:

- Seemingly simple questions like “Why doesn’t this business have an offer even though they should be eligible?” became increasingly complex. This was because both Data Science and Engineering had added incremental tuning, filtering, and bug fixes to the offer computation logic over time. Changes usually heavily involved both Data Science and Engineering.

- Unit tests in our Scala backend repo were the best and fastest way to validate offer computation initially, but we had significantly upgraded our Databricks-based underwriter since then. We had also built out Python workflows for our growing Data Science team to create experiments, run back-tests, and monitor daily underwriting results.

In response, we decided to transition ownership of offer computation from Engineering to Data Science, resulting in this final state:

- Data Science underwrites a business to create risk profile and also computes an offer balancing all of the parameters. This would unblock a major change to the computation method, which Data Science would redesign from the ground up. It would also prevent minor changes from being bottlenecked by Engineering.

- Engineering directly loads the computed offer and only handles the offer lifecycle. This would reuse Data Science’s underwriting newer monitoring infrastructure in a more straightforward, efficient way. It also helps Engineering focus on the offer lifecycle and product behavior, which sees much more frequent changes.

Process

This combined two strategic changes we'd been wanting to make for some time: Changing the method of offer generation and transitioning our Scala-based offer computation component to Python.

We started by aligning on the schema for the offers to be loaded. This unlocked parallelization whereby Data Science could start writing and iterating on new offer files adhering to the schema while Engineering was able to load and read them even in their incomplete state. Once we had this workflow in place, we were off to the races.

Data science: Changing the method of offer generation

To change the method of offer generation, Data Science created a fresh Python implementation using first principles thinking. This encapsulated the kinds of control we wanted and a more realistic path for future evolution by applying learnings from the Data Science-Engineering collaboration over the years.

We used novel mathematical techniques and validated them during development by running the new implementation alongside our existing method and reusing our existing safety checks. Data Science’s safety checks block offers from going out if something looks wrong, like amounts being far too low or high. We ran these checks on both methods and used summary statistics to ensure the old and new methods would not deviate too much across our merchant population.

Engineering: Transitioning offer generation from Scala to Python

To transition the offer generation load to the new Python method, Engineering created scheduled jobs to load new offer files alongside the existing ones, and then we migrated partners onto the new code paths in phases.

In doing so, we ran into a few sub-problems:

- How do we fight scope creep in unfamiliar territory?

- What is the fastest way to validate our work as we go?

- How do we safely roll everything out?

How do we fight scope creep in unfamiliar territory?

Bundling the Engineering migration with Data Science computation change was key in allowing us to discard the various filters and bug fixes that had become difficult to maintain.

The Engineering system for offer computation and creation had two main implementations in its past. The first was when it was initially created, and the second was when we generalized it from merchant cash advances to multiple capital products, creating two parallel pipelines. Both had taken place over a year prior to this effort, and the drift and lack of institutional memory around the two pipelines made detailed scoping difficult.

We wanted to have some kind of direction while avoiding getting bogged down in detail beforehand, especially around code we knew we’d be throwing away. To do this, we set out with a fairly under-scoped approach:

- Create a new “load and create” pipeline to replace the two existing “compute and create” pipelines for merchant cash advances and general capital products.

- Among the creation steps that would be duplicated, sift through the main steps and side effects that Engineering should own. Once identified, blindly refactor them in their existing usages to share logic and generate a dry-runnable plan before executing. In this way, we end up with lots of small no-op changes that are validated early in existing pipelines and can also be flipped on in the new “load and create” pipeline with a single dry-run flag.

- Throw out (or, more precisely, entrust to Data Science’s new method) everything else involving any kind of nontrivial numerical computation or that can be attributed to Data Science requests through git blame.

Our largest source of scope creep came from cleaning up the multiple product-specific pipelines back into a general capital product pipeline. This brought the number of distinct offer creation pipelines from three (one new and two existing) back down to one — just the new one. It also completed the largest remaining step of a different migration that had otherwise wrapped up work several months earlier.

What is the fastest way to validate our work as we go?

As we built momentum, we used a development branch containing the initial offer-loading job, merging in changes from the main branch as incremental refactors completed.

In order to minimize massive scary moments towards the end, we ran this in our production environment with the dry-run flag to analyze its results every day on the latest offers in S3.

It took several weeks before all the refactors were complete and the initial offer-loading job was merged into our main branch and added into our pipeline to dry-run on a scheduled basis. At that point, after acting as its human schedulers for so long, we already had a strong understanding of its capabilities and failure modes.

How do we safely roll everything out?

We chose to roll out on a partner-by-partner basis using a feature flag, segmented into phases by how many merchants were eligible per partner. During rollout, we monitored core metrics like conversion rates and origination volume to ensure they were not immediately negatively impacted. The rollout itself was smooth due to all of the safety checks and monitoring we added to understand potential issues.

In addition to aggregate summary statistics generated during underwriting, we added checks on the engineering side for the product experience.

As this was more significant than a typical tweak to offer computation logic, we wanted to understand whether we needed additional time and planning to communicate impact on our partners’ capital programs. Aggregating by partner was already helping isolate issues to individual partners’ offers. We also wanted to gain insight into impacts on individual businesses’ offers, which partners sometimes ask about directly for their platform’s most engaged businesses.

Before rollout, we generated visualizations like these every day to quickly understand the impact on aspects of offers we cared about. These show that individual offers that were just written to the database with the old method did not diverge significantly from the new offer that was loaded in dry-run for every merchant, which would be written if fully rolled out.

‍

Two histograms with line graphs showing frequency and cumulative proportion. — This is the monitoring output for one select partner a day of shadow mode preceding the rollout

This diagram allowed us to conclude that, for this partner on this day:

- All merchants would have seen a max offer amount deviating no more than -4% to +10% relative to the old method’s amount (i.e. if a merchant’s old max offer amount was $10,000, we can be certain their new offer would have been between $9,600 and $11,000).

- All merchants would have seen a fee rate deviating by no more than +4.5% relative to the old method’s rate (i.e. if a merchant’s old offer fee was 10%, we can be certain their new offer fee would have been no higher than 10.45%).

The diagrams weren’t a perfect check — they were just a step above manually sifting through individual offers to ensure they made sense — but discussions with Data Science on how the computation works gave us sufficient confidence to proceed with rolling out.

Results and what’s next

Since the migration, we’ve had a much easier time debugging offer creation on the Engineering side and answering questions about offers. This has allowed us to continue scaling our lending products. We’ve also unlocked new ways to control offers for Data Science:

- Previously, our Data Science team could only optimize the final offer amount with adjustable parameters. Now, Data Science can hold known variables constant while tuning the adjustable parameters to optimize for a wider variety of objectives. We’re actively exploring an even cleaner separation of fixed known variables and adjustable parameters.

- A partner recently asked us to modify fees for some select large merchants. This previously would have required a request to Engineering or hacks to modify an internal target value that indirectly expresses the constraint. After the migration, Data Science was able to directly express this constraint using the much richer parameter set.

We’re also exploring ways to interpret and debug the space of potential offers as parameters change, a workflow that would have been much more difficult to untangle before the migration. Check out this anonymized view of a tool to explore the space of constraint interactions below.

‍