Insights

Bioprocess deviation analysis: SIMCA vs BioRaptor

January 29, 2026

10 min read

Yaron David

David

CTO and Co-founder

Table of contents

Sometimes prospects ask us what is the difference between BioRaptor and SIMCA (by Sartorius). We hope this clarifies and answers the two different technologies and their best use cases.

TL;DR - SIMCA is great for multivariate analysis to find patterns and classify data sets into groups. You have to have clean and structured data in order for SIMCA to work. Additionally, it shows you that there is a deviation between batches but it doesn't tell you the root cause for those deviations. With BioRaptor, you don’t need to start with clean and organized data, something that usually takes a long time to achieve. BioRaptor ingest your online, in-line and off-line data in whatever state it comes, cleans it and structures it eliminates the need for manual data handling before you can even do the analysis. Then, it easily compares as many batches as you want, shows deviations and you can easily dig in to find the root cause for those deviations.

What is SIMCA

SIMCA has been part of upstream and MSAT toolkits for decades for a good reason. When you have a clean and structured dataset, it is one of the most effective ways to do a multivariate analysis without building custom pipelines.

SIMCA uses Principal Component Analysis (PCA) and related methods (PLS, OPLS) to help engineers. It can classify or batch results into groups, and then tells you if additional datasets fit into one group or the other, or doesn’t fit into any group. Which works for detecting deviations BUT, it doesn’t tell you WHY it’s deviating.

The most important piece is that you need clean and structured data sets for SIMCA to work. And getting your data clean and structured usually takes a lot of manual work.

If what you want is SIMCA to work well during deviations, what you really want is a team of interns: exporting time series, aligning timestamps, reconstructing phases, extracting aggregative results, hunting for lot numbers, and translating metadata which resides as free text into something you can analyze.

SIMCA limitations

The biggest limitations are 1) the time it takes to prepare a dataset that you can use with SIMCA, 2) it marks everything as different without attempting to clarify the reasons behind it, 3) it retains your own experiences and biases - what you’ll get at the end is limited by what you decide to derive from your data.

Building datasets

Before you can run SIMCA analysis, you need a dataset worth modeling, and that’s where the first bottleneck comes in.

The time spent modeling vs. time spent finding/cleaning/contextualizing data is hugely underestimated. Preparing a dataset can take days or even weeks.

It usually involves operational archaeology while finding:

Bioreactor time series from historian/SCADA/vendor export
(DO, pH, agitation, gas flows, feed rates, base addition, temperature, setpoints, alarms)
Offline assays from LIMS/ELN/spreadsheets
(titer, OD/CDW, metabolites, quality readouts, induction timing)
Batch record / MES events(holds, manual overrides, interventions, sensor calibrations)
Materials + lotsmedia components, feeds, trace elements, antifoam, supplements… and our hero: IPT
Free textshift handovers, operator notes, emails like “we switched the bottle” or “new lot arrived”

Then the interns (or that one unfortunate process engineer) do the real work aligning timestamps across systems, normalizing units and tag names, reconstructing phases (“when did induction actually start?”, and deciding which batches are actually comparable, instead of which ones look “kind of similar.” Only after that do you get to open SIMCA and paste your data.

SIMCA finds patterns but not the reasons

SIMCA does what it’s designed to do. It finds patterns, but doesn’t explain them.

After running the usual multivariate analyses, PCA and PLS, the team finally sees something emerge. The “bad” batches separate cleanly from the rest. They cluster together, clearly distinct. A handful of variables start to stand out. Perhaps base addition behavior shifts after induction. Or dissolved oxygen (DO) control drifts late in the run. Induction timing isn’t quite as consistent as expected.

What multivariate analysis gives you, especially in deviation work, is a map of symptoms. It tells you which variables moved together and when the process began to diverge. What it doesn’t tell you is why any of that happened.

You’re left with questions to investigate. And investigation capabilities are not part of what SIMCA offers.

SIMCA relies on your experience

Before SIMCA can find patterns, you have to decide which variables to include and how to represent them. That means you're making critical modeling decisions upfront - often before you know what's actually driving your deviation.

Should you include the raw DO trace, or the cumulative oxygen uptake? Feed rate as an absolute value, or normalized to biomass? Do you calculate specific productivity, or let the model figure it out from titer and VCD separately? What about the rate of base addition during induction versus total base consumed? Is it the OD at 6 hours? 40 hours?

These choices define what the model can see. If you don't calculate the right derived variable - say, the ratio of glutamine consumption to lactate production during the exponential phase - SIMCA will never surface it as a contributor, even if that's exactly where your process is drifting.

This is why SIMCA rewards experience. The senior engineer who's spent years on the process knows which transformations reveal signals. Everyone else is guessing.

Why this isn’t SIMCA’s fault (and also why SIMCA isn’t enough)

SIMCA is an analysis workbench. It does exactly what it was built to do: identify correlations in wide, multivariate datasets. For that purpose, it’s extremely effective. When you want to see which variables move together across runs, SIMCA delivers.

The problem is that most deviations are not, at their core, chemometrics problems. They are between data integration and context problems.

The reason the process behaved the way it did is rarely encoded in the time series alone. The real “why” usually lives in the surrounding context:

which media formulation was used
which raw material lots were involved
what interventions happened during the run
when they occurred

Often, it’s buried in free-text notes: a bottle was swapped, a pump started sounding off, a hold was introduced during a shift change. Sometimes it’s in offline assays that don’t quite line up with the online data, or in sampling timestamps that are slightly off but never corrected.

You can run a flawless PCA, see clean separation, and still come to the wrong conclusion, simply because the one contextual variable that actually mattered never made it into the dataset at all.

How BioRaptor is different from SIMCA

From the start, BioRaptor addresses the problem of datasets. Because BioRaptor ingests all your online, in-line and off-line data in whatever format it comes, structures it, it eliminates the need for manual data handling that takes up most of the time. Additionally, it takes in and organizes all the context that is essential in understanding root cause for any deviations.

BioRaptor treats a run as a joined object:

Bioreactor time series (raw + contextualized, with tag harmonization)
Offline assays aligned to the run timeline
Events (alarms, holds, interventions, calibrations)
Structured metadata (strain, recipe version, reactor, scale, seed train, etc.)
Materials + lots (media components, feeds, supplements, IPTG, antifoam, buffers)
Free text captured and made usable alongside structured data (not buried in a description field)

Having all your data structured, harmonized and contextualized allows you to dive into analysis right away. You can easily batch your runs by any specific criteria, quickly change that criteria and see the results immediately.

Because you’re not spending days or weeks preparing your datasets, the number and speed of questions fundamentally change. You’re no longer confined to a few parameters that take you forever to see if they are affecting your yield.

But this has one, even more important advantage - because BioRaptor automatically generates hundreds of process-relevant features - specific rates, phase-segmented metrics, cumulative values, ratios, slopes - it does not depend on your hypothesis about what might matter. The platform evaluates all potential contributors against your outcome of interest and surfaces the factors with the strongest correlation.

This means the analysis isn't constrained by what you thought to include or what the most experienced person on the team would have checked first. A media lot effect that no one suspected gets flagged the same as the obvious DO excursion everyone noticed. The data leads; your assumptions don't filter what the model can see.

Conclusion

Simply said, SIMCA is very limited in its capabilities and requires a lot of pre-work to even be able to do multivariate analysis. The analysis says what is different, but does not tell you why and requires additional investigative work.

Because BioRaptor ingests, organizes, structures and contextualizes all your data in one place, the data prep work is no longer needed. You have your historical and real-time data at your fingertips ready for analysis at any time. BioRaptor also does what SIMCA offers - batching runs and identifying deviations. But the core value is that it also tells you WHY your yields are different and gives you immediate insights that you can act on.

‍

If you would like to see BioRaptor in action, book a demo with us and we’ll show how the magic happens.

‍