CalcTune
⚠️ This tool is for reference only. Consult a healthcare professional for medical decisions.
❤️
Health · Wellness

Sensitivity & Specificity Calculator

Enter your 2×2 confusion matrix values (TP, FP, FN, TN) to compute all key diagnostic performance metrics: sensitivity, specificity, positive and negative predictive values, accuracy, F1 score, and likelihood ratios.

Example values — enter yours above

Diagnostic Performance Metrics

81.8%
Sensitivity
88.9%
Specificity
90.0%
PPV
80.0%
NPV
85.0%
Accuracy
85.7%
F1 Score
55.0%
Prevalence
7.36
LR+
0.20
LR−
200
Total Samples

Results are estimates based on the entered confusion matrix. These figures are not a substitute for clinical judgment or professional medical advice.

Understanding Sensitivity, Specificity, and Diagnostic Test Performance

When evaluating a diagnostic test — whether a medical screening tool, a machine learning classifier, or a quality-control system — understanding how well it distinguishes between positive and negative cases is essential. Sensitivity and specificity are the two foundational metrics, but a complete picture requires several additional statistics derived from the same 2×2 confusion matrix.

The 2×2 Confusion Matrix

Every binary classification test produces four possible outcomes. A True Positive (TP) occurs when both the test and the actual condition are positive — the test correctly identifies a case. A True Negative (TN) occurs when both the test and the actual condition are negative — the test correctly clears a non-case. A False Positive (FP) occurs when the test is positive but the condition is absent — a healthy subject is incorrectly flagged. A False Negative (FN) occurs when the test is negative but the condition is present — a sick subject is incorrectly cleared.

These four counts form the raw material for every diagnostic performance metric. By systematically arranging them in a 2×2 table, researchers and clinicians can calculate a comprehensive set of statistics that characterize the test's strengths and weaknesses.

Sensitivity (Recall / True Positive Rate)

Sensitivity measures how reliably a test detects true cases. It is calculated as TP ÷ (TP + FN) — the fraction of all actual positives that the test correctly identifies. A sensitivity of 95% means the test catches 95 out of every 100 true cases, missing only 5.

High sensitivity is critical when the cost of missing a case is severe. In cancer screening or infectious disease surveillance, failing to detect a true case (a false negative) can have serious consequences, so tests used for initial screening typically prioritize sensitivity. Tests with high sensitivity are useful for ruling out a condition: a negative result from a highly sensitive test provides strong reassurance.

Specificity (True Negative Rate)

Specificity measures how reliably a test excludes true non-cases. It is calculated as TN ÷ (TN + FP) — the fraction of all actual negatives that the test correctly identifies. A specificity of 90% means the test correctly clears 90 out of every 100 healthy subjects, incorrectly flagging only 10.

High specificity is critical when the cost of a false alarm is high — for example, in confirmatory testing before an invasive procedure. Tests with high specificity are useful for ruling in a condition: a positive result from a highly specific test provides strong evidence that the condition is truly present.

Positive and Negative Predictive Values (PPV and NPV)

While sensitivity and specificity are properties of the test itself, Positive Predictive Value (PPV) and Negative Predictive Value (NPV) describe performance in the context of a specific population. PPV is calculated as TP ÷ (TP + FP) — the fraction of positive test results that are truly positive. NPV is TN ÷ (TN + FN) — the fraction of negative test results that are truly negative.

A crucial point is that PPV and NPV depend heavily on prevalence. In a low-prevalence population, even a highly specific test will generate many false positives relative to true positives, yielding a low PPV. Conversely, in a high-prevalence population, a moderately sensitive test may still have an acceptable NPV. Understanding this relationship is essential for interpreting screening results in clinical practice.

Accuracy and F1 Score

Accuracy is the overall fraction of correct classifications: (TP + TN) ÷ Total. It is intuitive but can be misleading when classes are imbalanced. If 95% of cases in a dataset are negative, a test that always predicts negative achieves 95% accuracy while being completely useless at detecting positives.

The F1 score addresses this limitation. It is the harmonic mean of PPV (precision) and sensitivity (recall): 2 × (PPV × Sensitivity) ÷ (PPV + Sensitivity). By combining both metrics, F1 balances the tradeoff between missing cases and generating false alarms. It is particularly informative when class imbalance makes accuracy unreliable, and is widely used in both clinical research and machine learning benchmarks.

Likelihood Ratios

Likelihood ratios express how much a test result changes the probability that a patient has a condition. The Positive Likelihood Ratio (LR+) is Sensitivity ÷ (1 − Specificity). It describes how much more likely a positive test result is in a person with the condition than in a person without it. LR+ values above 10 are generally considered to provide strong evidence of disease.

The Negative Likelihood Ratio (LR−) is (1 − Sensitivity) ÷ Specificity. It describes how much more likely a negative test result is in a person with the condition compared to one without it. LR− values below 0.1 are considered strong evidence against disease. Likelihood ratios are particularly powerful because they can be applied directly to pre-test probability using Bayes' theorem, allowing clinicians to update their estimate of disease probability based on test results regardless of population prevalence.

Prevalence and Its Effect on Predictive Values

Prevalence — the proportion of truly positive cases in the population being tested — connects sensitivity and specificity (fixed test properties) to PPV and NPV (population-dependent values). At low prevalence, false positives accumulate relative to true positives, driving PPV downward. At high prevalence, false negatives accumulate relative to true negatives, driving NPV downward.

This relationship has practical consequences for population screening programs. A serological test with 99% sensitivity and 99% specificity sounds impressive, but if the disease prevalence is only 0.1%, then among 100,000 tested individuals there will be roughly 100 true positives and 999 false positives — a PPV of only about 9%. Understanding prevalence is therefore just as important as understanding the test's intrinsic properties.

Applications: Medicine, Machine Learning, and Quality Control

The confusion matrix framework originated in medical diagnostic testing but has since become universal in any field that involves binary classification. In machine learning, sensitivity is called recall and PPV is called precision. The F1 score is a standard benchmark metric for classification models. In quality control and industrial testing, sensitivity and specificity inform decisions about detection thresholds and acceptable false alarm rates.

The mathematics is identical across all these domains. What changes is the interpretation: a 'false positive' in medical screening means an unnecessary follow-up test, while in fraud detection it might mean a legitimate transaction blocked. The relative costs of false positives and false negatives — and the prevalence of the condition being detected — must always be considered when choosing a test or setting a decision threshold.

Interpreting Results: A Note on Context

No single metric captures the full picture of diagnostic performance. A test optimized purely for sensitivity will sacrifice specificity, and vice versa. The right balance depends on the clinical or practical context: the severity of missing a true case, the burden of investigating false positives, and the prevalence in the target population.

These calculations provide quantitative estimates based on the confusion matrix you enter. They are tools for structured thinking about test performance, not definitive verdicts. All results should be interpreted alongside clinical judgment, study design considerations, and an understanding of the population in which the test will be used.

Frequently Asked Questions

What is the difference between sensitivity and specificity?

Sensitivity (true positive rate) measures how well a test identifies those who have the condition: TP ÷ (TP + FN). Specificity (true negative rate) measures how well a test identifies those who do not have the condition: TN ÷ (TN + FP). High sensitivity minimizes missed cases; high specificity minimizes false alarms. In practice, there is often a trade-off between the two.

What is PPV and how does it differ from sensitivity?

Positive Predictive Value (PPV) is the probability that a positive test result is a true positive: TP ÷ (TP + FP). Unlike sensitivity, which is an intrinsic property of the test, PPV depends on the prevalence of the condition in the tested population. In low-prevalence settings, even a specific test can have a low PPV because false positives accumulate relative to true positives.

What do likelihood ratios tell us?

Likelihood ratios express how much a test result shifts the probability of disease. LR+ = Sensitivity ÷ (1 − Specificity): values above 10 provide strong evidence for disease. LR− = (1 − Sensitivity) ÷ Specificity: values below 0.1 provide strong evidence against disease. They are valuable because they can be applied to any pre-test probability using Bayes' theorem, making them more flexible than PPV and NPV across populations.

When should I use F1 score versus accuracy?

Accuracy is the overall fraction of correct results and works well when both classes are roughly balanced. F1 score is the harmonic mean of PPV and sensitivity, making it more informative when one class is much rarer than the other (class imbalance). For example, in rare disease screening, a test that always predicts 'negative' achieves high accuracy but an F1 score near zero, immediately revealing its uselessness.

Can I use this calculator for machine learning model evaluation?

Yes. The confusion matrix and the metrics calculated here — precision (PPV), recall (sensitivity), F1 score, and accuracy — are standard evaluation tools in machine learning classification tasks. Simply enter your model's TP, FP, FN, and TN counts from the test set to obtain a full performance summary.