Methodology

What Embarke is doing under the hood, in language a methods section can quote. Versioned as the product evolves; current revision 2026-05-13.

PRISMA 2020

The Preferred Reporting Items for Systematic Reviews and Meta-Analyses 2020 (PRISMA 2020; Page et al., BMJ 2021) is the dominant reporting standard for systematic reviews of health interventions. Embarke's PRISMA 2020-aligned systematic synthesis output format produces the full PRISMA section structure (Rationale, Objectives, Eligibility, Information sources, Study selection, Data items + synthesis methods, Risk of bias, Certainty of evidence, Results, Discussion, Implications, Other information).

Every PRISMA 2020-aligned run auto-generates a flow diagram SVG from pipeline counts (sources identified at Scout time, records included at citation persistence) and embeds it under the Study Selection section. The Critic agent enforces methodology completeness — drafts missing the Methods block, the certainty rollup, or the limitations acknowledgment fail review with a critical issue.

The boundary, stated plainly: this is an AI-assisted synthesis aligned to PRISMA 2020's reporting structure — it is not a registered, dual-reviewer systematic review. There is no prospective protocol registration, no independent dual screening, and the search strategy is generated rather than pre-specified. Each report's Limitations section names these boundaries so a reviewer sees exactly where machine assistance ends and human responsibility begins. Human-oversight gates (screening confirmation, findings approval) are on the roadmap and will be recorded in the methodology stamp when present.

Lighter PRISMA-aligned outputs are also available: the Evidence brief (1,500–3,000 words, Pro tier) and the Scoping review (PRISMA-ScR) per Tricco et al., Ann Intern Med 2018 (Pro tier).

GRADE-informed certainty estimates

Embarke estimates a certainty rating for every finding, informed by the Grading of Recommendations Assessment, Development and Evaluation (GRADE) working group's five published domains:

Risk of bias in the included evidence
Inconsistency of results across studies
Indirectness of evidence to the question
Imprecision (effect-size confidence)
Publication bias risk

The rating is shown with the GRADE certainty glyph used in Summary-of-Findings tables — filled circles for the level of certainty, so it's recognizable at a glance:

High— Very confident the true effect is close to the estimate.

Moderate— Moderately confident; the true effect is likely close, but may differ.

Low— Confidence is limited; the true effect may be substantially different.

Very low— Very little confidence; the true effect is likely substantially different.

Embarke's Calibrator agent runs after the Writer on every PRISMA-aligned framework (configurable via FrameworkSpec.calibrator_mode). The published worst-of-domains rule rolls per-domain estimates up to a single H/M/L/Very-Low rating per finding; the rating is rendered inline in the report and summarized in the Certainty section.

What this is and is not: these are machine-estimated certainty ratings produced by an LLM applying GRADE's domain structure — a starting point for your judgment, always labeled as machine-estimated. A formal GRADE assessment requires structured outcome-level appraisal by trained reviewers; Embarke does not claim to replace that, and these estimates should not be presented to an HTA or regulatory body as a formal GRADE rating without human review.

Non-PRISMA frameworks (Convergent assessment, ACH) use an “analytical confidence” calibrator mode instead — same H/M/L bucketing without the 5-domain GRADE substructure, since the question shapes don't suit GRADE.

Risk-of-bias tools

Embarke ships four published risk-of-bias instruments, applicable per cited study based on its design:

RoB 2 (Sterne et al., BMJ 2019) for randomized trials. 5 domains. Judgments in low / some_concerns / high. Overall = worst across domains.
ROBINS-I (Sterne et al., BMJ 2016) for non-randomized interventional studies. 7 domains. Judgments in low / moderate / serious / critical / no_information. Per the published rule, any no_information domain short-circuits the overall to no_information.
AMSTAR 2 (Shea et al., BMJ 2017) for appraising systematic reviews. 16 items, 4 of which are critical (protocol registration, comprehensive search, RoB assessment of included studies, RoB accounting in synthesis). Any critical no drives the overall to critically_low.
QUADAS-2 (Whiting et al., Ann Intern Med 2011) for diagnostic accuracy studies. 4 domains with low / high / unclear judgments, worst-wins rollup.

Per-domain judgments are LLM-proposed with a short rationale citing the source evidence; the rollup to an overall judgment is algorithmic (not LLM-guessed) so the published rules apply deterministically. Reviewers can override any per-domain judgment in the UI.

Reporting-standard checks

The Critic agent detects the dominant study type across cited findings (regex over title + URL — RCT, observational, diagnostic accuracy, prognostic prediction, systematic review, narrative review, guideline, case report) and applies the corresponding reporting standard's key checklist items:

RCT → CONSORT 2010 (Schulz et al.)
Observational → STROBE (von Elm et al.)
Diagnostic accuracy → STARD 2015 (Bossuyt et al.)
Prognostic model → TRIPOD+AI (Collins et al., 2024 update of TRIPOD with AI/ML extensions)
Systematic review → PRISMA 2020 (Page et al.)

When the draft skips an item the dominant study design requires, the Critic raises a missing_reporting_standard_item issue naming the specific item. Narrative reviews, case reports, and guidelines don't carry an audit-grade reporting bar; the check is skipped for those.

Retraction-aware citations

Every DOI cited in a synthesis is enriched against Crossref's metadata API, which carries Retraction Watch's retraction-status flags via the updated-by field. Cited papers flagged retracted or expression-of-concern surface in three places:

Inline badges in the rendered report (and inside the DOCX / PDF exports).
A top-of-report retraction banner summarizing flagged citations.
A critical Critic issue (cited_retracted_paper) that blocks Writer approval unless the draft explicitly acknowledges the retraction in context.

Verified live against the Wakefield 1998 retraction (DOI 10.1016/s0140-6736(97)11096-0) on 2026-05-10 — Crossref returns retracted status with date 2010-02-06.

Source-tier classification

Every URL Scout returns is classified into one of ten source categories (peer-reviewed, regulatory filing, preprint, primary interview, analyst report, trade publication, blog post, press release, social media, other) which map to a six-step quality tier (A → F).

Each FrameworkSpec declares a minimum source tier; PRISMA 2020 enforces tier B (peer-reviewed, preprint, regulatory, primary-interview only); PRISMA-light and evidence brief frameworks declare tier B; lighter business-research frameworks accept tier F (everything). Subscription tier additionally caps the ladder — Free is capped at A+B regardless of framework.

Reproducibility ZIP

Every report can be exported as a tamper-evident reproducibility package — a ZIP containing:

The full Markdown / DOCX / PDF rendering of the report.
manifest.json — model IDs per agent, temperature, prompt SHA-256 hash, corpus version, generation timestamp, framework + output format IDs.
The full source list with URLs, source tiers, Crossref enrichment status, retraction flags.
BibTeX + RIS exports of the citation set.
A SIGNATURE file — SHA-256 hash of the manifest, so a reviewer can confirm the package hasn't been edited after issue.

The audit-grade positioning is “drop the ZIP into your regulatory binder.” The manifest documents exactly what produced this report; the signature lets a reviewer detect tampering.

Methodology stamp + audit trail

Every Markdown / PDF / DOCX export opens with a methodology stamp: framework name, output format, total cited sources, retracted-paper count, generation timestamp, prompt hash, corpus version. Provenance up top, body below.

Behind the scenes, every pipeline run persists agent steps, Critic verdicts (per iteration, including revision cycles), Calibrator outputs, RoB assessments, and the raw prompt hashes — queryable via the admin dashboard for audit reconstruction.

PROSPERO pre-check

When a user starts a project on any PRISMA-aligned framework (PRISMA 2020, PRISMA-ScR, Evidence brief, PRISMA-light), the new-project form surfaces a banner linking to PROSPERO — the International Prospective Register of Systematic Reviews at the University of York's Centre for Reviews and Dissemination — pre-filled with the user's question.

The pre-check doesn't block project creation; it surfaces and lets the user decide. Duplicating an already-registered review is a credibility issue, not a technical error.

What we don't claim

The audit-grade label is earned with explicit limitations stated in every report:

Search strategy is not externally registered.
Per-query search logs are not preserved in the v0.1 pipeline.
Screening is single-reviewer (the Critic acts as second reviewer for methodology completeness, not inclusion decisions).
Per-study data-extraction tables are a Tier-2 add-on, not standard on every synthesis.
The Calibrator's GRADE assessment is LLM-proposed; reviewers should validate before regulatory use.

All five limitations are named in the Limitations section of every PRISMA 2020 output by default. The Critic enforces that they appear.

Questions on methodology not answered here? Email [email protected] — we maintain this page based on what reviewers ask.

Try Embarke free