Behavioral npm Security Scanning: The Complete Guide (2026)

Q: How is behavioral scanning different from npm audit?

npm audit checks your packages against a database of versions with known CVEs. Behavioral scanners read the actual code and flag behavior patterns that map to malicious activity (install hooks, credential access, network exfiltration, obfuscation). npm audit cannot catch a brand-new malicious package because no CVE exists for it yet. Behavioral scanners can.

TL;DR — Advisory-based scanners (npm audit, Snyk, Dependabot) check packages against a database of known-bad versions. They give you a green checkmark during the exact window when a brand-new attack is live. Behavioral scanners read the actual source code and flag what it does — install hooks, credential access, network exfiltration, obfuscation, worm propagation. Dependency Guardian runs 53 behavioral detectors (32 npm + 21 PyPI) plus a 78-rule correlator across every scanned package. This guide explains exactly how that works.

Axios has 100 million weekly downloads. In March 2026, an attacker stole the maintainer's credentials and published compromised versions containing a remote access trojan. It reached OpenAI's signing pipeline before anyone noticed.

npm audit said Axios was clean. Snyk said it was clean. Dependabot didn't raise a flag. No CVE existed. There was nothing for any of them to find.

This keeps happening. Six major supply chain attacks hit the npm ecosystem in the past year. The Shai-Hulud worm infected 796 packages. The chalk/debug hijack hit packages with 2.6 billion combined weekly downloads. tj-actions put 23,000 GitHub repositories at risk. Every one passed advisory-based scanners clean during the window that mattered.

What is behavioral npm security scanning?

Behavioral npm security scanning is the practice of analyzing what a package's code does — install-time behavior, runtime behavior, dependency graph shape, metadata anomalies — rather than whether the package appears in an advisory database of known vulnerabilities.

A CVE scanner answers: "Has someone already reported a problem with this version?"

A behavioral scanner answers: "Does this package exhibit patterns we know map to malicious behavior?"

Both are useful. They solve different problems. The CVE scanner catches known-bad versions of legitimate packages (an old vulnerable lodash, a log4j with log4shell). The behavioral scanner catches the brand-new malicious package that no one has classified yet.

CVE scanning vs behavioral scanning: a structural comparison

	CVE / advisory scanners	Behavioral scanners
Input	Package name + version	Package source code + metadata
Mechanism	Database lookup against known-bad versions	Static analysis + pattern detection + correlation
Time to detection for zero-day	Hours to weeks (someone has to discover and file)	Seconds (runs on publish)
Catches novel malicious packages	No — no record yet	Yes — based on behavior, not prior knowledge
Catches known vulnerable versions of legitimate packages	Yes — core use case	Partially — only if the vulnerability leaves a behavioral trace
False positives on legitimate tools that use suspicious APIs	Very low	Requires correlation to avoid
Typical scanners	`npm audit`, Snyk, Dependabot, Socket (partial), GitHub Security Advisories	Socket.dev, Dependency Guardian

You need both. This guide is about the second column — where advisory-based scanners structurally cannot help.

Why CVE scanners miss supply chain attacks: the structural gap

CVE scanners check your packages against a database of known problems. But someone has to discover the vulnerability first, report it, get a CVE assigned, and wait for it to propagate through the advisory ecosystem. That takes days to weeks.

During that entire window, every scanner in your pipeline gives you a green checkmark while compromised code ships to production.

It's like running a watch list at the airport for someone who's never committed a crime. The attack works because the attacker has no record.

Advisory databases matter. You need them for patching known vulnerabilities. But they were never designed to catch novel malicious code, and no amount of updating the database faster changes that. It's a structural limitation of the approach, not an implementation bug.

The Axios attack demonstrated this cleanly. Malicious code was live on npm for about three hours. During those three hours, every download — tens of thousands of npm install commands across companies worldwide — came with a RAT. Every CVE scanner looked at those installs and reported clean, because the advisory process hadn't started yet.

How behavioral npm scanning works: the 4-phase pipeline

Behavioral scanning is typically a four-phase pipeline. Here's how Dependency Guardian structures it:

Phase 1: Extract

The tarball is downloaded from the registry, verified, and extracted. File size limits apply (4MB per file, 15MB per package on the standard tier) to prevent resource exhaustion from adversarial packages. Extraction diagnostics — missing manifests, path traversal attempts, tampered archive headers — become findings before any detector runs.

Phase 2: Preprocess (the step most scanners skip)

Attackers know scanners pattern-match. So they stop pattern-matching: they hide payloads in zero-width spaces, backtick template strings, hex-encoded literals, Unicode-escaped identifiers, comment-embedded code. A naive regex for eval( never fires on \x65\x76\x61\x6c.

Dependency Guardian runs three preprocessing passes before any detector executes:

Strip comments — including multi-line blocks that have been used to hide payload strings
Strip invisible Unicode — zero-width spaces, right-to-left override marks, byte-order marks
Decode all string escapes — hex, unicode-4 (\u0041), unicode-curly (\u{0041}), octal, in that order

Detectors then see the decoded source. If an attacker wrote \x65\x76\x61\x6c, the detector sees eval. This step alone neutralizes 73 specific bypass vectors we've catalogued across 9 detector families.

Phase 3: Detect (35 detectors)

Each preprocessed file runs through a prefilter (a fast Aho-Corasick scan for detector keywords), then through the detectors whose keywords fired. The DG detector set has 35 detectors total, grouped into three classes:

File-scanning detectors (20) — analyze source code: InstallScript, ChildProcess, NetworkExfil, Obfuscation, TokenTheft, CiSecretAccess, SensitivePath, BehaviorDrift, BrowserPhishing, BunRuntime, WormBehavior, SuspiciousApi, LegitimateApiExfil, PurposeMismatch, DiffRisk, DataFlow, AttackPatterns, StructuralAnomaly, BinaryAddon, FilesystemPersistence.

V2-only detectors (3) — newer additions: ProtoPollution (prototype-pollution gadgets), RegexDos (catastrophic-backtracking patterns), Steganography (payloads hidden in images or whitespace).

Metadata detectors (12) — analyze registry metadata and dependency graph: DependencyConfusion, Typosquat, FreshPublish, MaintainerChange, SuspiciousDeps, EmptyPackage, VersionHistory, Provenance, KnownVulnerability, License, GithubReputation, TeaSpam.

Each detector emits findings with a confidence level (Low / Medium / High / Definitive), a severity (Info / Low / Medium / High / Critical), and concrete evidence — file path, line number, and a code snippet.

Phase 4: Correlate and score (78 amplifier rules + 50 critical-block rules)

Individual signals aren't enough. A package that makes HTTP requests? That's half of npm. One that reads environment variables? Normal for a CI tool.

The correlator runs 78 amplifier rules and 50 critical-block rules across the findings. An amplifier says "if signals A and B co-occur, upgrade both." A critical-block rule says "if signals A, B, and C co-occur, this is a confirmed attack pattern — force the score to 100 and block the merge."

Example from the Axios attack:

install_script (postinstall hook) fired → common on its own
network_exfil (HTTP POST every 60s to unrelated domain) fired → common on its own
obfuscation (reversed Base64 + XOR) fired → suspicious on its own

Any one of the three alone: noise. All three in the same package: the obfuscated_install critical-block rule fires, the score is pinned at 100, and the PR is stopped automatically. No human triage needed. The behavior is the evidence.

What behavioral detectors look for (by threat category)

Detectors cover six threat categories:

Code execution — install scripts, child-process spawning, preinstall-timing abuse, worm-propagation patterns that replicate the package across a developer's account.

Data exfiltration — outbound network calls to hardcoded IPs or recently-registered domains, reads of ~/.ssh/, .npmrc, .pypirc, environment variables containing _TOKEN/_KEY, CI secret file paths (GITHUB_TOKEN, NPM_TOKEN, AWS credential files).

Evasion — obfuscation (reversed Base64, XOR, eval-of-encoded-string, JSFuck), behavior drift (different code paths between first install and subsequent runs), browser-based phishing patterns, dynamic code execution via Function() or new AsyncFunction().

Package integrity — typosquatting against popular package names, dependency confusion (public package shadowing an internal scope), source-vs-registry mismatch (GitHub release absent or different from published tarball), suspicious version history.

Structural analysis — purpose mismatch (a package named "is-even" that spawns child processes), binary add-ons, filesystem persistence (writes outside the package tree), structural anomaly compared to the known-clean corpus.

Metadata and reputation — fresh-publish windows (new package, days old), recent maintainer change, missing GitHub presence, dependency tree shape, empty-package spam, TeaSpam filter, license risk.

Evasion handling: neutralizing the techniques attackers actually use

The preprocessing pipeline above is not theoretical. Every one of these techniques has been seen in real npm malware:

Hex-encoded identifiers — \x65\x76\x61\x6c for eval. Used in the event-stream / flatmap-stream attack (2018, still imitated in 2026).
Unicode-escaped strings — \u0065\u0076\u0061\u006c. Used in crypto-stealer families.
Zero-width spaces in identifiers — identifier with a ZWSP in the middle looks identical in an editor but doesn't match a plain regex for the name.
Comment-embedded payloads — payload string sits inside a block comment, extracted and eval'd at runtime.
Backtick template literal concatenation — \e`+`v`+`a`+`l`constructseval` at runtime without the string ever appearing in the source.

Each one of these defeats a pattern-match-only scanner. Each one is neutralized by preprocessing before the detector sees the code.

What behavioral scanning caught in 2025–2026

Attack	Window	CVE scanner verdict during window	Behavioral detector signals
Axios `1.14.1` / `0.30.4` (March 2026)	~3 hours	Clean	install_script + network_exfil + obfuscation + phantom_dependency + child_process → critical block
Shai-Hulud worm (2025)	Days across 796 packages	Clean	worm_behavior + install_script + token_theft → critical block
chalk / debug hijack (2025)	Hours across ~2.6B weekly DLs	Clean	behavior_drift + network_exfil + credential_access → critical block
tj-actions compromise (2025)	Days across 23K repos	Clean	ci_secret_access + obfuscation → high/block
`ahmed_salem_ph` trojanized AI tool (caught by feed scanner)	2 minutes after publish	Clean	binary_addon + suspicious_api + fresh_publish → block. Only 3/76 VirusTotal engines flagged the binary payload.

Each of these was scanned clean by every advisory-based tool in the standard security stack during the window when the attack was live. Each was caught by behavioral analysis within seconds of publishing.

Does it actually work: published accuracy

I won't ask you to trust the story. Here are the numbers, validated against a public methodology:

npm: 95.20% catch rate (8,903 of 9,352 disclosed-malware packages)
PyPI: 93.88% catch rate (4,276 of 4,555 disclosed-malware packages)
False positive rate: 0.44% on 1,967 clean npm packages, 0.29% on 2,000 clean PyPI packages
Validation corpus: 17,874 packages (OSSF MAL + Datadog disclosed-malware corpus + matched clean baseline). Methodology and full data at /benchmark.

No LLM in the loop. No per-scan API cost. Fully deterministic. Same package, same verdict, every time.

During a live feed scan, Dependency Guardian flagged a trojanized package called ahmed_salem_ph two minutes after it was published. Inside was a Windows keylogger disguised as an AI coding assistant. Only 3 of 76 antivirus engines on VirusTotal detected the binary it shipped.

Two minutes. Before most teams would have finished their morning standup.

Where behavioral scanning fits in your stack

This isn't a replacement for Snyk or Dependabot. You still need those for patching known vulnerabilities in legitimate packages. Behavioral scanning handles the part they're structurally incapable of: catching attacks that don't have an advisory yet.

A complete supply chain security posture looks like this:

Advisory scanners (npm audit, Snyk, Dependabot) → patching known-vulnerable versions of legitimate packages.
Behavioral scanning (Dependency Guardian, Socket.dev) → catching novel malicious packages and compromised legitimate ones.
SBOM + provenance → audit trail and compliance.
Lockfile review in CI → every dependency change goes through a gate.
Runtime sandboxing (optional) → catch time-bomb payloads and runtime-only behaviors that static analysis misses.

The goal is to narrow the window where a compromised package can reach production from "days to weeks" (advisory-only) to "seconds" (behavioral + gated CI).

Try it

npm install -g @westbayberry/dg

After install, npm install and pip install scan automatically — no prefix, no setup wizard.

Scans every dependency in your lockfile. Free tier gives you 1,000 scans per month with all 53 detectors enabled. No credit card.

For CI, the GitHub App scans pull requests automatically when your lockfile changes and blocks merge on critical findings. The CLI works in any pipeline where Node runs.

FAQ

What is behavioral npm security scanning?

Behavioral npm security scanning is the practice of analyzing what a package's code does — install-time behavior, runtime behavior, metadata, dependency shape — rather than looking the package up in an advisory database. It catches novel malicious packages that have no prior CVE or advisory.

How is behavioral scanning different from npm audit?

npm audit checks your packages against a database of versions with known CVEs. Behavioral scanners read the actual code and flag behavior patterns that map to malicious activity (install hooks, credential access, network exfiltration, obfuscation). npm audit cannot catch a brand-new malicious package because no CVE exists for it yet. Behavioral scanners can.

Do I still need npm audit, Snyk, or Dependabot if I use behavioral scanning?

Yes. They solve a different problem — patching known vulnerabilities in legitimate packages. Behavioral scanning solves the zero-day supply chain attack problem. Run both.

How many detectors does Dependency Guardian have?

53 detectors across both ecosystems (32 npm + 21 PyPI), spanning file-scanning, AST-aware, and metadata classes. The detector list is defined in the DetectorId enum in dg-scan and is identical across every pricing tier, including free.

What happens when multiple detectors fire on the same package?

The correlator runs 78 amplifier rules and 50 critical-block rules across the findings. Some combinations (install_script + network_exfil + obfuscation, for example) force the score to 100 and block the merge automatically. Single signals without corroboration usually don't block.

Can behavioral scanners catch obfuscated malware?

Yes — if they preprocess before scanning. Dependency Guardian runs three preprocessing passes (strip comments, strip invisible Unicode, decode hex/unicode/octal escapes) before any detector executes. \x65\x76\x61\x6c is scanned as eval. This neutralizes 73 specific evasion techniques we've catalogued.

What's the false positive rate?

0.44% on 1,967 clean npm packages (9 false positives), validated against a public methodology at /benchmark. Most false positives come from legitimate dev tools that use suspicious APIs (child_process, eval, network calls) for legitimate reasons; the correlator downgrades single-signal findings to avoid blocking them.

Is behavioral scanning slow enough to hurt CI?

No. Standard scans finish in seconds per package. A typical dg scan across a full lockfile runs in well under a minute. The Aho-Corasick prefilter plus per-file worker parallelism means most scanning time is I/O, not CPU.

What does behavioral scanning cost?

Dependency Guardian is free for 1,000 scans/month with the full detector set. Paid tiers start at $15/month (flat) for small teams. See /pricing for the full breakdown.