- Overview
- Schema Architecture
- Core Schema Definition
- Category-Specific Schemas
- Evidence Specifications
- Backwards Compatibility
- Sample Reports
- Internal Metadata Usage
This document provides the complete technical specification for XARF v4 schema validation, field definitions, and format requirements. It serves as the authoritative reference for implementing XARF v4 parsers, validators, and generators.
For a high-level introduction to XARF v4, see the Introduction & Overview. For guidance on building parsers, validators, and generators against this spec, see the Implementer's Guide.
Core Philosophy:
- Source-Centric: Focus on "what did this IP do wrong" rather than "what attack occurred"
- Evidence-Based: Every report must include actionable evidence
- Real-Time First: Optimized for fast operational response, not research/analysis
- Automation-Friendly: JSON first, email is just transport
- Backwards Compatible: v3 reports should work transparently
- Extensible: Add new types without breaking existing parsers
Real-Time Reporting Philosophy:
- One incident = One report (immediately when detected)
- No bundling multiple incidents together
- No daily batching or delayed reporting
- Exception: Threshold-based reporting (e.g., wait for >5 login attempts before reporting brute force)
- Goal: Minimize time between abuse detection and ISP notification
- JSON Schema based - Formal validation using JSON Schema Draft 2020-12
- Conditional validation - Type-specific requirements based on category/type combinations
- Extensible design -
additionalProperties: trueat the report level permits unknown fields for forward compatibility - Multi-level validation - Standard and strict modes
XARF v4 organizes abuse reports into seven primary categories, each with specific types and validation rules:
| Category | Purpose | Required Fields | Evidence Focus |
|---|---|---|---|
messaging |
Communication abuse | protocol, smtp_from, subject |
Original messages, headers |
connection |
Network attacks | destination_ip, protocol |
Attack logs, connection data |
infrastructure |
Compromised systems | None beyond base | Traffic analysis, indicators |
content |
Malicious web content | url |
Screenshots, webpage samples |
copyright |
IP infringement | work_title, rights_holder |
DMCA notices, content samples |
vulnerability |
Security disclosures | service |
Vulnerability scans, CVE data |
reputation |
Threat intelligence | threat_type |
Intelligence feeds, analysis |
Different abuse categories target different infrastructure layers and require distinct response procedures:
Infrastructure Layer Targeting:
- Client-side abuse (
infrastructure.bot) → End-user ISP remediation and quarantine - Server-side abuse (
infrastructure.compromised_server) → Hosting provider hardening and patching - Content-side abuse (
content.phishing,content.fraud) → Content takedown by hosting provider
ISP Action Categories:
- Immediate blocking → Login attacks, active DDoS traffic
- Content takedown → Phishing sites, copyright infringement, spamvertised content
- Client remediation → Bot infections, compromised end-user systems
- Infrastructure hardening → Vulnerable services, compromised servers, CVE patching
XARF v4 uses a three-tier system for attribute categorization to provide clear implementation guidance:
- Required Attributes: Must be present for valid XARF reports
- Recommended Attributes: Should be included when available to improve report utility
- Optional Attributes: May be included for additional context or specific use cases
This categorization helps implementers prioritize which fields to support while maintaining interoperability.
To enable both human readability and programmatic detection of field requirements, XARF v4 schemas use two complementary mechanisms:
All field descriptions are prefixed with their requirement level:
REQUIRED:- Field must be present (also inrequiredarray)RECOMMENDED:- Field should be included when availableOPTIONAL:- Field may be included for additional context
Example:
{
"source_port": {
"type": "integer",
"description": "RECOMMENDED: Source port number - critical for identifying sources in CGNAT environments"
}
}For programmatic detection of recommended fields, schemas use the JSON Schema extension property x-recommended: true:
{
"source_port": {
"type": "integer",
"minimum": 1,
"maximum": 65535,
"x-recommended": true,
"description": "RECOMMENDED: Source port number - critical for identifying sources in CGNAT environments"
}
}Determining Field Requirements Programmatically:
| Condition | Requirement Level |
|---|---|
Field is in required array |
Required |
Field has x-recommended: true |
Recommended |
| Neither of the above | Optional |
This dual approach ensures that:
- Developers can quickly understand requirements by reading descriptions
- Validators and tools can programmatically determine field requirements
- The schema remains compatible with standard JSON Schema validators
The authoritative base schema is defined in xarf-core.json. All XARF v4 reports are validated against it.
Every report carries a small set of required identity and routing fields: the schema version, a unique report ID, when the incident occurred, who is reporting it, what the abuse source is, and how the abuse is classified (category + type). Beyond those, the schema defines recommended fields that improve report utility — such as evidence and a confidence score — and optional fields for additional context.
Contact information (reporter and sender) follows a shared structure capturing the organization name, a contact email, and a domain for verification. The distinction between reporter and sender matters when a third-party service files on behalf of a client.
| Field | Required | Type | Notes |
|---|---|---|---|
org |
Yes | string | Organization name; max 200 chars |
contact |
Yes | Contact email address | |
domain |
Yes | hostname | Organization domain for verification |
Evidence items each carry a MIME type and a base64-encoded payload. Integrity hashes and a human-readable description are recommended; size metadata is optional.
| Field | Requirement | Type | Notes |
|---|---|---|---|
content_type |
Required | string | MIME type of the evidence (e.g. message/rfc822, image/png) |
payload |
Required | string | Base64-encoded evidence data |
description |
Recommended | string | Human-readable description; max 500 chars |
hash |
Recommended | string | Integrity hash; format: algorithm:hexvalue (e.g. sha256:e3b0c4…) |
size |
Optional | integer | Size in bytes; max 5 MB per item |
The _internal object is a free-form container for operational metadata (ticket IDs, analyst assignments, SLA data, etc.) that parsers must strip before transmitting any report.
Each specific category + type combination has its own schema that extends it with additional required, recommended, and optional fields. The following section describe those per-type schemas which are defined in /schemas/v4/types/.
Purpose: Communication-based abuse including spam and unwanted messaging
Valid Types: spam, bulk_messaging
Evidence Sources: spamtrap, user_complaint, automated_filter, honeypot
Category-Specific Fields:
| Field | Requirement | Applies To | Type | Notes |
|---|---|---|---|---|
protocol |
Required | both | enum | Values differ slightly per type; signal and chat only valid for spam |
smtp_from |
Required | both | Required when protocol=smtp; source_port also required in that case |
|
recipient_count |
Required | bulk_messaging |
integer | Minimum 100 |
subject |
Recommended | both | string | Max 500 characters |
unsubscribe_provided |
Recommended | bulk_messaging |
boolean | Whether the message includes an unsubscribe mechanism |
message_id |
Recommended | spam |
string | Helps with deduplication |
smtp_to |
Recommended | spam |
SMTP envelope recipient | |
sender_name |
Optional | both | string | Display name of the sender |
bulk_indicators |
Optional | bulk_messaging |
object | Structured sub-fields: high_volume, template_based, commercial_sender |
opt_in_evidence |
Optional | bulk_messaging |
boolean | Whether there is evidence of recipient opt-in |
language |
Optional | spam |
string | ISO 639-1 (e.g. en, en-US) |
spam_indicators |
Optional | spam |
object | Structured sub-fields: suspicious_links, commercial_content, bulk_characteristics |
user_agent |
Optional | spam |
string | From message headers |
Schema Definitions:
Examples and formal schema definition: messaging-spam.json, messaging-bulk-messaging.json
Purpose: Malicious or abusive web content — from phishing and malware to fraud, data exposure, brand abuse, and compromised sites
Valid Types: phishing, malware, fraud, csam, csem, exposed_data, brand_infringement, suspicious_registration, remote_compromise
Evidence Notes:
- Screenshot strongly recommended for most types; include HTML source alongside it when possible
- For
csamandcsem, do not attach visual content — use hash values and reference IDs instead
Category-Specific Fields:
| Field | Requirement | Applies To | Type | Notes |
|---|---|---|---|---|
url |
Required | all | uri | URL of the abusive content |
infringement_type |
Required | brand_infringement |
enum | e.g. typosquatting, lookalike, counterfeit |
legitimate_site |
Required | brand_infringement |
uri | URL of the legitimate brand website |
classification |
Required | csam |
enum | Legal classification: baseline, A1, A2, B1, B2 |
detection_method |
Required | csam, csem |
enum | Different valid values per type |
exploitation_type |
Required | csem |
enum | e.g. grooming, sextortion, trafficking |
data_types |
Required | exposed_data |
array | Min 1 item; e.g. credentials, financial, personal_information |
exposure_method |
Required | exposed_data |
enum | e.g. misconfigured_server, git_repository, paste_site |
fraud_type |
Required | fraud |
enum | e.g. investment, romance, tech_support, advance_fee |
compromise_type |
Required | remote_compromise |
enum | e.g. webshell, backdoor, defacement, malicious_redirect |
registration_date |
Required | suspicious_registration |
datetime | When the domain was registered |
suspicious_indicators |
Required | suspicious_registration |
array | Min 1 item; e.g. typosquatting, brand_keyword, fast_flux |
domain |
Recommended | all | string | FQDN of the abusive content |
target_brand |
Recommended | all | string | Most relevant for phishing and brand_infringement |
verification_method |
Recommended | all | enum | How content was verified: manual, automated_crawler, user_report, honeypot, threat_intelligence |
verified_at |
Recommended | all | datetime | When content was last confirmed active |
infringing_elements |
Recommended | brand_infringement |
array | e.g. logo, brand_name, color_scheme, domain_name |
similarity_score |
Recommended | brand_infringement |
number | 0.0–1.0 visual/textual similarity to legitimate site |
content_removed |
Recommended | csam |
boolean | Whether the content has been removed |
hash_values |
Recommended | csam |
object | Sub-fields: sha256 (recommended), photodna (recommended), md5, sha1 |
media_type |
Recommended | csam |
enum | image, video, audio, text, mixed |
ncmec_report_id |
Recommended | csam |
string | NCMEC CyberTipline report ID |
evidence_type |
Recommended | csem |
array | e.g. chat_logs, images, user_profile |
platform |
Recommended | csem |
enum | e.g. social_media, messaging_app, gaming_platform |
reporting_obligations |
Recommended | csem |
array | e.g. NCMEC, IWF, local_law_enforcement |
victim_age_range |
Recommended | csem |
enum | infant, toddler, prepubescent, pubescent, unknown |
affected_organization |
Recommended | exposed_data |
string | Organization whose data was exposed |
encryption_status |
Recommended | exposed_data |
enum | unencrypted, encrypted, partially_encrypted, hashed, unknown |
record_count |
Recommended | exposed_data |
integer | Number of records exposed |
sensitive_fields |
Recommended | exposed_data |
array | Specific data fields exposed (e.g. ssn, credit_card) |
claimed_entity |
Recommended | fraud |
string | Organization or person the fraudster claims to represent |
payment_methods |
Recommended | fraud |
array | e.g. cryptocurrency, wire_transfer, gift_cards |
distribution_method |
Recommended | malware |
enum | e.g. drive_by_download, email_attachment, exploit_kit |
file_hashes |
Recommended | malware |
object | Sub-fields: md5, sha1, sha256 (recommended), ssdeep |
malware_family |
Recommended | malware |
string | e.g. Emotet, TrickBot |
malware_type |
Recommended | malware |
enum | e.g. trojan, ransomware, infostealer, rat |
cloned_site |
Recommended | phishing |
uri | Legitimate site being impersonated |
credential_fields |
Recommended | phishing |
array | Form fields present on the phishing page (e.g. username, password) |
lure_type |
Recommended | phishing |
enum | e.g. account_suspension, security_alert, shipping_notification |
submission_url |
Recommended | phishing |
uri | Where credentials are submitted |
affected_cms |
Recommended | remote_compromise |
enum | e.g. wordpress, joomla, drupal |
compromise_indicators |
Recommended | remote_compromise |
array | Structured IoCs: each item has type and value |
malicious_activities |
Recommended | remote_compromise |
array | e.g. spam_sending, hosting_phishing, data_exfiltration |
persistence_mechanisms |
Recommended | remote_compromise |
array | e.g. cron_job, modified_core_files, htaccess_modification |
webshell_details |
Recommended | remote_compromise |
object | Sub-fields: family, capabilities, password_protected |
days_since_registration |
Recommended | suspicious_registration |
integer | |
predicted_usage |
Recommended | suspicious_registration |
array | e.g. phishing, spam, botnet_c2 |
registrant_details |
Recommended | suspicious_registration |
object | Sub-fields: email_domain, country, privacy_protected, bulk_registrations |
risk_score |
Recommended | suspicious_registration |
number | 0.0–1.0 |
targeted_brands |
Recommended | suspicious_registration |
array | Brands potentially targeted |
asn |
Optional | all | integer | Autonomous System Number |
attack_vector |
Optional | all | enum | e.g. phishing, malware, data_leak, remote_compromise |
country_code |
Optional | all | string | ISO 3166-1 alpha-2 |
dns_records |
Optional | all | object | Key DNS records (A, AAAA, MX, TXT) |
dns_response |
Optional | all | object | DNS query metadata |
hosting_provider |
Optional | all | string | e.g. AWS, Cloudflare, DigitalOcean |
nameservers |
Optional | all | array | DNS nameservers |
registrar |
Optional | all | string | Domain registrar |
screenshot_url |
Optional | all | uri | Reference URL to screenshot evidence |
ssl_certificate |
Optional | all | object | Sub-fields: issuer, subject, valid_from, valid_to, fingerprint |
whois |
Optional | all | object | Sub-fields: registrant, created_date, expiry_date, registrar_abuse_contact |
previous_enforcement |
Optional | brand_infringement |
array | Past actions: each item has date, action, result |
products_offered |
Optional | brand_infringement |
array | Products or services offered on the infringing site |
trademark_details |
Optional | brand_infringement |
object | Sub-fields: registration_number, jurisdiction, category (Nice classes) |
account_suspended |
Optional | csam |
boolean | Whether associated accounts were suspended |
perpetrator_indicators |
Optional | csem |
object | Sub-fields: account_id, ip_addresses, pattern_of_behavior |
accessibility |
Optional | exposed_data |
enum | public, requires_authentication, dark_web, removed |
data_format |
Optional | exposed_data |
enum | e.g. csv, sql, json |
discovery_source |
Optional | exposed_data |
enum | e.g. security_researcher, breach_monitoring |
sample_records |
Optional | exposed_data |
array | Redacted samples for verification; max 5 items |
cryptocurrency_addresses |
Optional | fraud |
array | Each item requires currency and address |
loss_amount |
Optional | fraud |
object | Sub-fields: currency (ISO 4217), amount |
c2_servers |
Optional | malware |
array | Each item: address, port, protocol |
exploit_cve |
Optional | malware |
array | CVEs exploited; format: CVE-YYYY-NNNNN |
file_metadata |
Optional | malware |
object | Sub-fields: filename, file_size, file_type, mime_type |
persistence_mechanism |
Optional | malware |
array | e.g. registry, scheduled_task, dll_hijacking |
sandbox_analysis |
Optional | malware |
object | Sub-fields: sandbox_name, analysis_url, verdict, score |
targeted_platforms |
Optional | malware |
array | e.g. windows, linux, android |
detection_evasion |
Optional | phishing |
array | e.g. geo_blocking, captcha, user_agent_filtering |
phishing_kit |
Optional | phishing |
string | Known kit identifier e.g. 16Shop, LogoKit |
redirect_chain |
Optional | phishing |
array | URL redirect sequence leading to phishing page |
cleanup_status |
Optional | remote_compromise |
enum | not_cleaned, partially_cleaned, cleaned, reinfected, unknown |
vulnerability_exploited |
Optional | remote_compromise |
object | Sub-fields: cve, description, component |
activation_behavior |
Optional | suspicious_registration |
object | Sub-fields: time_to_activation, initial_content |
related_domains |
Optional | suspicious_registration |
array | Other domains linked by registrant, nameserver, IP, or pattern |
ssl_certificate_details |
Optional | suspicious_registration |
object | Sub-fields: issued_immediately, free_certificate, wildcard |
Schema Definitions:
Examples and formal schema definition: content-base.json, content-phishing.json, content-malware.json, content-fraud.json, content-csam.json, content-csem.json, content-exposed-data.json, content-brand_infringement.json, content-suspicious_registration.json, content-remote_compromise.json
Purpose: Intellectual property infringement across file hosting, link sites, P2P networks, Usenet, UGC platforms, and direct web hosting
Valid Types: copyright, cyberlocker, link_site, p2p, usenet, ugc_platform
Category-Specific Fields:
| Field | Requirement | Applies To | Type | Notes |
|---|---|---|---|---|
infringing_url |
Required | copyright, cyberlocker, link_site, ugc_platform |
uri | URL of the infringing content |
hosting_service |
Required | cyberlocker |
string | Name of the file hosting service |
site_name |
Required | link_site |
string | Name of the link aggregation site |
p2p_protocol |
Required | p2p |
enum | bittorrent, edonkey, gnutella, kademlia, other; swarm_info with info_hash or magnet_uri also required (via anyOf) |
platform_name |
Required | ugc_platform |
string | Name of the UGC platform (e.g. YouTube, TikTok) |
newsgroup |
Required | usenet |
string | Newsgroup name; message_info.message_id also required (via anyOf) |
rights_holder |
Recommended | all | string | Organization or person holding the copyright |
work_title |
Recommended | all | string | Title of the copyrighted work |
work_category |
Recommended | all except copyright |
enum | e.g. movie, tv_show, music, software, ebook; values vary slightly per type |
infringement_type |
Recommended | copyright, ugc_platform |
enum | Different valid values per type |
file_info |
Recommended | cyberlocker |
object | Sub-fields: filename, file_size, file_hash, upload_date, download_count |
uploader_info |
Recommended | cyberlocker, ugc_platform |
object | Sub-fields vary per type |
link_info |
Recommended | link_site |
object | Sub-fields: page_title, posting_date, uploader, download_count, link_count |
linked_content |
Recommended | link_site |
array | Each item requires target_url and link_type; max 50 items |
site_category |
Recommended | link_site |
enum | e.g. torrent_index, direct_download_links, streaming_links |
swarm_info |
Recommended | p2p |
object | info_hash or magnet_uri required; also: torrent_name, file_count, total_size |
content_info |
Recommended | ugc_platform |
object | Sub-fields: content_id, content_title, upload_date, content_duration, view_count |
match_details |
Recommended | ugc_platform |
object | Sub-fields: match_confidence, match_duration, match_percentage, reference_id |
message_info |
Recommended | usenet |
object | message_id sub-field required within object |
original_url |
Optional | copyright |
uri | URL of the legitimate/original content |
access_method |
Optional | cyberlocker |
enum | direct_link, password_protected, premium_only, time_limited, captcha_protected |
takedown_info |
Optional | cyberlocker |
object | Sub-fields: previous_requests, service_response_time, automated_removal |
search_terms |
Optional | link_site |
array | Terms used to find the infringing links; max 10 items |
site_ranking |
Optional | link_site |
object | Sub-fields: alexa_rank, popularity_score (0.0–10.0) |
peer_info |
Optional | p2p |
object | Sub-fields: peer_id, client_version, upload_amount, download_amount |
release_date |
Optional | p2p |
date | Official release date of the work |
detection_method |
Optional | p2p, usenet |
enum | Different valid values per type |
monetization_info |
Optional | ugc_platform |
object | Sub-fields: monetized, ad_revenue, premium_content |
encoding_info |
Optional | usenet |
object | Sub-fields: encoding_format, par2_recovery, rar_compression |
nzb_info |
Optional | usenet |
object | Sub-fields: nzb_name, nzb_url, indexer_site, completion_percentage |
server_info |
Optional | usenet |
object | Sub-fields: nntp_server, server_group, retention_days |
Schema Definitions:
Examples and formal schema definition: copyright-copyright.json, copyright-cyberlocker.json, copyright-link-site.json, copyright-p2p.json, copyright-usenet.json, copyright-ugc-platform.json
Purpose: Network-level attacks, automated abuse, and unauthorized connection attempts
Valid Types: login_attack, port_scan, ddos, scraping, sql_injection, vulnerability_scan, infected_host, reconnaissance
Category-Specific Fields:
| Field | Requirement | Applies To | Type | Notes |
|---|---|---|---|---|
first_seen |
Required | all | datetime | When activity was first observed |
protocol |
Required | all | enum | Valid values vary by type; icmp and sctp only available for some types |
bot_type |
Required | infected_host |
enum | e.g. malicious, search_engine, ai_agent, unknown |
probed_resources |
Required | reconnaissance |
array | Specific paths probed (e.g. /.env, /.git/config, /.aws/credentials) |
total_requests |
Required | scraping |
integer | Minimum 1 |
scan_type |
Required | vulnerability_scan |
enum | e.g. port_scan, web_vuln_scan, os_fingerprinting, service_enumeration |
destination_ip |
Recommended | all | string | Target IPv4 or IPv6 address; source_port also required for login_attack, port_scan, ddos when source_identifier is an IP |
destination_port |
Recommended | all except vulnerability_scan |
integer | 1–65535 |
attack_vector |
Recommended | ddos |
string | e.g. syn_flood, udp_flood, dns_amplification, memcached_amplification |
peak_bps |
Recommended | ddos |
integer | Peak bits per second |
peak_pps |
Recommended | ddos |
integer | Peak packets per second |
behavior_pattern |
Recommended | infected_host |
enum | e.g. aggressive_crawling, api_abuse, vulnerability_probing |
bot_name |
Recommended | infected_host |
string | Known bot/tool name (e.g. GPTBot, UptimeRobot) |
verification_status |
Recommended | infected_host |
enum | verified, unverified, spoofed, unknown |
user_agent |
Recommended | infected_host, scraping |
string | User-Agent string of the bot or scraper |
resource_categories |
Recommended | reconnaissance |
array | e.g. environment_files, credential_files, version_control, backup_files |
successful_probes |
Recommended | reconnaissance |
array | Resources that returned success responses (200, 301, 302) |
scraping_pattern |
Recommended | scraping |
enum | e.g. sequential, deep_crawling, api_harvesting, sitemap_following |
target_content |
Recommended | scraping |
enum | e.g. product_data, pricing_information, user_profiles, api_data |
attack_technique |
Recommended | sql_injection |
enum | e.g. union_based, boolean_blind, time_blind, stacked_queries |
http_method |
Recommended | sql_injection |
enum | GET, POST, PUT, DELETE, PATCH, HEAD, OPTIONS |
injection_point |
Recommended | sql_injection |
enum | query_parameter, post_body, cookie, header, path, json_parameter |
target_url |
Recommended | sql_injection |
uri | Full URL targeted by the injection attempt |
scanner_signature |
Recommended | vulnerability_scan |
string | Known tool name (e.g. Nmap, Nikto, Nessus, Burp Scanner) |
targeted_ports |
Recommended | vulnerability_scan |
array | List of scanned port numbers |
last_seen |
Optional | all | datetime | When activity was last observed |
amplification_factor |
Optional | ddos |
number | Amplification factor for reflection/amplification attacks |
duration_seconds |
Optional | ddos |
integer | Attack duration in seconds |
mitigation_applied |
Optional | ddos |
boolean | Whether mitigation was applied during the attack |
service_impact |
Optional | ddos |
enum | none, degraded, unavailable |
threshold_exceeded |
Optional | ddos |
datetime | When the detection threshold was exceeded |
accepts_cookies |
Optional | infected_host |
boolean | |
api_endpoints_accessed |
Optional | infected_host |
array | |
follows_crawl_delay |
Optional | infected_host |
boolean | Whether bot honours crawl-delay directive |
javascript_execution |
Optional | infected_host |
boolean | Whether bot executes JavaScript |
request_rate |
Optional | infected_host, scraping |
number | Average requests per second |
respects_robots_txt |
Optional | infected_host, scraping |
boolean | |
automated_tool |
Optional | reconnaissance |
boolean | Whether activity appears to be from an automated tool |
http_methods |
Optional | reconnaissance |
array | HTTP methods used during probing |
response_codes |
Optional | reconnaissance |
array | HTTP response codes received |
total_probes |
Optional | reconnaissance |
integer | Total number of probe attempts |
user_agent |
Optional | reconnaissance, sql_injection, vulnerability_scan |
string | |
bot_signature |
Optional | scraping |
string | Known scraper identity (e.g. Scrapy, AhrefsBot, SemrushBot) |
concurrent_connections |
Optional | scraping |
integer | |
data_volume |
Optional | scraping |
integer | Total bytes transferred |
session_duration |
Optional | scraping |
integer | Duration of scraping session in seconds |
unique_urls |
Optional | scraping |
integer | Number of unique URLs accessed |
attempts_count |
Optional | sql_injection |
integer | Number of injection attempts observed |
payload_sample |
Optional | sql_injection |
string | Sanitized sample of injection payload; max 1000 chars |
scan_rate |
Optional | vulnerability_scan |
number | Requests per second |
targeted_services |
Optional | vulnerability_scan |
array | Services targeted (e.g. http, ssh, mysql, mongodb) |
total_requests |
Optional | vulnerability_scan |
integer | |
vulnerabilities_probed |
Optional | vulnerability_scan |
array | Specific CVEs or vulnerability identifiers probed |
Schema Definitions:
Examples and formal schema definition: connection-login-attack.json, connection-port-scan.json, connection-ddos.json, connection-scraping.json, connection-sql-injection.json, connection-vulnerability-scan.json, connection-infected-host.json, connection-reconnaissance.json
Purpose: Exposed services, known CVEs, and security misconfigurations that require remediation
Valid Types: cve, misconfiguration, open_service
Category-Specific Fields:
| Field | Requirement | Applies To | Type | Notes |
|---|---|---|---|---|
service |
Required | all | string | Vulnerable service or component name |
cve_id |
Required | cve |
string | CVE identifier; format: CVE-YYYY-NNNNN |
service_port |
Required | cve |
integer | Port where vulnerable service is running (1–65535) |
cvss_score |
Recommended | cve |
number | CVSS score 0.0–10.0 |
evidence_source |
Recommended | cve |
enum | vulnerability_scan, researcher_analysis, automated_discovery, penetration_testing |
exploitability |
Recommended | cve |
enum | theoretical, poc_available, functional, weaponized |
patch_available |
Recommended | cve |
boolean | Whether a patch is available |
risk_level |
Recommended | cve |
enum | info, low, medium, high, critical |
service_version |
Recommended | cve |
string | Version of the vulnerable service |
severity |
Recommended | cve |
enum | informational, low, medium, high, critical |
cve_ids |
Optional | cve |
array | Additional CVE identifiers; max 10 items |
cvss_vector |
Optional | cve |
string | CVSS v3 vector string (e.g. CVSS:3.1/AV:N/...) |
cvss_version |
Optional | cve |
enum | 2.0, 3.0, 3.1 |
disclosure_date |
Optional | cve |
datetime | When CVE was publicly disclosed |
impact_assessment |
Optional | cve |
object | Sub-fields: confidentiality, integrity, availability (each none/low/high) |
patch_url |
Optional | cve |
uri | URL to patch or fix information |
patch_version |
Optional | cve |
string | Version that fixes the vulnerability |
remediation_priority |
Optional | cve |
enum | low, medium, high, critical, emergency |
vendor_advisory |
Optional | cve |
uri | URL to vendor security advisory |
Schema Definitions:
Examples and formal schema definition: vulnerability-cve.json, vulnerability-misconfiguration.json, vulnerability-open-service.json
Purpose: Botnet infections and compromised servers requiring remediation or takedown
Valid Types: botnet, compromised_server
Category-Specific Fields:
| Field | Requirement | Applies To | Type | Notes |
|---|---|---|---|---|
compromise_evidence |
Required | botnet |
string | Evidence of how the compromise was detected (e.g. C2 communication observed) |
compromise_method |
Required | compromised_server |
string | Method used to compromise the server |
bot_capabilities |
Recommended | botnet |
array | Capabilities observed: ddos, spam, proxy, keylogger, file_download, remote_shell, cryptocurrency_mining, data_theft |
c2_protocol |
Recommended | botnet |
enum | Protocol used for C2 communications: http, https, tcp, udp, dns, irc, p2p, custom |
c2_server |
Recommended | botnet |
string | Command and control server domain or IP |
malware_family |
Recommended | botnet |
string | Malware family classification (e.g. mirai, emotet, zeus); max 200 chars |
Schema Definitions:
Examples and formal schema definition: infrastructure-botnet.json, infrastructure-compromised-server.json
Purpose: IP/domain blocklist inclusion and threat intelligence reports
Valid Types: blocklist, threat_intelligence
Category-Specific Fields:
| Field | Requirement | Applies To | Type | Notes |
|---|---|---|---|---|
threat_type |
Required | all | string | Type of threat for blocklist inclusion or intelligence report |
Schema Definitions:
Examples and formal schema definition: reputation-blocklist.json, reputation-threat-intelligence.json
Text Evidence:
text/plain- Log entries, configuration files, command outputtext/csv- Structured data exportsapplication/json- JSON-formatted data
Message Evidence:
message/rfc822- Complete email messages with headerstext/email- Email fragments or processed content
Image Evidence:
image/png- Screenshots (preferred for web content)image/jpeg- Photos, screenshotsimage/gif- Animated content or screenshots
Document Evidence:
application/pdf- Reports, documentationtext/html- Webpage snapshots
Binary Evidence:
application/octet-stream- Malware samples, unknown binariesapplication/zip- Archive files (must be password protected)
Base64 Encoding:
- All evidence payloads must be valid base64 encoded
- No line breaks or whitespace in encoding
- Proper padding required
- Use standard base64 alphabet (RFC 4648)
Size Management:
- Compress large text evidence before encoding
- Use screenshots instead of full webpage HTML when possible
- Truncate log files to relevant portions
- Consider external reference for very large evidence
For Messaging Class:
{
"evidence": [
{
"content_type": "message/rfc822",
"description": "Original spam email with full headers",
"payload": "base64-encoded-email"
}
]
}For Content Class:
{
"evidence": [
{
"content_type": "image/png",
"description": "Screenshot of phishing page",
"payload": "base64-encoded-screenshot",
"hash": "sha256:abcd1234..."
},
{
"content_type": "text/html",
"description": "HTML source of phishing page",
"payload": "base64-encoded-html"
}
]
}For Connection Class:
{
"evidence": [
{
"content_type": "text/plain",
"description": "Failed authentication log entries",
"payload": "base64-encoded-logs"
}
]
}XARF v4 parsers must support automatic conversion of v3 reports. The conversion process:
- Detect Format: Check for
Versionfield (v3) vsxarf_versionfield (v4) - Generate UUID: Create new UUID v4 for
report_id - Map Fields: Convert v3 structure to v4 structure
- Add Metadata: Include
legacy_version: "3"field - Convert Evidence: Map
Samplesarray toevidenceformat
Conversion Example:
// v3 Input
{
"Version": "3",
"ReporterInfo": {
"ReporterOrg": "Example Security",
"ReporterOrgEmail": "abuse@example.com"
},
"Report": {
"ReportClass": "Activity",
"ReportType": "Spam",
"SourceIp": "192.0.2.1",
"SourcePort": 25,
"Date": "2024-01-01T12:00:00Z",
"SmtpMailFromAddress": "spam@example.com",
"Samples": [{"ContentType": "message/rfc822", "Payload": "..."}]
}
}
// v4 Output (auto-converted)
{
"xarf_version": "4.2.0",
"report_id": "generated-uuid-v4-here",
"legacy_version": "3",
"timestamp": "2024-01-01T12:00:00Z",
"reporter": {
"org": "Example Security",
"contact": "abuse@example.com",
"type": "unknown"
},
"source_identifier": "192.0.2.1",
"source_port": 25,
"category": "messaging",
"type": "spam",
"evidence_source": "unknown",
"protocol": "smtp",
"smtp_from": "spam@example.com",
"evidence": [{"content_type": "message/rfc822", "payload": "..."}],
"tags": []
}| v3 Field | v4 Field | Conversion Notes |
|---|---|---|
Version |
xarf_version |
Set to "4.x.x", add legacy_version: "3" |
ReporterInfo.ReporterOrg |
reporter.org |
Direct mapping |
ReporterInfo.ReporterOrgEmail |
reporter.contact |
Direct mapping |
| N/A | reporter.type |
Set to "unknown" for v3 |
| N/A | report_id |
Generate new UUID v4 |
Report.Date |
timestamp |
Direct mapping |
Report.SourceIp |
source_identifier |
Direct mapping |
Report.SourcePort |
source_port |
Direct mapping |
Report.ReportType |
category + type |
Map to appropriate category/type |
Report.Samples |
evidence |
Convert array structure |
Dual-Format Processing:
1. Receive report (email/API/stream)
2. Detect format (v3 or v4)
3. If v3: convert to v4 automatically
4. Process with v4 logic
5. Store both original and converted versions
Gradual Migration:
Phase 1: Accept both v3 and v4, process as v4 internally
Phase 2: Generate v4 reports, maintain v3 compatibility
Phase 3: Deprecate v3 support with advance notice
Phase 4: v4-only processing
Complete sample reports for all 32 valid category/type combinations are in samples/v4/. Each schema file also contains an inlined example.
The _internal field provides a completely flexible container for operational metadata that is NEVER transmitted between systems. Organizations have complete freedom to define any internal structure needed for their specific operational workflows, integrations, and business processes.
- Never Transmitted: The
_internalfield must be stripped from all transmitted reports - Complete Freedom: Organizations define any structure they need for internal operations
- No Standards Required: No predefined fields - use whatever makes sense for your workflow
- Operational Focus: Design internal fields to support your specific processes and integrations
Example 1 - Security Research Organization:
{
"_internal": {
"threat_id": "THR-2024-001234",
"honeypot_source": "trap_bank_001",
"campaign_cluster": "fake_bank_wave_2024",
"ml_confidence": 0.94,
"analyst_reviewed": false,
"feed_destinations": ["partner_a", "partner_b"]
}
}Example 2 - ISP Abuse Department:
{
"_internal": {
"ticket": "ABUSE-5678",
"customer": "cust_premium_12345",
"assigned_analyst": "john.doe",
"sla_hours": 4,
"auto_actions": ["temp_block", "customer_email"],
"billing_impact": true,
"escalation_path": "tier2_team"
}
}Example 3 - Minimalist Approach:
{
"_internal": {
"id": "internal_12345",
"processed": "2024-01-15T14:30:25Z"
}
}Organizations typically use internal metadata for:
- Workflow tracking: Status, assignments, deadlines, escalations
- Integration data: Ticket IDs, case references, external system links
- Quality metrics: Confidence scores, validation flags, review requirements
- Business data: Customer info, billing, SLA tracking, priority levels
- Technical metadata: Processing times, system versions, batch IDs
- Analysis data: Campaign correlation, threat intelligence, ML scores
- Transmission Stripping: Parsers MUST remove
_internalbefore transmission - Local Storage: Internal fields can be stored alongside transmitted fields
- API Responses: Internal fields should only appear in internal APIs
- Logging: Internal fields are valuable for operational logging
- Custom Fields: Use
customobject for organization-specific needs
- Internal fields may contain sensitive operational data
- Never log internal fields to external systems
- Customer data in internal fields must follow privacy regulations
- Access controls should apply to internal metadata viewing