-
-
Notifications
You must be signed in to change notification settings - Fork 34.5k
Support RFC 9309 in robotparser #138907
Copy link
Copy link
Open
1 / 61 of 6 issues completedLabels
3.13bugs and security fixesbugs and security fixes3.14bugs and security fixesbugs and security fixes3.15new features, bugs and security fixesnew features, bugs and security fixesstdlibStandard Library Python modules in the Lib/ directoryStandard Library Python modules in the Lib/ directorytype-bugAn unexpected behavior, bug, or errorAn unexpected behavior, bug, or error
Metadata
Metadata
Assignees
Labels
3.13bugs and security fixesbugs and security fixes3.14bugs and security fixesbugs and security fixes3.15new features, bugs and security fixesnew features, bugs and security fixesstdlibStandard Library Python modules in the Lib/ directoryStandard Library Python modules in the Lib/ directorytype-bugAn unexpected behavior, bug, or errorAn unexpected behavior, bug, or error
Bug report
The
urllib.robotparsermodule implements an unofficial standard originally specified in http://www.robotstxt.org/orig.html, with some additions (support not only "disallow", but also "allow" rules, support additional fields "crawl-delay", "request-rate" and "sitemap"). The practice of using robots.txt files differs significantly from the original specification. The new standard RFC 9309 was published in 2022, but drafts were used as a de facto standard for many years before that. There are several open issues regarding the module's inconsistency with current practices. These can be addressed separately, but to finally resolve the issue, we need to implement support for RFC 9309. I consider this not a feature request, but a bug fix, because incorrect support of robots.txt files can make Python code that usesrobotparsermalicious.See also https://discuss.python.org/t/about-robotparser/103683
Linked PRs