数百万件の政府文書を検索できる「GovScape」を開発(GovScape lets you easily search millions of government documents)

2026-06-24 ワシントン大学(UW)

University of Washingtonの研究チームは、米国政府の膨大な公開文書を効率的に検索できるマルチモーダル検索システム「GovScape」を開発した。対象は、米国大統領任期終了時に保存される「End of Term Web Archive」に収録された政府PDF文書で、現在は約1,000万件(約7,100万ページ)のPDFを検索対象としている。GovScapeは、キーワード検索に加え、AIによる意味検索(セマンティック検索)や画像・図表の特徴に基づくビジュアル検索に対応し、「黒塗り文書」や「円グラフ」など内容や見た目の特徴から目的の資料を探索できる。PDFをページ単位で画像とテキストに分割し、それぞれを埋め込み(Embedding)として表現することで、高速かつ高精度な検索を実現した。また、高効率AIモデルの採用により、約1,000万件のPDF処理コストを約1,500ドルに抑え、高い拡張性も示した。今後は2008~2024年に蓄積された約7,000万件のPDFや他形式の政府文書へ対象を拡大し、行政情報へのアクセス性向上と研究・報道・政策分析への活用が期待される。

数百万件の政府文書を検索できる「GovScape」を開発(GovScape lets you easily search millions of government documents)
A University of Washington-led research team created GovScape, an efficient search system for PDFs from the End of Term Web Archive. Users can look up exact keywords, like “FAFSA,” or use a visual search option to query for qualities like “redacted documents.” Photo: University of Washington

<関連情報>

GovScape:7000万ページに及ぶ政府発行PDFを対象とした、公共向けマルチモーダル検索システム GovScape: A Public Multimodal Search System for 70 Million Pages of Government PDFs

Ying-Hsiang Huang, Claire Gong, Shreya Shaji, Alison Yan, Leslie Harka, Albert Du, Anjali Gopal, Samuel J Klein, Shannon Zejiang Shen, Mark Phillips, Trevor Owens, Kyle Deeds, Benjamin Charles Germain Lee
arXiv  last revised 18 May 2026 (this version, v3)
DOI:https://doi.org/10.48550/arXiv.2511.11010

Abstract

Efforts over the past three decades have produced web archives containing billions of webpage snapshots and petabytes of data. The End of Term Web Archive alone contains, among other file types, millions of PDFs produced by the federal government. While preservation with web archives has been successful, significant challenges for access and discoverability remain. For example, current affordances for browsing the End of Term PDFs are limited to downloading and browsing individual PDFs, as well as performing basic keyword search across them. In this paper, we introduce GovScape, a public search system that supports multimodal searches across 10,015,993 federal government PDFs from the 2020 End of Term crawl (70,958,487 total PDF pages) – to our knowledge, all renderable PDFs in the 2020 crawl that are 50 pages or under. GovScape supports four primary forms of search over these 10 million PDFs: in addition to providing (1) filter conditions over metadata facets including domain and crawl date and (2) exact text search against the PDF text, we provide (3) semantic text search and (4) visual search against the PDFs across individual pages, enabling users to structure queries such as “redacted documents” or “pie charts.” We detail the constituent components of GovScape, including the search affordances, embedding pipeline, system architecture, and open source codebase. Significantly, the total estimated compute cost for GovScape’s pre-processing pipeline for 10 million PDFs was approximately $1,500, equivalent to 47,000 PDF pages per dollar spent on compute, demonstrating the potential for immediate scalability. Accordingly, we outline steps that we have already begun pursuing toward multimodal search at the 100+ million PDF scale. GovScape can be found at this https URL.

1603情報システム・データ工学
ad
ad
Follow
ad
タイトルとURLをコピーしました