Directory of Software Resources

Home > Computers > Data Formats > Archive > WARC > Software

Tools and utilities for writing, reading, inspecting and managing WARC files.

Resources in This Category

DeDuplicator (Heritrix Add-on)

http://landsbokasafn.github.io/DeDuplicator/
An add-on module (plug-in) for the web crawler Heritrix. It offers a means to reduce the amount of duplicate data collected in a series of snapshot crawls.
GitHub: Alard/warc-proxy

https://github.com/alard/warc-proxy
Viewer for browsing the contents of a WARC file.
GitHub: archiveteam-megawarc-factory

https://github.com/ArchiveTeam/archiveteam-megawarc-factory
Scripts to bundle Archive Team uploads and upload them to Archive.org.
GitHub: cc-warc-examples

https://github.com/Smerity/cc-warc-examples
CommonCrawl WARC/WET/WAT examples and processing code.
GitHub: CDX-Writer

https://github.com/rajbot/CDX-Writer
Python script to create CDX index files of WARC data.
GitHub: Heritrix-Cassandra

https://github.com/openplaces/heritrix-cassandra
A library for writing Heritrix output directly to Cassandra.
GitHub: Megawarc

https://github.com/alard/megawarc
Nondestructive warc-in-tar to warc conversion.
GitHub: python-heritrix

https://github.com/gwu-libraries/python-heritrix
Simple Python wrapper around Heritrix API.
GitHub: warc-mapreduce

https://github.com/vadali/warc-mapreduce
Warc and wet support for Hadoop's mapreduce api.
GitHub: warc-tools

https://github.com/kbullaughey/warc-tools
Miscellaneous tools for processing WARC files from the CommonCrawl.
GitHub: WarcMiddleware

https://github.com/odie5533/WarcMiddleware
Lets download a mirror copy of a website when running a web crawl with the Python web crawler Scrapy.
GitHub: WarcMITMProxy

https://github.com/odie5533/WarcMITMProxy
HTTP(S) proxy that saves traffic to a WARC file, using libmitmproxy.
GitHub: WarcProxy

https://github.com/odie5533/WarcProxy
Saves proxied HTTP traffic to a WARC file.
GItHub: WarcQtViewer

https://github.com/odie5533/WarcQtViewer
UI to view and manage .warc and .warc.gz files.
GitHub: warctozip-service

https://github.com/alard/warctozip-service
An HTTP-based warc-to-zip converter.
GitHub: Wpull

https://github.com/chfoo/wpull
Wget-compatible web downloader and crawler.
Heritrix

https://webarchive.jira.com/wiki/display/Heritrix/Heritrix
The Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
IIPC: Open Wayback Development

http://netpreserve.org/openwayback
Landing site for open source Wayback development.
Java Web Archive Toolkit (JWAT)

https://sbforge.org/display/JWAT/JWAT
A package to read and validate WARC, ARC and GZip files.
NetarchiveSuite

https://sbforge.org/display/NAS/NetarchiveSuite
A complete web archiving package whose primary function is to plan, schedule and run web harvests of parts of the Internet. Is built around the Heritrix web crawler.
SiteStory

http://mementoweb.github.io/SiteStory/
Transactional Archiving. Consists of selectively capturing and storing transactions that take place between a web client (browser) and a web server.
WARCAT

http://warcat.readthedocs.io/en/latest/
Python tool and library for handling Web ARChive (WARC) files.
WarcManager

https://wiki.umiacs.umd.edu/adapt/index.php/WarcManager
Database web application which indexes and provides a browsing and search interface to a collection of warc data.
WARCreate

http://warcreate.com/
Extension that allows a user to create a Web ARChive (WARC) file from any browseable webpage. The resulting files can then be used with other tools like the Internet Archive's open source Wayback Machine.
Web Archiving Integration Layer (WAIL)

http://machawk1.github.io/wail/
A graphical user interface (GUI) atop multiple web archiving tools intended to be used as an easy way for anyone to preserve and replay web pages.

Home > Computers > Data Formats > Archive > WARC > Software

Thanks to DMOZ, which built a great web directory for nearly two decades and freely shared it with the web. About us