ArchiveSpark

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

License

MIT License

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

Creator

helgeho

Related apps

HadoopConcatGz

A Splitable Hadoop InputFormat for Concatenated GZIP Files and *.(w)arc.gz

Java9

6 years ago

hadoopsparkwarc

WarcPartitioner

Partition (W)ARC Files by MIME Type and Year

Java1mit

7 years ago

hadoopwarcweb-archiving

Web2Warc

An easy-to-use and highly customizable crawler that enables you to create your o

Scala24mit

7 years ago