RepoCoreWeaveCoreWeavepublished Jul 21, 2023seen 6d

coreweave/gutenberg-epub

Go

Open original ↗

Captured source

source ↗
published Jul 21, 2023seen 6dcaptured 8hhttp 200method plain

coreweave/gutenberg-epub

Language: Go

License: MIT

Stars: 5

Forks: 0

Open issues: 0

Created: 2023-07-21T19:59:35Z

Pushed: 2023-10-01T20:04:41Z

Default branch: main

Fork: no

Archived: yes

README:

fast-epubtotxt

A fast epub to txt converter implemented in Golang. This converter acheives speeds of 4-6 million characters/second in testing. For example, it converts the novel _1984_ in ~100-150ms.

Usage

./gutenberg-epub-converter \
-inputDir [INPUT_DIRECTORY] \
-outputDir [OUTPUT_DIRECTORY] \
-writeHeader=[true|false] \
-writeMetadata=[true|false] \
-cleanOutput=[true|false] \
-seperateFolders=[true|false] \
-stopEarly=[INT_NUMBER_OF_BOOKS] \
-silent=[true|false] \
-skipCopyRight=[true|false] \
-gutenbergCleaning=[true|false]

Example:

./gutenberg-epub-converter \
-inputDir ../data/test-lib \
-outputDir ./output \
-writeHeader=true \
-writeMetadata=true \
-cleanOutput=true \
-seperateFolders=true \
-stopEarly=100 \
-skipCopyRight=false \
-gutenbergCleaning=true

Arguments

| Argument | Type | Description | Default Value | | -------- | ---- | ----------- | ------------- | | inputDir | _string_ | Input folder path | ./input | | outputDir | _string_ | Output folder path | ./output | | writeHeader | _bool_ | Write a metadata header to the *.txt file. | true | | writeMetadata | _bool_ | Write metadata to a seperate file. | false | | cleanOutput | _bool_ | Remove strange characters and spacing from the output. | true | | gutenbergCleaning | _bool_ | Perform additional output cleaning for Gutenberg format books. | false | | seperateFolders | _bool_ | Write epub and metadata to a seperate folder per book. | false | | stopEarly | _int_ | The number of books to process before stopping. | 0 (unlimited) | | silent | _bool_ | Suppress console output. | false | | skipCopyRight | _bool_ | Skip all books marked as copyrighted in the metadata. | false |

Build instructions

Build the converter with golang.

go build gutenberg-epub-converter.go

Official icon

![Icon](./iconEpub.png)

Parsing benchmark

This converter processed 55,756 books from the Project Gutenberg library in less than 45 minutes.

Parsing took 44m56.324860172s, parsed 16465734085 characters at a rate of 6106732 characters per second.
Parsed 55756 books, 55340 finished and 416 skipped due to copy right.

Notes

The converter is not exhaustively tested. Please contact me or raise an issue if errors are discovered.

Significant references

  • https://github.com/soskek/bookcorpus/blob/master/epub2txt.py
  • https://github.com/taylorskalyo/goreader/tree/master/epub
  • Taken from a section of code I wrote while working as Coreweave in February 2023.