blog

here's what i'm working on

Concat to PDF

September 17, 2021 — ~kai

I wrote a script a few months ago to scrape web novels. I extracted them as plaintext and saved every chapter. Since some books had over a hundred chapters, I wanted to concatenate them so it would be easier to read. Concatenation can be done quickly using cat and tools like pandoc can generate pretty PDFs.


Concatenation was a little difficult if the files were sorted oddly or if only select files needed to be concatenated. I used a little KDE servicemenu to achieve this task in a fun way (check kf5-config --path services to see where it can be placed):

[Desktop Entry]
Type=Service
Icon=smiley-shape
X-KDE-ServiceTypes=KonqPopupMenu/Plugin
MimeType=all/allfiles;
Actions=mergeEntry;
Encoding=UTF-8

[Desktop Action mergeEntry]
Name=Merge selected file(s)
Icon=document-send
Exec=kdialog --msgbox "Will merge the following files:\n$(echo %F | head -c 500)..." && awk 'FNR==1{print ""}1' %F > "./merged$(date +'%s').out"

The neat thing about my process is that I was concatenating Markdown files. You know what that means? pandoc can parse these easily and give me a pretty PDF!

Font size larger (https://stackoverflow.com/a/46055046), use extarticle package, which also supports 14, 17, and 20pt:

pandoc -V geometry:margin=1in -V documentclass="extarticle" -V fontsize=14pt ...

If markdown:

pandoc -V geometry:margin=1in -V fontsize=12pt -f markdown -t pdf ... ...

If you’re converting books, try fonts (eg. Garamond) installed on your computer:

pandoc -V geometry:margin=1in -V fontsize=12pt -V mainfont="Garamond" -f markdown -t pdf ... ...

OR (check your ~/.fonts directory)

pandoc -V geometry:margin=1in -V fontsize=12pt -V mainfont="pala.tff" -f markdown -t pdf ... ...

If toc and chapters:

pandoc -V geometry:margin=1in -V fontsize=12pt -f markdown -t pdf myinput.md -o myoutput.pdf --toc

If want headers: --top-level-division=chapter

If CJK characters, read (https://stackoverflow.com/a/48090656), you must use xelatex and set a valid font:

pandoc -V geometry:margin=1in -V fontsize=12pt -V CJKmainfont="Noto Sans CJK JP" -f markdown -t pdf myinput.md -o myoutput.pdf --toc --pdf-engine=xelatex --standalone

If want headers:

-s aka --standalone

And the best settings, all together:

markdown to pdf:

pandoc -V geometry:margin=1in -V documentclass="extarticle" -V fontsize=14pt -V mainfont="Garamond" -V CJKmainfont="Noto Sans CJK JP" -f markdown -t pdf myinput.md -o myoutput.pdf --toc --pdf-engine=xelatex --standalone

epub to pdf:

pandoc -V geometry:margin=1in -V fontsize=12pt -V mainfont="Garamond" -V CJKmainfont="Noto Sans CJK JP" -f epub -t pdf myinput.epub -o myoutput.pdf

tags: code