Note: this is a technical post for all 3 people who might want to know about this subject. I’m guessing that people who need this information already know how to use CLTK locally but have less experience deploying software on platforms, and I’ve written it with that in mind.
So, I updated my Vergilbot to make owl tweets in Latin. As I’d expected, the hardest part wasn’t writing the bot code; it was getting the CLTK and its data modules up on Heroku, where the bot lives. I kept an ongoing list of my trial and error on this in the
README, but here’s more context for those who need it:
Ideally you don’t want to keep the third-party code (like JS or Python libraries, for example) in your repo because a) it’s not technically your code, b) it makes the file structure messier and c) you’ll need to update it from time to to time. Updating libraries yourself requires laborious manual command line/folder shenanigans, and the whole point of automation — nay programming! — is to avoid manual labor.
When you put a Python app up on Heroku, loading its dependencies is as easy as putting a list of libraries in a
requirements.txt file. That’s how you initially load the CLTK as a dependency, and it works fine.
But with the language processing libraries like NLTK and CLTK, you also need to download data modules — corpora, if you will — for them to use, and this is harder because unlike the library itself, the the data files do actually live in your Heroku repo — but still, in an ideal world, shouldn’t be uploaded from your local repo. NLTK became common enough that Heroku (well, Kenneth Reitz) made a separate NLTK modules file that Heroku loads automatically. In the case of CLTK, there’s is no automagic way to download your files yet — not shocking, I guess, given the relative popularity of both.
Normally, to add your own dependencies to Heroku you’d use a
post_compile hook (like this one suggested for the NLTK before it was automagic). “Post-compile” just means you tell Heroku’s app builder to do some stuff after it’s already finished its own build and compile tasks. And normally, you’d tell Heroku to run the Python commands indicated in the docs as the post-compile step. But as I found out through trial and error, that won’t work; and in a cleverly hidden “more” comment, I found out that as of January 2017, you can’t build files in a
post_compile command and have them stick around; they disappear unless you can control the directory they’re downloading to. And AFAIK you can’t tell CLTK where to download its corpora files — it’s always going to put them in the root folder — so we have a pretty big conflict there.
But I noticed that when you run the
import_corpora command in CLTK, all it does is clone a directory with git. So, since the Python commands weren’t working, I wrote out the git commands in the
post_compile, and that worked.
In order to create a post-compile installation, here’s what you do (you can see the full code in my Vergilbot repo):
- Create a
bindirectory at the root of your app
- Make an
install_cltkscript in the
bindirectory that uses
gitto create and clone the directories you need.
- Make a
post_compilescript in the
bindirectory that loads your
install_cltkscript — you have to include this, because it’s what Heroku looks for.
- Don’t forget to include the corpora in your path.
It’s not the most ideal solution, but it does fulfill the requirements of a) not keeping the data files in the repo and b) keeping the data modules up-to-date with the source code. An alternative might be to include
cltk_data directory in the repo with a yaml file, which is the method suggested for including user-defined CLTK corpora; I haven’t tried that yet, but I think it might still require git in the
post_compile, and so I’m not sure it’s much better. And there’s always the possibility of making a separate Heroku buildpack, like this one. That looks more promising.
But frankly, I just wanted to get my owls working this weekend, which I did.