In ripping my DVDs, I try to future-proof it as much as I can, by putting in as many elements as I *think* I might need or want someday down the road. One of those elements is subtitles. There are three types of subtitles that can be on DVDs — VobSub, closed captioning and SDH — and the first two can be extracted fairly easily. I have no idea how to access the SDH ones. I think you need either a newer DVD player or a Blu-Ray one.
I’ve been ripping my TV shows, and so far I haven’t seen any really hard and fast rules on what to expect with them on DVD. Part of the reason is that I just haven’t been paying much attention to subtitles until recently.
I was playing with ripping one show last night, and I saw the CC logo on the back of the case, so I went to check the rest of my library to see which other ones had it. Nearly my entire library of Warner Bros. DVDs displayed the logo — even for much older cartoons (Looney Tunes, Scooby Doo) — once again staying consistent with the fact that the studio puts a lot of effort into the quality of their releases.
I just started playing with extracting CC though, and just barely wrote the code to my DVD ripper to extract them, so I have no idea what the other series are like, if they have subtitles or not — VobSub or CC. I usually don’t find out until I actually go to rip them.
Extracting the closed captioning subtitles is a lot easier and faster than getting the VobSub streams. For Linux (and Mac and Windows) there’s a nifty OSS program called ccextractor. Once you have your VOB video file on your harddrive, just run that on the movie, and it will create an SRT subtitle file of the closed captioning text. It’s great, and really fast, taking probably under a minute on a 60-minute video on my box. Comparatively, when ripping a VobSub stream, you need to read the DVD directly which causes its own bottleneck, and then demux the entire stream. It takes probably around 3 to 5 minutes for an episode of the same length.
Another thing I like about the closed captioning titles is that because they are extracted as SRT, it’s easy to look through them since they are just text files. If you’re really anal, you can correct typos yourself. The VobSub subtitles are all bitmaps. I’ve also noticed that on some DVDs, where there were issues with framerates or something else, that the VobSub timestamps will be off … and sometimes either they will show up clumped together at the beginning of the film or the sync will be way off. I think that this has to do with the dumping process, somewhere, but I’m not sure. I’ve never really taken the time to pin down the source.
So, with closed captioning being easier and faster to extract, as well as editable and the timestamps haven’t had any issues for me (yet), it’s quickly becoming my preferred subtitle format.
There’s only one small issue with using ccextractor, and that is you won’t know if there are any captions in the VOB until after it’s made its trial run. The program will create an .srt file regardless when you run it, but the file will be empty if it couldn’t find any. That’s the only drawback. With VobSub, you can know if there are subtitles just by probing the DVD using lsdvd or something similar.
Muxing it into matroska is simple, too. Just pass it as a file argument and you’re done.
As a sidenote, while my bend application that I wrote and use to rip DVDs would be a major pain to setup for someone else, I’ve rewritten it recently so that it uses individual classes to access every object directly: DVD, DVD track, DVD VOB, Matroska file. They are standalone classes written in PHP if anyone wanted to use them, feel free. You would also need my tiny class of shell functions as well, since they all make calls to it.
The DVDVOB one makes it simple to extract the subtitle stream. In fact, all the classes make things relatively simple. They have made writing my code so much simpler.
Just rip the entire image. K3B does a nice job of this nowadays. Disk space is growing cheaper by the minute (1 TB for < $100US currently), so invest your money there. With a full image, you can work from the original data (including subtitles) so as your code improves, so does your ability to xfer to new formats.