based, the transcripts thing might wind up being a little wonky. I've noticed that the tool that I use, whisper.cpp, has a very hard time producing accurate transcripts for videos that lead in with music, like a stream intro or something, and I end up having to cut the video down to just the main content without music. That ends up working usually.
If you've got multi-lingual audio it also shits the bed but I've not actually tried to see if there are methods of working with multilingual audio so not sure if that will be a headache or not.
I was confused when i uploaded a longer video and it wasn't playable for a brief period but now it makes sense.