Hacker News

6 hours ago by bmn__

Has anyone had any success getting the software to work?

It's entirely unpackaged: https://repology.org/projects/?search=voice2json https://pkgs.org/search/?q=voice2json

Docker image is broken, how'd that happen?

    $ voice2json --debug train-profile
    ImportError: numpy.core.multiarray failed to import
    Traceback (most recent call last):
      File "/usr/lib/voice2json/.venv/lib/python3.7/site-packages/deepspeech/impl.py", line 14, in swig_import_helper
        return importlib.import_module(mname)
      File "/usr/lib/python3.7/importlib/__init__.py", line 127, in import_module
        return _bootstrap._gcd_import(name[level:], package, level)
      File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
      File "<frozen importlib._bootstrap>", line 983, in _find_and_load
      File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
      File "<frozen importlib._bootstrap>", line 670, in _load_unlocked
      File "<frozen importlib._bootstrap>", line 583, in module_from_spec
      File "<frozen importlib._bootstrap_external>", line 1043, in create_module
      File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
    ImportError: numpy.core.multiarray failed to import

5 hours ago by xrd

I tried docker (both debian version of Dockerfile), building from scratch, none of them work.

6 hours ago by nerdponx

The source package does have installation instructions and appears to use Autotools: https://voice2json.org/install.html#from-source. Hopefully at least building from source works.

4 hours ago by mdaniel

Building the v2.0 tag (or even master) using docker does not:

    E: The repository 'http://security.ubuntu.com/ubuntu eoan-security Release' does not have a Release file.

And just bumping the image tag to ":groovy" caused subsequent silliness, so this project is obviously only for folks who enjoy fighting with build systems (and that matches my experience of anything in the world that touches Numpy and friends)

7 hours ago by marcodiego

Good FLOSS speech recognition and TTS is badly needed. Such interaction should not be left to an oligoply with bad history of not respecting users freedoms and privacy.

7 hours ago by sodality2

Mozilla CommonVoice is definitely trying. I always do a few validations and a few clips if I have a few minutes to spare, and I recommend everyone does. They need volunteers to validate and upload speech clips to create a dataset.

https://commonvoice.mozilla.org/en

6 hours ago by teraflop

I like the idea, and decided to try doing some validation. The first thing I noticed is that it asks me to make a yes-or-no judgment of whether the sentence was spoken "accurately", but nowhere on the site is it explained what "accurate" means, or how strict I should be.

(The first clip I got was spoken more or less correctly, but a couple of words are slurred together and the prosody is awkward. Without having a good idea of the standards and goals of the project, I have no idea whether including this clip would make the overall dataset better or worse. My gut feeling is that it's good for training recognition, and bad for training synthesis.)

This seems to me like a major issue, since it should take a relatively small amount of effort to write up a list of guidelines, and it would be hugely beneficial to establish those guidelines before asking a lot of volunteers to donate their time. I don't find it encouraging that this has been an open issue for four years, with apparently no action except a bunch of bikeshedding: https://github.com/common-voice/common-voice/issues/273

5 hours ago by xwx

I downloaded the (unofficial) Common Voice app [1] and it provides a link to some guidelines [2], which also aren't official but look sensible and seem like the best there is at the moment.

[1] https://f-droid.org/packages/org.commonvoice.saverio/

[2] https://discourse.mozilla.org/t/discussion-of-new-guidelines...

4 hours ago by cptskippy

After listening to about 10 clips your point becomes abundantly clear.

One speaker, who sounded like they were from the mid-west United States, was dropping the S off words in a couple clips. I wasn't sure if it was misreads or some accent I'd never heard.

Another speaker, with a thick accent that sounded European, sounded out all the vowels in circuit. Had I not had the line being read, I don't think I'd have understood the word.

I heard a speaker with an Indian accent who added a preposition to the sentence that was inconsequential but incorrect none the less.

I hear these random prepositions added as flourishes frequently with some Indian coworkers, does anyone know the a reason? It's kind of like how American's interject "Umm..." or drop prepositions (e.g. "Are you done your meal?") and I almost didn't pick up on it. For that matter where did the American habit of dropping prepositions come from? It seems like it's people in the North East primarily.

5 hours ago by undefined

[deleted]

7 hours ago by tootie

If you read the doc, it says voice2json is layer on top of the actual voice recognition engine. And it supports mozilla deep speech, pocket sphinx and a few others as the underlying engine.

7 hours ago by wcarss

I've used the deepspeech project a fair amount and it is good. It's not perfect, certainly, and it honestly isn't good enough yet for an accurate transcription in my mind, but it's good. Easy to work with, pretty good results, and all the right kinds of free.

Thanks for taking time to contribute!

7 hours ago by jfarina

I wonder if they use movies and tv; recordings where the script is already available.

7 hours ago by wongarsu

That's fine for training your own model, but I don't think you could distribute the training set. That seems like a clear copyright violation, against one of the groups that cares most about copyright.

Maybe you could convince a couple of indie creators or state-run programs to licence their audio? But I'm not sure if negotiating that is more efficient than just recording a bit more audio, or promoting the project to get more volunteers.

6 hours ago by kelnos

I expect that wouldn't be perfect, though. Sometimes the cut that makes it into the final product doesn't exactly match the script. Sometimes it's due to an edit, other times it's due to an actor saying something similar to but not exactly what the script says, but the director deciding to just go with it.

What might work better is using closed captions or subtitles, but I've also seen enough cases where those don't exactly match the actual speech either.

35 minutes ago by cf

I'd check out coqui https://coqui.ai/

It's well-documented and works basically out of box. I wish the STT models bundled were closer to the quality of Kaldi but the ease-of-use has no comparisons.

And maybe with time it will surpass Kaldi in quality too.

7 hours ago by londons_explore

Good speech recognition generally requites massive mountains of training data, both labelled and unlabelled.

Massive mountains of data tends to be incompatible with opensource projects. Even Mozilla collecting user statistics is pretty controversial. Imagine someone like Mozilla trying to collect hundreds of voice clips from each of tens of millions of users!!

7 hours ago by sodality2

> Imagine someone like Mozilla trying to collect hundreds of voice clips from each of tens of millions of users!!

They do, and it's working! https://commonvoice.mozilla.org/en

6 hours ago by londons_explore

Except they have 12k hours of audio, when really they could do with 12B hours of audio...

7 hours ago by marcodiego

Really complicated question, but considering the free world got wikipedia and openstreetmaps, I'd bet we'll find a way.

5 hours ago by JadeNB

> Really complicated question, but considering the free world got wikipedia and openstreetmaps, I'd bet we'll find a way.

Both of those involve entering data about external things. Asking people to share their own data is another thing entirely—I suspect most people, me included, are much more suspicious about that.

7 hours ago by posmonerd

Not an expert on any of this, but wouldn’t already published content (public or proprietary) such as Youtube videos, audiobooks, tv interviews, movies, tv programs, radio programs, podcasts, etc. be useful and exempt from privacy concerns?

Do user collected clips have soemthing so special to the point that it’s critical to collect them?

5 hours ago by eliaspro

Movies etc would need to be transcribed accurately to be useful for training and even then just provide a single sample for the specific item.

2 hours ago by GekkePrutser

Well speech recognition for personal use doesn't have to recognise everyone. In fact it's a feature, not a bug if it recognises only me as the user.

2 hours ago by GekkePrutser

Indeed, and it doesn't have to be as "machine learning" as the big ones.

A FLOSS system would only have my voice to recognise and I would be willing to spend some time training it. Very different usecase from a massive cloud that should recognise everyone's voice and accent.

7 hours ago by hirundo

I wonder if it would be possible to map vim keybindings to sounds and effectively drive the editor with the mouth when the hands are otherwise occupied. It might be possible to use sounds that compose into pronounceable words with minimal syllables for combinations. What would vim bindings look like as a concise command language suited to human vocalization?

E.g. maybe "dine" maps to d$ and "chine" to c$. So as in keyboard vim you can guess what "dend" and "chend" do.

3 hours ago by krysp

I do this successfully for work using https://talonvoice.com/ - initial learning curve is steep, but once you learn how to configure and hack on the commands, you can be very effective. I use it maybe half the day to combat lingering RSI symptoms, and with some work I could probably use it for 98% of input for the computer. Some people do use it for 100% afaik

7 hours ago by twobitshifter

https://youtu.be/8SkdfdXWYaI?t=600

this guy is already there: Slurp slap scratch buff yank

3 hours ago by skratlo

I now get the joke about Emacs and OS

7 hours ago by undefined

[deleted]

2 hours ago by synesthesiam

Author here. Thanks to everyone for checking out voice2json!

The TLDR of this project is: a unified command-line interface to different offline speech recognition projects, with the ability to train your own grammar/intent recognizer in one step.

My apologies for the broken packages; I'll get those fixed shortly. My focus lately has been on Rhasspy (https://github.com/rhasspy/rhasspy), which has a lot of the same ideas but a larger scope (full voice assistant).

Questions, comments, and suggestions are welcomed and appreciated!

7 hours ago by offtop5

Fantastic.

Might use this with a Raspberry pi to set up some projects around the house. Is it possible to buy higher quality voice data ?

4 hours ago by nmstoker

If you're interested in projects on a Pi then you might just be interested in this: https://github.com/rhasspy/rhasspy

It's from the same author.

2 hours ago by GekkePrutser

I like rhasspy but the problem I have with it is that it's too much of a toolkit and less of an application. There's too many choices to pick for the different components.. I think they should pick one of each and really tune them so it works really well. This way they'd take a lot of complexity away from the user.

23 minutes ago by dokem

I can finally build the Jarvis home assistant I dreamt of when first learning coding in high school. To bad now I know voice assistant widgets generally are useless.

16 minutes ago by tootie

Speech recognition is actually orthogonal to AI. In my day the AI prototypes (like ELIZA) were basically chat bots. Speech recognition is now very sophisticated and accurate. Determining meaning from human language (spoken or written) is far more advanced than it used to be but still kinda sucks.

8 hours ago by marcodiego

For those who care: MIT license.

6 hours ago by nwalker85

Really interesting use of intents and entities. I feel like some of this is reinventing the wheel, since there is already a grammar specification, but novel use of intents/entities. https://www.w3.org/TR/speech-grammar/

4 hours ago by Edman274

Yeah, in my experience no one uses or supports that specification, which is a shame because if you're using something like AWS Connect with AWS Lex for telephony IVR, you can't just create a grammar and then have AWS Lex figure out how to turn its recognized speech-to-text into something that matches a grammar rule. Thus, Lex will return speech-to-text results that are according to general English grammar rules, rather than what you might have prompted the user to reply with. You'll be unpleasantly surprised if you think that defining a custom entity as alphanumeric always prevents the utterance "[wʌn]" as sometimes matching "won" instead of "one" or "1".

Edit - Sorry, I realize that's a tangent. What I'm saying is that when I was evaluating speech to text engines for things like IVR systems using AWS and Google, neither of them supported SRGS. Microsoft does, I think, but they didn't have a telephony component, and IBM was ignored from the get go, so "no one" really means "two very large companies."

Daily digest email

Get a daily email with the the top stories from Hacker News. No spam, unsubscribe at any time.

Voice2json: Offline speech and intent recognition on Linux