Automatic Speech Recognition

4 min

the below input parameters are for different attack types to start working with the apis, see audio speech recognition docid\ nlgwgbtjtaoqrqo8itmya audio speech recognition is an early access feature with limited functionality it is not available as part of aishield pypi package for early access, kindly contact aishield contact\@bosch com asr models support openai whisper all variants in pytorch and hugging face formats are supported refer to the openai whisper model card for the complete list of available variants \[ref openai whisper modelcard ] common parameter the below table parameters are common for evasion attack type parameter data type description remark model id string model id received during model registration we need to provide this model id in query parameter in url you have to do model registration only once for a model to perform model analysis this will help you track the no of api call made, and it's success metric request body (json format) model framework string framework on which model is trained on curretly supported framework are onnx for evasion and pytorch for data poisoning to access all sample artifacts, please visit artifacts docid\ ijneocxostabvvrsq11fa for specific artifact details, refer vulnerability report vulnerability report docid\ hl0ut2mwlcbkt8f97fr w sample attacks sample attacks docid 4g1mjm5lqjfm8t5wbvwpr evasion file upload format data the processed audio data, ready to be passed to the model for prediction, should be saved in a folder download sample data label a csv file should be created with two columns "audio" and "label " the first column should contain the audio file name, and the second column should contain the label(text) check sample label file attached download sample label model the model should be saved in onnx format this can be ignored when model is hosted as an api download sample model note format all uploaded files must be in a zipped format dataset the provided files are sample audio data from the librespeech dataset audio file properties each audio file should not exceed 30 seconds and must have a sampling rate of 16,000 hz sample count the minimum number of samples required must be between 300 and 500 data sampling strategy ensure that the uploaded samples are representative of the complete dataset, using techniques such as normal distributive sampling for even distribution parameter parameter data type description remark request body (json format) model api details string provide api details of hosted model as encrypted json string provide this only if use model api is "yes" use model api string use model api to use your model endpoint instead of uploading the model as a zip file when this parameter is yes, you don't have to upload model as zip you can pass api url along with other verification credential in json file convert pytorch model to onnx format for asr converting pytorch model to onnx will create two models (encoder and decoder) in onnx format to load an openai model load model "tiny en", "base en", "small en", "medium en", "large en" to load a distil whisper model load model model bin or "distil whisper/distil large v3" loading of whisper model import whisper, torch, onnxruntime \# load tiny model trained on whisper dataset device = torch device('cpu') model = whisper load model('tiny en') to(device) #hugging face > model bin tokenizer = whisper decoding get tokenizer(false,language='en',task='transcribe') ##initialoze whisper tokenizer model requires grad (false) model eval() convert to onnx format first load sample audio file to get the feature of audio signal python import librosa audio = librosa load(path of audio file,sr=16000)\[0] #sample audio file to load audio features audio = whisper pad or trim(audio) log mel = whisper log mel spectrogram(audio) unsqueeze(axis=0) prepare the input for decoder model python input = log mel to(device) audio features = model encoder(input) dec tokens = torch tensor(\[tokenizer sot sequence including notimestamps],dtype=torch long) ##decoder token convert and save model (encoder and decoder) in onnx format python \# onnx export \##encoder export torch onnx export( model encoder, (log mel,), "distil encoder onnx", input names=\["input mel"], output names=\["out"], dynamic axes={ "input mel" {0 "batch"}, "out" {0 "batch"}, }, ) \#decoder export torch onnx export( model decoder, (dec tokens, audio features), "distil decoder onnx", input names=\["tokens", "audio"], output names=\["out"], dynamic axes={ "tokens" {0 "batch", 1 "seq"}, "audio" {0 "batch"}, "out" {0 "batch", 1 "seq"}, }, ) data poisoning file upload format reference data the processed clean audio data, ready to be passed to the model for prediction, should be saved in a folder download sample data refernce label a csv file corresponding to clean data should be created with two columns "audio" and "label " the first column should contain the audio file name, and the second column should contain the label(text) check sample label file attached download sample label model the model should be saved in onnx format this can be ignored when model is hosted as an api download sample model universal data audio samples under test (poisoned, clean samples), should be saved in a folder download sample universal data universal label a csv files corresponding to universal data hould be created with two columns "audio" and "label " the first column should contain the audio file name, and the second column should contain the label(text) check sample label file attached download sample universal label note format all uploaded files must be in a zipped format dataset the provided files are sample audio data from the librespeech dataset audio file properties each audio file should not exceed 30 seconds and must have a sampling rate of 16,000 hz sample count the minimum number of samples required must be between 500 to 700 data sampling strategy ensure that the uploaded samples are representative of the complete dataset, using techniques such as normal distributive sampling for even distribution conversion of openai whisper models to huggingface format

Audio

Text