The automatic estimation of speaker characteristics, such as height, age, and gender, has various applications in forensics, surveillance, customer service, and many human-robot interaction applications. These applications are often required to produce a response promptly. This work proposes a novel approach to speaker profiling by combining filter bank initializations, such as continuous wavelets and gammatone filter banks, with one-dimensional (1D) convolutional neural networks (CNN) and residual blocks. The proposed end-to-end model goes from the raw waveform to an estimated height, age, and gender of the speaker by learning speaker representation directly from the audio signal without relying on handcrafted and pre-computed acou
... Show MoreAutomatic Speaker Profiling (ASP), is concerned with estimating the physical traits of a person from their voice. These traits include gender, age, ethnicity, and physical parameters. Reliable ASP has a wide range of applications such as mobile shopping, customer service, robotics, forensics, security, and surveillance systems. Research in ASP has gained interest in the last decade, however, it was focused on different tasks individually, such as age, height, or gender. In this work, a review of existing studies on different tasks of speaker profiling is performed. These tasks include age estimation and classification, gender detection, height, and weight estimation This study aims to provide insight into the work of ASP, available dat
... Show MoreBeyond the immediate content of speech, the voice can provide rich information about a speaker's demographics, including age and gender. Estimating a speaker's age and gender offers a wide range of applications, spanning from voice forensic analysis to personalized advertising, healthcare monitoring, and human-computer interaction. However, pinpointing precise age remains intricate due to age ambiguity. Specifically, utterances from individuals at adjacent ages are frequently indistinguishable. Addressing this, we propose a novel, end-to-end approach that deploys Mozilla's Common Voice dataset to transform raw audio into high-quality feature representations using Wav2Vec2.0 embeddings. These are then channeled into our self-attentio
... Show More