Why the right data input is key: A Machine Learning example
Finding the ‘sweet spot’ of data needs and consumption is critical to a business. Without enough, the business model under performs. Too much and you run the risk of compromised security and protection. Measuring what data intake is needed, like a balanced diet, is key to optimum performance and output. A healthy diet of data will set a company on the road to maximum results without drifting into red areas either side.
Machine learning is not black magic. A simple definition is the application of learning algorithms to data to uncover useful aspects of the input. There are clearly two parts to this process, though: the algorithms themselves and the data being processed and fed in.
Without question, there is a big need for an ample amount of data to offer the system a healthy helping to configure the best outcomes. What is crucial, though, is that the data collected is representative of the tasks you intend to perform.
Within speech recognition, for example, this means that you might be interested in any or all of the following attributes:
- formal speech/informal speech
- prepared speech/unprepared speech
- trained speakers/untrained speakers
- general speech/specific speech
- professional recording/amateur recording
In reality, all of these attributes impact the ability to perform the tasks required of speech recognition with ultimate accuracy. Therefore, the data needed to tick all the boxes is different and involves varying degrees of difficulty to obtain. Bear in mind that it is not just the audio that is needed, accurate transcripts are required to perform training. That probably means that most data will need to be listened to by humans to transcribe or validate the data, and that can create an issue of security.
An automatic speech recognition (ASR) system operates in two modes: training and operating.
Training is most likely managed by the AI/ML company providing the service, which means the company needs access to large amounts of relevant data. In some cases, this is readily available in the public domain anyway. For example, content that has already been broadcast on television or radio and therefore has no associated privacy issues. But this sort of content cannot help with many of the other scenarios in which ASR technology can be used, such as phone call transcription, which has many different translation characteristics. Obtaining this sort of data can be tied up with contracts for data ownership, privacy and usage restrictions.
In operational use, there is no need to collect audio. You just use the models that have previously been trained. But the obvious temptation is to capture the operational data and use it. However, as mentioned, this is where the challenge begins: ownership of the data. Many cloud solution providers want to use the data openly, as it will enable continuous improvement for the required use cases. Data ownership becomes the lynchpin.
The challenge is to be able to build great models that work really well in any scenario without capturing privately-owned data. A balance between quality and security must be struck. This trade-off happens in many computer systems but somehow data involving people’s voices often, understandably, generates a great deal of concern.
Finding a solution
To ultimately satiate an ASR system, there needs to be just enough data provided to execute the training so good systems can be built. There is an option for companies to train their own models, which enables them to maintain ownership of the data. This can often require a complex professional services agreement, requiring a good investment of time, but it can provide a solution at a reasonable cost very quickly.
ML algorithms are in a constant state of evolution, and techniques can now be used that allow smaller data sets to be used to bias systems already trained on big data. In some cases, smaller amounts of data can achieve ‘good enough’ accuracy. The overall issue of data acquisition is not removed, but sometimes less data can provide solutions.
Finding a balanced data diet by enabling better algorithm tuning, and filtering and selection of data, can get the best results without collecting everything that has ever been said. More effort may be needed to achieve the best equilibrium. And, without doubt, the industry must maintain its search for ways to make the technology work better without people’s privacy being compromised.
Author: Ian Firth