_NLP, our area of expertise, employs artificial intelligence to simulate human ability to read and understand text.
SentiSquare’s NLP technology is based on Distributional Semantics. This approach enables to represent the meaning of text without any supervision. The principle goes, “You shall know a word by the company it keeps” (Firth 1957).
Essentially, words are presumed to have similar meanings if they occur in similar contexts. That opens an opportunity for the quantification of meaning: Textual expressions can be represented as vectors in high-dimensional semantic space encoding their distribution over contexts (that is where the title Distributional Semantics came from).
This way, our algorithm learns the meaning of any text. Excitingly, it means that SentiSquare AI is language-independent.
Once we have created such meaning representation, interpreting the meaning in the desired way is still needed.
Take the word "loud", for instance. If you see it in the context of "music festival", you can state it is meant positively. However, if you see it in the context of "noise", it will probably be negative. Still, "loud" means the same thing.
What to do? Supervision is required to distinguish such cases. That is why we add it to the mix and call the result "semi-supervised machine learning".
Our use of multiple machine learning methods is why our AI boasts exceptional accuracy. The secret sauce is knowing how to combine unsupervised and supervised learning in just the right way.
In the interpretation phase, we combine all patterns to interpret the meaning in a desired way. Millions of context-based rules are created for each of our models and tuned to its specific functions.
As our algorithms learn directly from our clients’ data, they become a perfect fit for their operational and business needs.
How does SentiSquare AI deliver superior results in real-life deployment?
These are the 3 key success factors.
In any task that concerns text classification, the rate at which the AI returns a correct answer is crucial to its operational impact. While AI accuracy can never reach 100%, SentiSquare AI is getting pretty close to human level.
Not all NLP models can deal with difficult, messy text data – and very few can adapt to different languages. Call transcripts, for instance, are tricky as they contain a lot of errors. Emails are messy since they contain fluff such as footers etc. SentiSquare AI can deal with any text in any language because it learns patterns directly from data.
For a classification model to be successful, we need to know what it should look for. Otherwise, it will dump a lot of incoming communication into the “other” category. But there is no way to create meaningful rules without clarity about the content of the data. In big data, it is nearly impossible to create a dictionary of rules that covers everything. SentiSquare AI does just that – on its own, through clustering. The result is a thorough understanding of topics and patterns in the data.
How to increase accuracy?
One of the greatest obstacles is the power law problem – annotating data is expensive and time consuming, but at a certain point, adding more does not improve the model by much. Moreover, the data sparsity problem means that there will never be enough training data in the dataset to see everything.
Overcoming these challenges is a tough task. For example, the deepening of neural networks usually only yields a low-percentage improvement and there is a problem with overfitting the model.
1) Use unsupervised learning to create a meaning representation
2) Interpret the representation through supervised learning.
To boost accuracy as far as possible, we use techniques such as neural networks, probabilistic graphical models, and SVD (Singular Value Decomposition) based techniques. That way, we mitigate the power law problem and the sparsity problem, pushing our AI’s accuracy up to near-human level.
Difference in data
NLP engines are not always versatile enough for the diverse needs businesses have. From language coverage to channel-specific challenges, NLP has long struggled to create value in face of messy and difficult data. For example, only in recent years has the success rate of state-of-the-art voice transcription exceeded the threshold to be applied in real-world applications. But still, ontology-based models do not work very well on call transcripts. Text pieces from different channels need to be treated properly – and the NLP models must reflect that. Further, top-notch NLP is often only available in a few languages, leaving many out.
To create real business value, NLP engines need to be fine-tuned to work with the specific types of data they process. NLP scientists at SentiSquare have mastered the fine art of adapting NLP algorithms and preprocessing text data. That includes practices such as tokenization, world normalization, typo detection & correction, email header & footer clearing, and more. For example, we are able to treat different forms and declinations of a word like one without the need to tell the AI – our machine learning takes care of that.
That is what makes our engines versatile and language independent, including languages with rich morphology such as Czech, German, and Hungarian. Messy and difficult data gets ready for machine learning – no matter the language or channel.
Not knowing what is in the data
A common obstacle to building an effective classification system is not quite knowing what is really in the data. Sometimes there are hidden high-value patterns or unexpected trends in the data.
That is why, every time we get new data, we put our AI on Discovery Mode – we use unsupervised learning to generate clusters of text pieces with similar meanings, and identify words that carry the most meaning in the dataset. This way, we quickly uncover the most important themes and patterns within the dataset, and flag possible false assumptions about the data. We use the resulting knowledge to build classification systems that reflects what customers are saying.
Clustering not only provides insight to our clients to support CX improvement; it also provides a basis for building models for use cases from feedback categorisation to routing automation to churn prediction. In sum, our machine finds the best way to process data and offers solutions by itself. SentiSquare AI can do this in two weeks!