With the increasing regulations allowing the transfer of personnel paper records to an electronic record-keeping system, the 21st century has seen a massive move toward paperless and digital transformation to save money, increase data safety and speed up manual processes in the long term.
Many startups and IT companies nowadays propose products to store employees' records, create templates or sign documents, improving administrative tasks significantly.
How can companies manage their Human Resources archives?
Managing your HR archives cannot be about owning gigabytes of useless and hardly manageable scans. It needs to be about extracting
vital knowledge of the dynamic between your company and your employees to create momentum and improve everyone'sdaily work-life.
To manage them efficiently, your organisation has to answer questions such as:
How will we collect the data hidden in scans?
Should we extract the data manually from scans, even with OCR software?
Is robotic process automation software capable of helping us since our templates have changed many times over the years?
Data entry is a significant expense. So what can your business do when manual rewriting is a poor option, and cramming your new digital system with scans is no better alternative?
The answer is NER.
The NER technology (or Named Entity Recognition) can extract and copy valuable pieces of info (names, dates, budgets, addresses, social security numbers etc.) from an employment contract, unchanged or not.
How does it work?
NLP (or Natural language processing), one of the numerous fields of Machine Learning (ML), can identify named entities and assigns them to predetermined categories of your choosing, such as a person, company, date, and money.
The accuracy of the NER technology is the same as manual work, but its information processing is a thousand times faster.
Docmatic's NER model extracts the text from the document while maintaining its structure (sentences, paragraphs, pages).
Then, the model identifies named entities and assigns them to one of the categories determined.
Once the document analysis is complete, the output appears for the user as their original document enriched with highlighted words and groups of words. The different colours of the highlights represent the various categories.
The model's strength comes from the rich technical corpora Docmatic R&D team has created and the auto-tagging technology that categorises the extracted data.
Docmatic's technology is available as a SaaS and a plugin for Google Docs and, soon, Microsoft Word. Want to know more?
Docmatic's models are built on Deep Learning technology, a notion that refers to Deep Neural Networks.
What is a neural network?
A neural network is a computer architecture in which several processors connect in a manner suggestive of the connections between neurones in a human brain. Like young humans, neural networks can learn through trial and error.
What are deep neural networks?
DNNs are neural networks with a certain level of complexity. More than two layers, at least. Deep neural networks use sophisticated mathematical modelling to process data in complex ways.
How does Docmatic's NER technology use DNNs?
In short, when extracting data, Docmatic's NER model maintains the document's structure and its sequential data (the words, the sentences, the phrases, etc.). The model then turns these textual sequences into sequential vector representations containing only numbers (each word is represented by a vector built from 300 values). A vectorisation algorithm built inside the model performs this "transformation" of the textual data into vectors.
Finally, the model analyses the vectors and decides from the recollection of its training on datasets which category best fits the word presented. One of our competitive advantages is entity grouping, where the model replicates human interpretation by categorising multi-word compounds likely to belong to the same entity.
At last, our solutions use Convolutional Neural Network (CNN) to recognise the images, which is influential in analysing multi-dimension data and sequences like images, PDFs or scans. Thanks to CNNs,
Docmatic's solutions support every format (scans, PDF, Word, Google Doc, etc.).