Extracting Rules - Algodocs

What is an Extractor?

An extractor is a core feature in Algodocs that consists of a set of rules (extracting rules) that are used to extract data from documents. So, your entire data extraction lifecycle can start by firstly creating an extractor.

In general, you should create an extractor for every document that has a different layout. So, in other words you should create different extracting rules for documents with different layouts. However, in cases that your documents have similar layouts, then there are ways of creating only one extractor with more flexible extracting rules. You can find more information on creating different extractors with various extracting rules in other articles under the Extracting Rules category.

What do PDF Parser / OCR options in Extracting Rules mean?

Algodocs allows you to select the type of an engine in every extractor to apply when extracting data from a document. There are three options available:

Both (when PDF Parser fails, then apply OCR)
PDF Parser only
OCR only

The first option is the default one, which means that Algodocs will try to apply PDF Parser and if it fails extracting data from a document, then it will automatically apply OCR (Optical Character Recognition). Therefore, there are only two types of engines that you actually apply for extracting data from documents.

What each of these engines mean and how to know which one to use?

PDF Parser engine is used for generated PDF files only, such that PDF files that contain text and not scanned files. So, PDF Parser works only for non-scanned PDF files with text only. However, OCR engine is applicable for both text PDF files and scanned PDF files along with images. In other words, while PDF Parser works for text PDF files only, OCR can be applied for all types of documents, text & scanned PDF files and images.

Then, we ask another question: why do I need PDF Parser engine at all, if OCR handles all types of documents?

The reason behind Algodocs still having PDF Parser engine is because of its speed performance. PDF Parser engine is much faster than OCR engine. PDF Parser can extract data from a single page in 1–2 seconds, whereas OCR will extract data from a single page in 10–15 seconds. Therefore, we advise you to prefer PDF Parser in cases when you have PDF files with text only.

Please, note that when you have a mixture of text and image in PDF files, then when applying PDF Parser Algodocs will extract only the text sections of your document by ignoring the scanned section of the document. Therefore, to make an extraction from scanned section too, you should select ‘OCR only’ option.

If you have troubles extracting data from your documents, please contact our support team.

How to create an extracting rule

We did our best to keep creation of extracting rules in Algodocs as simple as possible. In general, steps for creating an extracting rule are as follows:

Click on ‘Add Rule’ button
Select the data type field (Text, Number, Date, etc.)
(optional step) Select an area of the data you want to extract (do this only if your data is always in a fixed position, i.e. it will not change its position when you upload other documents later)
Click on ‘Extract’ button
Use default filters or add new filters by using ‘Add Filter’ to refine your data until you get the desired output

Watch the introductory tutorial video to get started with creating extracting rules. (Video)

How to extract text from a document

Text extraction from documents is one of the most common extracting rules used in Algodocs. It is even advised to use text instead of a number for Invoice Number or Account Number fields. The reason behind this is that invoice or account numbers sometimes include letters, in which case a number data type will fail. Although text related extracting rules are widely used, it is very easy to create them in Algodocs.

In order to create an extracting rule for text data type we first click on ‘Add Rule’ and then select ‘Text’ from the list of data types. The next step is to decide whether the data that we want to extract is always at the fixed position. If this is the case, then we can simply select an area around the data we want to extract by drawing a rectangle around it. If the data might change its position in your documents later, then skip selecting an area and move on by clicking on ‘Extract’ button.

When we get extracted data from the document we can refine data further until we get the desired output in the required format. Therefore, we can add various filters to the extracted text by using ‘Add Filter’. For example, if the extracted text consists of several lines and you wish it to look properly by making it a single line, then click on ‘Add Filter’ › ‘Format Text’ › ‘Remove line breaks’. Moreover, one of the most common issues with text data extraction is the multiple spaces problem.

Often our users need to remove multiple spaces between extracted words. In order to overcome this problem click in ‘Add Filter’ › ‘Format Text’ › ‘Remove blank spaces’ and then select ‘Multiple Blank Spaces’ option from the dropdown list.

Watch the video tutorial that covers most of the scenarios related to text data extraction. (Video)

Fixed vs variable position extraction

Data extraction from a fixed position in Algodocs is performed by selecting an area around the data you need to extract. So, you simply draw a rectangle around the data. (Image)

Keep in mind that extracting data by selecting an area should be done only in cases when the data you want to extract is always in a fixed position. Therefore, if there are possibilities that your data will move and change its position, then you should not apply selection of an area, because when it changes its position your data will not be extracted as it will fall outside the area you selected.

On the other hand, for data extraction from a variable position we do not select an area. However, there are several ways we can apply. One of the ways is to extract everything from the document and apply filter to refine the data we need. This is possible by using such filters as ‘Specify Start Position’ and ‘Specify End Position’. Using these filters you can easily capture the data you need even from a variable positions, since you can use keywords/labels to search for.

With ‘Specify Start Position’ you can use ‘Text match after’ option for which you enter at least one keyword. However, if you need to specify several keywords you may do so by separating them with a vertical bar ‘|’. Similarly, with ‘Specify End Position’ you can use ‘End of line’ for example, which will capture everything until the end of that line.

Another way of extracting data from a variable position is using ‘Keyword Based Search’ from the list of ‘Data Extraction Field Types’. In this case you will need to specify all possible keywords to which your data is related.

Please contact our support team in case you have difficulties in setting up an extracting rule for extracting data from a variable position.

How to extract numerical data (amounts, totals, etc.)

Extracting numerical data is just as extracting text or other data except that you select a ‘Number’ as the data field type. However, keep in mind that if you already selected let’s say ‘Text’ as the data field type and progressed with extracting rule by adding filters, then you can also apply a number filter by clicking on ‘Add Filter’ › ‘Find …’ › ‘Find numbers’.

‘Find numbers’ filter searches for valid numbers in the document and outputs only found valid numbers.

The following are sample valid numbers:

1,456.00
5555
$ 2,546.00

The following are sample invalid numbers and will not be captured by ‘Find numbers’ filter:

4564-231-2121
18%
8/11/2020

‘Find numbers’ filter offers a culture for number formats. Moreover, you can search for integers only, floating numbers only or all. Sometimes you might need to specify the range, so you can specify a range so that numbers are searched only in the specified range. Moreover, you can apply ‘Format numbers’ which, for example, removes currency symbols, etc. (Image)

How to extract dates in different formats

Usually extracting dates from documents begins by selecting a ‘Date’ as the data field type or applying a ‘Find dates’ filter by clicking on ‘Add Filter’ › ‘Find …’ › ‘Find dates’.

‘Find dates’ filter searches for valid dates in the document based on the format you specify in ‘Format on document’ field. (Image)

You can change the output format of dates by specifying the format in ‘Output format’ field. (Image)

The following table contains detailed information on the date format syntax used in Algodocs.

Day Formatting Syntax
d — Day of the month without leading zeros (1 to 31)
dd — Day of the month, 2 digits with leading zeros (01 to 31)
s — English ordinal suffix for the day of the month, 2 characters (st, nd, rd or th). Works well with ‘d’

Month Formatting Syntax
M — Numeric representation of a month, without leading zeros (1 through 12)
MM — Numeric representation of a month, with leading zeros (01 through 12)
MMM — A short textual representation of a month, three letters (Jan through Dec)
MMMM — A full textual representation of a month (January through December)

Year Formatting Syntax
yyyy — A full numeric representation of a year, 4 digits (e.g., 1991, 2014)
yy — A two digit representation of a year (e.g., 14, 19)

Examples
MM/dd/yyyy → 01/24/2021
M/d/yyyy → 1/24/2021
yyyy-MM-dd → 2021-01-24
MMMM d, yyyy → January 24, 2021
ds MMMM yyyy → 24th January 2021
MMM-dd-yy → Jan-24-21

How to extract table rows

Table rows data extraction is the most widely used data extraction type in Algodocs. Algodocs offers very flexible tabular data extraction without limiting the number of rows, columns or number of pages when table spans over hundreds or thousands of pages.

In order to create an extracting rule for table rows select ‘Table’ as the data field type. You will see that several default column splitters will be placed over your document. You may remove or add more splitters based on your table’s number of columns by clicking on the ‘+’ button on the left side of your document.

As an optional step, you can select an area of your table by drawing a rectangle around the table you want to extract rows from. (Image)

Please note that selecting an area for the table is fine only in cases when your table is at the fixed position and will never get longer by spanning over multiple pages. So, if there is possibility that your table may span over multiple pages or even if your table never spans over multiple pages, but can move up or down within the same page, then just don’t select an area for it.

After clicking on ‘Extract’ button the extracted data is displayed. In order to remove column headers that were also extracted, we apply ‘Keep Rows’ filter with a condition ‘Where column 2 has digits only’. (Image)

As a final step we can set the column headers of our extracted table by clicking ‘Add Filter’ › ‘Alter Columns’ › ‘Set Column Headers’. (Image)

Extract table rows without selecting an area

We have discussed table rows extraction in detail in the previous article. In this article we talk about extracting table rows using a more advanced approach, i.e. without selecting an area of the table by drawing a rectangle around it.

This time Algodocs will extract all data from the document and place it into the columns table. In order to get only the line items we need to specify the beginning and the end of our table. For this, we apply ‘Add Filter’ › ‘Alter Rows’ › ‘Keep Section’ and in ‘Keep Section’ select ‘With Condition’ to specify the start and end of the table section.

When we look at our data we see that our actual line items begin after the word ‘DESCRIPTION’ in column 1 and end just before the word ‘Subtotal’ in column 3. (Images)

As a final step we can set the column headers of our extracted table by clicking ‘Add Filter’ › ‘Alter Columns’ › ‘Set Column Headers’. (Image)

Extract table rows when cells have multiple lines (merge rows)

AlgoDocs was designed to handle data extraction from tables of any complexities. One such complex situation is cells having multiple lines. These multiple lines in a cell are actually a single row. Therefore, Algodocs offers a ‘Merge Rows’ filter for such cases.

First, place column splitters and click ‘Extract’. Algodocs will extract the data and turn it into a table. Then define the beginning and the end of the table rows you want to extract by applying ‘Keep Section’ (Alter Rows › Keep Section) with ‘With Condition’.

Next step is to merge the rows and have all rows begin with the date in the first column. Apply ‘Merge Rows’ (Alter Rows › Merge Rows). Inside this filter use a condition such as ‘Where column 1 has a value’. The final table will have merged rows. Other approaches (including RegEx patterns) can also be used. (Images)

Optionally set column headers via ‘Alter Columns’ › ‘Set Column Headers’.

How to extract email addresses

Extracting email addresses is most probably the easiest filter that you can use in AlgoDocs. Select ‘Email Address’ as the data field type. You can also add a filter at any time by clicking on ‘Add Filter’ › ‘Find …’ › ‘Find email addresses’.

‘Find email addresses’ searches for valid email addresses. (Image)

How to extract data based on keywords

Most of the time the data we need to extract has their own keywords/labels that identify the data. For example, Account Number, Invoice #, ID, Statement Date, Purchase Order ID, Vendor, Page, etc. Therefore, Algodocs has flexible and easy-to-use approaches for keyword-based extraction.

There are two keyword-based data extraction approaches in Algodocs:

Text match [before | after | including] filters
Advanced keyword-based search

Using Text match filters: Often used when keywords are positioned inline to the left of the data to extract (e.g., Phone, INVOICE #). Use ‘Specify Start Position’ with ‘Text match after’ to enter keywords, and ‘Specify End Position’ with ‘Text match before’ for stopping criteria. Remove trailing spaces if needed (Format Text › Remove blank spaces › Trailing Blank Spaces). (Images)

Advanced Keyword-Based Search: Lets you define a set of keywords or phrases, not limited to inline layouts. Useful when keywords are above the values or in complex layouts. Select ‘Advanced Keyword-Based Search’ as the field type (or Add Filter › Advanced Keyword-Based Search), then define keywords via ‘Edit Keywords’. The engine searches for keywords in order, then looks to the right/bottom areas for their values. (Images)

Data extraction for dates, amounts and other fields are implemented similarly. Contact support for help as needed.

How to extract data using Regex (regular expression)

Data extraction using RegEx is widely used in Algodocs; in some complex or special cases it gives full flexibility. To create a rule that uses RegEx, select ‘Regular Expression’ as the data field type. Another way is via ‘Add Filter’ › ‘Find …’ › ‘Find by Regex’.

Example: Find all dates in a document with ‘M/d/yyyy’ format using the pattern \d{1,2}/\d{1,2}/\d{4}. Check ‘Global’ to return all matches. (Image)

You may visit regex101.com for more information on regular expressions.

How to search and replace extracted data

When you need to replace extracted data, use the ‘Search & Replace’ filter in Algodocs. Click ‘Add Filter’ › ‘Search & Replace’ and select either ‘Search By Text’ or ‘Search By RegEx’.

With ‘Search By Text’ enter the plain text to replace and the replacement value. With ‘Search By RegEx’ enter a regular expression for the search field instead of plain text. (Image)

What does Reprocess mean?

There are cases when you might need to tweak your extracting rules after uploading a bunch of documents. Since Algodocs extracts data based on your extracting rules immediately after upload is complete, all documents will be processed and data extracted automatically. However, you might wish to get those documents processed again based on your updated (tweaked) extracting rules.

To apply extraction again, go to the ‘Extracted Data’ section and click ‘Reprocess’. Before reprocessing you might filter documents by folder or upload dates, then click ‘Reprocess’ to queue documents for extraction based on the latest rules.

Note: Reprocessing does not use extra quota. You can reprocess as many times as you need. Your quota is deducted only once per processed page.

Processing time per page

Algodocs has two engines for extracting data from documents — PDF Parser and OCR.

PDF Parser average processing time: 2 seconds per single page
OCR average processing time: 20 seconds per single page or image

How to delete an extractor

Extractors in Algodocs are the basis for data extraction from documents. Before deleting an extractor, make sure all documents and extracted data associated with it are no longer needed. Go to the ‘Extractors’ section, find the extractor to delete, click the ‘Delete’ button, and confirm the warning. All documents along with extracted data associated with this extractor will be deleted. This operation cannot be undone.

How to delete an individual extracting rule

To delete an extracting rule, go to the extracting rules by clicking on ‘Extracting Rules’ of the relevant extractor. Hover over the rule you want to delete and click the ‘Delete’ button that appears, then confirm the warning.

Note: If you delete an extracting rule, extracted data related to it will not be available anymore. This operation cannot be undone.

How to duplicate an individual extracting rule

Duplicating extracting rules is useful when you need to create a complex rule and already have a similar one. Instead of rebuilding filters, duplicate the existing rule and tweak it.

Go to extracting rules via the relevant extractor, hover over the rule you want to duplicate, and click ‘Duplicate’. Enter a name for the new rule and specify the extractor where the copy will be created. (Video)

How to change a sample document used for creating extracting rules

When creating an extracting rule it is important to use the right sample document, since this document plays a vital role in creating accurate rules. Sometimes extracting rules might fail to extract properly from specific documents. In that case, use the specific “problematic” document as the sample to test and tweak your rules.

To change the sample document: go to extracting rules from the relevant extractor, click the rule you want to test, then at the bottom of the page click ‘Change Sample Document’. In the popup, choose one of the last 50 uploaded documents to use as the sample. (Video)

Company

Resources