API Reference

This section exposes the documentation of the classes and functions that support serveliza. Your query will allow an understanding of your logic for programmatic use.

Electoral Roll

Main class

class serveliza.roll.roll.ElectoralRoll(source, auto=False, *args, **kwargs)[source]

ElectoralRoll allows to instantiate an electoral roll of the chilean Electoral Service (Servicio Electoral de Chile, SERVEL) from PDF files.

Different parameters are handled in the constructor of this class. In itself it handles the source parameter that determines the path where to recognize pdf files (it can be a directory path or a file), the recursive parameter that determines whether to search in the root of the defined path or in each of its subdirectories, and the auto parameter that determines if the extraction is automatic. It also inherits parameters from PDFProcessorMixin (processor). Likewise, in the constructor it instantiates other nested classes by routing their parameters.

Parameters:
  • source (str) – The source path of pdf files.
  • auto (bool) – Run the extract in the instantation.
  • recursive (bool) – Determines if the search for pdf files in the delivered source is recursive or is only for the root of the indicated directory,
  • processor (str) – Processor to use (default=’pdftotext’, see more in PDFProcessorMixin).
  • memorize (bool) – Storage data in memory of instance (default=True, see more in RollMemorizer).
  • export (bool) – If export data in csv file (default=False, see more in RollExporter).
  • output (str) – Directory to store the data in csv file(s) ( see more in RollExporter).
  • mode (str) – Determines the data export mode in files. If it is unified (default) it creates a single csv file with the data, or if it is separated into several according to communal or regional criteria (see more in RollExporter).
  • mode_sep (str) – Criteria for separating files in export in separate mode (commune or region, default=”commune”, see more in RollExporter).
  • random_suffix (bool) – Determines whether exported files have a random text string appended to the end (see more in RollExporter).
  • summary (bool) – Determines whether to generate a summary file of the export and the extracted data (see more in RollExporter).

Anyway, only the source parameter is required:

>>> obj = ElectoralRoll(source='/path/to/the/pdf/file(s')
>>> obj.run()  # Start the analysis and extraction of data.

Setting the parameter auto to true in the constructor will automatically start the run method:

>>> obj = ElectoralRoll(source='/path/to/the/pdf/file(s', auto=True)

See the run at that method for a better understanding of this class.

inner_class_parser

alias of serveliza.roll.parsers.RollParser

inner_class_adapter

alias of serveliza.roll.adapters.RollAdapter

inner_class_printer

alias of serveliza.roll.printer.RollPrinter

inner_class_memorizer

alias of serveliza.roll.memorizer.RollMemorizer

inner_class_exporter

alias of serveliza.roll.exporter.RollExporter

run()[source]

ElectoralRoll.run is the main method within the class logic that executes the complete flow of data analysis and extraction:

  • Iterate over the found files, ordered by size from smallest to largest, executing the run_file method with the file, its index and the total.
  • It iterates on each page of each file:
    • Processing it with the library determined in the processor property and defined in the constructor (see more in PDFProcessorMixin).
    • Adapting the rendered page if required by the application and the processor used (see more in RollAdapter).
    • Analyzing the content text to extract its data (see more in RollParser).
    • Memorizing your data in a consolidated data stored in the memorizer. Its execution can be skipped by setting the memorizer parameter to false in the constructor (see more in RollMemorizer).
    • Exporting your data to one or more csv files depending on how the exporter is configured. Its execution can be activated by defining the export parameter as true in the constructor (see more in RollExporter).
  • The printer (RollPrinter instance) is executed in each part of the flow and it determines if and how it prints on the screen (as declared in the constructor).
>>> roll.run()
run_file(file, file_num, file_total)[source]
Parameters:
  • file (dict) – data of file
  • file_num (int) – the number of file to analize.
  • file_total (int) – the total of files to analize. this param and before is needed for printer.

The run_file method is called by the run method and iterates on each page of each file:

  • Processing it with the library determined in the processor property and defined in the constructor (see more in PDFProcessorMixin).
  • Adapting the rendered page if required by the application and the processor used (see more in RollAdapter).
  • Analyzing the content text to extract its data (see more in RollParser).
  • Memorizing your data in a consolidated data stored in the memorizer. Its execution can be skipped by setting the memorizer parameter to false in the constructor (see more in RollMemorizer).
  • Exporting your data to one or more csv files depending on how the exporter is configured. Its execution can be activated by defining the export parameter as true in the constructor (see more in RollExporter).

Stores metadatas of the extraction of each file.

sheet_parse(sheet, *args, **kwargs)[source]
Parameters:sheet (str) – sheet in string.
Returns:instance of RollParser.

Method that calls the class defined in the inner_class_parser class attribute, initializing it with the sheet argument.

sheet_memorize(parsed, *args, **kwargs)[source]
Parameters:parsed (str) – instance of RollParser.

Method that routes a parsed page to the memorize method of the memorizer.

sheet_export(parsed, *args, **kwargs)[source]
Parameters:parsed (str) – instance of RollParser.
Returns:Absolute path of the csv file where the data was exported.

Method that routes a parsed page to the export_sheet method of the exporter.

printer
Returns:inner instance of RollPrinter.

Property to call the RollPrinter object instanciated in constructor.

>>> roll.printer.__class__
serveliza.roll.printer.RollPrinter
memorizer
Returns:inner instance of RollMemorizer.

Property to call the RollMemorizer object instanciated in constructor.

>>> roll.memorizer.__class__
serveliza.roll.memorizer.RollMemorizer
exporter
Returns:inner instance of RollExporter.

Property to call the RollExporter object instanciated in constructor.

>>> roll.exporter.__class__
serveliza.roll.exporter.RollExporter
is_runned
Returns:boolean.

Boolean property that indicates whether the instance has executed the run method or not.

>>> roll.is_runned
True  # or False
metadata
Returns:dictionary with all metadata.

Property that stores the analysis metadata. It integrates the metadata of each electoral register detected in the analysis.

>>> roll.is_runned
False
>>> roll.metadata
{'files': {'filename.pdf': {'name': 'filename.pdf',
   'bytes': 10000,
   'relative': 'relative/path/filename.pdf',
   'absolute': '/absolute/path/filename.pdf',
   'mtime': datetime.datetime(...),
   'atime': datetime.datetime(...),
   'durations': {'processing': datetime.timedelta(0),
    'adapting': datetime.timedelta(0),
    'parsing': datetime.timedelta(0),
    'memorizing': datetime.timedelta(0),
    'exporting': datetime.timedelta(0)}}},
 'analysis': {'started': None,
  'finalized': None,
  'durations': {'processing': datetime.timedelta(0),
   'adapting': datetime.timedelta(0),
   'parsing': datetime.timedelta(0),
   'memorizing': datetime.timedelta(0),
   'exporting': datetime.timedelta(0)}},
 'rolls': {}}
>>> roll.run()
>>> roll.metadata
{'files': {'filename.pdf': {'name': 'filename.pdf',
   ...
   'rid': 'RID-XXXX',
   'roll': 'PADRON ELECTORAL X - ELECCIONES X XXXX',
   'year': XXXX,
   'region': 'REGION',
   'province': 'PROVINCE',
   'commune': 'COMMUNE',
   'entries': {'total': 999, 'rescue': 0, 'errors': 0},
   'duration': datetime.timedelta(...)}},
 'analysis': {'started': datetime.datetime(...) ...},
 'rolls': {'RID-XXXX: {'roll': 'PADRON ELECTORAL X - ELE...',
   'year': XXXX,
   'regions': ['REGION', ...],
   'communes': ['COMMUNE', ...],
   'provinces': ['PROVINCE', ...],
   'nulls': {'total': 0},
   'entries': {'total': 999, 'rescue': 0, 'errors': 0}}}}
rid
Returns:string with first roll identifier.

Property that returns the identifier of the electoral roll analyzed. If it will return only the first identifier detected, this should not cause inconvenience unless pdf files from different electoral rolls are loaded.

>>> roll.rid
'RID-XXXX'

If the instance did not run, it returns None.

roll
Returns:name of electoral roll.

Property that returns the full name of electoral roll analyzed.

>>> roll.roll
'PADRON ELECTORAL X - ELECCIONES X XXXX'

Internaly use the rid property. If the instance did not run, it returns None.

entries
Returns:list of data entries of memorizer.

Property that accesses the data entries of the electoral roll analyzed. The data is stored in the RollMemorizer instance.

>>> roll.entries
[[...]...]

Internaly use the rid property. If the instance did not run, it returns None.

fields
Returns:list of fields of electoral roll.

Property that returns the fields of the electoral roll analyzed.

>>> roll.fields
['nombre',
 'c-identidad',
 'sex',
 'region',
 'provincia',
 'comuna',
 'domicilio-electoral',
 'circunscripcion',
 'mesa',
 'reference']

Internaly use the rid property. If the instance did not run, it returns None.

errors
Returns:list of errors found.

Property that stores the errors of the analysis. List of errors found in the analysis. Errors are dictionaries with data to keep track of. The purpose of registering them is to improve the development of serveliza.

>>> roll.errors
[...]
to_dataframe
Returns:Pandas DataFrame instance.
Raises:UserWarning – You need to run the application before converting the result to Pandas DataFrame.

Property that returns the electoral roll data in a new Pandas DataFrame instance.

recursive

Property that determines if the search for pdf files in the delivered source is recursive or is only for the root of the indicated directory,

source
Returns:

list of paths to valid pdf files.

Raises:
  • TypeError – source param must be string or list.
  • TypeError – source doesnt have valid PDF files.

Property that stores paths of pdf files obtained from a list or string with file paths or directories.

>>> roll.source
['relative / path / to / file.pdf']

The source is loaded into the constructor through the parameter of the same name. It is also possible to redefine through the property setter:

>>> roll.source = ['path / to / file.pdf', '/ path / to / dir']
>>> roll.source = '/path/to/dir/o/file.pdf'

Roll adapters

exception serveliza.roll.adapters.RollNoisedError[source]

Exception that indicates an error when analyzing because the pattern is noisy, that is, it has watermarks to prevent the coherent extraction of text.

class serveliza.roll.adapters.PdftotextAdapterMixin[source]

PdftotextAdapterMixin is an adapter for the pdftotext processor.

It is a mixin designed to be inherited in RollAdapter.

adapter_pdftotext(sheet)[source]
Parameters:

sheet (str) – sheet in text string.

Raises:
  • ValueError – Unexpected type of sheet.
  • RollNoisedError – pdftotext processor cant process a noised roll. Try with the pdfminersix processor.
Returns:

sheet adapted.

Method to adapt a sheet processed by pdftotext before being passed to the parser.

class serveliza.roll.adapters.PdfminersixAdapterMixin[source]

PdfminersixAdapterMixin is an adapter for the pdfminersix processor.

It is a mixin designed to be inherited in RollAdapter.

adapter_pdfminersix(sheet)[source]
Parameters:sheet (list) – sheet in list of pdfminersix elements.
Returns:sheet adapted in text string.

Method to adapt a sheet processed by pdfminersix before being passed to the parser.

It is also capable of eliminating possible noise with watermarks.

class serveliza.roll.adapters.RollAdapter(sheet, processor, *args, **kwargs)[source]
Parameters:
  • sheet (obj) – sheet of the type according to the processor used.
  • processor (str) – processor used in the sheet.

RollAdapter is a class that is instantiated for a sheet by routing it to the appropriate method according to the processor defined in the constructor parameter of the same name.

Practical use:

>>> adapted = RollAdapted(processed_sheet, 'processor-name').sheet
sheet

Property where the adapted sheet is stored in the constructor.

Roll parsers

class serveliza.roll.parsers.RollParser(sheet, auto=True, more_fields=True, *args, **kwargs)[source]

RollParser is intended to be instantiated by each sheet.

Class attributes beginning with “regex_” correspond to the regular expressions used to detect fields in the header. The regexs_entries class attribute contains a dictionary with the regular expressions for the fields in each record and a key name for each. Finally, the dpa_fixture_path class attribute defines the path of the .json file that contains a compressed dictionary with communes and constituencies.

regex_roll = 'PADRO?Ó?N\\s+ELECTORAL\\s+[A-Z,\\s-]+\\d+'

roll name regex

regex_region = "REGIO?Ó?N[0,]*\\s*:\\s*([A-ZÑ\\'\\s.]*\\s{3})"

region regex

regex_commune = "COMUNA[0,]*\\s*:\\s*([A-ZÑ\\' -]*\\s{3})"

commune regex

regex_province = "PROVINCIA[0,]*\\s*:\\s*([A-ZÑ\\' ]*)"

province regex

regex_total_entries = 'Registros\\s*:\\s*(\\d+)'

total entries regex (optional)

regex_pagination = '[PAaáGgIiNn]+\\s*:?\\s*(\\d*)\\s*de\\s*(\\d*)'

pagination regex (optional)

regexs_entries = {'name': '^[A-ZÑa-z\\s]+', 'rut': '\\d*\\.?\\d+\\.\\d+-[0-9kK]', 'sex': '\\s(VAR|MUJ)[ONER]*\\s', 'table': '\\s(\\d+\\s?\\w?)\\s*\\d*$'}

regex’s for parsing entries.

dpa_fixture_path = '../utils/DPA-commune-circuns.json'

path to commune-circuns json.

run()[source]

Method that starts the voter registry sheet analyzer by executing: * decompose * parse_header * parse_fields * parse_entries

It measures the duration times of each method executed and saves them in the metadata[times] property.

decompose()[source]

Method that descompose a sheet of the electoral roll in a text string into a list with each line.

is_decomposed
Returns:boolean.

Property that indicates whether the ‘sheet’ property is decomposed into a list of lines or not.

sheet

Property that contains a text string of the entire sheet to parse or the list of text lines if it is decomposed.

>>> parser.is_descomposed
False
>>> parser.sheet
'text-\n-string'
>>> parser.descompose
>>> parser.sheet
['text-', '-string']
parse_header()[source]

Method that parses the head of the sheet and extracts the data from roll, election, year, region, province and commune to store it in the header property.

It also builds a unique identifier of the electoral roll that it stores in the metadata property with the rid key.

header

Property that contains the result of method parse_header. It consists of a dictionary with the data from the header of the electoral roll sheet.

>>> parser.header
{
    'roll': 'PADRON...',
    'election': 'ELECCION...',
    'year':     2020,
    'region':   'METRO...',
    'commune':  'SANTIA...',
    'province': 'SANTIA...',
}
parse_fields()[source]

Method to analyze and extract the fields of the electoral roll. The direct fields of the sheet (nombre, c-identidad, sex|o, domicilio-electoral, circunscripcion y mesa) are taken and commune, province and region (comuna, provincia, region) are added. Result is stored in the fields property, the method returns nothing.

fields

Property that contains the fields of the electoral roll detected in the sheet through the parse_fields method.

parse_entries()[source]

Method that analyzes and extracts each data entry from the voter registration sheet.

First determine if each line of text is well composed, that is, it begins with at least one letter and ends with a number or a space next to a single letter. Then each line of text is analyzed as if it were an input through the parse_entry method.

Afterwards, the lines considered malformed are internally processed, joining them in relation to whether they start with a letter or a space. Then use the parse_entry method again for each of them. Those that are rescued will remain in the metadata property in the keys entires > rescue.

parse_entry(line)[source]

A method that extracts the data from a voter registry entry in text line format.

Finds the fields found by regular expressions that are stored in the class attribute: :attr: regexs_entries <.RollParser.regexs_entries>.

Then it looks for the district from a list according to its commune and in relation to this it determines the place of the electoral domicile.

entries

Property containing a list of entries from the electoral roll sheet. Each entry corresponds to a list of data in the order of the fields defined in the fields property.

metadata

Property contains the metadata extracted during the parser analysis of the electoral roll sheet.

The metadata is stored as a dictionary, the rid key corresponding to the unique identifier of the voter registry of the sheet, the times key stores how long the analysis took (in total, during header, fields and entries), the entries key contains the total number of entries extracted , the amount of rescued and errors. Finally, the NULLS key contains the total number of null data inside each row or entry, as well as the detail of the fields, if there is any.

>>> parser.metadata
{
    'rid': 'PEA-EM-2016',
    'entries': {'total': 1, 'rescue': 0, 'errors': 0},
    'nulls': {'total': 0}
}
errors

Property contains a list with the errors found in the sheet analysis. Each error corresponds to a dictionary with at least two keys: code with a semantic slug text of the error and ‘target’ that contains what generated the error.

fields_index

Property that contains the index where the fields are located in the decomposed sheet as a list.

circuns
Returns:a dictionary with communes as key and list of circunscriptions as value.

Property that contains the possible electoral circunscriptions within the commune defined in the header property. It will return None if the parse_header method has not been executed.

more_fields
Returns:boolean.

Property where the option of whether to add fields to the input is stored.

Roll memorizer

class serveliza.roll.memorizer.RollMemorizer(*args, **kwargs)[source]
Parameters:memorize (bool) – If the memorizer is activated (default True)

RollMemorizer is a class that allows it to store data and errors from the electoral roll. It is instantiated within an instance of ElectoralRoll.

storage
Returns:dictionary with all data.

Property where all the memorized data are stored.

errors
Returns:list with errors.

Property where the errors found are stored.

is_active
Returns:boolean.

Property that indicates if the memorizer is active as defined in the constructor.

memorize(parsed)[source]
Parameters:parsed (obj) – an instance of RollParser.

memorize is the main method of RollMemorizer. It will memorize the metadata of an analyzed sheet, if the memorizer is active (see is_active property) it will also memorize the analyzed entries. The methods executed to memorize the metadata in order:

It then stores, if active, the entries and errors.

prepare_rid(parsed)[source]
Parameters:parsed (obj) – an instance of RollParser.

prepare_rid is a method that prepares the storage property for storing metadata and electoral roll data. Use the electoral roll identifier (see more in rid) as the key for the storage property.

store_metadata_places(parsed)[source]
Parameters:parsed (obj) – an instance of RollParser.

store_metadata_places is a method to memorize the places (regions, provinces and communes) present in the parsed sheet.

store_metadata_entries(parsed)[source]
Parameters:parsed (obj) – an instance of RollParser.

store_metadata_entries is a method to memorize the metadata of the entries (total, rescued, errors) present in the parsed sheet. If the total number of entries is declared in the header, it is added as declared.

store_metadata_nulls(parsed)[source]
Parameters:parsed (obj) – an instance of RollParser.

store_metadata_nulls is a method to memorize the metadata of the null data in the entries (total and for each field with null data) present in the parsed sheet.

Roll exporter

class serveliza.roll.exporter.RollExporter(*args, **kwargs)[source]

RollExporter is a class for exporting electoral roll data in csv files.

Parameters:
  • export (bool) – If the export is activated (default False)
  • output (str) – directory to store the data in .csv (see more in output.
  • mode (str) – determines the data export mode in files (unified o separated, see more in mode).
  • mode_sep (str) – Criteria for separating files in export in separate mode (region o commune, see more in mode).
  • random_suffix (bool) – Determines whether exported files have a random text string appended to the end.
  • summary (bool) – Determines whether to generate a summary file of the export and the extracted data.

It is instantiated within an instance of ElectoralRoll.

modes = ['unified', 'separated']

Available export modes.

mode_sep_opts = ['commune', 'region']

Available file separation modes

export_sheet(parsed)[source]
Parameters:parsed (obj) – an instance of RollParser.
Returns:the absolute path of the file where the data was exported.

export_sheet is a method of exporting the data from a parsed sheet into files as configured in the constructor.

export_summary(rid, metadata)[source]
Parameters:
  • rid (str) – identifier of the electoral roll
  • metadata (dict) – metadata of the electoral roll.

export_summary is a method that exports the metadata of the electoral roll as a summary in a yaml file.

is_active
Returns:boolean.

Property that indicates if the memorizer is active as defined in the constructor.

random_suffix

Determines whether exported files have a random text string appended to the end.

summary

Determines whether to generate a summary file of the export and the extracted data.

output

Directory to store the data in .csv.

mode

Determines the data export mode in files. If it is “unified” (default) it creates a single csv file with the data, or if it is “separated” into several according to communal or regional criteria.’

mode_sep

Criteria for separating files in export in separate mode.

Roll printer

class serveliza.roll.printer.ColorMixin[source]

Mixin that grants methods to color text if the colors property is true.

info(text)[source]

Color an information text (blue).

ok(text)[source]

Color an success text (green).

warn(text)[source]

Color an warn text (yellow).

subtle(text)[source]

Color an sublte text (gray).

error(text)[source]

Color an error text (red).

class serveliza.roll.printer.RollPrinter(*args, **kwargs)[source]
Parameters:
  • verbose (bool) – If print the progress in screen (default False).
  • colors (bool) – If print with colors in the screen (default True).

RollPrinter is a class that allows it to print progress of application in the screen. It is instantiated within an instance of ElectoralRoll.

verbose

Property that determines whether to print the application progress to the screen.

colors

Property that determines whether to print on screen with colors.

Method that prints on constructor search.

init_founded(files)[source]

Method that prints the search result of the constructor.

init_auto()[source]

Method that prints if the start was automatic.

run_started(started, files)[source]

Method that prints if the start of the analysis.

run_file_start(file, number)[source]

Method that prints if the start of the analysis of a file.

run_file_progress(pro)[source]

Method that prints the progress of the analysis of a file.

run_file_end(metadata)[source]

Method that prints the completion of the analysis of a file.

run_finalized(finalized, metadata)[source]

Method that prints the completion of the analysis.

percent(of, total)[source]

Utility that returns a percentage in string. It receives two parameters (of and total) with which the relation calculates.

clean_line()[source]

Utility that cleans the last line printed on the screen.

repr(obj)[source]

Method to print representation of ElectoralRoll class.

is_runned_tag(is_runned)[source]

Utility returns a text string formatted to print if ElectoralRol ran.

Mixins

PDF processor mixin

class serveliza.mixins.pdf.PDFProcessorMixin[source]

Mixin that allows an instance the ability to process PDF files with certain libraries. In the constructor, the processor to be used is defined with the argument of the same name, binding the process_pdf property to the method related to it.

PDF processor availables:

  • pdftotext (0.1.0 release) with pdftotext_processor
  • pdfminersix (0.1.0 release) with pdftotext_processor
processor

Processor (library) to extract text from pdf file.

process_pdf

Property that calls the method corresponding to the PDF file processor configured in the instance initialization.

>>> obj.process_pdf(*args)
process_pdf_page

Available PDF processors

class serveliza.mixins.pdf_processors.PdftotextMixin[source]
processor_pdftotext(pathfile)[source]

Method to use pdftotext in a file specified in the argument as a path.

>>> obj.processor_pdftotext('/path/to/file.pdf')
list # without processing
processor_pdftotext_page(page)[source]

pdftotext not need that.

class serveliza.mixins.pdf_processors.PdfminersixMixin[source]
processor_pdfminersix(pathfile)[source]
processor_pdfminersix_page(page)[source]