API Reference¶
This section exposes the documentation of the classes and functions that support serveliza. Your query will allow an understanding of your logic for programmatic use.
Electoral Roll¶
- Main module:
serveliza.roll.roll
Main class¶
-
class
serveliza.roll.roll.
ElectoralRoll
(source, auto=False, *args, **kwargs)[source]¶ ElectoralRoll
allows to instantiate an electoral roll of the chilean Electoral Service (Servicio Electoral de Chile, SERVEL) from PDF files.Different parameters are handled in the constructor of this class. In itself it handles the source parameter that determines the path where to recognize pdf files (it can be a directory path or a file), the recursive parameter that determines whether to search in the root of the defined path or in each of its subdirectories, and the auto parameter that determines if the extraction is automatic. It also inherits parameters from
PDFProcessorMixin
(processor). Likewise, in the constructor it instantiates other nested classes by routing their parameters.Parameters: - source (str) – The source path of pdf files.
- auto (bool) – Run the extract in the instantation.
- recursive (bool) – Determines if the search for pdf files in the delivered source is recursive or is only for the root of the indicated directory,
- processor (str) – Processor to use (default=’pdftotext’, see more in
PDFProcessorMixin
). - memorize (bool) – Storage data in memory of instance (default=True, see more in
RollMemorizer
). - export (bool) – If export data in csv file (default=False, see more in
RollExporter
). - output (str) – Directory to store the data in csv file(s) ( see more in
RollExporter
). - mode (str) – Determines the data export mode in files. If it is unified (default) it creates a single csv file with the data, or if it is separated into several according to communal or regional criteria (see more in
RollExporter
). - mode_sep (str) – Criteria for separating files in export in separate mode (commune or region, default=”commune”, see more in
RollExporter
). - random_suffix (bool) – Determines whether exported files have a random text string appended to the end (see more in
RollExporter
). - summary (bool) – Determines whether to generate a summary file of the export and the extracted data (see more in
RollExporter
).
Anyway, only the source parameter is required:
>>> obj = ElectoralRoll(source='/path/to/the/pdf/file(s') >>> obj.run() # Start the analysis and extraction of data.
Setting the parameter auto to true in the constructor will automatically start the
run
method:>>> obj = ElectoralRoll(source='/path/to/the/pdf/file(s', auto=True)
See the
run
at that method for a better understanding of this class.-
inner_class_parser
¶ alias of
serveliza.roll.parsers.RollParser
-
inner_class_adapter
¶ alias of
serveliza.roll.adapters.RollAdapter
-
inner_class_printer
¶ alias of
serveliza.roll.printer.RollPrinter
-
inner_class_memorizer
¶
-
inner_class_exporter
¶ alias of
serveliza.roll.exporter.RollExporter
-
run
()[source]¶ ElectoralRoll.run
is the main method within the class logic that executes the complete flow of data analysis and extraction:- Iterate over the found files, ordered by size from smallest to largest, executing the
run_file
method with the file, its index and the total. - It iterates on each page of each file:
- Processing it with the library determined in the processor property and defined in the constructor (see more in
PDFProcessorMixin
). - Adapting the rendered page if required by the application and the processor used (see more in
RollAdapter
). - Analyzing the content text to extract its data (see more in
RollParser
). - Memorizing your data in a consolidated data stored in the memorizer. Its execution can be skipped by setting the memorizer parameter to false in the constructor (see more in
RollMemorizer
). - Exporting your data to one or more csv files depending on how the exporter is configured. Its execution can be activated by defining the export parameter as true in the constructor (see more in
RollExporter
).
- Processing it with the library determined in the processor property and defined in the constructor (see more in
- The printer (
RollPrinter
instance) is executed in each part of the flow and it determines if and how it prints on the screen (as declared in the constructor).
>>> roll.run()
- Iterate over the found files, ordered by size from smallest to largest, executing the
-
run_file
(file, file_num, file_total)[source]¶ Parameters: - file (dict) – data of file
- file_num (int) – the number of file to analize.
- file_total (int) – the total of files to analize. this param and before is needed for printer.
The
run_file
method is called by therun
method and iterates on each page of each file:- Processing it with the library determined in the processor property and defined in the constructor (see more in
PDFProcessorMixin
). - Adapting the rendered page if required by the application and the processor used (see more in
RollAdapter
). - Analyzing the content text to extract its data (see more in
RollParser
). - Memorizing your data in a consolidated data stored in the memorizer. Its execution can be skipped by setting the memorizer parameter to false in the constructor (see more in
RollMemorizer
). - Exporting your data to one or more csv files depending on how the exporter is configured. Its execution can be activated by defining the export parameter as true in the constructor (see more in
RollExporter
).
Stores metadatas of the extraction of each file.
-
sheet_parse
(sheet, *args, **kwargs)[source]¶ Parameters: sheet (str) – sheet in string. Returns: instance of RollParser
.Method that calls the class defined in the
inner_class_parser
class attribute, initializing it with the sheet argument.
-
sheet_memorize
(parsed, *args, **kwargs)[source]¶ Parameters: parsed (str) – instance of RollParser
.Method that routes a parsed page to the
memorize
method of the memorizer.
-
sheet_export
(parsed, *args, **kwargs)[source]¶ Parameters: parsed (str) – instance of RollParser
.Returns: Absolute path of the csv file where the data was exported. Method that routes a parsed page to the
export_sheet
method of the exporter.
-
printer
¶ Returns: inner instance of RollPrinter
.Property to call the
RollPrinter
object instanciated in constructor.>>> roll.printer.__class__ serveliza.roll.printer.RollPrinter
-
memorizer
¶ Returns: inner instance of RollMemorizer
.Property to call the
RollMemorizer
object instanciated in constructor.>>> roll.memorizer.__class__ serveliza.roll.memorizer.RollMemorizer
-
exporter
¶ Returns: inner instance of RollExporter
.Property to call the
RollExporter
object instanciated in constructor.>>> roll.exporter.__class__ serveliza.roll.exporter.RollExporter
-
is_runned
¶ Returns: boolean. Boolean property that indicates whether the instance has executed the
run
method or not.>>> roll.is_runned True # or False
-
metadata
¶ Returns: dictionary with all metadata. Property that stores the analysis metadata. It integrates the metadata of each electoral register detected in the analysis.
>>> roll.is_runned False >>> roll.metadata {'files': {'filename.pdf': {'name': 'filename.pdf', 'bytes': 10000, 'relative': 'relative/path/filename.pdf', 'absolute': '/absolute/path/filename.pdf', 'mtime': datetime.datetime(...), 'atime': datetime.datetime(...), 'durations': {'processing': datetime.timedelta(0), 'adapting': datetime.timedelta(0), 'parsing': datetime.timedelta(0), 'memorizing': datetime.timedelta(0), 'exporting': datetime.timedelta(0)}}}, 'analysis': {'started': None, 'finalized': None, 'durations': {'processing': datetime.timedelta(0), 'adapting': datetime.timedelta(0), 'parsing': datetime.timedelta(0), 'memorizing': datetime.timedelta(0), 'exporting': datetime.timedelta(0)}}, 'rolls': {}} >>> roll.run() >>> roll.metadata {'files': {'filename.pdf': {'name': 'filename.pdf', ... 'rid': 'RID-XXXX', 'roll': 'PADRON ELECTORAL X - ELECCIONES X XXXX', 'year': XXXX, 'region': 'REGION', 'province': 'PROVINCE', 'commune': 'COMMUNE', 'entries': {'total': 999, 'rescue': 0, 'errors': 0}, 'duration': datetime.timedelta(...)}}, 'analysis': {'started': datetime.datetime(...) ...}, 'rolls': {'RID-XXXX: {'roll': 'PADRON ELECTORAL X - ELE...', 'year': XXXX, 'regions': ['REGION', ...], 'communes': ['COMMUNE', ...], 'provinces': ['PROVINCE', ...], 'nulls': {'total': 0}, 'entries': {'total': 999, 'rescue': 0, 'errors': 0}}}}
-
rid
¶ Returns: string with first roll identifier. Property that returns the identifier of the electoral roll analyzed. If it will return only the first identifier detected, this should not cause inconvenience unless pdf files from different electoral rolls are loaded.
>>> roll.rid 'RID-XXXX'
If the instance did not run, it returns None.
-
roll
¶ Returns: name of electoral roll. Property that returns the full name of electoral roll analyzed.
>>> roll.roll 'PADRON ELECTORAL X - ELECCIONES X XXXX'
Internaly use the
rid
property. If the instance did not run, it returns None.
-
entries
¶ Returns: list of data entries of memorizer. Property that accesses the data entries of the electoral roll analyzed. The data is stored in the
RollMemorizer
instance.>>> roll.entries [[...]...]
Internaly use the
rid
property. If the instance did not run, it returns None.
-
fields
¶ Returns: list of fields of electoral roll. Property that returns the fields of the electoral roll analyzed.
>>> roll.fields ['nombre', 'c-identidad', 'sex', 'region', 'provincia', 'comuna', 'domicilio-electoral', 'circunscripcion', 'mesa', 'reference']
Internaly use the
rid
property. If the instance did not run, it returns None.
-
errors
¶ Returns: list of errors found. Property that stores the errors of the analysis. List of errors found in the analysis. Errors are dictionaries with data to keep track of. The purpose of registering them is to improve the development of serveliza.
>>> roll.errors [...]
-
to_dataframe
¶ Returns: Pandas DataFrame instance. Raises: UserWarning – You need to run the application before converting the result to Pandas DataFrame. Property that returns the electoral roll data in a new Pandas DataFrame instance.
-
recursive
¶ Property that determines if the search for pdf files in the delivered source is recursive or is only for the root of the indicated directory,
-
source
¶ Returns: list of paths to valid pdf files.
Raises: - TypeError – source param must be string or list.
- TypeError – source doesnt have valid PDF files.
Property that stores paths of pdf files obtained from a list or string with file paths or directories.
>>> roll.source ['relative / path / to / file.pdf']
The source is loaded into the constructor through the parameter of the same name. It is also possible to redefine through the property setter:
>>> roll.source = ['path / to / file.pdf', '/ path / to / dir'] >>> roll.source = '/path/to/dir/o/file.pdf'
Roll adapters¶
-
exception
serveliza.roll.adapters.
RollNoisedError
[source]¶ Exception that indicates an error when analyzing because the pattern is noisy, that is, it has watermarks to prevent the coherent extraction of text.
-
class
serveliza.roll.adapters.
PdftotextAdapterMixin
[source]¶ PdftotextAdapterMixin
is an adapter for the pdftotext processor.It is a mixin designed to be inherited in
RollAdapter
.-
adapter_pdftotext
(sheet)[source]¶ Parameters: sheet (str) – sheet in text string.
Raises: - ValueError – Unexpected type of sheet.
- RollNoisedError – pdftotext processor cant process a noised roll. Try with the pdfminersix processor.
Returns: sheet adapted.
Method to adapt a sheet processed by pdftotext before being passed to the parser.
-
-
class
serveliza.roll.adapters.
PdfminersixAdapterMixin
[source]¶ PdfminersixAdapterMixin
is an adapter for the pdfminersix processor.It is a mixin designed to be inherited in
RollAdapter
.-
adapter_pdfminersix
(sheet)[source]¶ Parameters: sheet (list) – sheet in list of pdfminersix elements. Returns: sheet adapted in text string. Method to adapt a sheet processed by pdfminersix before being passed to the parser.
It is also capable of eliminating possible noise with watermarks.
-
-
class
serveliza.roll.adapters.
RollAdapter
(sheet, processor, *args, **kwargs)[source]¶ Parameters: - sheet (obj) – sheet of the type according to the processor used.
- processor (str) – processor used in the sheet.
RollAdapter
is a class that is instantiated for a sheet by routing it to the appropriate method according to the processor defined in the constructor parameter of the same name.Practical use:
>>> adapted = RollAdapted(processed_sheet, 'processor-name').sheet
-
sheet
¶ Property where the adapted sheet is stored in the constructor.
Roll parsers¶
-
class
serveliza.roll.parsers.
RollParser
(sheet, auto=True, more_fields=True, *args, **kwargs)[source]¶ RollParser
is intended to be instantiated by each sheet.Class attributes beginning with “regex_” correspond to the regular expressions used to detect fields in the header. The regexs_entries class attribute contains a dictionary with the regular expressions for the fields in each record and a key name for each. Finally, the dpa_fixture_path class attribute defines the path of the .json file that contains a compressed dictionary with communes and constituencies.
-
regex_roll
= 'PADRO?Ó?N\\s+ELECTORAL\\s+[A-Z,\\s-]+\\d+'¶ roll name regex
-
regex_region
= "REGIO?Ó?N[0,]*\\s*:\\s*([A-ZÑ\\'\\s.]*\\s{3})"¶ region regex
-
regex_commune
= "COMUNA[0,]*\\s*:\\s*([A-ZÑ\\' -]*\\s{3})"¶ commune regex
-
regex_province
= "PROVINCIA[0,]*\\s*:\\s*([A-ZÑ\\' ]*)"¶ province regex
-
regex_total_entries
= 'Registros\\s*:\\s*(\\d+)'¶ total entries regex (optional)
-
regex_pagination
= '[PAaáGgIiNn]+\\s*:?\\s*(\\d*)\\s*de\\s*(\\d*)'¶ pagination regex (optional)
-
regexs_entries
= {'name': '^[A-ZÑa-z\\s]+', 'rut': '\\d*\\.?\\d+\\.\\d+-[0-9kK]', 'sex': '\\s(VAR|MUJ)[ONER]*\\s', 'table': '\\s(\\d+\\s?\\w?)\\s*\\d*$'}¶ regex’s for parsing entries.
-
dpa_fixture_path
= '../utils/DPA-commune-circuns.json'¶ path to commune-circuns json.
-
run
()[source]¶ Method that starts the voter registry sheet analyzer by executing: *
decompose
*parse_header
*parse_fields
*parse_entries
It measures the duration times of each method executed and saves them in the
metadata[times]
property.
-
decompose
()[source]¶ Method that descompose a
sheet
of the electoral roll in a text string into a list with each line.
-
is_decomposed
¶ Returns: boolean. Property that indicates whether the ‘sheet’ property is decomposed into a list of lines or not.
-
sheet
¶ Property that contains a text string of the entire sheet to parse or the list of text lines if it is decomposed.
>>> parser.is_descomposed False >>> parser.sheet 'text-\n-string' >>> parser.descompose >>> parser.sheet ['text-', '-string']
-
parse_header
()[source]¶ Method that parses the head of the sheet and extracts the data from roll, election, year, region, province and commune to store it in the
header
property.It also builds a unique identifier of the electoral roll that it stores in the
metadata
property with the rid key.
-
header
¶ Property that contains the result of method
parse_header
. It consists of a dictionary with the data from the header of the electoral roll sheet.>>> parser.header { 'roll': 'PADRON...', 'election': 'ELECCION...', 'year': 2020, 'region': 'METRO...', 'commune': 'SANTIA...', 'province': 'SANTIA...', }
-
parse_fields
()[source]¶ Method to analyze and extract the fields of the electoral roll. The direct fields of the sheet (nombre, c-identidad, sex|o, domicilio-electoral, circunscripcion y mesa) are taken and commune, province and region (comuna, provincia, region) are added. Result is stored in the
fields
property, the method returns nothing.
-
fields
¶ Property that contains the fields of the electoral roll detected in the sheet through the
parse_fields
method.
-
parse_entries
()[source]¶ Method that analyzes and extracts each data entry from the voter registration sheet.
First determine if each line of text is well composed, that is, it begins with at least one letter and ends with a number or a space next to a single letter. Then each line of text is analyzed as if it were an input through the
parse_entry
method.Afterwards, the lines considered malformed are internally processed, joining them in relation to whether they start with a letter or a space. Then use the
parse_entry
method again for each of them. Those that are rescued will remain in themetadata
property in the keys entires > rescue.
-
parse_entry
(line)[source]¶ A method that extracts the data from a voter registry entry in text line format.
Finds the fields found by regular expressions that are stored in the class attribute: :attr: regexs_entries <.RollParser.regexs_entries>.
Then it looks for the district from a list according to its commune and in relation to this it determines the place of the electoral domicile.
-
entries
¶ Property containing a list of entries from the electoral roll sheet. Each entry corresponds to a list of data in the order of the fields defined in the
fields
property.
-
metadata
¶ Property contains the metadata extracted during the parser analysis of the electoral roll sheet.
The metadata is stored as a dictionary, the rid key corresponding to the unique identifier of the voter registry of the sheet, the times key stores how long the analysis took (in total, during header, fields and entries), the entries key contains the total number of entries extracted , the amount of rescued and errors. Finally, the NULLS key contains the total number of null data inside each row or entry, as well as the detail of the fields, if there is any.
>>> parser.metadata { 'rid': 'PEA-EM-2016', 'entries': {'total': 1, 'rescue': 0, 'errors': 0}, 'nulls': {'total': 0} }
-
errors
¶ Property contains a list with the errors found in the sheet analysis. Each error corresponds to a dictionary with at least two keys: code with a semantic slug text of the error and ‘target’ that contains what generated the error.
-
fields_index
¶ Property that contains the index where the fields are located in the decomposed sheet as a list.
-
circuns
¶ Returns: a dictionary with communes as key and list of circunscriptions as value. Property that contains the possible electoral circunscriptions within the commune defined in the
header
property. It will return None if theparse_header
method has not been executed.
-
more_fields
¶ Returns: boolean. Property where the option of whether to add fields to the input is stored.
-
Roll memorizer¶
-
class
serveliza.roll.memorizer.
RollMemorizer
(*args, **kwargs)[source]¶ Parameters: memorize (bool) – If the memorizer is activated (default True) RollMemorizer
is a class that allows it to store data and errors from the electoral roll. It is instantiated within an instance ofElectoralRoll
.-
storage
¶ Returns: dictionary with all data. Property where all the memorized data are stored.
-
errors
¶ Returns: list with errors. Property where the errors found are stored.
-
is_active
¶ Returns: boolean. Property that indicates if the memorizer is active as defined in the constructor.
-
memorize
(parsed)[source]¶ Parameters: parsed (obj) – an instance of RollParser
.memorize
is the main method ofRollMemorizer
. It will memorize the metadata of an analyzed sheet, if the memorizer is active (seeis_active
property) it will also memorize the analyzed entries. The methods executed to memorize the metadata in order:prepare_rid
.store_metadata_places
.store_metadata_entries
.store_metadata_nulls
.store_metadata_nulls
.
It then stores, if active, the entries and errors.
-
prepare_rid
(parsed)[source]¶ Parameters: parsed (obj) – an instance of RollParser
.prepare_rid
is a method that prepares thestorage
property for storing metadata and electoral roll data. Use the electoral roll identifier (see more inrid
) as the key for thestorage
property.
-
store_metadata_places
(parsed)[source]¶ Parameters: parsed (obj) – an instance of RollParser
.store_metadata_places
is a method to memorize the places (regions, provinces and communes) present in the parsed sheet.
-
store_metadata_entries
(parsed)[source]¶ Parameters: parsed (obj) – an instance of RollParser
.store_metadata_entries
is a method to memorize the metadata of the entries (total, rescued, errors) present in the parsed sheet. If the total number of entries is declared in the header, it is added as declared.
-
store_metadata_nulls
(parsed)[source]¶ Parameters: parsed (obj) – an instance of RollParser
.store_metadata_nulls
is a method to memorize the metadata of the null data in the entries (total and for each field with null data) present in the parsed sheet.
-
Roll exporter¶
-
class
serveliza.roll.exporter.
RollExporter
(*args, **kwargs)[source]¶ RollExporter
is a class for exporting electoral roll data in csv files.Parameters: - export (bool) – If the export is activated (default False)
- output (str) – directory to store the data in .csv (see more in
output
. - mode (str) – determines the data export mode in files (unified o separated, see more in
mode
). - mode_sep (str) – Criteria for separating files in export in separate mode (region o commune, see more in
mode
). - random_suffix (bool) – Determines whether exported files have a random text string appended to the end.
- summary (bool) – Determines whether to generate a summary file of the export and the extracted data.
It is instantiated within an instance of
ElectoralRoll
.-
modes
= ['unified', 'separated']¶ Available export modes.
-
mode_sep_opts
= ['commune', 'region']¶ Available file separation modes
-
export_sheet
(parsed)[source]¶ Parameters: parsed (obj) – an instance of RollParser
.Returns: the absolute path of the file where the data was exported. export_sheet
is a method of exporting the data from a parsed sheet into files as configured in the constructor.
-
export_summary
(rid, metadata)[source]¶ Parameters: - rid (str) – identifier of the electoral roll
- metadata (dict) – metadata of the electoral roll.
export_summary
is a method that exports the metadata of the electoral roll as a summary in a yaml file.
-
is_active
¶ Returns: boolean. Property that indicates if the memorizer is active as defined in the constructor.
-
random_suffix
¶ Determines whether exported files have a random text string appended to the end.
-
summary
¶ Determines whether to generate a summary file of the export and the extracted data.
-
output
¶ Directory to store the data in .csv.
-
mode
¶ Determines the data export mode in files. If it is “unified” (default) it creates a single csv file with the data, or if it is “separated” into several according to communal or regional criteria.’
-
mode_sep
¶ Criteria for separating files in export in separate mode.
Roll printer¶
-
class
serveliza.roll.printer.
ColorMixin
[source]¶ Mixin that grants methods to color text if the colors property is true.
-
class
serveliza.roll.printer.
RollPrinter
(*args, **kwargs)[source]¶ Parameters: - verbose (bool) – If print the progress in screen (default False).
- colors (bool) – If print with colors in the screen (default True).
RollPrinter
is a class that allows it to print progress of application in the screen. It is instantiated within an instance ofElectoralRoll
.-
verbose
¶ Property that determines whether to print the application progress to the screen.
-
colors
¶ Property that determines whether to print on screen with colors.
-
percent
(of, total)[source]¶ Utility that returns a percentage in string. It receives two parameters (of and total) with which the relation calculates.
-
repr
(obj)[source]¶ Method to print representation of
ElectoralRoll
class.
-
is_runned_tag
(is_runned)[source]¶ Utility returns a text string formatted to print if
ElectoralRol
ran.
Mixins¶
- PDF processor mixin:
serveliza.mixins.pdf
- Available PDF processors:
serveliza.mixins.pdf_processors
PDF processor mixin¶
-
class
serveliza.mixins.pdf.
PDFProcessorMixin
[source]¶ Mixin that allows an instance the ability to process PDF files with certain libraries. In the constructor, the processor to be used is defined with the argument of the same name, binding the
process_pdf
property to the method related to it.PDF processor availables:
- pdftotext (0.1.0 release) with
pdftotext_processor
- pdfminersix (0.1.0 release) with
pdftotext_processor
-
processor
¶ Processor (library) to extract text from pdf file.
-
process_pdf
¶ Property that calls the method corresponding to the PDF file processor configured in the instance initialization.
>>> obj.process_pdf(*args)
-
process_pdf_page
¶
- pdftotext (0.1.0 release) with