How to convert an Excel document to XML format?
BIG DATA RECIPES DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET     ALL TAGS

How to convert an Excel document to XML format?

How to convert an Excel document to XML format?

This recipe helps you convert an Excel document to XML format

0

Recipe Objective

In big data scenarios, converting data from Excel format to XML format using Python can be used to to exchange data between different applications. We process complex XML files into a relational format or a Big Data format such as Parquet/ORC . XML is user-readable and more commonly used in network communication, especially with the rise of REST-based web services. The tools needed to manage big data analytics, often in the form of NoSQL databases, only map over to XML in a partial sense. XML is considered to be at odds with relational schemas, which puts it in a camp with most of NoSQL technology

System requirements :

  • Install Yattag python module as follows: pip install yattag
  • Install OpenPyXL python module as follows:pip install openpyxl
  • The below codes can be run in Jupyter notebook , or any python console
  • In this example we are going to use the excel employee dataset

Step 1: Import the modules

import openpyxl import yattag import load_workbook import Doc, indent

Step 2: Load the Workbook

create a Yattag document using yattag module and load the workbook and the worksheet.

#load the excel file along with path where exists wb = load_workbook("empdata.xlsx") ws = wb.worksheets[0] # Create Yattag doc, tag and text objects doc, tag, text = Doc().tagtext()

Step 3: Fill the headers with Yattags

Fill the headers with Yattags asis() method which enables us to input any string as the next line

xml_header = '' xml_schema = '' doc.asis(xml_header) doc.asis(xml_schema)

Step 4: Then create main Tag

create main tag Employees as below:

with tag('Employees'):

Step 5: Now start loop through the sheet

Now start loopthrough the sheet with the iter_rows method. The iter_rows method returns a generator with all the cells. We use a list comprehension to get all the values from the cells.

for row in ws.iter_rows(min_row=2, max_row=100, min_col=1, max_col=12): row = [cell.value for cell in row] with tag("Employee"): with tag("FirstName"): text(row[0]) with tag("LastName"): text(row[1]) with tag("Email"): text(row[10]) with tag("Phone"): text(row[8]) with tag("Company"): text(row[2])

Step 6: Adding the Yattag indent

Next, we are adding the Employees. When finished we indent our result with Yattags indent method.

result = indent( doc.getvalue(), indentation = ' ', indent_text = True ) print(result) with open("employee1.xml", "w") as f: f.write(result)

The complete code looks as follows :

import openpyxl, yattag from openpyxl import load_workbook from yattag import Doc, indent wb = load_workbook("empdata.xlsx") ws = wb.worksheets[0] # Create Yattag doc, tag and text objects doc, tag, text = Doc().tagtext() xml_header = '' xml_schema = '' doc.asis(xml_header) doc.asis(xml_schema) with tag('Employees'): # Use ws.max_row for all rows for row in ws.iter_rows(min_row=2, max_row=100, min_col=1, max_col=12): row = [cell.value for cell in row] with tag("Employee"): with tag("FirstName"): text(row[0]) with tag("LastName"): text(row[1]) with tag("Email"): text(row[10]) with tag("Phone"): text(row[8]) with tag("Company"): text(row[2]) result = indent( doc.getvalue(), indentation = ' ', indent_text = True ) print(result) with open("employee.xml", "w") as f: #give path where you want to create f.write(result)

Output of the above Code as follows :

Relevant Projects

Implementing Slow Changing Dimensions in a Data Warehouse using Hive and Spark
Hive Project- Understand the various types of SCDs and implement these slowly changing dimesnsion in Hadoop Hive and Spark.

Data processing with Spark SQL
In this Apache Spark SQL project, we will go through provisioning data for retrieval using Spark SQL.

Analyse Yelp Dataset with Spark & Parquet Format on Azure Databricks
In this Databricks Azure project, you will use Spark & Parquet file formats to analyse the Yelp reviews dataset. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

Hadoop Project-Analysis of Yelp Dataset using Hadoop Hive
The goal of this hadoop project is to apply some data engineering principles to Yelp Dataset in the areas of processing, storage, and retrieval.

Yelp Data Processing Using Spark And Hive Part 1
In this big data project, we will continue from a previous hive project "Data engineering on Yelp Datasets using Hadoop tools" and do the entire data processing using spark.

Spark Project -Real-time data collection and Spark Streaming Aggregation
In this big data project, we will embark on real-time data collection and aggregation from a simulated real-time system using Spark Streaming.

Online Hadoop Projects -Solving small file problem in Hadoop
In this hadoop project, we are going to be continuing the series on data engineering by discussing and implementing various ways to solve the hadoop small file problem.

Web Server Log Processing using Hadoop
In this hadoop project, you will be using a sample application log file from an application server to a demonstrated scaled-down server log processing pipeline.

Spark Project-Analysis and Visualization on Yelp Dataset
The goal of this Spark project is to analyze business reviews from Yelp dataset and ingest the final output of data processing in Elastic Search.Also, use the visualisation tool in the ELK stack to visualize various kinds of ad-hoc reports from the data.

Hadoop Project for Beginners-SQL Analytics with Hive
In this hadoop project, learn about the features in Hive that allow us to perform analytical queries over large datasets.