How to convert an Excel document to XML format?
BIG DATA RECIPES DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET     ALL TAGS

How to convert an Excel document to XML format?

How to convert an Excel document to XML format?

This recipe helps you convert an Excel document to XML format

0

Recipe Objective

In big data scenarios, converting data from Excel format to XML format using Python can be used to to exchange data between different applications. We process complex XML files into a relational format or a Big Data format such as Parquet/ORC . XML is user-readable and more commonly used in network communication, especially with the rise of REST-based web services. The tools needed to manage big data analytics, often in the form of NoSQL databases, only map over to XML in a partial sense. XML is considered to be at odds with relational schemas, which puts it in a camp with most of NoSQL technology

System requirements :

  • Install Yattag python module as follows: pip install yattag
  • Install OpenPyXL python module as follows:pip install openpyxl
  • The below codes can be run in Jupyter notebook , or any python console
  • In this example we are going to use the excel employee dataset

Step 1: Import the modules

import openpyxl import yattag import load_workbook import Doc, indent

Step 2: Load the Workbook

create a Yattag document using yattag module and load the workbook and the worksheet.

#load the excel file along with path where exists wb = load_workbook("empdata.xlsx") ws = wb.worksheets[0] # Create Yattag doc, tag and text objects doc, tag, text = Doc().tagtext()

Step 3: Fill the headers with Yattags

Fill the headers with Yattags asis() method which enables us to input any string as the next line

xml_header = '' xml_schema = '' doc.asis(xml_header) doc.asis(xml_schema)

Step 4: Then create main Tag

create main tag Employees as below:

with tag('Employees'):

Step 5: Now start loop through the sheet

Now start loopthrough the sheet with the iter_rows method. The iter_rows method returns a generator with all the cells. We use a list comprehension to get all the values from the cells.

for row in ws.iter_rows(min_row=2, max_row=100, min_col=1, max_col=12): row = [cell.value for cell in row] with tag("Employee"): with tag("FirstName"): text(row[0]) with tag("LastName"): text(row[1]) with tag("Email"): text(row[10]) with tag("Phone"): text(row[8]) with tag("Company"): text(row[2])

Step 6: Adding the Yattag indent

Next, we are adding the Employees. When finished we indent our result with Yattags indent method.

result = indent( doc.getvalue(), indentation = ' ', indent_text = True ) print(result) with open("employee1.xml", "w") as f: f.write(result)

The complete code looks as follows :

import openpyxl, yattag from openpyxl import load_workbook from yattag import Doc, indent wb = load_workbook("empdata.xlsx") ws = wb.worksheets[0] # Create Yattag doc, tag and text objects doc, tag, text = Doc().tagtext() xml_header = '' xml_schema = '' doc.asis(xml_header) doc.asis(xml_schema) with tag('Employees'): # Use ws.max_row for all rows for row in ws.iter_rows(min_row=2, max_row=100, min_col=1, max_col=12): row = [cell.value for cell in row] with tag("Employee"): with tag("FirstName"): text(row[0]) with tag("LastName"): text(row[1]) with tag("Email"): text(row[10]) with tag("Phone"): text(row[8]) with tag("Company"): text(row[2]) result = indent( doc.getvalue(), indentation = ' ', indent_text = True ) print(result) with open("employee.xml", "w") as f: #give path where you want to create f.write(result)

Output of the above Code as follows :

Relevant Projects

Real-time Auto Tracking with Spark-Redis
Spark Project - Discuss real-time monitoring of taxis in a city. The real-time data streaming will be simulated using Flume. The ingestion will be done using Spark Streaming.

Hadoop Project for Beginners-SQL Analytics with Hive
In this hadoop project, learn about the features in Hive that allow us to perform analytical queries over large datasets.

Airline Dataset Analysis using Hadoop, Hive, Pig and Impala
Hadoop Project- Perform basic big data analysis on airline dataset using big data tools -Pig, Hive and Impala.

Online Hadoop Projects -Solving small file problem in Hadoop
In this hadoop project, we are going to be continuing the series on data engineering by discussing and implementing various ways to solve the hadoop small file problem.

Finding Unique URL's using Hadoop Hive
Hive Project -Learn to write a Hive program to find the first unique URL, given 'n' number of URL's.

Explore features of Spark SQL in practice on Spark 2.0
The goal of this spark project for students is to explore the features of Spark SQL in practice on the latest version of Spark i.e. Spark 2.0.

Analysing Big Data with Twitter Sentiments using Spark Streaming
In this big data spark project, we will do Twitter sentiment analysis using spark streaming on the incoming streaming data.

Tough engineering choices with large datasets in Hive Part - 2
This is in continuation of the previous Hive project "Tough engineering choices with large datasets in Hive Part - 1", where we will work on processing big data sets using Hive.

Analyse Yelp Dataset with Spark & Parquet Format on Azure Databricks
In this Databricks Azure project, you will use Spark & Parquet file formats to analyse the Yelp reviews dataset. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

Tough engineering choices with large datasets in Hive Part - 1
Explore hive usage efficiently in this hadoop hive project using various file formats such as JSON, CSV, ORC, AVRO and compare their relative performances