How to convert an Excel document to XML format?
BIG DATA RECIPES DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET     ALL TAGS

How to convert an Excel document to XML format?

How to convert an Excel document to XML format?

This recipe helps you convert an Excel document to XML format

Recipe Objective

In big data scenarios, converting data from Excel format to XML format using Python can be used to to exchange data between different applications. We process complex XML files into a relational format or a Big Data format such as Parquet/ORC . XML is user-readable and more commonly used in network communication, especially with the rise of REST-based web services. The tools needed to manage big data analytics, often in the form of NoSQL databases, only map over to XML in a partial sense. XML is considered to be at odds with relational schemas, which puts it in a camp with most of NoSQL technology

System requirements :

  • Install Yattag python module as follows: pip install yattag
  • Install OpenPyXL python module as follows:pip install openpyxl
  • The below codes can be run in Jupyter notebook , or any python console
  • In this example we are going to use the excel employee dataset

Step 1: Import the modules

import openpyxl import yattag import load_workbook import Doc, indent

Step 2: Load the Workbook

create a Yattag document using yattag module and load the workbook and the worksheet.

#load the excel file along with path where exists wb = load_workbook("empdata.xlsx") ws = wb.worksheets[0] # Create Yattag doc, tag and text objects doc, tag, text = Doc().tagtext()

Step 3: Fill the headers with Yattags

Fill the headers with Yattags asis() method which enables us to input any string as the next line

xml_header = '' xml_schema = '' doc.asis(xml_header) doc.asis(xml_schema)

Step 4: Then create main Tag

create main tag Employees as below:

with tag('Employees'):

Step 5: Now start loop through the sheet

Now start loopthrough the sheet with the iter_rows method. The iter_rows method returns a generator with all the cells. We use a list comprehension to get all the values from the cells.

for row in ws.iter_rows(min_row=2, max_row=100, min_col=1, max_col=12): row = [cell.value for cell in row] with tag("Employee"): with tag("FirstName"): text(row[0]) with tag("LastName"): text(row[1]) with tag("Email"): text(row[10]) with tag("Phone"): text(row[8]) with tag("Company"): text(row[2])

Step 6: Adding the Yattag indent

Next, we are adding the Employees. When finished we indent our result with Yattags indent method.

result = indent( doc.getvalue(), indentation = ' ', indent_text = True ) print(result) with open("employee1.xml", "w") as f: f.write(result)

The complete code looks as follows :

import openpyxl, yattag from openpyxl import load_workbook from yattag import Doc, indent wb = load_workbook("empdata.xlsx") ws = wb.worksheets[0] # Create Yattag doc, tag and text objects doc, tag, text = Doc().tagtext() xml_header = '' xml_schema = '' doc.asis(xml_header) doc.asis(xml_schema) with tag('Employees'): # Use ws.max_row for all rows for row in ws.iter_rows(min_row=2, max_row=100, min_col=1, max_col=12): row = [cell.value for cell in row] with tag("Employee"): with tag("FirstName"): text(row[0]) with tag("LastName"): text(row[1]) with tag("Email"): text(row[10]) with tag("Phone"): text(row[8]) with tag("Company"): text(row[2]) result = indent( doc.getvalue(), indentation = ' ', indent_text = True ) print(result) with open("employee.xml", "w") as f: #give path where you want to create f.write(result)

Output of the above Code as follows :

Relevant Projects

Tough engineering choices with large datasets in Hive Part - 1
Explore hive usage efficiently in this hadoop hive project using various file formats such as JSON, CSV, ORC, AVRO and compare their relative performances

Yelp Data Processing Using Spark And Hive Part 1
In this big data project, we will continue from a previous hive project "Data engineering on Yelp Datasets using Hadoop tools" and do the entire data processing using spark.

Event Data Analysis using AWS ELK Stack
This Elasticsearch example deploys the AWS ELK stack to analyse streaming event data. Tools used include Nifi, PySpark, Elasticsearch, Logstash and Kibana for visualisation.

Build a big data pipeline with AWS Quicksight, Druid, and Hive
Use the dataset on aviation for analytics to simulate a complex real-world big data pipeline based on messaging with AWS Quicksight, Druid, NiFi, Kafka, and Hive.

Hadoop Project for Beginners-SQL Analytics with Hive
In this hadoop project, learn about the features in Hive that allow us to perform analytical queries over large datasets.

Analyse Yelp Dataset with Spark & Parquet Format on Azure Databricks
In this Databricks Azure project, you will use Spark & Parquet file formats to analyse the Yelp reviews dataset. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

Movielens dataset analysis for movie recommendations using Spark in Azure
In this Databricks Azure tutorial project, you will use Spark Sql to analyse the movielens dataset to provide movie recommendations. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

Yelp Data Processing using Spark and Hive Part 2
In this spark project, we will continue building the data warehouse from the previous project Yelp Data Processing Using Spark And Hive Part 1 and will do further data processing to develop diverse data products.

Building Real-Time AWS Log Analytics Solution
In this AWS Project, you will build an end-to-end log analytics solution to collect, ingest and process data. The processed data can be analysed to monitor the health of production systems on AWS.

Finding Unique URL's using Hadoop Hive
Hive Project -Learn to write a Hive program to find the first unique URL, given 'n' number of URL's.