Оценок пока нет Parsing xml data by URL with BeautifulSoup

At first we will be create virtual environment for our project

$ python3 venv -m parsing_venv
$ source ./parsing_venv/bin/activate
Virtual environment not necessary, but very important if you want to keep your system clean

Then you need to install the modules that you need to work

$ pip install bs4 requests lxml

Now we must analyze the xml tree from specify URL, for example we using this link https://www.mapi.gov.il/ProfessionalInfo/Documents/dataGov/CITY.xml

<esri:Workspace xmlns:esri="http://www.esri.com/schemas/ArcGIS/9.3"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <WorkspaceDefinition xsi:type="esri:WorkspaceDefinition">
...
    </WorkspaceDefinition>
    <WorkspaceData xsi:type="esri:WorkspaceData">
        <DatasetData xsi:type="esri:TableData">
            <DatasetName>city</DatasetName>
            <DatasetType>esriDTFeatureClass</DatasetType>
            <Data xsi:type="esri:RecordSet">
...
                <Records xsi:type="esri:ArrayOfRecord">
                    <Record xsi:type="esri:Record">
                        <Values xsi:type="esri:ArrayOfValue">
                            <Value xsi:type="xs:int">1</Value>
                            <Value xsi:type="esri:PointN">
                                <X>184689.8424</X>
                                <Y>640598.3157</Y>
                            </Value>
                            <Value xsi:type="xs:int">1</Value>
                            <Value xsi:type="xs:short">862</Value>
                            <Value xsi:type="xs:string">גני יוחנן</Value>
                            <Value xsi:type="xs:int">536</Value>
                            <Value xsi:type="xs:short">31</Value>
                            <Value xsi:type="xs:string">מושבים (כפרים שיתופיים) (ב)</Value>
                            <Value xsi:type="xs:string">GANNE YOHANAN</Value>
                        </Values>
                    </Record>
...
                </Records>
            </Data>
        </WorkspaceData
...

To get all the child elements of the Records tag, we will need to write a request for the parser through tags and attributes and, through iteration, get each value of the Record element

# import required modules
import bs4 as bs
import requests


def get_parsed_cities():
    # assign URL
    URL = 'https://www.mapi.gov.il/ProfessionalInfo/Documents/dataGov/CITY.xml'
    
    # parsing
    url_link = requests.get(URL)
    url_link.encoding = 'utf-8'
    file = bs.BeautifulSoup(url_link.text, features="xml")


    find_table = file.find('WorkspaceData', {"xsi:type": "esri:WorkspaceData"}) #, class_='numpy-table' xsi:type="esri:WorkspaceData"
    records = find_table.find_all('Record')

    cities = []

    for record in records:
        record_id =  record.find_all('Value')[0].text
        x = record.find_all('Value')[1].find('X').text
        y = record.find_all('Value')[1].find('Y').text
        record_id_2 =  record.find_all('Value')[2].text
        city_id = record.find_all('Value')[3].text
        city_name_heb = record.find_all('Value')[4].text
        secondary_id = record.find_all('Value')[5].text
        city_type_id = record.find_all('Value')[6].text
        city_type_name = record.find_all('Value')[7].text
        city_name_eng = record.find_all('Value')[8].text

        city = {"record_id": record_id, "X":x, "Y":y, "record_id_2":record_id_2, "city_id":city_id, "city_name_heb":city_name_heb, "secondary_id":secondary_id, "city_type_id":city_type_id, "city_type_name":city_type_name, "city_name_eng":city_name_eng}
        cities.append(city)
        
    return cities

Пожалуйста, оцените материал

WebSofter

Web - технологии