نوع مقاله : مقاله پژوهشی
نویسندگان
1 گروه مدیریت فناوری اطلاعات، دانشکده مدیریت، واحد تهران مرکزی، دانشگاه آزاد اسلامی، تهران، ایران
2 گروه مدیریت صنعتی، دانشکده مدیریت، واحد تهران مرکزی، دانشگاه آزاد اسلامی، تهران، ایران
چکیده
امروزه دادهها بهعنوان یکی از داراییهای ارزشمند سازمانها و صنایع مختلف، نقش مهمی را در توسعه و پیشرفت کسبوکارها ایفا میکنند. درواقع هر سازمانی برای جمعآوری دادههای خود از منابع مختلفی استفاده میکند که یکی از این منابع بستر وب است که در آن روزانه دادههای زیادی توسط کاربران مختلف و یا حتی رباتها در سراسر جهان تولید و منتشر میشود. جمعآوری و تحلیل چنین دادههایی، میتواند اطلاعات مفیدی را برای سازمان فراهم نماید. به همین منظور طی دهههای گذشته ابزارهای مختلفی توسعه یافتهاند که به برداشت اطلاعات از بستر وب کمک شایانی نمودهاند که ازجمله آنها میتوان به کتابخانههای ریکوئست، سلنیوم، اسکرپی، سوپ زیبا و ... در زبان برنامهنویسی پایتون اشاره نمود. بااینحال، هر یک از این کتابخانهها با چالشهایی مواجه هستند. ما در این مقاله با مطالعه کتابخانه سلنیوم و با توجه به وجود چالشهای متعدد در آن، راهحلی را برای مدیریت زمان و بهبود چالش نامتقارن بودن آن ارائه نمودهایم. آزمایشهای ما نشان میدهد که استفاده از راهحل پیشنهادی، دقت اطلاعات برداشتشده از بستر وب را افزایش و درنتیجه چالش نامتقارن بودن را بهبود میدهد و همچنین زمان برداشت اطلاعات از بستر وب را نیز کاهش میدهد.
کلیدواژهها
موضوعات
عنوان مقاله [English]
A Mechanism to Manage Time and Increase Data Accuracy When Using the Selenium Library
نویسندگان [English]
- Farnaz Taghizadeh Kourayem 1
- Mohammadreza Kabaranzad Ghadim 2
- Seyed Abdollah Amin Mousavi 1
1 Department of Information Technology Management, Faculty of Management, Central Tehran Branch, Islamic Azad University, Tehran, Iran
2 Department of industrial Management, Faculty of Management, Central Tehran Branch, Islamic Azad University, Tehran, Iran
چکیده [English]
Introduction
The Internet platform is a very powerful source of information that can be collected with the help of various tools and techniques and used after analysis in order to make better and more efficient decisions. According to previous researchers, when it comes to automatically extracting information from the web, Selenium is always the best option, however, this library has many challenges. One of the challenges of using the Selenium library is its asynchronous and the other is the slowness of the Selenium library, which we are trying to investigate and improve in this article.
Research Question(s): How to improve the challenge of slowness and asynchronous of the Selenium library?
Literature Review
Selenium library, which is one of the best web scripting tools, has been used in different studies and with different purposes. This library is a free, open-source automated testing framework used to verify web applications across multiple platforms and browsers. Various programming languages such as Java, C#, and Python can be used to create Selenium test scripts (Teotia et al., 2023). But despite its many advantages, selenium also has disadvantages, including: 1. Slowness, 2. Brittleness, 3. Flakiness, 4. Maintainability, 5. Asynchronous, 6. Time-consuming, 7. Cross-browser, 8. failure analysis, 9. Infrastructure, 10. Scalability, 11. Assertability, 12. Documentation and 13. Support (Leotta et al., 2023).
Methodology
The first thing we examined in the Selenium library is the lack of time management. Time management actually refers to the fact that the waiting time for downloading information from the web platform in this library is not known. To solve the problem of slowness and asynchronous of the Selenium library, we have used a solution that includes three different steps:
Step 1) According to the manual checks, first we define the variable t1:
t1 = 0.5
where t1 is the value used in the sleep function (time required to open the main page in normal mode).
Step 2) We use the while loop and try, except inside it. If the page does not open after 3 seconds, or after 6 attempts with different sleep times, the desired page and products do not appear, the error of the site not being available or the internet being slow will be printed:
t1 = 0.5
try:
while t1 <= 3:
try:
driver.get("""The Study site""") # to open the page
time.sleep(t1) # Waiting for the page to open
…
# Code related to page scrolling
time.sleep(t1) # Waiting for information to be displayed after scrolling
# Information collection codes
except:
t1 += 0.5
Except:
print('The internet speed is very slow or the intended site is down')
In this code, the program with time t1=0.5 first tries to display the page information in full and if it fails, it adds half a second to t1 and this repetition continues up to 6 times. If the page is displayed in full, we use the new value of t1 for the next pages.
Step 3) If the page opens, we will enter the third step, which is related to collecting the basic information of the products and we must avoid the problem of asynchronous.
At this stage, according to the existence of five different types of information (such as product name, price, amount of discount, price after discount, type of discount) of each product, we first define five different and empty lists for product information.
Then, with the help of commands related to information collection, we take the information about each product and put it in the corresponding list. Then we run the following script to prevent wrong items in the list:
check1 = [[product_prices], [product_off], [product _prices2], [Product_off_type] ]
for j in range(0,4):
if len(product_name) < check1[len[j]]:
product_name.append(‘error’)
elif len(product_name) > check1[len[j]]:
check[j].append(‘error’)
In other words, after taking the information of a product and adding them to the predefined main lists, the program calculates the number of items in the existing lists with the help of the len function and puts them in the checklist. Then, with the help of the for loop, the length of each list is compared with the rest of the lists, and if the number of items in a list is low, the word "error" is added to it.
Results
In order to evaluate the solutions presented in this research, we have reviewed the information related to supplementary drugs on the DigiKala site, which was almost 3000 different medicines, 13 times on 13 different dates, from September 23, 2023, to March 15, 2024. In this study, the codes written in order to retrieve information from the mentioned site were executed the first time with the proposed solution and the second time without the proposed solution, and in each execution, both the time of information retrieval and the number of information related to each list or the same column was recorded and compared with each other, and the error rate and its percentage were calculated based on the difference in the time of information collection in the first and second execution and the difference in the number of information collected for each column in the first and second execution. After the implementation and use of the proposed solution, the investigations show that the accuracy and correctness of the collected information have increased compared to not using the proposed solution, and the time of information collection has also improved.
Discussion and Conclusion
In this article, we studied and evaluated the challenges of being slow, time-consuming and asymmetric of the Selenium library. Our studies were conducted using the Python programming language. Studies show that it is very important to use the solution of checking the list and the same length of the list at the end of the collection of each product from the web platform, so that not using it in 12 out of 13 cases of information collection from the web platform makes us encountered an error. Also, using a constant value for the sleep function significantly increases the time to retrieve information compared to using a variable value for it. In general, the findings show that the use of the proposed solution when using the Selenium library in order to extract information from the web platform helps to increase the accuracy of the information and also improves the time of complete information retrieval from the web platform.
کلیدواژهها [English]
- Web Crawler
- Web Scraping
- Selenium Library
- Asynchronous
- Data Accuracy
- Data Retrieval Time