نوع مقاله : مقاله پژوهشی

نویسندگان

1 گروه مدیریت فناوری اطلاعات، دانشکده مدیریت، واحد تهران مرکزی، دانشگاه آزاد اسلامی، تهران، ایران

2 گروه مدیریت صنعتی، دانشکده مدیریت، واحد تهران مرکزی، دانشگاه آزاد اسلامی، تهران، ایران

چکیده

امروزه داده‌ها به‌عنوان یکی از دارایی‌های ارزشمند سازمان‌ها و صنایع مختلف، نقش مهمی را در توسعه و پیشرفت کسب‌وکارها ایفا می‌کنند. درواقع هر سازمانی برای جمع‌آوری داده‌های خود از منابع مختلفی استفاده می‌کند که یکی از این منابع بستر وب است که در آن روزانه داده‌های زیادی توسط کاربران مختلف و یا حتی ربات‌ها در سراسر جهان تولید و منتشر می‌شود. جمع‌آوری و تحلیل چنین داده‌هایی، می‌تواند اطلاعات مفیدی را برای سازمان فراهم نماید. به همین منظور طی دهه‌های گذشته ابزارهای مختلفی توسعه یافته‌اند که به برداشت اطلاعات از بستر وب کمک شایانی نموده‌اند که ازجمله آن‌ها می‌توان به کتابخانه‌های ریکوئست، سلنیوم، اسکرپی، سوپ زیبا و ... در زبان برنامه‌نویسی پایتون اشاره نمود. بااین‌حال، هر یک از این کتابخانه‌ها با چالش‌هایی مواجه هستند. ما در این مقاله با مطالعه کتابخانه سلنیوم و با توجه به وجود چالش‌های متعدد در آن، راه‌حلی را برای مدیریت زمان و بهبود چالش نامتقارن بودن آن ارائه نموده‌ایم. آزمایش‌های ما نشان می‌دهد که استفاده از راه‌حل پیشنهادی، دقت اطلاعات برداشت‌شده از بستر وب را افزایش و درنتیجه چالش نامتقارن بودن را بهبود می‌دهد و همچنین زمان برداشت اطلاعات از بستر وب را نیز کاهش می‌دهد.

کلیدواژه‌ها

موضوعات

عنوان مقاله [English]

A Mechanism to Manage Time and Increase Data Accuracy When Using the Selenium Library

نویسندگان [English]

  • Farnaz Taghizadeh Kourayem 1
  • Mohammadreza Kabaranzad Ghadim 2
  • Seyed Abdollah Amin Mousavi 1

1 Department of Information Technology Management, Faculty of Management, Central Tehran Branch, Islamic Azad University, Tehran, Iran

2 Department of industrial Management, Faculty of Management, Central Tehran Branch, Islamic Azad University, Tehran, Iran

چکیده [English]

Introduction

The Internet platform is a very powerful source of information that can be collected with the help of various tools and techniques and used after analysis in order to make better and more efficient decisions. According to previous researchers, when it comes to automatically extracting information from the web, Selenium is always the best option, however, this library has many challenges. One of the challenges of using the Selenium library is its asynchronous and the other is the slowness of the Selenium library, which we are trying to investigate and improve in this article.
 
Research Question(s): How to improve the challenge of slowness and asynchronous of the Selenium library?
 

Literature Review

Selenium library, which is one of the best web scripting tools, has been used in different studies and with different purposes. This library is a free, open-source automated testing framework used to verify web applications across multiple platforms and browsers. Various programming languages ​​such as Java, C#, and Python can be used to create Selenium test scripts (Teotia et al., 2023). But despite its many advantages, selenium also has disadvantages, including: 1. Slowness, 2. Brittleness, 3. Flakiness, 4. Maintainability, 5. Asynchronous, 6. Time-consuming, 7. Cross-browser, 8. failure analysis, 9. Infrastructure, 10. Scalability, 11. Assertability, 12. Documentation and 13. Support (Leotta et al., 2023).

Methodology

The first thing we examined in the Selenium library is the lack of time management. Time management actually refers to the fact that the waiting time for downloading information from the web platform in this library is not known. To solve the problem of slowness and asynchronous of the Selenium library, we have used a solution that includes three different steps:
Step 1) According to the manual checks, first we define the variable t1:
t1 = 0.5
where t1 is the value used in the sleep function (time required to open the main page in normal mode).
Step 2) We use the while loop and try, except inside it. If the page does not open after 3 seconds, or after 6 attempts with different sleep times, the desired page and products do not appear, the error of the site not being available or the internet being slow will be printed:
t1 = 0.5
try:
while t1 <= 3:
try:
driver.get("""The Study site""") # to open the page
 
time.sleep(t1) # Waiting for the page to open

# Code related to page scrolling
time.sleep(t1) # Waiting for information to be displayed after scrolling
# Information collection codes
except:
t1 += 0.5
Except:
print('The internet speed is very slow or the intended site is down')
In this code, the program with time t1=0.5 first tries to display the page information in full and if it fails, it adds half a second to t1 and this repetition continues up to 6 times. If the page is displayed in full, we use the new value of t1 for the next pages.
Step 3) If the page opens, we will enter the third step, which is related to collecting the basic information of the products and we must avoid the problem of asynchronous.
At this stage, according to the existence of five different types of information (such as product name, price, amount of discount, price after discount, type of discount) of each product, we first define five different and empty lists for product information.
Then, with the help of commands related to information collection, we take the information about each product and put it in the corresponding list. Then we run the following script to prevent wrong items in the list:
 
check1 =  [[product_prices], [product_off], [product _prices2], [Product_off_type] ]
 
for j in range(0,4):
if len(product_name) < check1[len[j]]:
                                product_name.append(‘error’)
elif len(product_name) > check1[len[j]]:
                                check[j].append(‘error’)
 
In other words, after taking the information of a product and adding them to the predefined main lists, the program calculates the number of items in the existing lists with the help of the len function and puts them in the checklist. Then, with the help of the for loop, the length of each list is compared with the rest of the lists, and if the number of items in a list is low, the word "error" is added to it.

Results

In order to evaluate the solutions presented in this research, we have reviewed the information related to supplementary drugs on the DigiKala site, which was almost 3000 different medicines, 13 times on 13 different dates, from September 23, 2023, to March 15, 2024. In this study, the codes written in order to retrieve information from the mentioned site were executed the first time with the proposed solution and the second time without the proposed solution, and in each execution, both the time of information retrieval and the number of information related to each list or the same column was recorded and compared with each other, and the error rate and its percentage were calculated based on the difference in the time of information collection in the first and second execution and the difference in the number of information collected for each column in the first and second execution. After the implementation and use of the proposed solution, the investigations show that the accuracy and correctness of the collected information have increased compared to not using the proposed solution, and the time of information collection has also improved.

Discussion and Conclusion

In this article, we studied and evaluated the challenges of being slow, time-consuming and asymmetric of the Selenium library. Our studies were conducted using the Python programming language. Studies show that it is very important to use the solution of checking the list and the same length of the list at the end of the collection of each product from the web platform, so that not using it in 12 out of 13 cases of information collection from the web platform makes us encountered an error. Also, using a constant value for the sleep function significantly increases the time to retrieve information compared to using a variable value for it. In general, the findings show that the use of the proposed solution when using the Selenium library in order to extract information from the web platform helps to increase the accuracy of the information and also improves the time of complete information retrieval from the web platform.

کلیدواژه‌ها [English]

  • Web Crawler
  • Web Scraping
  • Selenium Library
  • Asynchronous
  • Data Accuracy
  • Data Retrieval Time
Boeing, G., & Waddell, P. (2017). New insights into rental housing markets across the United States: Web scraping and analyzing craigslist rental listings. Journal of Planning Education and Research37(4), 457-476.
Henrys, K. (2021). Importance of web scraping in e-commerce and e-marketing. Available at SSRN 3769593.
Krotov, V., & Tennyson, M. (2018). Scraping Financial Data from the Web Using the R Language. Journal of Emerging Technologies in Accounting, 15(1), 169-181.
Krotov, V., Johnson, L., & Silva, L. (2020). Tutorial: Legality and ethics of web scraping.
Leotta, M., García, B., Ricca, F., & Whitehead, J. (2023, April). Challenges of end-to-end testing with selenium WebDriver and how to face them: A survey. In 2023 IEEE Conference on Software Testing, Verification and Validation (ICST) (pp. 339-350). IEEE.
Neumann, M., Steinberg, J., & Schaer, P. (2017). Web-Scraping for non-programmers: Introducing OXPath for digital library metadata harvesting. Code4Lib Journal, 38.
Suganthi, V., & Varun, M. M. (2024). Automation Using Selenium. International Journal Of Multidisciplinary Research In Science, Engineering And Technology, e-ISSN:2582-7219.
Teotia, H., Shishodia, G., Tyagi, E., Prakash, A., & Avasthi, S. (2023, April). Instagram Analysis and Activity Automation: Using Python and Selenium Automation Tools. In 2023 International Conference on Computational Intelligence, Communication Technology and Networking (CICTN) (pp. 522-526). IEEE.
Yuan, S. (2023). Design and Visualization of Python Web Scraping Based on Third-Party Libraries and Selenium Tools. Academic Journal of Computing & Information Science, 6(9), 25-31.