Java에서 웹 페이지의 내용을 문자열로 읽는 방법은 무엇입니까?

<시간/>

Java를 사용하여 여러 가지 방법으로 웹 페이지의 내용을 읽을 수 있습니다. 여기에서는 그 중 세 가지에 대해 논의하겠습니다.

openStream() 메소드 사용

URL java.net 패키지의 클래스는 월드 와이드 웹에서 리소스(파일 또는 디렉토리 또는 참조)를 가리키는 데 사용되는 Uniform Resource Locator를 나타냅니다.

openStream() 이 클래스의 메서드는 현재 개체가 나타내는 URL에 대한 연결을 열고 URL에서 데이터를 읽을 수 있는 InputStream 개체를 반환합니다.

따라서 웹 페이지에서 데이터를 읽으려면(URL 클래스 사용) -

원하는 웹 페이지의 URL을 생성자에 매개변수로 전달하여 java.net.URL 클래스를 인스턴스화합니다.
openStream() 메서드를 호출하고 InputStream 객체를 검색합니다.
위에서 검색한 InputStream 객체를 매개변수로 전달하여 Scanner 클래스를 인스턴스화합니다.

예시

import java.io.IOException;
import java.net.URL;
import java.util.Scanner;
public class ReadingWebPage {
   public static void main(String args[]) throws IOException {
      //Instantiating the URL class
      URL url = new URL("https://www.something.com/");
      //Retrieving the contents of the specified page
      Scanner sc = new Scanner(url.openStream());
      //Instantiating the StringBuffer class to hold the result
      StringBuffer sb = new StringBuffer();
      while(sc.hasNext()) {
         sb.append(sc.next());
         //System.out.println(sc.next());
      }
      //Retrieving the String from the String Buffer object
      String result = sb.toString();
      System.out.println(result);
      //Removing the HTML tags
      result = result.replaceAll("<[^>]*>", "");
      System.out.println("Contents of the web page: "+result);
   }
}

출력

<html><body><h1>Itworks!</h1></body></html>
Contents of the web page: Itworks!

HttpClient 사용

Http 클라이언트는 전송 라이브러리이며 클라이언트 측에 상주하며 HTTP 메시지를 보내고 받습니다. 최신 HTTP 표준을 충족하는 기능이 풍부하고 효율적인 최신 구현을 제공합니다.

(Http 프로토콜의) GET 요청은 주어진 URI를 사용하여 주어진 서버에서 정보를 검색하는 데 사용됩니다. GET을 사용하는 요청은 데이터만 검색해야 하며 데이터에 다른 영향을 미치지 않아야 합니다.

HttpClient API는 get 요청 메서드를 나타내는 HttpGet이라는 클래스를 제공합니다. GET 요청을 실행하고 웹 페이지의 내용을 검색하려면 -

createDefault() HttpClients 클래스의 메서드는 HttpClient 인터페이스의 기본 구현인 CloseableHttpClient 개체를 반환합니다. 이 방법을 사용하여 HttpClient 개체를 만듭니다.
HttpGet 클래스를 인스턴스화하여 HTTP GET 요청을 만듭니다. 이 클래스의 생성자는 요청을 보내야 하는 웹 페이지의 URI를 나타내는 String 값을 받습니다.
execute()를 호출하여 HttpGet 요청을 실행합니다. 방법.
응답에서 웹 사이트의 내용을 나타내는 InputStream 개체를 -
로 검색합니다.

httpresponse.getEntity().getContent()

예시

import java.util.Scanner;
import org.apache.http.HttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
public class HttpClientExample {
   public static void main(String args[]) throws Exception{
      //Creating a HttpClient object
      CloseableHttpClient httpclient = HttpClients.createDefault();
      //Creating a HttpGet object
      HttpGet httpget = new HttpGet("https://www.something.com/");
      //Executing the Get request
      HttpResponse httpresponse = httpclient.execute(httpget);
      Scanner sc = new Scanner(httpresponse.getEntity().getContent());
      //Instantiating the StringBuffer class to hold the result
      StringBuffer sb = new StringBuffer();
      while(sc.hasNext()) {
         sb.append(sc.next());
         //System.out.println(sc.next());
      }
      //Retrieving the String from the String Buffer object
      String result = sb.toString();
      System.out.println(result);
      //Removing the HTML tags
      result = result.replaceAll("<[^>]*>", "");
      System.out.println("Contents of the web page: "+result);
   }
}

출력

<html><body><h1>Itworks!</h1></body></html>
Contents of the web page: Itworks!

Jsoup 라이브러리 사용

Jsoup은 HTML 기반 콘텐츠와 함께 작동하는 Java 기반 라이브러리입니다. DOM, CSS 및 jquery와 유사한 메서드를 사용하여 데이터를 추출하고 조작하는 매우 편리한 API를 제공합니다. WHATWG HTML5 사양을 구현하고 HTML을 최신 브라우저와 동일한 DOM으로 구문 분석합니다.

Jsoup 라이브러리를 사용하여 웹 페이지의 내용을 검색하려면 -

연결() Jsoup 클래스의 메소드는 웹 페이지의 URL을 받아 지정된 웹 페이지에 연결하고 연결 객체를 반환합니다. connect()를 사용하여 원하는 웹페이지에 연결합니다. 방법.
Connection 인터페이스의 get() 메서드는 GET 요청을 전송/실행하고 HTML 문서를 Document 클래스의 객체로 반환합니다. get() 메서드를 호출하여 페이지에 GET 요청을 보냅니다.
얻은 문서의 내용을 문자열로 검색 -

String result = doc.body().text();

예시

import java.io.IOException;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class JsoupExample {
   public static void main(String args[]) throws IOException {
      String page = "https://www.something.com/";
      //Connecting to the web page
      Connection conn = Jsoup.connect(page);
      //executing the get request
      Document doc = conn.get();
      //Retrieving the contents (body) of the web page
      String result = doc.body().text();
      System.out.println(result);
   }
}

출력

It works!