Are you looking to create your very own dataset for a new and innovative application? Or maybe you’re trying to collect data for analysis for a college project and have become weary of manually downloading each image or CSV. Worry not, in this article I’ll explain the building blocks needed in order to automate downloading files for these kinds of tasks.
Before you can create an application to download and create datasets for you, you’ll need to know the basics required for automating file downloads via Java code. Getting the basics right will help you use them to your own specific set of needs, whether it’s for a backend server application or Android app.
There are multiple ways to download a file using Java code. Here are just a few ways of how you can accomplish the task:
The most easily available and a basic package available for downloading a file from internet using Java code is the Java IO package. Here we will be using the
BufferedInputStream and the
URL classes to open and read a file on a given address to a file on our local system. The reason we use the
BufferedInputStream class instead of the
InputStream is its buffering ability that gives our code a performance boost.
Before we dive deeper into the coding aspect let’s take an overview of the classes and the individual functions we will be using in the process.
java.net.URL class in Java is a built-in library that offers multiple methods to access and manipulate data on the internet. In this case, we will be using the
openStream() function of the
URL class. The method signature for the
openStream() function is:
public final InputStream openStream() throws IOException
openStream() function works on an object of the
URL class. The
URL class opens up a connection to the given URL and the
openStream() method returns an input stream which is used to read data from the connection.
The second class we will be using is the
BufferedInputStreamReader and the
FileOutputStream. These classes are used for reading from a file and writing to it, respectively.
Here is the complete code:
try (BufferedInputStream inputStream = new BufferedInputStream(new URL("http://example.com/my-file-path.txt").openStream());
Note: You may need to add the ‘User-Agent’ header to the HTTP request since some servers don’t allow downloads from unknown clients.
As you can see we open up a connection using the
URL object and then read it via the
BufferedInputStreamReader object. The contents are read as bytes and copied to a file in the local directory using the
To lower the number of lines of code we can use the
Files class available from Java 7. The
Files class contains methods that read all the bytes at once and then copies it into another file. Here is how you can use it:
InputStream inputStream = new URL("http://example.com/my-file-path.txt").openStream();
Java NIO is an alternative package to handle networking and input-output operations in Java. The main advantage that the Java NIO package offers is that it’s non-blocking, and has channeling and buffering capabilities. When we use the Java IO library we work with streams that read data byte by byte. However, the Java NIO package uses channels and buffers. The buffering and channeling capabilities allow the system to copy contents from a URL directly into the intended file without needing to save the bytes in application memory, which would be an intermediary step. The ability to work with channels boosts performance.
In order to download the contents of a URL, we will use the
ReadableByteChannel and the
ReadableByteChannel readChannel = Channels.newChannel(new URL("http://example.com/my-file-path.txt").openStream());
ReadableByteChannel class creates a stream to read content from the URL. The downloaded contents will be transferred to a file on the local system via the corresponding file channel.
FileOutputStream fileOS = new FileOutputStream("/Users/username/Documents/file_name.txt");
After defining the file channel we will use the
transferFrom() method to copy the contents read from the
readChannel object to the file destination using the
transferTo() methods are much more efficient than working with streams using a buffer. The transfer methods enable us to directly copy the contents of the file system cache to the file on the system. Thus direct channeling restricts the number of context switches required and enhances the overall code performance.
Now, in the following sections, we will be looking at ways to download files from a URL using third-party libraries instead of core Java functionality components.
Apache Commons IO
The Apache Commons IO library offers a list of utility classes to manage IO operations. Now you may be thinking why would we use this when Java has its own set of libraries to handle IO operations. However, Apache Commons IO overcomes the problem of code rewriting and helps avoid writing boilerplate code.
In order to start using the Apache Commons IO library, you will need to download the jar files from the official website. When you are done downloading the jar files, you need to add them to use them. If you are using an Integrated Development Environment (IDE) such as Eclipse, you will need to add the files to the build path of your project. To add files to your project you would need to right click on it, select build path option by navigating through “configure build path-> build path”, and then choose the add external archives option.
To download a file from a given URL using the Apache Commons IO we will require the
FileUtils class of the package. There is only a single line of code required to download a file, which looks like:
The connection and read timeouts convey the permissible time for which either the connection may stay idle or reading from the URL may stop.
Another class of the Apache Commons IO package that can be used to download a file over the internet is the IOUtils class. We will use the
copy(inputStream, fileOS) method to download a file into the local system.
InputStream inputStream = new URL("http://example.com/my-file-path.txt").openStream();
The function returns the number of bytes copied. If the value of the variable i is -1, then it indicates that the contents of the file are over 2GB. When the returned value is -1, you can use the function
copyLarge(inputStream, fileOS) in place of the
copy(inputstream, fileOS) function to handle this load. Both of these functions buffer the
inputstream internally. The internal buffer means we do not have to use the
BufferedInputStream class to enhance our code performance and helps us avoid writing boilerplate code.
Using Apache HTTP Components
Another library managed by the Apache organization is the HttpComponents package. This library uses the request-response mechanism to download the file from a given URL.
The first step to downloading a file is to create an HTTP client object that would issue the request to the server. For this, we will be using the
CloseableHttpClient class. The
CloseableHttpClient class is an abstract class that requires
HttpClientBuilder class to create instances. The code snippet that creates a new HTTP client is as follows:
CloseableHttpClient client = HttpClientBuilder.create().build();
We then need to create an
HttpPost object to send the request to the server. The request is created by the following line of code:
HttpGet request = new HttpGet("url from where the file is intended to be downloaded");
execute(request) function is applied to the client object and returns with a response from the server. Once the request is sent to the server we need a response object to receive the data sent from the server. To catch the response from the server we use the
HttpResponse class object.
HttpResponse response = client.execute(request);
The data sent by the server in the form of a message is obtained through the
HttpEntity entity = response.getEntity();
You can also obtain the response code sent by the server through the
response object and use it to your specific need.
int responseCode = response.getStatusLine().getStatusCode();
The data to be downloaded is encapsulated within the
entity object and can be extracted using the
getContent() function. The
getContent() function returns an
InputStream object that can be further used with a
BufferedInputStreamReader to enhance performance.
InputStream inputStream = entity.getContent();
Now all you need to do is read from the stream byte by byte and write the contents into a file using the
String fileName = "D:\\Demo\file.txt";
The last thing required to be done is closing all the open resources in order to ensure that the system resources are not overutilized and that there are no memory leaks.
So there you have it - these are the simplest ways to download a file using the basic Java code and other third party libraries. Now that we are done with the basics, you can be as creative as you want and utilize the knowledge to suit your needs. So see you next time with a new set of concepts to help you become a better coder. We wish you happy coding till then.
If you like this blog or find it useful for you, you are welcome to comment on it. You are also welcome to share this blog, so that more people can participate in it. If the images used in the blog infringe your copyright, please contact the author to delete them. Thank you !